3 functions to perform linear regression in Python

By Jia Liu in python regression statistics

December 12, 2021

So why are we here?

Recently I helped another PhD student in our department to solve a problem in her python script that performs linear regression. In this script, two linear regression models based on the input data were generated using two methods (LinearRegression() from sklearn.linear_model; the OLS function from statsmodels.api). The R-square as well as Adjusted R-Square calculated from these two models were very different. That inspired me to start this post: while there are always different methods that performs the same process, disagreement among the results generated from different methods can be a headache. Here we will go through three widely used methods that build linear regression models in python: sklearn, statsmodel.api, and statsmodels.formula.api using a randomly generated dataset, and see how we can avoid the problem.

Generate a Random dataset

We will first import packages that needed, and then generate a random dataset:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm


# Generate 'random' data
np.random.seed(0)
X = 2.5 * np.random.randn(150) + 1 # Array of 150 values with mean = 1, stddev = 2.5
res = 0.5 * np.random.randn(150) # Generate 150 residual/random error terms
y = 4 + 0.3 * X + res # Get the actual values of Y assuming y_i = 4 + 0.3x_i + e_i 
 
# Create pandas dataframe to store our X and y values
df = pd.DataFrame(
    {'X': X,
     'y': y}
)

df.head()

##           X         y
## 0  5.410131  5.588918
## 1  2.000393  5.456789
## 2  3.446845  4.661676
## 3  6.602233  5.567451
## 4  5.668895  5.651442

Let’s visualize the regression:

sns.set_theme(color_codes=True)
sns.regplot(x="X", y="y", data = df).set(title = "y vs X")

There seems to be a positive trend between $y$ and $X$, but there’s also a certain extent of dispersion of data points from the predicted regression line as shown in the figure.

Linear regression

Since our goal here is to show how to apply linear regression using three methods in python, but not analyze any data, we will simply build the models based on the simulated data, make predictions using the predictor $X$ in the simulated data, and get the summaries.

Use Sklearn

Build and fit the model:

from sklearn.linear_model import LinearRegression

model_sk = LinearRegression()
model_sk.fit(df[['X']], df['y'])

## LinearRegression()

print(f'The coefficient of `model_sk` is {model_sk.coef_[0]}')

## The coefficient of `model_sk` is 0.2935352606493269

print(f'The intercept of `model_sk` is {model_sk.intercept_}')

## The intercept of `model_sk` is 3.9733289338377187

Get the R square:

# Calculate R square
R2 = model_sk.score(df[['X']], df['y']) 
print(f'The R square of `model_sk` is {R2}')

## The R square of `model_sk` is 0.7042485663051222

There seems to be no built-in function for adjusted R square in LinearRegression of sklearn. As we know $$R_{adj}^2 = 1 - \frac{(1-R^2)(n-1)}{n-k-1}$$ where $n$: the number of subjects in the sample; $k$: the number of variables, we can manually calculate the adjusted R square:

# Calculate adjusted R square
#R2 = model_sk.score(df[['X']], df['y'])
n = len(X)
p = 1

adj_R2 = 1-(1-R2)*(n-1)/(n-p-1)
print(f'The adjusted R square of `model_sk` is {adj_R2}')

## The adjusted R square of `model_sk` is 0.7022502458071839

Use statsmodel.api

statsmodels.api.OLS can be used for ordinary least squares model. One thing to notice as mentioned in the Statsmodels API is: …An intercept is not included by default and should be added by the user. See statsmodels.tools.add_constant()…. Including intercept or not can make a drastic difference on the model. Let’s have a look.

Default: Without intercept

Model summary:

import statsmodels.api as sm
model_statsm = sm.OLS(y, X)
results_1 = model_statsm.fit()
print(results_1.summary())

##                                  OLS Regression Results                                
## =======================================================================================
## Dep. Variable:                      y   R-squared (uncentered):                   0.358
## Model:                            OLS   Adj. R-squared (uncentered):              0.353
## Method:                 Least Squares   F-statistic:                              83.00
## Date:                Tue, 14 Dec 2021   Prob (F-statistic):                    5.09e-16
## Time:                        15:20:58   Log-Likelihood:                         -403.54
## No. Observations:                 150   AIC:                                      809.1
## Df Residuals:                     149   BIC:                                      812.1
## Df Model:                           1                                                  
## Covariance Type:            nonrobust                                                  
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## x1             0.9276      0.102      9.110      0.000       0.726       1.129
## ==============================================================================
## Omnibus:                        0.930   Durbin-Watson:                   0.418
## Prob(Omnibus):                  0.628   Jarque-Bera (JB):                0.943
## Skew:                          -0.037   Prob(JB):                        0.624
## Kurtosis:                       2.619   Cond. No.                         1.00
## ==============================================================================
## 
## Notes:
## [1] R² is computed without centering (uncentered) since the model does not contain a constant.
## [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Parameters:

results_1.params

## array([0.92757837])

Add an intercept

Model summary:

import statsmodels.api as sm
X_cons = sm.add_constant(X)
model_statsm = sm.OLS(y, X_cons)
results_2 = model_statsm.fit()
print(results_2.summary())

##                             OLS Regression Results                            
## ==============================================================================
## Dep. Variable:                      y   R-squared:                       0.704
## Model:                            OLS   Adj. R-squared:                  0.702
## Method:                 Least Squares   F-statistic:                     352.4
## Date:                Tue, 14 Dec 2021   Prob (F-statistic):           5.49e-41
## Time:                        15:21:00   Log-Likelihood:                -104.36
## No. Observations:                 150   AIC:                             212.7
## Df Residuals:                     148   BIC:                             218.7
## Df Model:                           1                                         
## Covariance Type:            nonrobust                                         
## ==============================================================================
##                  coef    std err          t      P>|t|      [0.025      0.975]
## ------------------------------------------------------------------------------
## const          3.9733      0.045     88.572      0.000       3.885       4.062
## x1             0.2935      0.016     18.773      0.000       0.263       0.324
## ==============================================================================
## Omnibus:                        0.109   Durbin-Watson:                   2.235
## Prob(Omnibus):                  0.947   Jarque-Bera (JB):                0.121
## Skew:                           0.061   Prob(JB):                        0.941
## Kurtosis:                       2.933   Cond. No.                         3.32
## ==============================================================================
## 
## Notes:
## [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Parameters:

results_2.params

## array([3.97332893, 0.29353526])

Summary:
- In our simulated data, we assumed there is an intercept. Thus the default model of statsmodel.api that assumes no intercept didn’t fit the data very well.
- Coefficients generated from the model with intercept are similar to the ones used to simulate the data; R-square and adjusted R square are close to the ones from sklearn model
- We can include an intercept to a statsmodel.api model by using sm.add_constant() function to add a constant variable to $X$ as in the example above.

Use Statsmodels formula

statsmodels.formula.api allows users to define the regression model using a R-like formula.

Default: With intercept

Note that unlike statsmodel.api, the default ols model made by statsmodels.formula.api includes an intercept:

from statsmodels.formula.api import ols

formula = 'y ~ X'
model_sfa = ols(formula = formula, data = df).fit()
model_sfa.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.704
Model:	OLS	Adj. R-squared:	0.702
Method:	Least Squares	F-statistic:	352.4
Date:	Tue, 14 Dec 2021	Prob (F-statistic):	5.49e-41
Time:	15:21:02	Log-Likelihood:	-104.36
No. Observations:	150	AIC:	212.7
Df Residuals:	148	BIC:	218.7
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	3.9733	0.045	88.572	0.000	3.885	4.062
X	0.2935	0.016	18.773	0.000	0.263	0.324

Omnibus:	0.109	Durbin-Watson:	2.235
Prob(Omnibus):	0.947	Jarque-Bera (JB):	0.121
Skew:	0.061	Prob(JB):	0.941
Kurtosis:	2.933	Cond. No.	3.32

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Parameters predicted from the above model are close to the coefficients we used to simulate the data. The R square and adjusted R-square also agree with the previous sklearn and statsmodel.api (with intercept) methods. Great!

Exclude the intercept

To build a model without an intercept using statsmodels.formula.api (data may already been mean-centered), we can define the formula as $formula = \text{'}y \sim X - 1\text{'}$ (-1 will exclude the intercept):

from statsmodels.formula.api import ols

formula = 'y ~ X-1'
model_sfa = ols(formula = formula, data = df).fit()
model_sfa.summary()

OLS Regression Results
Dep. Variable:	y	R-squared (uncentered):	0.358
Model:	OLS	Adj. R-squared (uncentered):	0.353
Method:	Least Squares	F-statistic:	83.00
Date:	Tue, 14 Dec 2021	Prob (F-statistic):	5.09e-16
Time:	15:21:03	Log-Likelihood:	-403.54
No. Observations:	150	AIC:	809.1
Df Residuals:	149	BIC:	812.1
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
X	0.9276	0.102	9.110	0.000	0.726	1.129

Omnibus:	0.930	Durbin-Watson:	0.418
Prob(Omnibus):	0.628	Jarque-Bera (JB):	0.943
Skew:	-0.037	Prob(JB):	0.624
Kurtosis:	2.619	Cond. No.	1.00

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Posted on:: December 12, 2021

Length:: 7 minute read, 1313 words

Categories:: python regression statistics

See Also: