3 min read

How to debone linear regression output in Python

This post makes a trio with two most recent blogs on the same subject for R and Excel, this time in Python usking the sklearn package.

The example given in the official documentation (https://goo.gl/GGmUKK) is somewhat sketchy and uses a different dataset than the previous two examples. Although the example dataset is different, my main purpose is to illustrate the default output.

                        OLS Regression Results
Dep. Variable:                      y   R-squared:                       0.518
Model:                            OLS   Adj. R-squared:                  0.507
Method:                 Least Squares   F-statistic:                     46.27
Date:                Wed, 29 Aug 2018   Prob (F-statistic):           3.83e-62
Time:                        08:39:07   Log-Likelihood:                -2386.0
No. Observations:                 442   AIC:                             4794.
Df Residuals:                     431   BIC:                             4839.
Df Model:                          10
Covariance Type:            nonrobust
                 coef    std err          t      P>|t|      [0.025      0.975]
const        152.1335      2.576     59.061      0.000     147.071     157.196
x1           -10.0122     59.749     -0.168      0.867    -127.448     107.424
x2          -239.8191     61.222     -3.917      0.000    -360.151    -119.488
x3           519.8398     66.534      7.813      0.000     389.069     650.610
x4           324.3904     65.422      4.958      0.000     195.805     452.976
x5          -792.1842    416.684     -1.901      0.058   -1611.169      26.801
x6           476.7458    339.035      1.406      0.160    -189.621    1143.113
x7           101.0446    212.533      0.475      0.635    -316.685     518.774
x8           177.0642    161.476      1.097      0.273    -140.313     494.442
x9           751.2793    171.902      4.370      0.000     413.409    1089.150
x10           67.6254     65.984      1.025      0.306     -62.065     197.316
Omnibus:                        1.506   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.471   Jarque-Bera (JB):                1.404
Skew:                           0.017   Prob(JB):                        0.496
Kurtosis:                       2.726   Cond. No.                         227.

Source: User JARH on Stackoveflow (https://goo.gl/vZ3UWP), using the same dataset as the official documentation.

The output is divided into three sections.

  1. OLS REGRESSION RESULTS lacks the “call,” identifying the dependent and independent variables. It contains less information on residuals than R and adds some additional information. The date/time stamp is a gift during the exploratory data phase before the analyst has started more formal procedures.

Four additional substantive fields are added

  • Covariance Type is a useful reminder whether the the requirements for a reliable linear regression are satisfied totally or not (“nonrobust”).
  • Log-likelihood specifies which measure was applied to deal with the role of error terms
  • AIC – the Akalike information criterion – is a “score” to assess the quality of competing models
  • BIC – the Baysian inforation criterion – performs a similar role
  1. The second section, below the “===============” line is similar to R and Excel in providing coefficients and standard error. Like Excel, it provides a confidence interval (but in standard 92.5/97.5 form for the 95% confidence interval) that has to be run separately in R. Unlike the other two programs it omits the convenient asterisk coding of significance levels.

  2. The third section, below the “================” has information omitted from both R and Excel or in a different form.

  • The Omnibus test is the F statistic from analysis of variance (ANOVA), included in Excel and separatly available in R. Skew and kurtosis describe asymmetries in the distribution curve. Durbin-Watson tests for autocorrelation. It is separately available in R; I haven’t checked in Excel. Jarque-Bera tests whether the data is normally distributed. It is separately available in R; I haven’t checked in Excel. The condition number is a measure of collinearity among the independent variables, which can have the effect of overemmphasizing redundant variables. It is separately available in R; I haven’t checked in Excel.

As sometimes said of economists, you can lay them end-to-end, and they would still point in different directions. Sklearn’s community of designers make different choices from those of R and the proprietary team of Microsoft Excel. Which is better or worse is a matter of personal taste, so long as each provides some functionality to produce the same output separately from the default for linear regression output.

If you don’t regularly use all three, I hope these guides have been useful.