How to debone the output of a linear regression model

I was recently puzzling over an academic paper in the social sciences with a multiple linear regression model that seemed off. (Communication with the authors educated me on considerations that resolved my concerns.)

During the course of my puzzling, I compared what I was seeing as model output in the paper and what I expected to see based on experience. The embarrassing revelation dawned that I didn’t actually know what all those components meant.

So, I decided to find out. Data on Seattle area housing prices provide a convenient way to illustrate the usual output of a multiple linear regression model output. This is a 21K dataset with 19 variables on housing characteristics and sales price.

## [1] "Data Structure"

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_character(),
##   date = col_datetime(format = "")
## )

## See spec(...) for full column specifications.

## cols(
##   id = col_character(),
##   date = col_datetime(format = ""),
##   price = col_double(),
##   bedrooms = col_double(),
##   bathrooms = col_double(),
##   sqft_living = col_double(),
##   sqft_lot = col_double(),
##   floors = col_double(),
##   waterfront = col_double(),
##   view = col_double(),
##   condition = col_double(),
##   grade = col_double(),
##   sqft_above = col_double(),
##   sqft_basement = col_double(),
##   yr_built = col_double(),
##   yr_renovated = col_double(),
##   zipcode = col_double(),
##   lat = col_double(),
##   long = col_double(),
##   sqft_living15 = col_double(),
##   sqft_lot15 = col_double()
## )

## 
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
##     yr_built, data = house)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1784413  -131451   -16905   100357  3933748 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.006e+06  1.298e+05  46.281   <2e-16 ***
## bedrooms    -7.096e+04  2.246e+03 -31.592   <2e-16 ***
## bathrooms    8.114e+04  3.730e+03  21.755   <2e-16 ***
## sqft_living  3.044e+02  2.998e+00 101.544   <2e-16 ***
## sqft_lot    -3.388e-01  4.119e-02  -8.225   <2e-16 ***
## yr_built    -3.057e+03  6.686e+01 -45.730   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 245800 on 21607 degrees of freedom
## Multiple R-squared:  0.552,  Adjusted R-squared:  0.5519 
## F-statistic:  5325 on 5 and 21607 DF,  p-value: < 2.2e-16

Well now! I pick a handful of variables off the top of my head and presto! My model explains over 50% of the variation in price. Am I good or what?

Not really, and I’ll explain all the bonehead errors in another post (didn’t normalize, didn’t split into training and test sets, used variables that are not truly independent, didn’t check the distribution of residual errors and a host of other sins). Let’s talk about the information you get from a multiple linear regression results table.

The Call is an explicit statement of the model.
The Residuals show a summary distribution of the differences between the observations and predictions. In this example half of the residuals are about, low, up to $17,000 and and half are above, including one whopper that’s almost $4 million high.
The Coefficients include, in addition to the coefficient and standard error, a t value and the probability that the coefficient is greater that the absolute value of the t value. It is essentially equivalent to the significance level – the probability, which is extremely low in this example, of the results being due to random effects.
The Signif. codes are for convenient reference to the size of the significance level.
The Residual standard error indicates the average amount that the observed values deviate from the trend line, in this case approximately $246,000.
Multiple R Squared, indicates the proportion of variance explained by the coefficient, holding the other coefficients constant.
Adjusted R Squared, accounts for possible interaction among the other independent variables.
The F-statistic addresses the question of whether there is, in fact, no relation between one or more of the independent variables and the coefficient. An F-statistic that approaches 1 precludes us from rejecting the null hypothesis of no relation and to conclude that there is, in fact, no relationship at all between one or more of the independent variable with the dependent. The degrees of freedom equals the number of data points, minus 1. The p-value, or significance, gives the probability that the F-statistic is merely the result of random chance. The extremely low p-value in this example indicates that the F-statistic is highly unlikely to be a chance artifact.

As I said before, there are defects in my model that render these results garbage, and my table should be accompanied by a discussion that the conditions required to make multiple linear regression are actually unsatisfied. It’s solely to illustrate the information provided in the table.

Just goes to show you don’t really understand something you haven’t tried to explain.