Logo

Evan Trock

LinkedIn
Resume
GitHub

Data Section

  1. This data shows observations about a sample of homes that have been built over a long span of time(1852-2008), It shows different types of add-ons for a house, square footage, type of street and information about the sale of the property but most of all, the sales price
  2. 1941 Observations in this data set
  3. Year Sold Ranging from 2006-2008
  4. No Duplicate Parcels
  5. Certain Variables have missing values including: v_Mas_Vnr_Area(Missing 18) v_BsmtFin_SF_1, v_BsmtFin_SF_2(Both missing 1), and v_Lot_Frontage(Missing 321)
  6. A couple of outliers in the Sales price category: I determined it is over 60000 for smaller houses, and lot size with lower prices looks to be over 70000.
  7. Data comes from a City called Ames in Iowa
  8. Continous variables include but are not limited to many area variables: Lot size, price, pool_area, porch area, as well as street front property size Notes: Strong correlation between price and these variables, not as big as expected in lot size, but large correlation in square feet of 1st and 2nd floor
  9. Categorical Variables include types and styles like roof type and style, neighborhood, lot configuration and type of road access Notes: Strong correlation between price and the condition of the house at sale, showing new houses go for more, and variables like Overall quality
  10. Discrete Variables include number of bedrooms, kitchen, bathrooms and so on
import pandas as pd
from statsmodels.formula.api import ols as sm_ols
import numpy as np
import seaborn as sns
from statsmodels.iolib.summary2 import summary_col # nicer tables
import matplotlib.pyplot as plt 

data = pd.read_csv('input_data2/housing_train.csv')
data.describe()

v_MS_SubClass v_Lot_Frontage v_Lot_Area v_Overall_Qual v_Overall_Cond v_Year_Built v_Year_Remod/Add v_Mas_Vnr_Area v_BsmtFin_SF_1 v_BsmtFin_SF_2 ... v_Wood_Deck_SF v_Open_Porch_SF v_Enclosed_Porch v_3Ssn_Porch v_Screen_Porch v_Pool_Area v_Misc_Val v_Mo_Sold v_Yr_Sold v_SalePrice
count 1941.000000 1620.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1923.000000 1940.000000 1940.000000 ... 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000 1941.000000
mean 58.088614 69.301235 10284.770222 6.113344 5.568264 1971.321999 1984.073158 104.846074 436.986598 49.247938 ... 92.458011 49.157135 22.947965 2.249871 16.249871 3.386399 52.553838 6.431221 2006.998454 182033.238022
std 42.946015 23.978101 7832.295527 1.401594 1.087465 30.209933 20.837338 184.982611 457.815715 169.555232 ... 127.020523 70.296277 65.249307 22.416832 56.748086 43.695267 616.064459 2.745199 0.801736 80407.100395
min 20.000000 21.000000 1470.000000 1.000000 1.000000 1872.000000 1950.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 2006.000000 13100.000000
25% 20.000000 58.000000 7420.000000 5.000000 5.000000 1953.000000 1965.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 2006.000000 130000.000000
50% 50.000000 68.000000 9450.000000 6.000000 5.000000 1973.000000 1993.000000 0.000000 361.500000 0.000000 ... 0.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6.000000 2007.000000 161900.000000
75% 70.000000 80.000000 11631.000000 7.000000 6.000000 2001.000000 2004.000000 168.000000 735.250000 0.000000 ... 168.000000 72.000000 0.000000 0.000000 0.000000 0.000000 0.000000 8.000000 2008.000000 215000.000000
max 190.000000 313.000000 164660.000000 10.000000 9.000000 2008.000000 2009.000000 1600.000000 5644.000000 1474.000000 ... 1424.000000 742.000000 1012.000000 407.000000 576.000000 800.000000 17000.000000 12.000000 2008.000000 755000.000000

8 rows × 37 columns

# graph continuous data
plt.plot(data['v_2nd_Flr_SF'], data['v_SalePrice'], 'o')
plt.title('Lot Area vs Sale Price')
plt.show()

png

#Graph categorical data
sns.boxplot(x='v_Overall_Qual', y='v_SalePrice', data=data)
plt.title('Sale Price by Categorical variables')
Text(0.5, 1.0, 'Sale Price by Categorical variables')

png

print('Regression Results:')
reg1 = sm_ols('v_SalePrice ~  v_Lot_Area ', data=data).fit()
logLotArea = np.log(data['v_Lot_Area'])
reg2 = sm_ols('v_SalePrice ~  logLotArea ', data=data).fit()
logSalePrice = np.log(data['v_SalePrice'])
reg3 = sm_ols('logSalePrice ~  v_Lot_Area ', data=data).fit()
reg4 = sm_ols('logSalePrice ~  logLotArea ', data=data).fit()
reg5 = sm_ols('logSalePrice ~  v_Yr_Sold ', data=data).fit()
data['v_Yr_Sold_2007'] = (data['v_Yr_Sold'] == 2007).astype(int)
data['v_Yr_Sold_2008'] = (data['v_Yr_Sold'] == 2008).astype(int)
reg6 = sm_ols('logSalePrice ~  v_Yr_Sold_2007 + v_Yr_Sold_2008 ', data=data).fit()
log1stFlr = np.log(data['v_1st_Flr_SF'])
logQual = np.log(data['v_Overall_Qual'])
reg7 = sm_ols('logSalePrice ~  v_Kitchen_Qual + log1stFlr + v_Functional + logQual + v_Gr_Liv_Area ', data=data).fit()

print(summary_col(results=[reg1,reg2,reg3,reg4,reg5,reg6,reg7], # list the result obj here
                  float_format='%0.2f',
                  stars = True, # stars are easy way to see if anything is statistically significant
                  model_names=['No Log','Log Area','No AreaLog','Both Log','Yr Sold','2007+2008',
                               '5 Variable Model'],
                  )
     )

Regression Results:

======================================================================================================
                        No Log       Log Area   No AreaLog Both Log Yr Sold 2007+2008 5 Variable Model
------------------------------------------------------------------------------------------------------
Intercept            154789.55*** -327915.80*** 11.89***   9.41***  22.29   12.02***  8.48***         
                     (2911.59)    (30221.35)    (0.01)     (0.15)   (22.94) (0.02)    (0.11)          
v_Lot_Area           2.65***                    0.00***                                               
                     (0.23)                     (0.00)                                                
logLotArea                        56028.17***              0.29***                                    
                                  (3315.14)                (0.02)                                     
v_Yr_Sold                                                           -0.01                             
                                                                    (0.01)                            
v_Yr_Sold_2007                                                              0.03                      
                                                                            (0.02)                    
v_Yr_Sold_2008                                                              -0.01                     
                                                                            (0.02)                    
v_Kitchen_Qual[T.Fa]                                                                  -0.31***        
                                                                                      (0.03)          
v_Kitchen_Qual[T.Gd]                                                                  -0.10***        
                                                                                      (0.02)          
v_Kitchen_Qual[T.TA]                                                                  -0.23***        
                                                                                      (0.02)          
v_Functional[T.Maj2]                                                                  -0.13           
                                                                                      (0.08)          
v_Functional[T.Min1]                                                                  0.19***         
                                                                                      (0.05)          
v_Functional[T.Min2]                                                                  0.16***         
                                                                                      (0.05)          
v_Functional[T.Mod]                                                                   0.20***         
                                                                                      (0.06)          
v_Functional[T.Sal]                                                                   -0.61***        
                                                                                      (0.13)          
v_Functional[T.Sev]                                                                   -0.24           
                                                                                      (0.18)          
v_Functional[T.Typ]                                                                   0.24***         
                                                                                      (0.05)          
log1stFlr                                                                             0.27***         
                                                                                      (0.01)          
logQual                                                                               0.69***         
                                                                                      (0.02)          
v_Gr_Liv_Area                                                                         0.00***         
                                                                                      (0.00)          
R-squared            0.07         0.13          0.06       0.13     0.00    0.00      0.82            
R-squared Adj.       0.07         0.13          0.06       0.13     -0.00   0.00      0.82            
======================================================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
#3
print(summary_col(results=[reg5, reg6 ],  
                  float_format='%0.6f', 
                  stars=True, 
                  model_names=['Reg 5', 'reg6']
                 )
     )
=======================================
                  Reg 5        reg6    
---------------------------------------
Intercept      22.293213   12.022869***
               (22.936825) (0.016136)  
v_Yr_Sold      -0.005114               
               (0.011428)              
v_Yr_Sold_2007             0.025590    
                           (0.022246)  
v_Yr_Sold_2008             -0.010282   
                           (0.022848)  
R-squared      0.000103    0.001436    
R-squared Adj. -0.000412   0.000406    
=======================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
  1. Regression outputs and R-Squared Shown above
  2. $\beta_1$ = 56028.17. This means that if the lot area increases by 1%, the predicted sale price increases by approximately $560.28
  3. $\beta_1$ = 0.000013 it’s a very small positive number, it means that for each additional square foot of lot area, the sale price increases by a .013 cents
  4. Looking at R squared first, Model 2 and 4 have a leg up at .13. To distinguish them, the elasticity is more interpretable for model 2 than 4 so I beleive model 2 represents the data the best.
  5. $\beta_1$ = -0.01. This means that, holding the effects of sales in 2007 and 2008 constant, each additional year of sale is associated with a decrease of $0.01 in sale price.
  6. $\alpha$ = 12.02. This represents the predicted sale price when v_Yr_Sold, v_Yr_Sold_2007, and v_Yr_Sold_2008 are all zero. In the context of this model, it’s the baseline sale price for years other than 2007 or 2008
  7. $\beta_1$ = .03 : This means that, holding the year of sale and the effect of 2008 constant, sales in 2007 were associated with a 0.03 increase in sale price.
  8. The indicator variables in model 6 provide more explanatory power, leading to a higher R-squared, even though the overall R-squared values are still very low, with the data split between 2 years for model 6
  9. I used logSalePrice and v_1st_Flr_SF, measuring the 1st floor square footage versus the sale price
  10. 0.82 is the R-Squared from my 5 variable model
  11. Although it could be used for a predictive regression to predicting house prices in the 2007/2008 period, I don’t think that it necessarily should as the low R squared value of .001436 suggests that it is more of a correlation rather than causation and the 2 variables are not very related.
  12. Similar to number 11, model 5 has a very small R-Squared value at .000103 and should not be used for predictive analysis.