-
Notifications
You must be signed in to change notification settings - Fork 0
/
BostonHousing Code
505 lines (236 loc) · 10.7 KB
/
BostonHousing Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
#!/usr/bin/env python
# coding: utf-8
# ### Data Description
# The details and description of the Boston Housing Data can be found here:
#
# https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
#
# and
#
# https://towardsdatascience.com/things-you-didnt-know-about-the-boston-housing-dataset-2e87a6f960e8
# In[50]:
# import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# OPTIONAL:
# import pearson correlation library
from scipy.stats import pearsonr
# to show grid lines in plots
sns.set_style('whitegrid')
# to make all plots well positioned in the notebook
get_ipython().run_line_magic('matplotlib', 'inline')
# In[51]:
# Import the Boston DataSet
from sklearn.datasets import load_boston
# In[52]:
# store the dataset
boston = load_boston()
# In[53]:
# check the different keys
boston.keys()
# In[54]:
# You can choose to print each of these keys to see their content
# kindly delete the '#' and run the code to see
## for data
# print(boston['data'])
## for target or predictors
# print(boston['target'])
## for feature_names OR predictor/column names
# print(boston['feature_names'])
## for DESCR: description of each feature/predictor
print(boston['DESCR'])
## for filename
# (not relevant)
# ## Exploratory Data Analysis
# #### Let's explore the data a little !
# ##### First, let us put the data in a Dataframe
# In[55]:
# chect the current data type
type(boston['data'])
# In[56]:
# Convert the numpy array " boston['data'] " into a dataframe
bostonData_array = boston['data']
#boston_df = pd.DataFrame(bostonData_array, columns = ['CRIM','ZN','INDUS','CHAS', 'NOX', 'RM',
# 'AGE', 'DIS', 'RAD', 'TAX', ' PTRATIO', 'B', 'LSTAT'])
boston_df = pd.DataFrame(bostonData_array, columns = boston['feature_names'])
# add the RESPONSE Variable
boston_df['Med. Worth of Home'] = boston['target']
boston_df.head()
# ###### (1) Compare
# the 'Average number of rooms per dwelling [RM]' (predictor) with the 'Median value of owner-occupied homes in $1000's [boston['target']] (the response vairable).
#
# ###### Does the correlation make sense?
# In[57]:
sns.jointplot(x = 'RM',
y = 'Med. Worth of Home',
data = boston_df ,
color = 'k',
stat_func = pearsonr) # optional for the correlation value and p-value
# ###### YES, correlation makes sense.
# There exist is a positive correlation/relationship between RM (Average number of rooms per dwelling) and boston['target'] (Median value of owner-occupied homes in $1000's). This could possibly infer that, higher average number of rooms per dwelling would result to a higher median value of the homes.
# In[ ]:
# ##### (2) Let us do a pair plot to see the relationship between "selected" predictors/columns and their correlation
# In[58]:
sns.pairplot(data = boston_df[
['CRIM', 'RM', 'AGE', 'DIS', 'TAX', 'Med. Worth of Home']
]
)
# In[ ]:
# ### We can as well create a linear model plot
# In[59]:
sns.lmplot(x = 'RM', y = 'Med. Worth of Home', data = boston_df)
# In[60]:
## It will be nice to represent this 'RM' and 'Med. Worth of Home' with a hex plot
# In[61]:
sns.jointplot(x = 'RM',
y = 'Med. Worth of Home',
data = boston_df ,
kind = 'hex', # also try 'scatter', 'reg', 'resid', 'kde' *it is optional
color = 'k',
stat_func = pearsonr)
# #### Interpretation:
# ** There is a dense between 5.5 to 7.0 in 'RM' which corresponds to between 15.0 to 25.0 in the Median Worth of Home,
# ** This implies that most owner-occupied homes in Boston have an average of 6 rooms per dwelling and the median worth of these homes is around between 150,000 to 250,000
# ** We see that there is a strong correlation between the average number of rooms per dwelling and the Median Value or Worth of the Homes. Correlation is strong being 0.7.
# In[ ]:
# ## Training and Testing Data
#
# Now that we've explored the data a bit, let's go ahead and split the data into training and testing sets.
#
# ** Our variable X will equal the numerical features/columns which is boston['data'] **
#
# ** Our variable y will equal the response variable which is boston['target'], i.e. Median value of owner-occupied homes in \$1000's **
# In[62]:
# predictors
X = boston['data']
# In[63]:
# response (predicands)
y = boston['target']
# In[64]:
# Split data into training and testing set
from sklearn.model_selection import train_test_split
# In[65]:
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size = 0.3, # 30% of data should be used to test so # 70% to train
random_state = 101)
# #### Train the Model with train-data
# In[66]:
# import LinearRegression
from sklearn.linear_model import LinearRegression
# In[67]:
# create an object for the linear regression
lm = LinearRegression()
# In[68]:
# train and fit 'lm' on the training data
lm.fit(X = X_train,
y = y_train)
# In[69]:
# Get the coefficients of the predictors
lm.coef_
print('Coefficients: ', lm.coef_)
# In[70]:
# Get the intercept of the predictors
lm.intercept_
# print('Intercept = %0.3f' % lm.intercept_) # to 3 decimal place
# In[ ]:
# In[71]:
## The model is thus given by:
print("Our linear model is: "
" 'Medain Value of Home (Y)' = {:.4} + {:.4}*CRIM + {:.4}*ZN + {:.4}*INDUS + {:.4}*CHAS + {:.4}*NOX + {:.4}*RM + "
" {:.4}*AGE + {:.4}*DIS + {:.4}*RAD + {:.4}*TAX + {:.4}*PTRATIO + {:.4}*B + "
" {:.4}*LSTAT ".format(
lm.intercept_,
lm.coef_[0], lm.coef_[1], lm.coef_[2], lm.coef_[3], lm.coef_[4], lm.coef_[5],
lm.coef_[6], lm.coef_[7], lm.coef_[8], lm.coef_[9], lm.coef_[10], lm.coef_[11], lm.coef_[12]))
# In[ ]:
# ### Prediction of Model
# ** We evaluate its performance of our model by predicting the test values! **
# In[72]:
predictions = lm.predict(X = X_test)
# In[73]:
predictions
# ### Compare:
# **Now let's see how strong the relationship is, between our predictions and the real (original y-values)**
# In[74]:
# Using a scatter plot to check for correlation
# Using matplotlib
plt.scatter(x = y_test,
y = predictions)
# In[75]:
# Using seaborn scatterplot
#sns.scatterplot(x = y_test, # original or real values from data
# y = predictions) # predicted values
# Using Seaborn Jointplot (so we can call the correlation value)
sns.jointplot(x = y_test,
y = predictions,
kind = 'scatter',
stat_func = pearsonr)
# ### Interpreation:
# ** There is obviously a strong correlation between our prediction and the original values. Correlation value is 0.85, hence our model is good enough to be used for predictions in real life.
# In[ ]:
# ## Evaluating the Model
# ** Let's evaluate our model performance by calculating the residual sum of squares and the variance score (R^2) **
# In[76]:
# import the library
from sklearn import metrics
# ##### Quick important note:
# ** The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit. As the square root of a variance, RMSE can be interpreted as the standard deviation of the unexplained variance, and has the useful property of being in the same units as the response variable. **Lower values of RMSE indicate better fit**. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction. (Source: https://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/)**
# In[77]:
# for MAE:
mae = metrics.mean_absolute_error(y_true = y_test,
y_pred = predictions)
# for MSE:
mse = metrics.mean_squared_error(y_true = y_test,
y_pred = predictions)
# for RMSE
rmse = np.sqrt( metrics.mean_squared_error(y_true = y_test,
y_pred = predictions)
)
print("MAE: ", mae )
print("MSE: ", mse )
print("RMSE:", rmse)
# In[78]:
# Since the RMSE is low, we can say that our model accurately predicts the reponse.
# In[ ]:
# ## Residuals:
# ** Let's quickly explore the residuals to make sure everything was okay with our data. We do this by ploting the histogram of the residuals to ensure it is normmally distributed **
# In[79]:
# Using Seaborn
sns.distplot(a = (y_test - predictions), # this is how residual is calculated
bins = 50)
# In[80]:
# YES: there is a good level of normality in our residuals. Hence, we finally accepts the model
# In[ ]:
# ## Further Thoughts:
# ** We can tell how powerful each predictor/variable is in the model, usign the coeeficients of the predictors.
# ** We can tell the significance of each predictor/variable in predicting the response variable, using their p-values
# ##### Effect of predictors on model
# In[81]:
# Create a dataframe that will take the coefficeint and also the column names
# recall the column names
boston_df.columns
# In[82]:
# remove the last column (i.e. the predictor)
boston_df.drop(labels = 'Med. Worth of Home', # name of the column to drop
axis = 1, # means column, axis = 0 means row
inplace = True) # make the drop permanent
# now check the columns again
boston_df.columns
# In[83]:
# DataFrame
cdf = pd.DataFrame(data = lm.coef_,
index = boston_df.columns,
columns = ['Coefficient'])
cdf
# In[84]:
# Let's sort the values of by the Coefficient column
cdf.sort_values(by = ['Coefficient'],
ascending = False)
# ### Interpretation:
# ** We see the effect of each predictor (based on their coefficients) in the above table in descendeing order
# ** Clearly, 'CHAS'':Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)' and 'RM: ''average number of rooms per dwelling', happens to give the most positive (increasing) effect on the model
# ** And 'NOX: '' nitric oxides concentration (parts per 10 million)' gives the most negative (decreasing) effect on the model.
# In[ ]: