728x90
해당 데이터를 활용한 기본적인 단순 선형 회귀분석 실습 코드
In [1]:
# 기본 라이브러리
import os
import pandas as pd
import numpy as np
import statsmodels.api as sm
In [2]:
# 데이터 불러오기
boston = pd.read_csv('C:/Users/설위준/Desktop/05-11--machine-learning/Part 05~11) Machine Learning/06. 회귀분석/실습코드/Boston_house.csv')
boston.head()
Out[2]:
AGE | B | RM | CRIM | DIS | INDUS | LSTAT | NOX | PTRATIO | RAD | ZN | TAX | CHAS | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 65.2 | 396.90 | 6.575 | 0.00632 | 4.0900 | 2.31 | 4.98 | 0.538 | 15.3 | 1 | 18.0 | 296 | 0 | 24.0 |
1 | 78.9 | 396.90 | 6.421 | 0.02731 | 4.9671 | 7.07 | 9.14 | 0.469 | 17.8 | 2 | 0.0 | 242 | 0 | 21.6 |
2 | 61.1 | 392.83 | 7.185 | 0.02729 | 4.9671 | 7.07 | 4.03 | 0.469 | 17.8 | 2 | 0.0 | 242 | 0 | 34.7 |
3 | 45.8 | 394.63 | 6.998 | 0.03237 | 6.0622 | 2.18 | 2.94 | 0.458 | 18.7 | 3 | 0.0 | 222 | 0 | 33.4 |
4 | 54.2 | 396.90 | 7.147 | 0.06905 | 6.0622 | 2.18 | 5.33 | 0.458 | 18.7 | 3 | 0.0 | 222 | 0 | 36.2 |
In [3]:
# target 제외한 데이터만 뽑기
boston_data = boston.drop('Target',axis=1)
In [4]:
boston_data.describe()
# data 통계 뽑아보기
Out[4]:
AGE | B | RM | CRIM | DIS | INDUS | LSTAT | NOX | PTRATIO | RAD | ZN | TAX | CHAS | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 68.574901 | 356.674032 | 6.284634 | 3.613524 | 3.795043 | 11.136779 | 12.653063 | 0.554695 | 18.455534 | 9.549407 | 11.363636 | 408.237154 | 0.069170 |
std | 28.148861 | 91.294864 | 0.702617 | 8.601545 | 2.105710 | 6.860353 | 7.141062 | 0.115878 | 2.164946 | 8.707259 | 23.322453 | 168.537116 | 0.253994 |
min | 2.900000 | 0.320000 | 3.561000 | 0.006320 | 1.129600 | 0.460000 | 1.730000 | 0.385000 | 12.600000 | 1.000000 | 0.000000 | 187.000000 | 0.000000 |
25% | 45.025000 | 375.377500 | 5.885500 | 0.082045 | 2.100175 | 5.190000 | 6.950000 | 0.449000 | 17.400000 | 4.000000 | 0.000000 | 279.000000 | 0.000000 |
50% | 77.500000 | 391.440000 | 6.208500 | 0.256510 | 3.207450 | 9.690000 | 11.360000 | 0.538000 | 19.050000 | 5.000000 | 0.000000 | 330.000000 | 0.000000 |
75% | 94.075000 | 396.225000 | 6.623500 | 3.677083 | 5.188425 | 18.100000 | 16.955000 | 0.624000 | 20.200000 | 24.000000 | 12.500000 | 666.000000 | 0.000000 |
max | 100.000000 | 396.900000 | 8.780000 | 88.976200 | 12.126500 | 27.740000 | 37.970000 | 0.871000 | 22.000000 | 24.000000 | 100.000000 | 711.000000 | 1.000000 |
In [5]:
# 타겟 데이터
# 1978 보스턴 주택 가격
# 506개 타운의 주택 가격 중앙값 (단위 1,000 달러)
# 변수 특징 설명
# CRIM: 범죄율
# INDUS: 비소매상업지역 면적 비율
# NOX: 일산화질소 농도
# RM: 주택당 방 수
# LSTAT: 인구 중 하위 계층 비율
# B: 인구 중 흑인 비율
# PTRATIO: 학생/교사 비율
# ZN: 25,000 평방피트를 초과 거주지역 비율
# CHAS: 찰스강의 경계에 위치한 경우는 1, 아니면 0
# AGE: 1940년 이전에 건축된 주택의 비율
# RAD: 방사형 고속도로까지의 거리
# DIS: 직업센터의 거리
# TAX: 재산세율'''
crim/rm/lstat 세게의 변수로 각각 단순 선형 회귀 분석하기¶
In [5]:
## 변수 설정 target/crim/rm/lstat
target = boston[['Target']]
crim = boston[['CRIM']]
rm = boston[['RM']]
lstat = boston[['LSTAT']]
target ~ crim 선형회귀분석¶
In [6]:
# crim변수에 상수항추가하기
crim1 = sm.add_constant(crim,has_constant="add")
crim1
Out[6]:
const | CRIM | |
---|---|---|
0 | 1.0 | 0.00632 |
1 | 1.0 | 0.02731 |
2 | 1.0 | 0.02729 |
3 | 1.0 | 0.03237 |
4 | 1.0 | 0.06905 |
... | ... | ... |
501 | 1.0 | 0.06263 |
502 | 1.0 | 0.04527 |
503 | 1.0 | 0.06076 |
504 | 1.0 | 0.10959 |
505 | 1.0 | 0.04741 |
506 rows × 2 columns
In [7]:
# sm.OLS 적합시키기
model1 = sm.OLS(target,crim1)
fitted_model1 = model1.fit()
In [8]:
# summary함수통해 결과출력
fitted_model1.summary()
Out[8]:
Dep. Variable: | Target | R-squared: | 0.151 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.149 |
Method: | Least Squares | F-statistic: | 89.49 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 1.17e-19 |
Time: | 15:51:04 | Log-Likelihood: | -1798.9 |
No. Observations: | 506 | AIC: | 3602. |
Df Residuals: | 504 | BIC: | 3610. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 24.0331 | 0.409 | 58.740 | 0.000 | 23.229 | 24.837 |
CRIM | -0.4152 | 0.044 | -9.460 | 0.000 | -0.501 | -0.329 |
Omnibus: | 139.832 | Durbin-Watson: | 0.713 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 295.404 |
Skew: | 1.490 | Prob(JB): | 7.14e-65 |
Kurtosis: | 5.264 | Cond. No. | 10.1 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- R²는 0.151이므로 15%의 설명력을 가진다.
Crim변수 계수는 -0.4152의 음의 값을 가지며 P값을 통해 매우 유의미하다고 볼 수도 있다.
In [9]:
## 회귀 계수 출력
fitted_model1.params
Out[9]:
const 24.033106 CRIM -0.415190 dtype: float64
y_hat=beta0 + beta1 * X 계산해보기¶
In [10]:
#회귀 계수 x 데이터(X)
# yhat구하기1
np.dot(crim1,fitted_model1.params)
Out[10]:
array([ 24.03048217, 24.02176733, 24.02177563, 24.01966646, 24.00443729, 24.02071274, 23.99644902, 23.97309042, 23.94540138, 23.96250722, 23.93973403, 23.98433377, 23.99416963, 23.77163594, 23.76823138, 23.77261995, 23.59552468, 23.70751396, 23.69982879, 23.73176107, 23.51337514, 23.67934745, 23.52139661, 23.62271965, 23.72160552, 23.68412214, 23.75413567, 23.63627976, 23.71216824, 23.61689868, 23.56360486, 23.4706396 , 23.45682622, 23.55492323, 23.36347899, 24.00646341, 23.99265003, 23.99983283, 23.96042712, 24.02163447, 24.01915993, 23.98019433, 23.97435675, 23.96694145, 23.98216648, 23.96193426, 23.95490093, 23.9379155 , 23.92770182, 23.94185981, 23.99626634, 24.01509937, 24.01085198, 24.01242555, 24.02745959, 24.02766303, 24.02457401, 24.02716065, 23.96898004, 23.99022532, 23.97110996, 23.96181385, 23.98732314, 23.9805846 , 24.02500581, 24.01822575, 24.01492499, 24.00907081, 23.97683128, 23.97989539, 23.99646148, 23.96719057, 23.99505814, 23.95198215, 24.00032275, 23.99361327, 23.99095191, 23.99695556, 24.00966453, 23.99828417, 24.0160294 , 24.01458038, 24.01791436, 24.01836277, 24.0121017 , 24.00929501, 24.0115661 , 24.00341592, 24.0096064 , 24.01109279, 24.01365866, 24.01678089, 24.01565573, 24.02116945, 24.0152779 , 23.98243635, 23.98534268, 23.98293873, 23.99911455, 24.00462412, 23.97138399, 23.98564162, 23.93812725, 23.94524776, 23.97514561, 23.97804364, 23.9620256 , 23.97864567, 23.97995351, 23.92364956, 23.98829469, 23.99123839, 23.98191736, 23.94088411, 23.97402045, 23.96196747, 23.97847544, 23.97042075, 23.97889063, 23.97300323, 24.0044622 , 24.00335779, 23.99449763, 23.97066986, 23.99221408, 23.96293071, 23.87228222, 23.92550961, 23.8979908 , 23.66721974, 23.89191657, 23.53780908, 23.78812315, 23.89616812, 23.62780988, 23.80152134, 23.89914918, 23.88682218, 23.92939164, 23.80702676, 23.91232732, 23.35691068, 22.6542385 , 22.33190553, 22.87898515, 23.04522734, 23.13835037, 23.04967818, 23.06530179, 22.89798841, 23.34530196, 23.41184866, 23.56536111, 23.14078753, 23.4460894 , 22.56540439, 23.01726842, 23.52508765, 23.47557206, 23.44145172, 23.50437796, 23.42553333, 23.2717427 , 23.40242384, 23.1021001 , 22.8190898 , 23.19849483, 23.28564742, 23.07800246, 23.01608513, 23.53179713, 23.07239739, 23.9753366 , 23.99500001, 23.99803505, 24.00543789, 24.00395151, 24.0105821 , 24.00552924, 24.00910818, 24.00575344, 24.00450787, 23.9953114 , 23.99155393, 23.99861217, 24.00799962, 24.00984721, 24.00040994, 23.98087939, 23.99835475, 23.99545672, 24.00441237, 23.99713409, 24.02402596, 24.02713159, 24.0273724 , 24.01645289, 24.0137334 , 24.0174618 , 24.02002768, 24.02572409, 24.01880287, 24.02406748, 24.018533 , 24.024765 , 23.97646592, 23.93774112, 23.92848238, 23.97669427, 23.85220362, 23.96067208, 23.87708597, 23.942931 , 23.97476364, 23.91288783, 23.9508902 , 24.0141735 , 24.00398888, 23.98714876, 23.98567068, 23.88443069, 23.86382895, 23.77421012, 23.77788871, 23.90218422, 23.81432996, 23.87444536, 23.86189001, 23.90930059, 23.84968341, 23.81014899, 23.84088968, 23.79425136, 23.89548305, 23.8471383 , 23.89590655, 23.81696642, 23.82059933, 23.99887789, 23.99469277, 23.98606927, 23.98904618, 23.99038309, 23.98014035, 23.94754376, 23.95366782, 23.89201206, 23.95149222, 23.96485304, 23.95391693, 23.97485498, 23.94421809, 23.99897338, 23.87992587, 24.01309815, 24.01837522, 24.02672055, 23.77920071, 23.75762327, 23.76047148, 23.80885775, 23.81134474, 23.8171491 , 23.69046625, 23.80472246, 23.71688895, 23.70689117, 23.79298503, 23.80869583, 23.99546918, 23.90889785, 23.96579968, 23.98552537, 23.94098376, 24.00967283, 23.9932313 , 23.9896399 , 24.00766747, 23.99998229, 23.94575844, 24.01825067, 24.01772337, 24.00765916, 24.02687417, 24.02934455, 24.02855569, 24.02494769, 24.01703416, 24.01404894, 24.01526545, 24.01856621, 24.00036427, 24.01809705, 23.9987907 , 23.99906472, 23.97941377, 24.01080215, 23.97455189, 24.00625997, 24.01001744, 24.01476722, 24.01842089, 23.99463464, 23.99158715, 24.01020843, 24.0103579 , 24.00195445, 24.01262899, 23.82842567, 23.88803869, 22.9388805 , 23.70493563, 23.92445503, 23.92126222, 23.87981792, 23.92783053, 23.90096356, 23.93129321, 23.86619138, 23.83569565, 23.96352028, 23.95771177, 23.88731626, 23.91522535, 23.89148892, 23.95344777, 23.90710838, 23.93303286, 24.00563303, 24.00518878, 24.01423993, 24.01225117, 24.01871568, 24.01200205, 24.01758636, 24.01666049, 24.0188776 , 24.02048024, 24.01937998, 24.01028316, 24.00756782, 24.02770455, 24.02273472, 24.02254789, 24.02044702, 24.0201813 , 24.00752215, 24.02534212, 24.02687417, 24.02106981, 24.00731871, 24.00009855, 24.00302979, 24.02601057, 24.01524884, 23.98885104, 20.30346852, 22.43474816, 21.87338184, 22.26385169, 22.14734515, 22.44008751, 22.50594499, 22.2800109 , 22.5906189 , 22.14155324, 22.49816848, 18.4188202 , 21.99941285, 21.6789856 , 21.31827659, 20.19994497, 20.60062435, 19.42113105, 16.35283338, 15.8915985 , 17.68567721, 19.95448863, 14.21460344, 16.61502604, -12.90894703, 17.44220963, 20.21874479, 20.71470618, 15.69405096, 17.05301026, 13.90503757, 14.65100995, 18.08189329, 20.64858298, 21.14248918, 21.83548327, 19.22607466, 20.44388587, 18.4862471 , 20.41399632, 21.5950881 , 20.84775806, 8.10981167, 19.91585102, 13.63420895, 18.12237434, 20.04906067, 13.73568146, 6.79058608, -4.16694965, 15.43194134, 19.07112564, 20.95908303, 18.03846438, 2.80201916, 18.19939214, 16.22296186, 12.13549661, 5.0397702 , 16.52455607, 19.53485167, 13.26282125, -6.49753724, 19.12875405, 19.42972549, 21.11739508, 19.03081067, 21.10584033, 20.38270343, 17.44806381, 18.9481878 , 8.39625145, 20.97435373, 20.15568984, 20.50725636, 19.85533704, 21.35759926, 21.71590017, 18.25639776, 19.3994166 , 18.04573021, 17.73168029, 18.35409203, 20.13420789, 14.87770384, 19.99572118, 21.68048444, 19.89509566, 18.71771568, 19.60227857, 21.42236064, 19.91240494, 20.1597587 , 20.90837999, 21.24397414, 21.77399775, 21.91971708, 20.60857939, 20.08313949, 22.05996835, 22.09465335, 20.62830508, 20.81445565, 21.20932651, 22.03515658, 22.49976281, 21.27004809, 21.61622129, 20.77829672, 22.71961021, 22.46577118, 22.19701851, 17.56622696, 18.60445177, 22.22753085, 22.3563976 , 22.55142493, 22.10376262, 20.68842049, 21.3787449 , 22.0105441 , 17.79553655, 19.78446406, 18.08189329, 21.61503384, 21.66312533, 21.65358426, 22.8629422 , 23.04554703, 22.50783411, 21.66994691, 22.025383 , 23.97047057, 23.95697273, 23.9469708 , 23.98920395, 23.98688719, 23.96114955, 23.91703143, 23.95879127, 23.91286707, 23.92167741, 23.93382587, 23.95927289, 23.93994578, 24.00710281, 24.01431051, 24.00787921, 23.98760547, 24.013422 ])
In [11]:
## predict함수를 통해 yhat구하기2
pred1 = fitted_model1.predict(crim1)
pred1
Out[11]:
0 24.030482 1 24.021767 2 24.021776 3 24.019666 4 24.004437 ... 501 24.007103 502 24.014311 503 24.007879 504 23.987605 505 24.013422 Length: 506, dtype: float64
In [12]:
## 직접구한 yhat과 predict함수를 통해 구한 yhat차이
np.dot(crim1,fitted_model1.params) - fitted_model1.predict(crim1)
# 둘의 값에 차이는 없다!!
Out[12]:
0 0.0 1 0.0 2 0.0 3 0.0 4 0.0 ... 501 0.0 502 0.0 503 0.0 504 0.0 505 0.0 Length: 506, dtype: float64
적합시킨 직선 시각화¶
In [13]:
import matplotlib.pyplot as plt
plt.yticks(fontname = "Arial") #
plt.scatter(crim,target,label="data") # 자료
plt.plot(crim,pred1,label="result") # 회귀식
plt.legend()
plt.show()
In [14]:
plt.scatter(target,pred1) # 실제 값과 예상 값의 분포도
plt.xlabel("real_value")
plt.ylabel("pred_value")
plt.show()
# 잘 안된 것 같은 느낌을 받는다.
In [15]:
## residual 시각화
fitted_model1.resid.plot()
plt.xlabel("residual_number")
plt.show()
# 잔차가 큰 것을 확인할 수 있다.
In [16]:
##잔차의 합계산해보기
np.sum(fitted_model1.resid)
# 잔차의 합은 0에 수렴한다.
Out[16]:
-5.684341886080801e-13
위와 동일하게 rm변수와 lstat 변수로 각각 단순선형회귀분석 적합시켜보기¶
In [17]:
# 상수항추가
rm1 = sm.add_constant(rm,has_constant="add")
lstat1 = sm.add_constant(lstat,has_constant="add")
In [18]:
# 회귀모델 적합
model2 = sm.OLS(target,rm1)
fitted_model2 = model2.fit()
model3 = sm.OLS(target,lstat1)
fitted_model3 = model3.fit()
In [19]:
# rm모델 결과 출력
fitted_model2.summary()
Out[19]:
Dep. Variable: | Target | R-squared: | 0.484 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.483 |
Method: | Least Squares | F-statistic: | 471.8 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 2.49e-74 |
Time: | 15:54:32 | Log-Likelihood: | -1673.1 |
No. Observations: | 506 | AIC: | 3350. |
Df Residuals: | 504 | BIC: | 3359. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -34.6706 | 2.650 | -13.084 | 0.000 | -39.877 | -29.465 |
RM | 9.1021 | 0.419 | 21.722 | 0.000 | 8.279 | 9.925 |
Omnibus: | 102.585 | Durbin-Watson: | 0.684 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 612.449 |
Skew: | 0.726 | Prob(JB): | 1.02e-133 |
Kurtosis: | 8.190 | Cond. No. | 58.4 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [20]:
# lstat모델 결과 출력
fitted_model3.summary()
Out[20]:
Dep. Variable: | Target | R-squared: | 0.544 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.543 |
Method: | Least Squares | F-statistic: | 601.6 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 5.08e-88 |
Time: | 15:54:34 | Log-Likelihood: | -1641.5 |
No. Observations: | 506 | AIC: | 3287. |
Df Residuals: | 504 | BIC: | 3295. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 34.5538 | 0.563 | 61.415 | 0.000 | 33.448 | 35.659 |
LSTAT | -0.9500 | 0.039 | -24.528 | 0.000 | -1.026 | -0.874 |
Omnibus: | 137.043 | Durbin-Watson: | 0.892 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 291.373 |
Skew: | 1.453 | Prob(JB): | 5.36e-64 |
Kurtosis: | 5.319 | Cond. No. | 29.7 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [21]:
## 각각 yhat_예측하기
pred2 = fitted_model2.predict(rm1)
pred3 = fitted_model3.predict(lstat1)
In [22]:
## rm모델 시각화
import matplotlib.pyplot as plt
plt.scatter(rm,target,label="data")
plt.plot(rm,pred2,label="result")
plt.legend()
plt.show()
In [23]:
## lstat모델 직선 시각화
import matplotlib.pyplot as plt
plt.scatter(lstat,target,label="data")
plt.plot(lstat,pred3,label="result")
plt.legend()
plt.show()
In [24]:
# rm모델 reisidual 시각화
fitted_model2.resid.plot()
plt.xlabel("residual_number")
plt.show()
In [25]:
# lstat모델 residual시각화
fitted_model3.resid.plot()
plt.xlabel("residual_number")
plt.show()
In [26]:
## 세모델의 residual비교
fitted_model1.resid.plot(label="crim")
fitted_model2.resid.plot(label="rm")
fitted_model3.resid.plot(label="lstat")
plt.legend()
Out[26]:
<matplotlib.legend.Legend at 0x253bdbe4250>
In [27]:
## bostan data에서 crim, rm, lstat 변수만 뽑아오기
x_data = boston[['CRIM','RM','LSTAT']]
x_data.head()
Out[27]:
CRIM | RM | LSTAT | |
---|---|---|---|
0 | 0.00632 | 6.575 | 4.98 |
1 | 0.02731 | 6.421 | 9.14 |
2 | 0.02729 | 7.185 | 4.03 |
3 | 0.03237 | 6.998 | 2.94 |
4 | 0.06905 | 7.147 | 5.33 |
In [28]:
#상수항 추기
x_data1 = sm.add_constant(x_data,has_constant="add")
x_data1
Out[28]:
const | CRIM | RM | LSTAT | |
---|---|---|---|---|
0 | 1.0 | 0.00632 | 6.575 | 4.98 |
1 | 1.0 | 0.02731 | 6.421 | 9.14 |
2 | 1.0 | 0.02729 | 7.185 | 4.03 |
3 | 1.0 | 0.03237 | 6.998 | 2.94 |
4 | 1.0 | 0.06905 | 7.147 | 5.33 |
... | ... | ... | ... | ... |
501 | 1.0 | 0.06263 | 6.593 | 9.67 |
502 | 1.0 | 0.04527 | 6.120 | 9.08 |
503 | 1.0 | 0.06076 | 6.976 | 5.64 |
504 | 1.0 | 0.10959 | 6.794 | 6.48 |
505 | 1.0 | 0.04741 | 6.030 | 7.88 |
506 rows × 4 columns
In [29]:
# 회구모델 적합
multi_model = sm.OLS(target,x_data1)
fitted_model = multi_model.fit()
In [30]:
# summary함수를 통해 결과출력
fitted_model.summary()
Out[30]:
Dep. Variable: | Target | R-squared: | 0.646 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.644 |
Method: | Least Squares | F-statistic: | 305.2 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 1.01e-112 |
Time: | 15:56:17 | Log-Likelihood: | -1577.6 |
No. Observations: | 506 | AIC: | 3163. |
Df Residuals: | 502 | BIC: | 3180. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -2.5623 | 3.166 | -0.809 | 0.419 | -8.783 | 3.658 |
CRIM | -0.1029 | 0.032 | -3.215 | 0.001 | -0.166 | -0.040 |
RM | 5.2170 | 0.442 | 11.802 | 0.000 | 4.348 | 6.085 |
LSTAT | -0.5785 | 0.048 | -12.135 | 0.000 | -0.672 | -0.485 |
Omnibus: | 171.754 | Durbin-Watson: | 0.822 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 628.308 |
Skew: | 1.535 | Prob(JB): | 3.67e-137 |
Kurtosis: | 7.514 | Cond. No. | 216. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
단순선형회귀모델의 회귀계수와 비교¶
In [31]:
## 단순선형회귀모델의 회귀 계수
print(fitted_model1.params)
print(fitted_model2.params)
print(fitted_model3.params)
const 24.033106 CRIM -0.415190 dtype: float64 const -34.670621 RM 9.102109 dtype: float64 const 34.553841 LSTAT -0.950049 dtype: float64
In [32]:
## 다중선형회귀모델의 회귀 계수
print(fitted_model.params)
const -2.562251 CRIM -0.102941 RM 5.216955 LSTAT -0.578486 dtype: float64
계수들의 값이 전체적으로 작아졌음을 알 수 있다.
행렬연산을 통해 beta구하기¶
In [33]:
from numpy import linalg ##행렬연산을 통해 beta구하기 (X'X)-1X'y
ba = linalg.inv(np.dot(x_data1.T,x_data1)) # linalg.inv 역행렬 구하는 함수
np.dot(np.dot(ba, x_data1.T),target) # 회귀계수와 일치!!
Out[33]:
array([[-2.56225101], [-0.10294089], [ 5.21695492], [-0.57848582]])
In [34]:
# y_hat구하기
pred4 = fitted_model.predict(x_data1)
pred4
#a = np.dot(np.dot(ba, x_data1.T),target)
#np.dot(x_data1,a) 위와 같은 결과 출력
Out[34]:
0 28.857718 1 25.645645 2 32.587463 3 32.241919 4 31.632888 ... 501 26.232728 502 24.108202 503 30.562312 504 29.121871 505 24.332638 Length: 506, dtype: float64
residual plot¶
In [35]:
fitted_model.resid.plot()
plt.xlabel("residual_number")
plt.show()
In [36]:
fitted_model1.resid.plot(label="crim")
fitted_model2.resid.plot(label="rm")
fitted_model3.resid.plot(label="lstat")
fitted_model.resid.plot(label="full")
plt.legend()
Out[36]:
<matplotlib.legend.Legend at 0x253bdbbeb50>
crim, rm, lstat, b, tax, age, zn, nox, indus 변수를 통한 다중선형회귀분석¶
In [39]:
# bostan data에서 변수 모두 뽑아오기
x_data2 = boston[['CRIM','RM','LSTAT','B','TAX','AGE','ZN','NOX','INDUS']]
x_data2.head()
Out[39]:
CRIM | RM | LSTAT | B | TAX | AGE | ZN | NOX | INDUS | |
---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 6.575 | 4.98 | 396.90 | 296 | 65.2 | 18.0 | 0.538 | 2.31 |
1 | 0.02731 | 6.421 | 9.14 | 396.90 | 242 | 78.9 | 0.0 | 0.469 | 7.07 |
2 | 0.02729 | 7.185 | 4.03 | 392.83 | 242 | 61.1 | 0.0 | 0.469 | 7.07 |
3 | 0.03237 | 6.998 | 2.94 | 394.63 | 222 | 45.8 | 0.0 | 0.458 | 2.18 |
4 | 0.06905 | 7.147 | 5.33 | 396.90 | 222 | 54.2 | 0.0 | 0.458 | 2.18 |
In [40]:
# 상수항추기
x_data2_ = sm.add_constant(x_data2, has_constant='add')
# 회귀모델 적합
multi_model2 = sm.OLS(target,x_data2_)
fitted_multi_model2=multi_model2.fit()
# 결과 출력
fitted_multi_model2.summary()
Out[40]:
Dep. Variable: | Target | R-squared: | 0.662 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.656 |
Method: | Least Squares | F-statistic: | 108.1 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 5.76e-111 |
Time: | 15:59:41 | Log-Likelihood: | -1565.5 |
No. Observations: | 506 | AIC: | 3151. |
Df Residuals: | 496 | BIC: | 3193. |
Df Model: | 9 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -7.1088 | 3.828 | -1.857 | 0.064 | -14.631 | 0.413 |
CRIM | -0.0453 | 0.036 | -1.269 | 0.205 | -0.115 | 0.025 |
RM | 5.0922 | 0.458 | 11.109 | 0.000 | 4.192 | 5.993 |
LSTAT | -0.5651 | 0.057 | -9.854 | 0.000 | -0.678 | -0.452 |
B | 0.0090 | 0.003 | 2.952 | 0.003 | 0.003 | 0.015 |
TAX | -0.0060 | 0.002 | -2.480 | 0.013 | -0.011 | -0.001 |
AGE | 0.0236 | 0.014 | 1.653 | 0.099 | -0.004 | 0.052 |
ZN | 0.0294 | 0.013 | 2.198 | 0.028 | 0.003 | 0.056 |
NOX | 3.4838 | 3.833 | 0.909 | 0.364 | -4.047 | 11.014 |
INDUS | 0.0293 | 0.065 | 0.449 | 0.654 | -0.099 | 0.157 |
Omnibus: | 195.490 | Durbin-Watson: | 0.848 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 872.873 |
Skew: | 1.686 | Prob(JB): | 2.87e-190 |
Kurtosis: | 8.479 | Cond. No. | 1.04e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [42]:
# 세변수만 추가한 모델의 회귀 계수
fitted_model.params
Out[42]:
const -2.562251 CRIM -0.102941 RM 5.216955 LSTAT -0.578486 dtype: float64
In [43]:
# full모델의 회귀계수
fitted_multi_model2.params
Out[43]:
const -7.108827 CRIM -0.045293 RM 5.092238 LSTAT -0.565133 B 0.008974 TAX -0.006025 AGE 0.023619 ZN 0.029377 NOX 3.483832 INDUS 0.029270 dtype: float64
In [44]:
# base모델과 full모델의 잔차비교
import matplotlib.pyplot as plt
fitted_model.resid.plot(label="full")
fitted_multi_model2.resid.plot(label="full_add")
plt.legend()
Out[44]:
<matplotlib.legend.Legend at 0x253beadc070>
상관계수/산점도를 통해 다중공선성 확인¶
In [45]:
# 상관행렬
x_data2.corr()
Out[45]:
CRIM | RM | LSTAT | B | TAX | AGE | ZN | NOX | INDUS | |
---|---|---|---|---|---|---|---|---|---|
CRIM | 1.000000 | -0.219247 | 0.455621 | -0.385064 | 0.582764 | 0.352734 | -0.200469 | 0.420972 | 0.406583 |
RM | -0.219247 | 1.000000 | -0.613808 | 0.128069 | -0.292048 | -0.240265 | 0.311991 | -0.302188 | -0.391676 |
LSTAT | 0.455621 | -0.613808 | 1.000000 | -0.366087 | 0.543993 | 0.602339 | -0.412995 | 0.590879 | 0.603800 |
B | -0.385064 | 0.128069 | -0.366087 | 1.000000 | -0.441808 | -0.273534 | 0.175520 | -0.380051 | -0.356977 |
TAX | 0.582764 | -0.292048 | 0.543993 | -0.441808 | 1.000000 | 0.506456 | -0.314563 | 0.668023 | 0.720760 |
AGE | 0.352734 | -0.240265 | 0.602339 | -0.273534 | 0.506456 | 1.000000 | -0.569537 | 0.731470 | 0.644779 |
ZN | -0.200469 | 0.311991 | -0.412995 | 0.175520 | -0.314563 | -0.569537 | 1.000000 | -0.516604 | -0.533828 |
NOX | 0.420972 | -0.302188 | 0.590879 | -0.380051 | 0.668023 | 0.731470 | -0.516604 | 1.000000 | 0.763651 |
INDUS | 0.406583 | -0.391676 | 0.603800 | -0.356977 | 0.720760 | 0.644779 | -0.533828 | 0.763651 | 1.000000 |
In [46]:
# 상관행렬 시각화 해서 보기
import seaborn as sns;
cmap = sns.light_palette("darkgray", as_cmap=True)
sns.heatmap(x_data2.corr(), annot=True, cmap=cmap)
plt.show()
In [47]:
# 변수별 산점도 시각화
sns.pairplot(x_data2)
plt.show()
VIF를 통한 다중공선성 확인¶
In [48]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(
x_data2.values, i) for i in range(x_data2.shape[1])]
vif["features"] = x_data2.columns
vif # 10만 넘어도 다중공선성이 있다고할 수 있는데 RM, NOX는 값이 크다..
Out[48]:
VIF Factor | features | |
---|---|---|
0 | 1.917332 | CRIM |
1 | 46.535369 | RM |
2 | 8.844137 | LSTAT |
3 | 16.856737 | B |
4 | 19.923044 | TAX |
5 | 18.457503 | AGE |
6 | 2.086502 | ZN |
7 | 72.439753 | NOX |
8 | 12.642137 | INDUS |
In [49]:
# nox 변수 제거후(X_data3) VIF 확인
vif = pd.DataFrame()
x_data3= x_data2.drop('NOX',axis=1)
vif["VIF Factor"] = [variance_inflation_factor(
x_data3.values, i) for i in range(x_data3.shape[1])]
vif["features"] = x_data3.columns
vif
Out[49]:
VIF Factor | features | |
---|---|---|
0 | 1.916648 | CRIM |
1 | 30.806301 | RM |
2 | 8.171214 | LSTAT |
3 | 16.735751 | B |
4 | 18.727105 | TAX |
5 | 16.339792 | AGE |
6 | 2.074500 | ZN |
7 | 11.217461 | INDUS |
In [50]:
# NOX와 RM 변수 제거후(x_data4) VIF 확인
vif = pd.DataFrame()
x_data4= x_data3.drop('RM',axis=1)
vif["VIF Factor"] = [variance_inflation_factor(
x_data4.values, i) for i in range(x_data4.shape[1])]
vif["features"] = x_data4.columns
vif
Out[50]:
VIF Factor | features | |
---|---|---|
0 | 1.907517 | CRIM |
1 | 7.933529 | LSTAT |
2 | 7.442569 | B |
3 | 16.233237 | TAX |
4 | 13.765377 | AGE |
5 | 1.820070 | ZN |
6 | 11.116823 | INDUS |
In [51]:
# nox 변수 제거한 데이터(x_data3) 상수항 추가 후 회귀 모델 적합
# nox, rm 변수 제거한 데이터(x_data4) 상수항 추가 후 회귀 모델 적합
x_data3_ = sm.add_constant(x_data3, has_constant='add')
x_data4_ = sm.add_constant(x_data4, has_constant='add')
multi_model3 = sm.OLS(target,x_data3_)
fitted_multi_model3=multi_model3.fit()
multi_model4 = sm.OLS(target,x_data4_)
fitted_multi_model4=multi_model4.fit()
In [52]:
# 회귀모델 결과 비교
fitted_multi_model3.summary()
Out[52]:
Dep. Variable: | Target | R-squared: | 0.662 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.656 |
Method: | Least Squares | F-statistic: | 121.6 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 7.62e-112 |
Time: | 16:04:44 | Log-Likelihood: | -1566.0 |
No. Observations: | 506 | AIC: | 3150. |
Df Residuals: | 497 | BIC: | 3188. |
Df Model: | 8 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -5.9162 | 3.596 | -1.645 | 0.101 | -12.981 | 1.149 |
CRIM | -0.0451 | 0.036 | -1.264 | 0.207 | -0.115 | 0.025 |
RM | 5.1027 | 0.458 | 11.138 | 0.000 | 4.203 | 6.003 |
LSTAT | -0.5628 | 0.057 | -9.825 | 0.000 | -0.675 | -0.450 |
B | 0.0087 | 0.003 | 2.880 | 0.004 | 0.003 | 0.015 |
TAX | -0.0056 | 0.002 | -2.344 | 0.019 | -0.010 | -0.001 |
AGE | 0.0287 | 0.013 | 2.179 | 0.030 | 0.003 | 0.055 |
ZN | 0.0284 | 0.013 | 2.130 | 0.034 | 0.002 | 0.055 |
INDUS | 0.0486 | 0.062 | 0.789 | 0.431 | -0.072 | 0.170 |
Omnibus: | 193.530 | Durbin-Watson: | 0.849 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 843.773 |
Skew: | 1.677 | Prob(JB): | 5.98e-184 |
Kurtosis: | 8.364 | Cond. No. | 8.44e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [53]:
fitted_multi_model4.summary()
# RM을 지우니 R-squraed가 낮아진것을 확인할 수 있다.
Out[53]:
Dep. Variable: | Target | R-squared: | 0.577 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.571 |
Method: | Least Squares | F-statistic: | 97.20 |
Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 5.53e-89 |
Time: | 16:04:49 | Log-Likelihood: | -1622.3 |
No. Observations: | 506 | AIC: | 3261. |
Df Residuals: | 498 | BIC: | 3294. |
Df Model: | 7 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 29.6634 | 1.844 | 16.087 | 0.000 | 26.041 | 33.286 |
CRIM | -0.0329 | 0.040 | -0.825 | 0.410 | -0.111 | 0.045 |
LSTAT | -0.9256 | 0.053 | -17.589 | 0.000 | -1.029 | -0.822 |
B | 0.0046 | 0.003 | 1.384 | 0.167 | -0.002 | 0.011 |
TAX | -0.0048 | 0.003 | -1.814 | 0.070 | -0.010 | 0.000 |
AGE | 0.0703 | 0.014 | 4.993 | 0.000 | 0.043 | 0.098 |
ZN | 0.0513 | 0.015 | 3.490 | 0.001 | 0.022 | 0.080 |
INDUS | -0.0357 | 0.068 | -0.523 | 0.601 | -0.170 | 0.098 |
Omnibus: | 138.742 | Durbin-Watson: | 0.960 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 316.077 |
Skew: | 1.427 | Prob(JB): | 2.32e-69 |
Kurtosis: | 5.617 | Cond. No. | 3.85e+03 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.85e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
학습 / 검증데이터 분할¶
In [54]:
from sklearn.model_selection import train_test_split
X = x_data2_ #모든 변수가 포함된 데이터
y = target
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7, test_size=0.3,random_state = 1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)
(354, 10) (152, 10) (354, 1) (152, 1)
In [55]:
# train_x에 상수항 추가후 회귀모델 적합
train_x.head()
fit_1 = sm.OLS(train_y,train_x)
fit_1 = fit_1.fit()
In [56]:
## 검등데이터에 대한 예측값과 true값 비교
plt.plot(np.array(fit_1.predict(test_x)),label="pred")
plt.plot(np.array(test_y),label="true")
plt.legend()
plt.show()
In [57]:
## x_data3와 x_data4 학습 검증데이터 분할
X = x_data3_ # NOX변수 제거 데이터
y = target
train_x2, test_x2, train_y2, test_y2 = train_test_split(X, y, train_size=0.7, test_size=0.3,random_state = 1)
X = x_data4_ # RM, NOX변수 제거 데이터
y = target
train_x3, test_x3, train_y3, test_y3 = train_test_split(X, y, train_size=0.7, test_size=0.3,random_state = 1)
In [58]:
# x_data3/x_data4 회귀 모델 적합(fit2,fit3)
fit_2 = sm.OLS(train_y2,train_x2)
fit_2 = fit_2.fit()
fit_3 = sm.OLS(train_y3,train_x3)
fit_3 = fit_3.fit()
In [59]:
# 3가지 데이터 유형에 따른 결과 비교
plt.plot(np.array(test_y2['Target']-fit_1.predict(test_x)),label="pred_full")
plt.plot(np.array(test_y2['Target']-fit_2.predict(test_x2)),label="pred_vif")
plt.plot(np.array(test_y2['Target']-fit_3.predict(test_x3)),label="pred_vif2")
plt.legend()
plt.show()
MSE를 통한 검증데이터에 대한 성능비교¶
In [60]:
from sklearn.metrics import mean_squared_error
mean_squared_error(test_y['Target'],fit_1.predict(test_x))
Out[60]:
26.148631468819886
In [61]:
mean_squared_error(test_y['Target'],fit_2.predict(test_x2))
Out[61]:
26.140062609846407
In [62]:
mean_squared_error(test_y['Target'],fit_3.predict(test_x3))
# MSE값이 제일 높으므로 제일 않좋은 모델이다.
Out[62]:
38.78845317912829
In [63]:
from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))
728x90
'Data scientist > Machine Learning' 카테고리의 다른 글
회귀분석(4)_로지스틱 회귀분석 (0) | 2021.08.21 |
---|---|
회귀분석(3)_변수선택법 Code (0) | 2021.08.21 |
회귀분석(1) (0) | 2021.08.19 |
수학적 개념 이해(2) (0) | 2021.08.19 |
수학적 개념 이해(1) (0) | 2021.08.19 |