250x250

Recent Posts

Recent Comments

Link

Tags more

Today

Total

관리 메뉴

#wannabeeeeeee the best DataScientist

회귀분석(2)_Code 본문

Data scientist/Machine Learning

회귀분석(2)_Code

맨사설 2021. 8. 19. 20:57

728x90

해당 데이터를 활용한 기본적인 단순 선형 회귀분석 실습 코드

Boston_house.csv

0.03MB

In [1]:

# 기본 라이브러리
import os
import pandas as pd 
import numpy as np
import statsmodels.api as sm

In [2]:

# 데이터 불러오기
boston = pd.read_csv('C:/Users/tommy/Desktop/05-11--machine-learning/Part 05~11) Machine Learning/06. 회귀분석/실습코드/Boston_house.csv')
boston.head()

Out[2]:

	AGE	B	RM	CRIM	DIS	INDUS	LSTAT	NOX	PTRATIO	RAD	ZN	TAX	Target
0	65.2	396.90	6.575	0.00632	4.0900	2.31	4.98	0.538	15.3	1	18.0	296	24.0
1	78.9	396.90	6.421	0.02731	4.9671	7.07	9.14	0.469	17.8	2	0.0	242	21.6
2	61.1	392.83	7.185	0.02729	4.9671	7.07	4.03	0.469	17.8	2	0.0	242	34.7
3	45.8	394.63	6.998	0.03237	6.0622	2.18	2.94	0.458	18.7	3	0.0	222	33.4
4	54.2	396.90	7.147	0.06905	6.0622	2.18	5.33	0.458	18.7	3	0.0	222	36.2

In [3]:

# target 제외한 데이터만 뽑기
boston_data = boston.drop('Target',axis=1)

In [4]:

boston_data.describe()
# data 통계 뽑아보기 

Out[4]:

	AGE	B	RM	CRIM	DIS	INDUS	LSTAT	NOX	PTRATIO	RAD	ZN	TAX	CHAS
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	68.574901	356.674032	6.284634	3.613524	3.795043	11.136779	12.653063	0.554695	18.455534	9.549407	11.363636	408.237154	0.069170
std	28.148861	91.294864	0.702617	8.601545	2.105710	6.860353	7.141062	0.115878	2.164946	8.707259	23.322453	168.537116	0.253994
min	2.900000	0.320000	3.561000	0.006320	1.129600	0.460000	1.730000	0.385000	12.600000	1.000000	0.000000	187.000000	0.000000
25%	45.025000	375.377500	5.885500	0.082045	2.100175	5.190000	6.950000	0.449000	17.400000	4.000000	0.000000	279.000000	0.000000
50%	77.500000	391.440000	6.208500	0.256510	3.207450	9.690000	11.360000	0.538000	19.050000	5.000000	0.000000	330.000000	0.000000
75%	94.075000	396.225000	6.623500	3.677083	5.188425	18.100000	16.955000	0.624000	20.200000	24.000000	12.500000	666.000000	0.000000
max	100.000000	396.900000	8.780000	88.976200	12.126500	27.740000	37.970000	0.871000	22.000000	24.000000	100.000000	711.000000	1.000000

In [5]:

# 타겟 데이터
# 1978 보스턴 주택 가격
# 506개 타운의 주택 가격 중앙값 (단위 1,000 달러)

# 변수 특징 설명
# CRIM: 범죄율
# INDUS: 비소매상업지역 면적 비율
# NOX: 일산화질소 농도
# RM: 주택당 방 수
# LSTAT: 인구 중 하위 계층 비율
# B: 인구 중 흑인 비율
# PTRATIO: 학생/교사 비율
# ZN: 25,000 평방피트를 초과 거주지역 비율
# CHAS: 찰스강의 경계에 위치한 경우는 1, 아니면 0
# AGE: 1940년 이전에 건축된 주택의 비율
# RAD: 방사형 고속도로까지의 거리
# DIS: 직업센터의 거리
# TAX: 재산세율'''

lstat 세게의 변수로 각각 단순 선형 회귀 분석하기¶

In [5]:

## 변수 설정 target/crim/rm/lstat
target = boston[['Target']]
crim = boston[['CRIM']]
rm = boston[['RM']]
lstat = boston[['LSTAT']]

target ~ crim 선형회귀분석¶

In [6]:

# crim변수에 상수항추가하기 
crim1 = sm.add_constant(crim,has_constant="add")
crim1

Out[6]:

	const	CRIM
0	1.0	0.00632
1	1.0	0.02731
2	1.0	0.02729
3	1.0	0.03237
4	1.0	0.06905
...	...	...
501	1.0	0.06263
502	1.0	0.04527
503	1.0	0.06076
504	1.0	0.10959
505	1.0	0.04741

506 rows × 2 columns

In [7]:

# sm.OLS 적합시키기 
model1 = sm.OLS(target,crim1)
fitted_model1 = model1.fit()

In [8]:

# summary함수통해 결과출력
fitted_model1.summary()

Out[8]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.151
Model:	OLS	Adj. R-squared:	0.149
Method:	Least Squares	F-statistic:	89.49
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	1.17e-19
Time:	15:51:04	Log-Likelihood:	-1798.9
No. Observations:	506	AIC:	3602.
Df Residuals:	504	BIC:	3610.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	24.0331	0.409	58.740	0.000	23.229	24.837
CRIM	-0.4152	0.044	-9.460	0.000	-0.501	-0.329

Omnibus:	139.832	Durbin-Watson:	0.713
Prob(Omnibus):	0.000	Jarque-Bera (JB):	295.404
Skew:	1.490	Prob(JB):	7.14e-65
Kurtosis:	5.264	Cond. No.	10.1

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

R²는 0.151이므로 15%의 설명력을 가진다.

Crim변수 계수는 -0.4152의 음의 값을 가지며 P값을 통해 매우 유의미하다고 볼 수도 있다.

In [9]:

## 회귀 계수 출력
fitted_model1.params

Out[9]:

const    24.033106
CRIM     -0.415190
dtype: float64

y_hat=beta0 + beta1 * X 계산해보기¶

In [10]:

#회귀 계수 x 데이터(X)
# yhat구하기1
np.dot(crim1,fitted_model1.params)

Out[10]:

array([ 24.03048217,  24.02176733,  24.02177563,  24.01966646,
        24.00443729,  24.02071274,  23.99644902,  23.97309042,
        23.94540138,  23.96250722,  23.93973403,  23.98433377,
        23.99416963,  23.77163594,  23.76823138,  23.77261995,
        23.59552468,  23.70751396,  23.69982879,  23.73176107,
        23.51337514,  23.67934745,  23.52139661,  23.62271965,
        23.72160552,  23.68412214,  23.75413567,  23.63627976,
        23.71216824,  23.61689868,  23.56360486,  23.4706396 ,
        23.45682622,  23.55492323,  23.36347899,  24.00646341,
        23.99265003,  23.99983283,  23.96042712,  24.02163447,
        24.01915993,  23.98019433,  23.97435675,  23.96694145,
        23.98216648,  23.96193426,  23.95490093,  23.9379155 ,
        23.92770182,  23.94185981,  23.99626634,  24.01509937,
        24.01085198,  24.01242555,  24.02745959,  24.02766303,
        24.02457401,  24.02716065,  23.96898004,  23.99022532,
        23.97110996,  23.96181385,  23.98732314,  23.9805846 ,
        24.02500581,  24.01822575,  24.01492499,  24.00907081,
        23.97683128,  23.97989539,  23.99646148,  23.96719057,
        23.99505814,  23.95198215,  24.00032275,  23.99361327,
        23.99095191,  23.99695556,  24.00966453,  23.99828417,
        24.0160294 ,  24.01458038,  24.01791436,  24.01836277,
        24.0121017 ,  24.00929501,  24.0115661 ,  24.00341592,
        24.0096064 ,  24.01109279,  24.01365866,  24.01678089,
        24.01565573,  24.02116945,  24.0152779 ,  23.98243635,
        23.98534268,  23.98293873,  23.99911455,  24.00462412,
        23.97138399,  23.98564162,  23.93812725,  23.94524776,
        23.97514561,  23.97804364,  23.9620256 ,  23.97864567,
        23.97995351,  23.92364956,  23.98829469,  23.99123839,
        23.98191736,  23.94088411,  23.97402045,  23.96196747,
        23.97847544,  23.97042075,  23.97889063,  23.97300323,
        24.0044622 ,  24.00335779,  23.99449763,  23.97066986,
        23.99221408,  23.96293071,  23.87228222,  23.92550961,
        23.8979908 ,  23.66721974,  23.89191657,  23.53780908,
        23.78812315,  23.89616812,  23.62780988,  23.80152134,
        23.89914918,  23.88682218,  23.92939164,  23.80702676,
        23.91232732,  23.35691068,  22.6542385 ,  22.33190553,
        22.87898515,  23.04522734,  23.13835037,  23.04967818,
        23.06530179,  22.89798841,  23.34530196,  23.41184866,
        23.56536111,  23.14078753,  23.4460894 ,  22.56540439,
        23.01726842,  23.52508765,  23.47557206,  23.44145172,
        23.50437796,  23.42553333,  23.2717427 ,  23.40242384,
        23.1021001 ,  22.8190898 ,  23.19849483,  23.28564742,
        23.07800246,  23.01608513,  23.53179713,  23.07239739,
        23.9753366 ,  23.99500001,  23.99803505,  24.00543789,
        24.00395151,  24.0105821 ,  24.00552924,  24.00910818,
        24.00575344,  24.00450787,  23.9953114 ,  23.99155393,
        23.99861217,  24.00799962,  24.00984721,  24.00040994,
        23.98087939,  23.99835475,  23.99545672,  24.00441237,
        23.99713409,  24.02402596,  24.02713159,  24.0273724 ,
        24.01645289,  24.0137334 ,  24.0174618 ,  24.02002768,
        24.02572409,  24.01880287,  24.02406748,  24.018533  ,
        24.024765  ,  23.97646592,  23.93774112,  23.92848238,
        23.97669427,  23.85220362,  23.96067208,  23.87708597,
        23.942931  ,  23.97476364,  23.91288783,  23.9508902 ,
        24.0141735 ,  24.00398888,  23.98714876,  23.98567068,
        23.88443069,  23.86382895,  23.77421012,  23.77788871,
        23.90218422,  23.81432996,  23.87444536,  23.86189001,
        23.90930059,  23.84968341,  23.81014899,  23.84088968,
        23.79425136,  23.89548305,  23.8471383 ,  23.89590655,
        23.81696642,  23.82059933,  23.99887789,  23.99469277,
        23.98606927,  23.98904618,  23.99038309,  23.98014035,
        23.94754376,  23.95366782,  23.89201206,  23.95149222,
        23.96485304,  23.95391693,  23.97485498,  23.94421809,
        23.99897338,  23.87992587,  24.01309815,  24.01837522,
        24.02672055,  23.77920071,  23.75762327,  23.76047148,
        23.80885775,  23.81134474,  23.8171491 ,  23.69046625,
        23.80472246,  23.71688895,  23.70689117,  23.79298503,
        23.80869583,  23.99546918,  23.90889785,  23.96579968,
        23.98552537,  23.94098376,  24.00967283,  23.9932313 ,
        23.9896399 ,  24.00766747,  23.99998229,  23.94575844,
        24.01825067,  24.01772337,  24.00765916,  24.02687417,
        24.02934455,  24.02855569,  24.02494769,  24.01703416,
        24.01404894,  24.01526545,  24.01856621,  24.00036427,
        24.01809705,  23.9987907 ,  23.99906472,  23.97941377,
        24.01080215,  23.97455189,  24.00625997,  24.01001744,
        24.01476722,  24.01842089,  23.99463464,  23.99158715,
        24.01020843,  24.0103579 ,  24.00195445,  24.01262899,
        23.82842567,  23.88803869,  22.9388805 ,  23.70493563,
        23.92445503,  23.92126222,  23.87981792,  23.92783053,
        23.90096356,  23.93129321,  23.86619138,  23.83569565,
        23.96352028,  23.95771177,  23.88731626,  23.91522535,
        23.89148892,  23.95344777,  23.90710838,  23.93303286,
        24.00563303,  24.00518878,  24.01423993,  24.01225117,
        24.01871568,  24.01200205,  24.01758636,  24.01666049,
        24.0188776 ,  24.02048024,  24.01937998,  24.01028316,
        24.00756782,  24.02770455,  24.02273472,  24.02254789,
        24.02044702,  24.0201813 ,  24.00752215,  24.02534212,
        24.02687417,  24.02106981,  24.00731871,  24.00009855,
        24.00302979,  24.02601057,  24.01524884,  23.98885104,
        20.30346852,  22.43474816,  21.87338184,  22.26385169,
        22.14734515,  22.44008751,  22.50594499,  22.2800109 ,
        22.5906189 ,  22.14155324,  22.49816848,  18.4188202 ,
        21.99941285,  21.6789856 ,  21.31827659,  20.19994497,
        20.60062435,  19.42113105,  16.35283338,  15.8915985 ,
        17.68567721,  19.95448863,  14.21460344,  16.61502604,
       -12.90894703,  17.44220963,  20.21874479,  20.71470618,
        15.69405096,  17.05301026,  13.90503757,  14.65100995,
        18.08189329,  20.64858298,  21.14248918,  21.83548327,
        19.22607466,  20.44388587,  18.4862471 ,  20.41399632,
        21.5950881 ,  20.84775806,   8.10981167,  19.91585102,
        13.63420895,  18.12237434,  20.04906067,  13.73568146,
         6.79058608,  -4.16694965,  15.43194134,  19.07112564,
        20.95908303,  18.03846438,   2.80201916,  18.19939214,
        16.22296186,  12.13549661,   5.0397702 ,  16.52455607,
        19.53485167,  13.26282125,  -6.49753724,  19.12875405,
        19.42972549,  21.11739508,  19.03081067,  21.10584033,
        20.38270343,  17.44806381,  18.9481878 ,   8.39625145,
        20.97435373,  20.15568984,  20.50725636,  19.85533704,
        21.35759926,  21.71590017,  18.25639776,  19.3994166 ,
        18.04573021,  17.73168029,  18.35409203,  20.13420789,
        14.87770384,  19.99572118,  21.68048444,  19.89509566,
        18.71771568,  19.60227857,  21.42236064,  19.91240494,
        20.1597587 ,  20.90837999,  21.24397414,  21.77399775,
        21.91971708,  20.60857939,  20.08313949,  22.05996835,
        22.09465335,  20.62830508,  20.81445565,  21.20932651,
        22.03515658,  22.49976281,  21.27004809,  21.61622129,
        20.77829672,  22.71961021,  22.46577118,  22.19701851,
        17.56622696,  18.60445177,  22.22753085,  22.3563976 ,
        22.55142493,  22.10376262,  20.68842049,  21.3787449 ,
        22.0105441 ,  17.79553655,  19.78446406,  18.08189329,
        21.61503384,  21.66312533,  21.65358426,  22.8629422 ,
        23.04554703,  22.50783411,  21.66994691,  22.025383  ,
        23.97047057,  23.95697273,  23.9469708 ,  23.98920395,
        23.98688719,  23.96114955,  23.91703143,  23.95879127,
        23.91286707,  23.92167741,  23.93382587,  23.95927289,
        23.93994578,  24.00710281,  24.01431051,  24.00787921,
        23.98760547,  24.013422  ])

In [11]:

## predict함수를 통해 yhat구하기2
pred1 = fitted_model1.predict(crim1)
pred1

Out[11]:

0      24.030482
1      24.021767
2      24.021776
3      24.019666
4      24.004437
         ...    
501    24.007103
502    24.014311
503    24.007879
504    23.987605
505    24.013422
Length: 506, dtype: float64

In [12]:

## 직접구한 yhat과 predict함수를 통해 구한 yhat차이 
np.dot(crim1,fitted_model1.params) - fitted_model1.predict(crim1)
# 둘의 값에 차이는 없다!!

Out[12]:

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
501    0.0
502    0.0
503    0.0
504    0.0
505    0.0
Length: 506, dtype: float64

적합시킨 직선 시각화¶

In [13]:

import matplotlib.pyplot as plt
plt.yticks(fontname = "Arial") #
plt.scatter(crim,target,label="data") # 자료
plt.plot(crim,pred1,label="result") # 회귀식
plt.legend()
plt.show()

In [14]:

plt.scatter(target,pred1) # 실제 값과 예상 값의 분포도
plt.xlabel("real_value")
plt.ylabel("pred_value")
plt.show()
# 잘 안된 것 같은 느낌을 받는다.

In [15]:

## residual 시각화 
fitted_model1.resid.plot()
plt.xlabel("residual_number")
plt.show()
# 잔차가 큰 것을 확인할 수 있다.

In [16]:

##잔차의 합계산해보기
np.sum(fitted_model1.resid)
# 잔차의 합은 0에 수렴한다.

Out[16]:

-5.684341886080801e-13

위와 동일하게 rm변수와 lstat 변수로 각각 단순선형회귀분석 적합시켜보기¶

In [17]:

# 상수항추가
rm1 = sm.add_constant(rm,has_constant="add")
lstat1 = sm.add_constant(lstat,has_constant="add")

In [18]:

# 회귀모델 적합
model2 = sm.OLS(target,rm1)
fitted_model2 = model2.fit()
model3 = sm.OLS(target,lstat1)
fitted_model3 = model3.fit()

In [19]:

# rm모델 결과 출력
fitted_model2.summary()

Out[19]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.484
Model:	OLS	Adj. R-squared:	0.483
Method:	Least Squares	F-statistic:	471.8
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	2.49e-74
Time:	15:54:32	Log-Likelihood:	-1673.1
No. Observations:	506	AIC:	3350.
Df Residuals:	504	BIC:	3359.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-34.6706	2.650	-13.084	0.000	-39.877	-29.465
RM	9.1021	0.419	21.722	0.000	8.279	9.925

Omnibus:	102.585	Durbin-Watson:	0.684
Prob(Omnibus):	0.000	Jarque-Bera (JB):	612.449
Skew:	0.726	Prob(JB):	1.02e-133
Kurtosis:	8.190	Cond. No.	58.4

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [20]:

# lstat모델 결과 출력 
fitted_model3.summary()

Out[20]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.544
Model:	OLS	Adj. R-squared:	0.543
Method:	Least Squares	F-statistic:	601.6
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	5.08e-88
Time:	15:54:34	Log-Likelihood:	-1641.5
No. Observations:	506	AIC:	3287.
Df Residuals:	504	BIC:	3295.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	34.5538	0.563	61.415	0.000	33.448	35.659
LSTAT	-0.9500	0.039	-24.528	0.000	-1.026	-0.874

Omnibus:	137.043	Durbin-Watson:	0.892
Prob(Omnibus):	0.000	Jarque-Bera (JB):	291.373
Skew:	1.453	Prob(JB):	5.36e-64
Kurtosis:	5.319	Cond. No.	29.7

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [21]:

## 각각 yhat_예측하기
pred2 = fitted_model2.predict(rm1)
pred3 = fitted_model3.predict(lstat1)

In [22]:

## rm모델 시각화
import matplotlib.pyplot as plt
plt.scatter(rm,target,label="data")
plt.plot(rm,pred2,label="result")
plt.legend()
plt.show()

In [23]:

## lstat모델 직선 시각화
import matplotlib.pyplot as plt
plt.scatter(lstat,target,label="data")
plt.plot(lstat,pred3,label="result")
plt.legend()
plt.show()

In [24]:

# rm모델 reisidual 시각화 
fitted_model2.resid.plot()
plt.xlabel("residual_number")
plt.show()

In [25]:

# lstat모델 residual시각화
fitted_model3.resid.plot()
plt.xlabel("residual_number")
plt.show()

In [26]:

## 세모델의 residual비교 
fitted_model1.resid.plot(label="crim")
fitted_model2.resid.plot(label="rm")
fitted_model3.resid.plot(label="lstat")
plt.legend()

Out[26]:

<matplotlib.legend.Legend at 0x253bdbe4250>

다중선형회귀분석¶

crim, rm, lstat 세개의 변수를 통해 다중회귀적합¶

In [27]:

## bostan data에서 crim, rm, lstat 변수만 뽑아오기
x_data = boston[['CRIM','RM','LSTAT']]
x_data.head()

Out[27]:

	CRIM	RM	LSTAT
0	0.00632	6.575	4.98
1	0.02731	6.421	9.14
2	0.02729	7.185	4.03
3	0.03237	6.998	2.94
4	0.06905	7.147	5.33

In [28]:

#상수항 추기
x_data1 = sm.add_constant(x_data,has_constant="add")
x_data1

Out[28]:

	const	CRIM	RM	LSTAT
0	1.0	0.00632	6.575	4.98
1	1.0	0.02731	6.421	9.14
2	1.0	0.02729	7.185	4.03
3	1.0	0.03237	6.998	2.94
4	1.0	0.06905	7.147	5.33
...	...	...	...	...
501	1.0	0.06263	6.593	9.67
502	1.0	0.04527	6.120	9.08
503	1.0	0.06076	6.976	5.64
504	1.0	0.10959	6.794	6.48
505	1.0	0.04741	6.030	7.88

506 rows × 4 columns

In [29]:

# 회구모델 적합
multi_model = sm.OLS(target,x_data1)
fitted_model = multi_model.fit()

In [30]:

# summary함수를 통해 결과출력 
fitted_model.summary()

Out[30]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.646
Model:	OLS	Adj. R-squared:	0.644
Method:	Least Squares	F-statistic:	305.2
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	1.01e-112
Time:	15:56:17	Log-Likelihood:	-1577.6
No. Observations:	506	AIC:	3163.
Df Residuals:	502	BIC:	3180.
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-2.5623	3.166	-0.809	0.419	-8.783	3.658
CRIM	-0.1029	0.032	-3.215	0.001	-0.166	-0.040
RM	5.2170	0.442	11.802	0.000	4.348	6.085
LSTAT	-0.5785	0.048	-12.135	0.000	-0.672	-0.485

Omnibus:	171.754	Durbin-Watson:	0.822
Prob(Omnibus):	0.000	Jarque-Bera (JB):	628.308
Skew:	1.535	Prob(JB):	3.67e-137
Kurtosis:	7.514	Cond. No.	216.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

단순선형회귀모델의 회귀계수와 비교¶

In [31]:

## 단순선형회귀모델의 회귀 계수
print(fitted_model1.params)
print(fitted_model2.params)
print(fitted_model3.params)

const    24.033106
CRIM     -0.415190
dtype: float64
const   -34.670621
RM        9.102109
dtype: float64
const    34.553841
LSTAT    -0.950049
dtype: float64

In [32]:

## 다중선형회귀모델의 회귀 계수
print(fitted_model.params)

const   -2.562251
CRIM    -0.102941
RM       5.216955
LSTAT   -0.578486
dtype: float64

계수들의 값이 전체적으로 작아졌음을 알 수 있다.

행렬연산을 통해 beta구하기¶

In [33]:

from numpy import linalg ##행렬연산을 통해 beta구하기 (X'X)-1X'y
ba = linalg.inv(np.dot(x_data1.T,x_data1)) # linalg.inv 역행렬 구하는 함수
np.dot(np.dot(ba, x_data1.T),target) # 회귀계수와 일치!!

Out[33]:

array([[-2.56225101],
       [-0.10294089],
       [ 5.21695492],
       [-0.57848582]])

In [34]:

# y_hat구하기 
pred4 = fitted_model.predict(x_data1)
pred4

#a = np.dot(np.dot(ba, x_data1.T),target)
#np.dot(x_data1,a) 위와 같은 결과 출력

Out[34]:

0      28.857718
1      25.645645
2      32.587463
3      32.241919
4      31.632888
         ...    
501    26.232728
502    24.108202
503    30.562312
504    29.121871
505    24.332638
Length: 506, dtype: float64

residual plot¶

In [35]:

fitted_model.resid.plot()
plt.xlabel("residual_number")
plt.show()

In [36]:

fitted_model1.resid.plot(label="crim")
fitted_model2.resid.plot(label="rm")
fitted_model3.resid.plot(label="lstat")
fitted_model.resid.plot(label="full")
plt.legend()

Out[36]:

<matplotlib.legend.Legend at 0x253bdbbeb50>

crim, rm, lstat, b, tax, age, zn, nox, indus 변수를 통한 다중선형회귀분석¶

In [39]:

# bostan data에서 변수 모두 뽑아오기 
x_data2 = boston[['CRIM','RM','LSTAT','B','TAX','AGE','ZN','NOX','INDUS']]
x_data2.head()

Out[39]:

	CRIM	RM	LSTAT	B	TAX	AGE	ZN	NOX	INDUS
0	0.00632	6.575	4.98	396.90	296	65.2	18.0	0.538	2.31
1	0.02731	6.421	9.14	396.90	242	78.9	0.0	0.469	7.07
2	0.02729	7.185	4.03	392.83	242	61.1	0.0	0.469	7.07
3	0.03237	6.998	2.94	394.63	222	45.8	0.0	0.458	2.18
4	0.06905	7.147	5.33	396.90	222	54.2	0.0	0.458	2.18

In [40]:

# 상수항추기
x_data2_ = sm.add_constant(x_data2, has_constant='add')

# 회귀모델 적합
multi_model2 = sm.OLS(target,x_data2_)
fitted_multi_model2=multi_model2.fit()

# 결과 출력
fitted_multi_model2.summary()

Out[40]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.662
Model:	OLS	Adj. R-squared:	0.656
Method:	Least Squares	F-statistic:	108.1
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	5.76e-111
Time:	15:59:41	Log-Likelihood:	-1565.5
No. Observations:	506	AIC:	3151.
Df Residuals:	496	BIC:	3193.
Df Model:	9
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-7.1088	3.828	-1.857	0.064	-14.631	0.413
CRIM	-0.0453	0.036	-1.269	0.205	-0.115	0.025
RM	5.0922	0.458	11.109	0.000	4.192	5.993
LSTAT	-0.5651	0.057	-9.854	0.000	-0.678	-0.452
B	0.0090	0.003	2.952	0.003	0.003	0.015
TAX	-0.0060	0.002	-2.480	0.013	-0.011	-0.001
AGE	0.0236	0.014	1.653	0.099	-0.004	0.052
ZN	0.0294	0.013	2.198	0.028	0.003	0.056
NOX	3.4838	3.833	0.909	0.364	-4.047	11.014
INDUS	0.0293	0.065	0.449	0.654	-0.099	0.157

Omnibus:	195.490	Durbin-Watson:	0.848
Prob(Omnibus):	0.000	Jarque-Bera (JB):	872.873
Skew:	1.686	Prob(JB):	2.87e-190
Kurtosis:	8.479	Cond. No.	1.04e+04

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

In [42]:

# 세변수만 추가한 모델의 회귀 계수
fitted_model.params

Out[42]:

const   -2.562251
CRIM    -0.102941
RM       5.216955
LSTAT   -0.578486
dtype: float64

In [43]:

# full모델의 회귀계수
fitted_multi_model2.params

Out[43]:

const   -7.108827
CRIM    -0.045293
RM       5.092238
LSTAT   -0.565133
B        0.008974
TAX     -0.006025
AGE      0.023619
ZN       0.029377
NOX      3.483832
INDUS    0.029270
dtype: float64

In [44]:

# base모델과 full모델의 잔차비교 
import matplotlib.pyplot as plt
fitted_model.resid.plot(label="full")
fitted_multi_model2.resid.plot(label="full_add")
plt.legend()

Out[44]:

<matplotlib.legend.Legend at 0x253beadc070>

상관계수/산점도를 통해 다중공선성 확인¶

In [45]:

# 상관행렬 
x_data2.corr()

Out[45]:

	CRIM	RM	LSTAT	B	TAX	AGE	ZN	NOX	INDUS
CRIM	1.000000	-0.219247	0.455621	-0.385064	0.582764	0.352734	-0.200469	0.420972	0.406583
RM	-0.219247	1.000000	-0.613808	0.128069	-0.292048	-0.240265	0.311991	-0.302188	-0.391676
LSTAT	0.455621	-0.613808	1.000000	-0.366087	0.543993	0.602339	-0.412995	0.590879	0.603800
B	-0.385064	0.128069	-0.366087	1.000000	-0.441808	-0.273534	0.175520	-0.380051	-0.356977
TAX	0.582764	-0.292048	0.543993	-0.441808	1.000000	0.506456	-0.314563	0.668023	0.720760
AGE	0.352734	-0.240265	0.602339	-0.273534	0.506456	1.000000	-0.569537	0.731470	0.644779
ZN	-0.200469	0.311991	-0.412995	0.175520	-0.314563	-0.569537	1.000000	-0.516604	-0.533828
NOX	0.420972	-0.302188	0.590879	-0.380051	0.668023	0.731470	-0.516604	1.000000	0.763651
INDUS	0.406583	-0.391676	0.603800	-0.356977	0.720760	0.644779	-0.533828	0.763651	1.000000

In [46]:

# 상관행렬 시각화 해서 보기 
import seaborn as sns;
cmap = sns.light_palette("darkgray", as_cmap=True)
sns.heatmap(x_data2.corr(), annot=True, cmap=cmap)
plt.show()

In [47]:

# 변수별 산점도 시각화
sns.pairplot(x_data2)
plt.show()

VIF를 통한 다중공선성 확인¶

In [48]:

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(
    x_data2.values, i) for i in range(x_data2.shape[1])]
vif["features"] = x_data2.columns
vif # 10만 넘어도 다중공선성이 있다고할 수 있는데 RM, NOX는 값이 크다..

Out[48]:

	VIF Factor	features
0	1.917332	CRIM
1	46.535369	RM
2	8.844137	LSTAT
3	16.856737	B
4	19.923044	TAX
5	18.457503	AGE
6	2.086502	ZN
7	72.439753	NOX
8	12.642137	INDUS

In [49]:

# nox 변수 제거후(X_data3) VIF 확인 

vif = pd.DataFrame()
x_data3= x_data2.drop('NOX',axis=1)
vif["VIF Factor"] = [variance_inflation_factor(
    x_data3.values, i) for i in range(x_data3.shape[1])]
vif["features"] = x_data3.columns
vif

Out[49]:

	VIF Factor	features
0	1.916648	CRIM
1	30.806301	RM
2	8.171214	LSTAT
3	16.735751	B
4	18.727105	TAX
5	16.339792	AGE
6	2.074500	ZN
7	11.217461	INDUS

In [50]:

# NOX와 RM 변수 제거후(x_data4) VIF 확인 

vif = pd.DataFrame()
x_data4= x_data3.drop('RM',axis=1)
vif["VIF Factor"] = [variance_inflation_factor(
    x_data4.values, i) for i in range(x_data4.shape[1])]
vif["features"] = x_data4.columns
vif

Out[50]:

	VIF Factor	features
0	1.907517	CRIM
1	7.933529	LSTAT
2	7.442569	B
3	16.233237	TAX
4	13.765377	AGE
5	1.820070	ZN
6	11.116823	INDUS

In [51]:

# nox 변수 제거한 데이터(x_data3) 상수항 추가 후 회귀 모델 적합
# nox, rm 변수 제거한 데이터(x_data4) 상수항 추가 후 회귀 모델 적합
x_data3_ = sm.add_constant(x_data3, has_constant='add')
x_data4_ = sm.add_constant(x_data4, has_constant='add')
multi_model3 = sm.OLS(target,x_data3_)
fitted_multi_model3=multi_model3.fit()
multi_model4 = sm.OLS(target,x_data4_)
fitted_multi_model4=multi_model4.fit()

In [52]:

# 회귀모델 결과 비교 
fitted_multi_model3.summary()

Out[52]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.662
Model:	OLS	Adj. R-squared:	0.656
Method:	Least Squares	F-statistic:	121.6
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	7.62e-112
Time:	16:04:44	Log-Likelihood:	-1566.0
No. Observations:	506	AIC:	3150.
Df Residuals:	497	BIC:	3188.
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-5.9162	3.596	-1.645	0.101	-12.981	1.149
CRIM	-0.0451	0.036	-1.264	0.207	-0.115	0.025
RM	5.1027	0.458	11.138	0.000	4.203	6.003
LSTAT	-0.5628	0.057	-9.825	0.000	-0.675	-0.450
B	0.0087	0.003	2.880	0.004	0.003	0.015
TAX	-0.0056	0.002	-2.344	0.019	-0.010	-0.001
AGE	0.0287	0.013	2.179	0.030	0.003	0.055
ZN	0.0284	0.013	2.130	0.034	0.002	0.055
INDUS	0.0486	0.062	0.789	0.431	-0.072	0.170

Omnibus:	193.530	Durbin-Watson:	0.849
Prob(Omnibus):	0.000	Jarque-Bera (JB):	843.773
Skew:	1.677	Prob(JB):	5.98e-184
Kurtosis:	8.364	Cond. No.	8.44e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [53]:

fitted_multi_model4.summary() 
# RM을 지우니 R-squraed가 낮아진것을 확인할 수 있다.

Out[53]:

OLS Regression Results
Dep. Variable:	Target	R-squared:	0.577
Model:	OLS	Adj. R-squared:	0.571
Method:	Least Squares	F-statistic:	97.20
Date:	Fri, 20 Aug 2021	Prob (F-statistic):	5.53e-89
Time:	16:04:49	Log-Likelihood:	-1622.3
No. Observations:	506	AIC:	3261.
Df Residuals:	498	BIC:	3294.
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	29.6634	1.844	16.087	0.000	26.041	33.286
CRIM	-0.0329	0.040	-0.825	0.410	-0.111	0.045
LSTAT	-0.9256	0.053	-17.589	0.000	-1.029	-0.822
B	0.0046	0.003	1.384	0.167	-0.002	0.011
TAX	-0.0048	0.003	-1.814	0.070	-0.010	0.000
AGE	0.0703	0.014	4.993	0.000	0.043	0.098
ZN	0.0513	0.015	3.490	0.001	0.022	0.080
INDUS	-0.0357	0.068	-0.523	0.601	-0.170	0.098

Omnibus:	138.742	Durbin-Watson:	0.960
Prob(Omnibus):	0.000	Jarque-Bera (JB):	316.077
Skew:	1.427	Prob(JB):	2.32e-69
Kurtosis:	5.617	Cond. No.	3.85e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.85e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

학습 / 검증데이터 분할¶

In [54]:

from sklearn.model_selection import train_test_split
X = x_data2_ #모든 변수가 포함된 데이터
y = target
train_x, test_x, train_y, test_y = train_test_split(X, y, train_size=0.7, test_size=0.3,random_state = 1)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)

(354, 10) (152, 10) (354, 1) (152, 1)

In [55]:

# train_x에 상수항 추가후 회귀모델 적합
train_x.head()
fit_1 = sm.OLS(train_y,train_x)
fit_1 = fit_1.fit()

In [56]:

## 검등데이터에 대한 예측값과 true값 비교 
plt.plot(np.array(fit_1.predict(test_x)),label="pred")
plt.plot(np.array(test_y),label="true")
plt.legend()
plt.show()

In [57]:

## x_data3와 x_data4 학습 검증데이터 분할
X = x_data3_ # NOX변수 제거 데이터
y = target
train_x2, test_x2, train_y2, test_y2 = train_test_split(X, y, train_size=0.7, test_size=0.3,random_state = 1)
X = x_data4_ # RM, NOX변수 제거 데이터
y = target
train_x3, test_x3, train_y3, test_y3 = train_test_split(X, y, train_size=0.7, test_size=0.3,random_state = 1)

In [58]:

# x_data3/x_data4 회귀 모델 적합(fit2,fit3)
fit_2 = sm.OLS(train_y2,train_x2)
fit_2 = fit_2.fit()
fit_3 = sm.OLS(train_y3,train_x3)
fit_3 = fit_3.fit()

In [59]:

# 3가지 데이터 유형에 따른 결과 비교
plt.plot(np.array(test_y2['Target']-fit_1.predict(test_x)),label="pred_full")
plt.plot(np.array(test_y2['Target']-fit_2.predict(test_x2)),label="pred_vif")
plt.plot(np.array(test_y2['Target']-fit_3.predict(test_x3)),label="pred_vif2")
plt.legend()
plt.show()

MSE를 통한 검증데이터에 대한 성능비교¶

In [60]:

from sklearn.metrics import mean_squared_error
mean_squared_error(test_y['Target'],fit_1.predict(test_x))

Out[60]:

26.148631468819886

In [61]:

mean_squared_error(test_y['Target'],fit_2.predict(test_x2))

Out[61]:

26.140062609846407

In [62]:

mean_squared_error(test_y['Target'],fit_3.predict(test_x3))
# MSE값이 제일 높으므로 제일 않좋은 모델이다.

Out[62]:

38.78845317912829

In [63]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))

728x90

저작자표시 비영리 변경금지 (새창열림)

'Data scientist > Machine Learning' 카테고리의 다른 글

회귀분석(4)_로지스틱 회귀분석 (0)	2021.08.21
회귀분석(3)_변수선택법 Code (0)	2021.08.21
회귀분석(1) (0)	2021.08.19
수학적 개념 이해(2) (0)	2021.08.19
수학적 개념 이해(1) (0)	2021.08.19

'Data scientist/Machine Learning' Related Articles

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

#wannabeeeeeee the best DataScientist

회귀분석(2)_Code 본문

회귀분석(2)_Code

해당 데이터를 활용한 기본적인 단순 선형 회귀분석 실습 코드

crim/rm/lstat 세게의 변수로 각각 단순 선형 회귀 분석하기¶

target ~ crim 선형회귀분석¶

y_hat=beta0 + beta1 * X 계산해보기¶

적합시킨 직선 시각화¶

위와 동일하게 rm변수와 lstat 변수로 각각 단순선형회귀분석 적합시켜보기¶

다중선형회귀분석¶

crim, rm, lstat 세개의 변수를 통해 다중회귀적합¶

단순선형회귀모델의 회귀계수와 비교¶

행렬연산을 통해 beta구하기¶

residual plot¶

crim, rm, lstat, b, tax, age, zn, nox, indus 변수를 통한 다중선형회귀분석¶

상관계수/산점도를 통해 다중공선성 확인¶

VIF를 통한 다중공선성 확인¶

학습 / 검증데이터 분할¶

MSE를 통한 검증데이터에 대한 성능비교¶

'Data scientist > Machine Learning' 카테고리의 다른 글

티스토리툴바