Data scientist/Machine Learning

PCA + Python_Code

맨사설 2021. 8. 23. 16:09

728x90

◎ 차원의 저주

차원이 커질수록 한정된 자료는 커진 차원의 패턴을 잘 설명하지 못한다.
차원이 증가함에 따라 model complexity가 기하급수적으로 높아진다.
주성분분석(PCA)은 서로 상관관계를 갖는 많은 변수를 상관관계가 없는 소수의 변수로 변환하는 차원축소 기법

◎ 공분산 행렬(Covariance matrix)의 정의

공분산 행렬은 일종의 행렬로써, 데이터의 구조를 설명해주며, 특히 특징 쌍(feature pairs)들의 변동이 얼마나 닮았는가(다른 말로는 얼마만큼이나 함께 변하는가)를 행렬에 나타내고 있다.

◎ Principal Components

차원을 줄이면서 정보의 손실을 최소화하는 방법
변환에 사용하는 소수의 변수를 주성분(Principal component) 또는 성분(component)이라고 함

☆ PC를 얻어내는 것

공분산이 데이터의 형태를 변형시키는 방향의 축과 그것에 직교하는 축을 찾아내는 과정

◎ PCA 수학적 개념 이해

ⓐ 행렬식(determinant), det(A) 구하기

ⓑ Eigen value

ⓒ Singular Value Decomposition (SVD)

◎ PCA 수행 과정

Mean centering
SVD 수행
SVD 결과를 활용하여 공분산의 eigen vector, eigen value 구하기
PC score 구하기
PC score를 활용하여 분석 진행

◎ Kernel PCA

관측치 사이의 패턴이 존재하는 것으로 보이나, 변수 간의 선형 관계가 아닐 때
K (Kernel matrix)는 관측치 사이의 유사도 개념
비슷한 관측치일수록 큰 값(서로 이질적인 관측치일수록 작은 값)

Principal compoenet analysis 실습 Code¶

대부분의 머신러닝을 모듈에 포함하고, 이에 대한 예제와 정보가 담겨있는 웹사이트 참고: https://scikit-learn.org

1. 데이터 전처리 및 데이터 파악¶

scikit-lean 패키지에서 데이터와 PCA 로드.

In [1]:

from sklearn import datasets
from sklearn.decomposition import PCA

자료 처리에 도움을 줄 pandas, numpy와 시각화를 위한 pyplot, seaborn 로드.

In [2]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

iris 데이터를 불러오고, 구조를 살핌.

In [3]:

iris = datasets.load_iris()
dir(iris)

Out[3]:

['DESCR',
 'data',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

설명의 편의를 위하여, 독립변수 중 처음 2개만을 사용.

In [4]:

X = iris.data[:,[0,2]] #변수 2개만 활용
y=iris.target

In [7]:

iris.feature_names

Out[7]:

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [5]:

print(X.shape)
feature_names=[iris.feature_names[0],iris.feature_names[2]]
df_X = pd.DataFrame(X)
df_X.head()

(150, 2)

Out[5]:

	0	1
0	5.1	1.4
1	4.9	1.4
2	4.7	1.3
3	4.6	1.5
4	5.0	1.4

In [8]:

print(y.shape)
df_Y = pd.DataFrame(y)
df_Y.head()

(150,)

Out[8]:

	0
0	0
1	0
2	0
3	0
4	0

결측치 여부를 파악.

In [9]:

print(df_X.isnull().sum())
print(df_Y.isnull().sum())

0    0
1    0
dtype: int64
0    0
dtype: int64

In [10]:

print(set(y))
iris.target_names

{0, 1, 2}

Out[10]:

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

종속 변수 (출력변수, 반응변수)의 분포를 살핌.

In [11]:

df_Y[0].value_counts().plot(kind='bar')
plt.show()

독립 변수 (속성, 입력변수, 설명변수)의 분포를 살핌.

In [12]:

for i in range(df_X.shape[1]):
    sns.distplot(df_X[i])
    plt.title(feature_names[i])
    plt.show()

C:\work\envs\datascience\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

C:\work\envs\datascience\lib\site-packages\seaborn\distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

2. PCA 함수 활용 및 아웃풋 의미파악¶

PCA 함수를 활용하여 PC를 얻어냄. 아래의 경우 PC 2개를 뽑아냄.

In [13]:

pca = PCA(n_components=2)
pca.fit(X)

Out[13]:

PCA(n_components=2)

아래와 같이 PC score를 얻어냄. 아래의 PC score를 이용하여, 회귀분석에 활용할 수 있음.

In [14]:

PCscore = pca.transform(X)
PCscore[0:5]

Out[14]:

array([[-2.46024094, -0.24479165],
       [-2.53896211, -0.06093579],
       [-2.70961121,  0.08355948],
       [-2.56511594,  0.25420858],
       [-2.49960153, -0.15286372]])

In [15]:

eigens_v = pca.components_.transpose()
print(eigens_v)

[[ 0.39360585 -0.9192793 ]
 [ 0.9192793   0.39360585]]

In [16]:

mX=np.matrix(X)
for i in range(X.shape[1]):
    mX[:,i]=mX[:,i]-np.mean(X[:,i])
dfmX=pd.DataFrame(mX)

In [17]:

(mX*eigens_v)[0:5] # PC스코어와 동일

Out[17]:

matrix([[-2.46024094, -0.24479165],
        [-2.53896211, -0.06093579],
        [-2.70961121,  0.08355948],
        [-2.56511594,  0.25420858],
        [-2.49960153, -0.15286372]])

In [18]:

plt.scatter(PCscore[:,0],PCscore[:,1])
plt.show()

3. PC를 활용한 회귀분석¶

이번에는 모든 독립변수를 활용하여 PC를 뽑아냄.

In [20]:

X2 = iris.data
pca2 = PCA(n_components=4)
pca2.fit(X2)

Out[20]:

PCA(n_components=4)

In [21]:

pca2.explained_variance_

Out[21]:

array([4.22824171, 0.24267075, 0.0782095 , 0.02383509])

In [22]:

PCs=pca2.transform(X2)[:,0:2]

In [23]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

모델의 복잡성으로 인하여 기존 자료를 이용한 분석은 수렴하지 않는 모습.

In [24]:

cif = LogisticRegression(solver='sag',multi_class='multinomial').fit(X2,y)

C:\work\envs\datascience\lib\site-packages\sklearn\linear_model\_sag.py:328: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn("The max_iter was reached which means "

PC 2개 만을 뽑아내여 분석한 경우 모델이 수렴.

In [25]:

cif2 = LogisticRegression(solver='sag',multi_class='multinomial').fit(PCs,y)

In [26]:

confusion_matrix(y,cif2.predict(PCs))

Out[26]:

array([[50,  0,  0],
       [ 0, 47,  3],
       [ 0,  2, 48]], dtype=int64)

임의로 변수 2개 만을 뽑아내여 분석한 경우 모델의 퍼포먼스가 하락함.

In [27]:

clf = LogisticRegression(solver='sag', max_iter=1000, random_state=0,
                             multi_class="multinomial").fit(X2[:,0:2], y)

In [28]:

confusion_matrix(y, clf.predict(X2[:,0:2]))

Out[28]:

array([[49,  1,  0],
       [ 0, 37, 13],
       [ 0, 14, 36]], dtype=int64)

위와 같이, 차원축소를 통하여 모델의 복잡성을 줄이는 동시에 최대한 많은 정보를 활용하여 분석할 수 있음.

In [29]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))

728x90

저작자표시 비영리 변경금지

'Data scientist > Machine Learning' 카테고리의 다른 글

K-NN + Python_Code (0)	2021.08.25
Naive Bayes + Python_Code (0)	2021.08.25
회귀분석(4)_로지스틱 회귀분석 (0)	2021.08.21
회귀분석(3)_변수선택법 Code (0)	2021.08.21
회귀분석(2)_Code (0)	2021.08.19

현재글PCA + Python_Code

#wannabeeeeeee the best DataScientist

PCA + Python_Code

◎ 차원의 저주

◎ 공분산 행렬(Covariance matrix)의 정의

◎ Principal Components

◎ PCA 수학적 개념 이해

◎ PCA 수행 과정

◎ Kernel PCA

Principal compoenet analysis 실습 Code¶

1. 데이터 전처리 및 데이터 파악¶

2. PCA 함수 활용 및 아웃풋 의미파악¶

3. PC를 활용한 회귀분석¶

'Data scientist > Machine Learning' 카테고리의 다른 글

'Data scientist/Machine Learning'의 다른글

티스토리툴바

PCA + Python_Code

◎ 차원의 저주

◎ 공분산 행렬(Covariance matrix)의 정의

◎ Principal Components

◎ PCA 수학적 개념 이해

◎ PCA 수행 과정

◎ Kernel PCA

Principal compoenet analysis 실습 Code¶

1. 데이터 전처리 및 데이터 파악¶

2. PCA 함수 활용 및 아웃풋 의미파악¶

3. PC를 활용한 회귀분석¶

'Data scientist > Machine Learning' 카테고리의 다른 글

'Data scientist/Machine Learning'의 다른글

관련글

티스토리툴바