Book report/혼자 공부하는 머신러닝 + 딥러닝

[혼자 공부하는 머신러닝 + 딥러닝] Chapter 6 비지도 학습

맨사설 2021. 9. 3. 19:30

728x90

06-1 군집 알고리즘¶

비지도 학습 : 타깃이 없을 때 사용하는 머신러닝 알고리즘

In [11]:

# 과일 사진 데이터
url= 'https://bit.ly/fruits_300_data'
wget.download(url)

100% [..........................................................................] 3000128 / 3000128

Out[11]:

'fruits_300_data'

In [1]:

import wget
import numpy as np
import matplotlib.pyplot as plt
fruits = np.load('fruits_300.npy')
print(fruits.shape)
# 첫 번쨰 차원(300)은 샘플의 개수, 두 번째 차원(100)은 이미지 높이, 세 번쨰 차원(100)은 이미지 너비

(300, 100, 100)

In [2]:

# 첫 번째 샘플의 모습
plt.imshow(fruits[0], cmap='gray')
plt.show()
# cmap='gray'는 0에 가까울수록 검게 나타내고 높은 값은 밝게 표시

In [3]:

# cmap='gray_r'는 0에 가까울수록 밝게 나타나고 높은 값은 어둠게 표시
plt.imshow(fruits[0], cmap='gray_r')
plt.show()

In [4]:

# 100번째와 200번째 모습
fig, axs = plt.subplots(1, 2)
axs[0].imshow(fruits[100], cmap='gray_r')
axs[1].imshow(fruits[200], cmap='gray_r')
plt.show()

In [5]:

# 0~100은 사과, 101~200은 파인애플, 201~300은 바나나이므로 2차원으로 재구성
apple = fruits[0:100].reshape(-1, 100*100)
pineapple = fruits[100:200].reshape(-1, 100*100)
banana = fruits[200:300].reshape(-1, 100*100)

In [6]:

# 사과별로 특징을 파악하고자 열 방향으로 평균값 계산
print(apple.mean(axis=1))

[ 88.3346  97.9249  87.3709  98.3703  92.8705  82.6439  94.4244  95.5999
  90.681   81.6226  87.0578  95.0745  93.8416  87.017   97.5078  87.2019
  88.9827 100.9158  92.7823 100.9184 104.9854  88.674   99.5643  97.2495
  94.1179  92.1935  95.1671  93.3322 102.8967  94.6695  90.5285  89.0744
  97.7641  97.2938 100.7564  90.5236 100.2542  85.8452  96.4615  97.1492
  90.711  102.3193  87.1629  89.8751  86.7327  86.3991  95.2865  89.1709
  96.8163  91.6604  96.1065  99.6829  94.9718  87.4812  89.2596  89.5268
  93.799   97.3983  87.151   97.825  103.22    94.4239  83.6657  83.5159
 102.8453  87.0379  91.2742 100.4848  93.8388  90.8568  97.4616  97.5022
  82.446   87.1789  96.9206  90.3135  90.565   97.6538  98.0919  93.6252
  87.3867  84.7073  89.1135  86.7646  88.7301  86.643   96.7323  97.2604
  81.9424  87.1687  97.2066  83.4712  95.9781  91.8096  98.4086 100.7823
 101.556  100.7027  91.6098  88.8976]

In [7]:

# 그 외 다른 3가지의 분포를 시각화
# alpha =0.8은 겹친 부분을 잘 볼 수 있도록 투명도를 낮춘것
plt.hist(np.mean(apple, axis=1), alpha=0.8)
plt.hist(np.mean(pineapple, axis=1), alpha=0.8)
plt.hist(np.mean(banana, axis=1), alpha=0.8)
plt.legend(['apple', 'pineapple', 'banana'])
plt.show()

In [8]:

# 픽셀별 평균 계산하여 시각화 해보기
fig, axs = plt.subplots(1, 3, figsize=(20, 5))
axs[0].bar(range(10000), np.mean(apple, axis=0))
axs[1].bar(range(10000), np.mean(pineapple, axis=0))
axs[2].bar(range(10000), np.mean(banana, axis=0))
plt.show()

In [9]:

# 사과, 파인애플, 바나나를 구별하기 위해 평균값과 가까운 사진 골라보았다
apple_mean = np.mean(apple, axis=0).reshape(100, 100)
pineapple_mean = np.mean(pineapple, axis=0).reshape(100, 100)
banana_mean = np.mean(banana, axis=0).reshape(100, 100)

abs_diff = np.abs(fruits - apple_mean)
abs_mean = np.mean(abs_diff, axis=(1,2)) # 2,3차원의 평균
print(abs_mean.shape)

(300,)

In [10]:

# 평균값 차이 작은순으로 100개 출력하기
apple_index = np.argsort(abs_mean)[:100]
fig, axs = plt.subplots(10, 10, figsize=(10,10))
for i in range(10):
    for j in range(10):
        axs[i, j].imshow(fruits[apple_index[i*10 + j]], cmap='gray_r')
        axs[i, j].axis('off') # 좌표축 생략 함수
plt.show()

☆ 군집 : 비슷한 샘플끼리 그룹으로 모으는 작업으로 대표적인 비지도 학습

☆ 클러스터 : 군집 알고리즘에서 만든 그룹

06-2 k-평균¶

☆ k-평균 알고리즘 : 처음에 랜덤하게 클러스터 중심을 정하여 클러스터를 만들고 그다음 클러스터의 중심을 이동하여 다시 클러스터를 결정하는 식으로 반복해서 최적의 클러스터를 구성하는 알고리즘

In [11]:

import numpy as np

fruits = np.load('fruits_300.npy')
fruits_2d = fruits.reshape(-1, 100*100) # 차원 축소?
fruits_2d.shape

Out[11]:

(300, 10000)

In [12]:

# k=3인 k-mean cluster 시행
from sklearn.cluster import KMeans

km = KMeans(n_clusters=3, random_state=42)
km.fit(fruits_2d)
print(np.unique(km.labels_, return_counts=True))

(array([0, 1, 2]), array([111,  98,  91], dtype=int64))

In [13]:

# 100번째 샘플의 거리
print(km.transform(fruits_2d[100:101]))
# 인덱스 0가 제일 거리가 짧음으로 2를 반환할 것

[[3393.8136117  8837.37750892 5267.70439881]]

In [14]:

print(km.predict(fruits_2d[100:101]))

[0]

In [15]:

# 알고리즘 반복 횟수
print(km.n_iter_)

최적의 k 찾기¶

이니셔(inertia) : 클러스터 중심과 클러스터에 속한 샘플 사이의 거리를 제곱하여 합한 것을 이니셔라고 부른다.

엘보우 방법 : 이니셔가 크게 줄어들지 않는 지점을 찾아 k 값을 정하는 방법

In [16]:

# 엘보우 방법
inertia = []
for k in range(2, 7):
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(fruits_2d)
    inertia.append(km.inertia_)

plt.plot(range(2, 7), inertia)
plt.xlabel('k')
plt.ylabel('inertia')
plt.show()
# k=3일때 꺾기는 것을 확인할 수 있다.

06-3 주성분 분석¶

☆ 차원 축소 : 데이터를 가장 잘 나타내는 일부 특성을 선택하여 데이터 크기를 줄이고 지도 학습 모델의 성능을 향상시킬 수 있는 방법

☆ 주성분 분석(PCA) : 대표적인 차원 축소 알고리즘

☆ 설명괸 분산 : 주성분 분석에서 주성분이 얼마나 원본 데이터의 분산을 잔 나타내는지 기록한 것.

In [17]:

import numpy as np
fruits = np.load('fruits_300.npy')
fruits_2d = fruits.reshape(-1,100*100)

In [18]:

# 50개의 주성분
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
pca.fit(fruits_2d)

Out[18]:

PCA(n_components=50)

In [19]:

print(pca.components_.shape)

(50, 10000)

In [20]:

# 300개의 샘플을 50개의 성분으로 축소
fruits_pca = pca.transform(fruits_2d)
print(fruits_pca.shape)

(300, 50)

원본 데이터 재구성¶

In [21]:

# 50개의 특성을 가지고 100*100 으로 늘린 것
fruits_inverse = pca.inverse_transform(fruits_pca)
print(fruits_inverse.shape)

(300, 10000)

In [22]:

# 원본 데이터의 분산을 얼마나 잘 나타내는지 기록한 값을 설명된 분산이라고 합니다.
print(np.sum(pca.explained_variance_ratio_))

0.9215788108051702

In [23]:

# 설명된 분산을 시각화 하면
plt.plot(pca.explained_variance_ratio_)
plt.show()
# 처음 10개의 주성분이 대부분의 분산을 표현하고 있다.

다른 알고리즘과 함께 사용하기¶

In [25]:

# PCA로 축소한 데이터를 학습하면 어떤 차이가 있는지 확인해 보겠다.
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
# 타깃 변수 만들기 
target = np.array([0]*100 + [1]*100 + [2]*100)
from sklearn.model_selection import cross_validate
# PCA 전 데이터
scores = cross_validate(lr, fruits_2d, target)
print(np.mean(scores['test_score']))

0.9966666666666667

In [26]:

# PCA 후 데이터
scores = cross_validate(lr, fruits_pca, target)
print(np.mean(scores['test_score']))

1.0

차원 축소(PCA)를 진행한 데이터가 오히려 정확도가 높다.

In [27]:

# 설명된 분산의 50%에 달하는 주성분을 찾도록 하는 모델
pca = PCA(n_components=0.5)
pca.fit(fruits_2d)
print(pca.n_components_)
# 2개의 특성만으로 분산의 50%를 설명

In [28]:

fruits_pca = pca.transform(fruits_2d)
print(fruits_pca.shape)
# 2개의 특성으로 된 데이터 형성

(300, 2)

In [29]:

# 2개의 특성만으로 구성된 데이터의 검증 결과
scores = cross_validate(lr, fruits_pca, target)
print(np.mean(scores['test_score']))
# 로지스틱 회귀 모델이 완전히 수렴하지 못해 반복 횟수를 증가하라는 경고

C:\work\envs\datascience\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\work\envs\datascience\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
C:\work\envs\datascience\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

0.99

C:\work\envs\datascience\lib\site-packages\sklearn\linear_model\_logistic.py:763: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

In [32]:

###### pca의 또다른 장점 : 시각화
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, random_state=42)
km.fit(fruits_pca)
for label in range(3):
    data = fruits_pca[km.labels_ == label]
    plt.scatter(data[:,0],data[:,1])
plt.legend(['apple','bannana','pineapple'])
plt.show()

☆ PCA 장점 1 : 시각화하기 쉽다.

☆ PCA 장점 2 : 성능을 높이거나 훈련 속도를 빠르게 만들 수 있다

In [33]:

from IPython.core.display import display, HTML
display(HTML("<style>.container {width:80% !important;}</style>"))

728x90

저작자표시 비영리 변경금지

'Book report > 혼자 공부하는 머신러닝 + 딥러닝' 카테고리의 다른 글

[혼자 공부하는 머신러닝 + 딥러닝] Chapter 8 이미지를 위한 인공 신경망 (0)	2021.09.27
[혼자 공부하는 머신러닝 + 딥러닝] Chapter 7 딥러닝을 시작합니다 (0)	2021.09.04
[혼자 공부하는 머신러닝 + 딥러닝] Chapter 5 트리 알고리즘 (0)	2021.09.01
[혼자 공부하는 머신러닝 + 딥러닝] Chapter 4 다양한 분류 알고리즘 (0)	2021.08.31
[혼자 공부하는 머신러닝 + 딥러닝] Chapter 3 회귀 알고리즘과 모델 규제 (0)	2021.08.31

현재글[혼자 공부하는 머신러닝 + 딥러닝] Chapter 6 비지도 학습

#wannabeeeeeee the best DataScientist

[혼자 공부하는 머신러닝 + 딥러닝] Chapter 6 비지도 학습

06-1 군집 알고리즘¶

06-2 k-평균¶

최적의 k 찾기¶

06-3 주성분 분석¶

원본 데이터 재구성¶

다른 알고리즘과 함께 사용하기¶

'Book report > 혼자 공부하는 머신러닝 + 딥러닝' 카테고리의 다른 글

'Book report/혼자 공부하는 머신러닝 + 딥러닝'의 다른글

티스토리툴바

[혼자 공부하는 머신러닝 + 딥러닝] Chapter 6 비지도 학습

06-1 군집 알고리즘¶

06-2 k-평균¶

최적의 k 찾기¶

06-3 주성분 분석¶

원본 데이터 재구성¶

다른 알고리즘과 함께 사용하기¶

'Book report > 혼자 공부하는 머신러닝 + 딥러닝' 카테고리의 다른 글

'Book report/혼자 공부하는 머신러닝 + 딥러닝'의 다른글

관련글

티스토리툴바