Machine Learning

[AI class w8d2] ML 기초 - 분류문제 실습

makeitworth 2021. 6. 18. 22:07

확률적 식별 모델

다중클래스 로지스틱 회귀
우도함수 소프트맥스, 시그모이드의 함수들 사이의 관계를 잘 기억할것
체인룰을 쓰기 위해 함수관계를 잘 이해해야 한다.

Gradient Descent(batch)
- 사이킷런의 make_classification을 통해 쉽게 데이터를 만들어낼 수 있다.
- 로지스틱 리그레션 모델에서 반드시 써야하는 함수가 시그모이드 함수이다. 지수함수에 1을더해서 나눈것을 시그모이드 함수로 정의한다.
- Cost를 계산하는 함수 : 현재의 파라미터값 w가 주어져 있을 때 입력 X와 목표값 t에 대해 cost를 계산한다.
- 전체 데이터를 한꺼번에 넣어서 그래디언트를 업데이트 하는데, 이를 배치 업데이트라고한다. 배치란 데이터 전체를 한꺼번에 원샷으로 다 쓴다는 의미이다. 딥러닝에서 말하는 배치사이즈는 보통 미니배치를 말하는 것이다. 여기서 배치라는 것은 전체 데이터를 다 쓴다는 것이다.
1이중분류(binary classification)

이중분류 문제의 경우 클래스가 비슷한 분포를 가지고 있지 않을 때 불균형의 문제가 있어 정확도만 구하는 것은 좋은 지표가 될 수 없다.

다음 것들을 고려해야 한다.

오차행렬(Confusion matrix)
- 오차행렬을 구하기 위해서 사이킷런에 포함된 오차행렬을 사용한다.
- TP(True Positive) : 모델이 Positive인데 실제로도 positive라고 예측한 경우
- FP(False Positive) : 모델이 negative인데 positive라고 예측한 경우
- TN(True Negative) : 모델이 negative인데 negative라고 예측한 경우
- FN(False Negative) : 모델이 Positive인데 negative라고 예측한 경우

정밀도 : 모델이 positive라고 했을 때 그중 몇개가 정말로 Positive인지의 비율
재현율 : 데이터에서 positive인 것들의 개수 중에서 모델이 얼마나 잘 positive를 찾아냈는지에 대한 비율 어떤 모델은 precision이 높고 recall이 낮으며, 어떤 모델은 그 반대일 수 있다. 어떤 모델이 좋은 모델일까? 이것은 경우에 따라서 다를 수 있다.
의료진단의 경우 precision보다 recall이 중요하다. FP인 경우 병이 있다고 판단했는데 실제로는 병이 없는 경우이다. 번거롭지만 리스크가 큰 것은 아니다.
하지만 FN인 경우 실제로 병이 있는데 병이 없다고 판단하게 되면 리스크가 크다. 이런 경우 recall이 높아야 한다.
스팸을 분류하는 경우를 생각해보자. Precision을 높이기 위해 recall이 낮아지면, 스팸이 아닌데 스팸이라고 판단되었는데 중요한 이메일이면 리스크가 크다. 따라서 스팸을 분류하는 경우도 recall이 더 중요하다.
동영상을 분류하는 경우(어린이에게 적절한 동영상인지) precision이 더 중요하다. 분류기가 잘못해서 좋은 동영상이 아닌데 보여주는 것은, 그 하나 때문에 큰 리스크가 발생할 수 있다.

하나의 모델 안에서 precision과 recall을 튜닝할 수 있다.
위의 사진을 보면 오른쪽으로 갈수록 이 이미지들이 높은 스코어를 가진다.
주어진 이미지에 대해 positive/negative를 판별하기 위해 스코어를 보는데, 스코어 값이 특정 기준보다 크냐/작냐에 따라 판별한다.
그 기준을 threshold라고 한다.
Threshold을 낮게 잡으면 모든 경우에 positive라고 말한다. Recall이 1에 가깝게 된다. 존재하는 모든 positive를 positive라고 예측하기 때문이다. 반면 precision은 낮을 수밖에 없다.
반면 Threshold를 높게 작으면 recall은 낮아지고 precision은 높아진다.
그래서 중간정도의 threshold를 많이 사용한다. 하지만 앞의 암이나 스팸 메일등 precision과 recall의 중요도에 따라 달라지기도 한다.

어떤지점이 가장 좋은 지점일까? 급격한 변화가 일어나기 전의 지점을 threshold로 지정하면 좋은 trade-off라고 할 수 있다.

하나의 모델 안에서 precision, recall을 조정하기 위해서는 모델의 예측값이 참이냐 거짓이냐 보다는 스코어를 알아야 한다.

다중 분류(multiclass classification)

이진문제가 아닌 다중분류의 경우에는 클래스가 비슷한 분포를 가지고 있기 때문에 그냥 정확도만 구해도 불균형의 문제가 별로 없어서 정밀도나 재현율을 구하지 않고 정확도를 구하는 것도 괜찮은 지표가 된다.

Data Augmentation
- 학습한 모델의 성능을 향상시키기 위해 data augmentation이라는 방법을 사용
- 이미지에 변형을 가해서 추가적인 데이터를 만들어서 학습데이터에 포함해 모델을 새로 학습하면 좀더 안정적인 모델을 만들 수 있다.

실습 - MNIST

In [1]:

# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "classification"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

MNIST 데이터¶

In [2]:

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)
mnist.keys()

Out[2]:

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [3]:

X, y = mnist["data"], mnist["target"]
X.shape

Out[3]:

(70000, 784)

In [4]:

X = np.array(X)

In [5]:

Out[5]:

0        5
1        0
2        4
3        1
4        9
        ..
69995    2
69996    3
69997    4
69998    5
69999    6
Name: class, Length: 70000, dtype: category
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']

In [6]:

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

In [7]:

some_digit = X[2]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap=mpl.cm.binary)
plt.axis("off")

save_fig("some_digit_plot")
plt.show()

Saving figure some_digit_plot

In [8]:

some_digit

Out[8]:

array([  0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,  67., 232.,  39.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,  62.,  81.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0., 120., 180.,  39.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0., 126., 163.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   2., 153., 210.,  40.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 220., 163.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,  27., 254., 162.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0., 222., 163.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0., 183., 254., 125.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,  46., 245., 163.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0., 198., 254.,  56.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0., 120., 254., 163.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,  23., 231., 254.,  29.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 159., 254.,
       120.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0., 163., 254., 216.,  16.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0., 159., 254.,  67.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,  14.,  86., 178., 248., 254.,  91.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 159.,
       254.,  85.,   0.,   0.,   0.,  47.,  49., 116., 144., 150., 241.,
       243., 234., 179., 241., 252.,  40.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0., 150., 253., 237., 207., 207., 207.,
       253., 254., 250., 240., 198., 143.,  91.,  28.,   5., 233., 250.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0., 119., 177., 177., 177., 177., 177.,  98.,  56.,   0.,   0.,
         0.,   0.,   0., 102., 254., 220.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 169., 254.,
       137.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0., 169., 254.,  57.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0., 169.,
       254.,  57.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0., 169., 255.,  94.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
       169., 254.,  96.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0., 169., 254., 153.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0., 169., 255., 153.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,  96., 254., 153.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,
         0.,   0.,   0.])

In [9]:

#배열이 아닌 정수로 저장
y = y.astype(np.uint8)

In [10]:

Out[10]:

0        5
1        0
2        4
3        1
4        9
        ..
69995    2
69996    3
69997    4
69998    5
69999    6
Name: class, Length: 70000, dtype: uint8

In [11]:

def plot_digit(data):
    image = data.reshape(28, 28)
    plt.imshow(image, cmap = mpl.cm.binary,
               interpolation="nearest")
    plt.axis("off")

In [12]:

def plot_digits(instances, images_per_row=10, **options):
    size = 28
    images_per_row = min(len(instances), images_per_row)
    images = [instance.reshape(size,size) for instance in instances]
    n_rows = (len(instances) - 1) // images_per_row + 1
    row_images = []
    n_empty = n_rows * images_per_row - len(instances)
    images.append(np.zeros((size, size * n_empty)))
    for row in range(n_rows):
        rimages = images[row * images_per_row : (row + 1) * images_per_row]
        row_images.append(np.concatenate(rimages, axis=1))
    image = np.concatenate(row_images, axis=0)
    plt.imshow(image, cmap = mpl.cm.binary, **options)
    plt.axis("off")

In [13]:

#최초 100개의 이미지 그리기

plt.figure(figsize=(9,9))
example_images = X[:100]
plot_digits(example_images, images_per_row=10)
save_fig("more_digits_plot")
plt.show()

Saving figure more_digits_plot

In [14]:

y[0]

Out[14]:

In [15]:

#학습/ 테스트 데이터 분리
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

이진분류기 (Binary classifier)¶

문제를 단순화해서 숫자 5만 식별해보자.

In [16]:

y_train_5 = (y_train == 5)
y_test_5 = (y_test == 5)

In [17]:

y_train_5

Out[17]:

0         True
1        False
2        False
3        False
4        False
         ...  
59995    False
59996    False
59997     True
59998    False
59999    False
Name: class, Length: 60000, dtype: bool

로지스틱 회귀 모델을 사용해보자.

In [18]:

import warnings
warnings.filterwarnings(action='ignore')

In [19]:

#사이킷런 라이브러리 활용
from sklearn.linear_model import LogisticRegression
log_clf = LogisticRegression(random_state=0).fit(X_train, y_train_5)

In [20]:

log_clf.predict([X[0],X[1],X[2]])

Out[20]:

array([ True, False, False])

교차 검증을 사용해서 평가해보자. (fold 갯수 3)

In [21]:

from sklearn.model_selection import cross_val_score
cross_val_score(log_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Out[21]:

array([0.97525, 0.9732 , 0.9732 ])

모든 교차 검증 폴드에 대해 정확도가 97% 이상임. 모델이 좋아 보이는가? 이것이 정말 좋은 결과일까? 찾아보자. Never5Classifier 선언 : 무조건 5가 아니다(0) 라고 예측하는 classifier

In [22]:

from sklearn.base import BaseEstimator
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros(len(X), dtype=bool)

In [23]:

never_5_clf = Never5Classifier()
cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring="accuracy")

Out[23]:

array([0.91125, 0.90855, 0.90915])

In [24]:

never_5_clf.predict(X)

Out[24]:

array([False, False, False, ..., False, False, False])

이미지의 10%만 숫자 5이기 때문에 무조건 5가 아닌 것으로 예측해도 정확도는 90%가 된다. 즉 97% 정확도가 아주 좋은 예측률이 아닐 수도 있다는 뜻.
목표값(클래스)들이 불균형인 경우에 정확도(accuracy)는 좋은 지표가 아니다.
그렇다면 어떻게 해야할까? -> Precision/Recall matrix를 활용하거나 , F1-score 개선
목표값의 분포가 50:50에 가까우면 accuracy를 써도 된다

오차행렬 (Confusion matrix)¶

In [25]:

#먼저 cross-validation을 활용해서 예측값을 저장하자
from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(log_clf, X_train, y_train_5, cv=3)

In [26]:

y_train_pred.shape

Out[26]:

(60000,)

In [27]:

#오차행렬 생성
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train_5, y_train_pred)

Out[27]:

array([[54038,   541],
       [ 1026,  4395]])

%E1%84%89%E1%85%B3%E1%84%8F%E1%85%B3%E1%84%85%E1%85%B5%E1%86%AB%E1%84%89%E1%85%A3%E1%86%BA%202021-06-15%20%E1%84%8B%E1%85%A9%E1%84%92%E1%85%AE%208.43.48.png

precision = $\frac{TP}{TP+FP}$ (정밀도)¶

recall = $\frac{TP}{TP+FN}$ (재현율)¶

In [28]:

from sklearn.metrics import precision_score, recall_score

precision_score(y_train_5, y_train_pred)

Out[28]:

0.8903970826580226

In [29]:

4395/(4395+541)

Out[29]:

0.8903970826580226

In [30]:

recall_score(y_train_5, y_train_pred)

Out[30]:

0.8107360265633647

In [31]:

4395/(4395+1026)

Out[31]:

0.8107360265633647

In [32]:

confusion_matrix(y_train_5, never_5_clf.predict(X)[:60000])

Out[32]:

array([[54579,     0],
       [ 5421,     0]])

In [33]:

precision_score(y_train_5, never_5_clf.predict(X)[:60000])

Out[33]:

0.0

In [34]:

recall_score(y_train_5, never_5_clf.predict(X)[:60000])

Out[34]:

0.0

Error cases 조사하기¶

89%의 정밀도라면 완벽한 모델은 아니다.
어떨 때 에러가 발생하는지 살펴보자

In [35]:

errors = (y_train_pred != y_train_5)

In [36]:

errors

Out[36]:

0        False
1        False
2        False
3        False
4        False
         ...  
59995    False
59996    False
59997    False
59998    False
59999    False
Name: class, Length: 60000, dtype: bool

In [37]:

# 에러인 경우 중에서 100개를 뽑아서 보자
plt.figure(figsize=(9,9))
plot_digits(X_train[errors][:100], images_per_row=10)

save_fig("more_digits_plot")
plt.show()

Saving figure more_digits_plot

Precision/Recall Trade-off¶

두개의 모델이
하나는 정밀도가 높지만 재현률이 낮고,
하나는 정밀도가 낮지만 재현률이 높다면,
어떤 모델을 선택해야 할까?

ex 1> 암 진단 모델의 경우

precision에서 문제가 있는 경우?
cancer 있다고 진단했는데 실제로는 cancer가 아닌 경우

recall에서 문제가 있는 경우?
cancer 없다고 진단했는데 실제로는 cancer가 있는 경우

후자가 훨씬 더 심각한 결과를 초래하는 사건 --> recall이 높아야 한다.

ex 2> spam mail 분류 모델의 경우

precision에서 문제가 있는 경우?
spam 아니라고 했는데 실제로는 spam인 경우

recall에서 문제가 있는 경우?
spam 이라고 진단했는데 실제로는 spam 아닌 경우

전자가 훨씬 더 심각한 결과를 초래하는 사건 --> recall이 높아야 한다.

ex 3> 어린이 동영상 등급 분류기 -> precision이 중요하다

좋은 동영상이라고 했는데 실제로는 좋은 동영상이 아닌 경우

$ $

%E1%84%89%E1%85%B3%E1%84%8F%E1%85%B3%E1%84%85%E1%85%B5%E1%86%AB%E1%84%89%E1%85%A3%E1%86%BA%202021-06-15%20%E1%84%8B%E1%85%A9%E1%84%92%E1%85%AE%208.59.52.png

여러 thresholds가 있을 수 있다. 어떤 것을 선택해야 할까?

In [38]:

#에러의 인덱스를 뽑아보자.
for i in range(len(errors)):
    if errors[i]:
        print(i)

In [39]:

y_train_pred[48], y_train_5[48]

Out[39]:

(True, False)

5가 아닌데, 5라고 예측한 경우

In [40]:

some_digit = X_train[48]

#decision_function 은 예측값이 decision surface와 얼마나 떨어져 있는지를 반환한다.
y_scores = log_clf.decision_function([some_digit])
y_scores

Out[40]:

array([0.22419047])

In [41]:

some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap=mpl.cm.binary)
plt.axis("off")

save_fig("some_digit_plot")
plt.show()

Saving figure some_digit_plot

In [42]:

threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Out[42]:

array([ True])

threshold가 0이면, 0.2는 0보다 크니까 True를 반환한 것

In [43]:

threshold = 0.5
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred

Out[43]:

array([False])

threshold값을 올리므로써, False Positive가 하나 줄었다. 즉, recall이 줄어들고 precision은 올라감

In [44]:

#전체 데이터에 있어서의 decision function 생성
y_scores = cross_val_predict(log_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

In [45]:

y_scores.shape

Out[45]:

(60000,)

In [46]:

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

In [47]:

precisions.shape

Out[47]:

(59897,)

In [48]:

thresholds.shape

Out[48]:

(59896,)

thresholds 값에 따라서 60000개 보다 조금 작게 shape이 나온다.

In [49]:

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
save_fig("precision_vs_recall_plot")
plt.show()

Saving figure precision_vs_recall_plot

오른쪽으로 갈수록 threshold 가 낮은 경우

다중 분류 (Multiclass Classification)¶

원래의 문제 (10개의 숫자 중에 하나로 판정)

In [50]:

from sklearn.linear_model import LogisticRegression
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)
softmax_reg.fit(X_train, y_train)

Out[50]:

LogisticRegression(C=10, multi_class='multinomial')

In [51]:

softmax_reg.predict(X_train)[:10]

Out[51]:

array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4], dtype=uint8)

In [52]:

from sklearn.metrics import accuracy_score
y_pred = softmax_reg.predict(X_test)
accuracy_score(y_test, y_pred)

Out[52]:

0.9243

Data Augmentation¶

성능향상 전략 중 하나 y-label은 바꾸지 않고, x 이미지를 조금씩 움직여서 x 데이터를 늘려서

In [53]:

from scipy.ndimage.interpolation import shift

In [54]:

def shift_image(image, dx, dy):
    image = image.reshape((28, 28))
    shifted_image = shift(image, [dy, dx], cval=0, mode="constant")
    return shifted_image.reshape([-1])

In [55]:

image = X_train[1000]
#이미지 아래로 이동
shifted_image_down = shift_image(image, 0, 5)
#이미지 왼쪽으로 이동
shifted_image_left = shift_image(image, -5, 0)

plt.figure(figsize=(12,3))
plt.subplot(131)
plt.title("Original", fontsize=14)
plt.imshow(image.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(132)
plt.title("Shifted down", fontsize=14)
plt.imshow(shifted_image_down.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.subplot(133)
plt.title("Shifted left", fontsize=14)
plt.imshow(shifted_image_left.reshape(28, 28), interpolation="nearest", cmap="Greys")
plt.show()

전후 좌우 말고도 회전 시키는 방법도 있다.

In [56]:

X_train_augmented = [image for image in X_train]
y_train_augmented = [label for label in y_train]

# 네가지 방법으로 데이터 변환
for dx, dy in ((1, 0), (-1, 0), (0, 1), (0, -1)):
    for image, label in zip(X_train, y_train):
        X_train_augmented.append(shift_image(image, dx, dy))
        y_train_augmented.append(label)

X_train_augmented = np.array(X_train_augmented)
y_train_augmented = np.array(y_train_augmented)

In [57]:

X_train_augmented.shape

Out[57]:

(300000, 784)

In [58]:

# 학습에 문제가 있을 수 있기 때문에 augmented data 섞어준다
shuffle_idx = np.random.permutation(len(X_train_augmented))
X_train_augmented = X_train_augmented[shuffle_idx]
y_train_augmented = y_train_augmented[shuffle_idx]

In [59]:

X_train_augmented.shape, X_train.shape

Out[59]:

((300000, 784), (60000, 784))

In [60]:

softmax_reg_augmented = LogisticRegression(multi_class="multinomial",solver="lbfgs", C=10)
softmax_reg_augmented.fit(X_train_augmented, y_train_augmented)

Out[60]:

LogisticRegression(C=10, multi_class='multinomial')

In [61]:

y_pred = softmax_reg_augmented.predict(X_test)
accuracy_score(y_test, y_pred)

Out[61]:

0.9279

augmentation해서 정확도가 조금 올라갔다.

Titanic 데이터셋¶

In [62]:

import numpy as np
import pandas as pd

In [63]:

train_data = pd.read_csv("titanic.csv")

In [64]:

train_data.head()

Out[64]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

속성들

Survived: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
Pclass: passenger class.
Name, Sex, Age: self-explanatory
SibSp: how many siblings & spouses of the passenger aboard the Titanic.
Parch: how many children & parents of the passenger aboard the Titanic.
Ticket: ticket id
Fare: price paid (in pounds)
Cabin: passenger's cabin number
Embarked: where the passenger embarked the Titanic

식별자를 feature로 넣어서 학습시키게 되면 새로운 데이터가 들어왔을 때 예측값이 굉장히 안좋아질 수 있는 가능성이 있다. 식별자적 속성이 있는 변수는 feature로 넣지 않도록

In [65]:

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Age, Cabin, Embarked 속성들이 missing value를 가지고 있다.

Cabin, Name, Ticket 속성들은 무시한다.

In [66]:

train_data.describe()

Out[66]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

오직 40% 미만이 생존했음을 알 수 있다.

In [67]:

train_data["Survived"].value_counts()

Out[67]:

0    549
1    342
Name: Survived, dtype: int64

Categorical 속성들을 조사해보자.

In [68]:

train_data["Pclass"].value_counts()

Out[68]:

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [69]:

train_data["Sex"].value_counts()

Out[69]:

male      577
female    314
Name: Sex, dtype: int64

In [70]:

train_data["Embarked"].value_counts()

Out[70]:

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [71]:

from sklearn.base import BaseEstimator, TransformerMixin

#원하는 속성만 선택해서 쓰기 위해서
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

Numerical 속성을 처리하는 pipeline을 만든다.

In [72]:

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(["Age", "SibSp", "Parch", "Fare"])),
        #결측치를 중간값으로 넣음
        ("imputer", SimpleImputer(strategy="median")),
    ])

In [73]:

num_pipeline.fit_transform(train_data)

Out[73]:

array([[22.    ,  1.    ,  0.    ,  7.25  ],
       [38.    ,  1.    ,  0.    , 71.2833],
       [26.    ,  0.    ,  0.    ,  7.925 ],
       ...,
       [28.    ,  1.    ,  2.    , 23.45  ],
       [26.    ,  0.    ,  0.    , 30.    ],
       [32.    ,  0.    ,  0.    ,  7.75  ]])

In [74]:

# 결측치를 최빈값으로 넣음
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)

In [75]:

from sklearn.preprocessing import OneHotEncoder

In [76]:

cat_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(["Pclass", "Sex", "Embarked"])),
        ("imputer", MostFrequentImputer()),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

In [77]:

cat_pipeline.fit_transform(train_data)

Out[77]:

array([[0., 0., 1., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 0., 1.],
       ...,
       [0., 0., 1., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

In [78]:

cat_pipeline.fit_transform(train_data)[0]

Out[78]:

array([0., 0., 1., 0., 1., 0., 0., 1.])

Categorical, numerical 속성들을 통합한다.

In [79]:

from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

In [80]:

X_train = preprocess_pipeline.fit_transform(train_data)
X_train

Out[80]:

array([[22.,  1.,  0., ...,  0.,  0.,  1.],
       [38.,  1.,  0., ...,  1.,  0.,  0.],
       [26.,  0.,  0., ...,  0.,  0.,  1.],
       ...,
       [28.,  1.,  2., ...,  0.,  0.,  1.],
       [26.,  0.,  0., ...,  1.,  0.,  0.],
       [32.,  0.,  0., ...,  0.,  1.,  0.]])

In [81]:

X_train.shape

Out[81]:

(891, 12)

목표값 벡터

In [82]:

y_train = train_data["Survived"]

In [83]:

y_train

Out[83]:

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [84]:

log_clf = LogisticRegression(random_state=0).fit(X_train, y_train)

In [85]:

#모델이 predict한 값(score)와 목표값, input속성들을 concatenate함
a = np.c_[log_clf.decision_function(X_train), y_train, X_train]

In [86]:

df = pd.DataFrame(data=a, columns=["Score", "Survived", "Age", "SibSp", "Parch", "Fare", "Pclass_1", "Pclass_2", "Pclass_3", "Female", "Male", "Embarked_C", "Embarked_Q", "Embarked_S"])

In [87]:

df

Out[87]:

	Score	Survived	Age	SibSp	Parch	Fare	Pclass_1	Pclass_2	Pclass_3	Female	Male	Embarked_C	Embarked_Q	Embarked_S
0	-2.333812	0.0	22.0	1.0	0.0	7.2500	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0
1	2.346548	1.0	38.0	1.0	0.0	71.2833	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0
2	0.483770	1.0	26.0	0.0	0.0	7.9250	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
3	1.997652	1.0	35.0	1.0	0.0	53.1000	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
4	-2.490752	0.0	35.0	0.0	0.0	8.0500	0.0	0.0	1.0	0.0	1.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	-1.020055	0.0	27.0	0.0	0.0	13.0000	0.0	1.0	0.0	0.0	1.0	0.0	0.0	1.0
887	2.829704	1.0	19.0	0.0	0.0	30.0000	1.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0
888	-0.041455	0.0	28.0	1.0	2.0	23.4500	0.0	0.0	1.0	1.0	0.0	0.0	0.0	1.0
889	0.333715	1.0	26.0	0.0	0.0	30.0000	1.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0
890	-2.051100	0.0	32.0	0.0	0.0	7.7500	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0

891 rows × 14 columns

In [88]:

df.sort_values(by=['Score'], ascending=False)[:20]

Out[88]:

	Score	Survived	Age	SibSp	Parch	Fare	Pclass_1	Female	Embarked_C	Embarked_S
258	4.041730	1.0	35.0	0.0	0.0	512.3292	1.0	1.0	1.0	0.0
700	3.526282	1.0	18.0	1.0	0.0	227.5250	1.0	1.0	1.0	0.0
689	3.407084	1.0	15.0	0.0	1.0	211.3375	1.0	1.0	0.0	1.0
329	3.334674	1.0	16.0	0.0	1.0	57.9792	1.0	1.0	1.0	0.0
297	3.303046	0.0	2.0	1.0	2.0	151.5500	1.0	1.0	0.0	1.0
307	3.220966	1.0	17.0	1.0	0.0	108.9000	1.0	1.0	1.0	0.0
310	3.206418	1.0	24.0	0.0	0.0	83.1583	1.0	1.0	1.0	0.0
369	3.166488	1.0	24.0	0.0	0.0	69.3000	1.0	1.0	1.0	0.0
641	3.166488	1.0	24.0	0.0	0.0	69.3000	1.0	1.0	1.0	0.0
306	3.140391	1.0	28.0	0.0	0.0	110.8833	1.0	1.0	1.0	0.0
311	3.129693	1.0	18.0	2.0	2.0	262.3750	1.0	1.0	1.0	0.0
716	3.111692	1.0	38.0	0.0	0.0	227.5250	1.0	1.0	1.0	0.0
710	3.109451	1.0	24.0	0.0	0.0	49.5042	1.0	1.0	1.0	0.0
504	3.101930	1.0	16.0	0.0	0.0	86.5000	1.0	1.0	0.0	1.0
291	3.096663	1.0	19.0	1.0	0.0	91.0792	1.0	1.0	1.0	0.0
708	3.070492	1.0	22.0	0.0	0.0	151.5500	1.0	1.0	0.0	1.0
537	3.054589	1.0	30.0	0.0	0.0	106.4250	1.0	1.0	1.0	0.0
256	3.049102	1.0	28.0	0.0	0.0	79.2000	1.0	1.0	1.0	0.0
742	3.020260	1.0	21.0	2.0	2.0	262.3750	1.0	1.0	1.0	0.0
393	3.014705	1.0	23.0	1.0	0.0	113.2750	1.0	1.0	1.0	0.0

스코어가 높을수록 대체적으로 실제로 살아남았다

In [90]:

df.sort_values(by=['Score'])[:20]

Out[90]:

	Score	Age	SibSp	Parch	Fare	Pclass_3	Male	Embarked_Q	Embarked_S
159	-4.759969	28.0	8.0	2.0	69.5500	1.0	1.0	0.0	1.0
324	-4.759969	28.0	8.0	2.0	69.5500	1.0	1.0	0.0	1.0
201	-4.759969	28.0	8.0	2.0	69.5500	1.0	1.0	0.0	1.0
846	-4.759969	28.0	8.0	2.0	69.5500	1.0	1.0	0.0	1.0
851	-3.914177	74.0	0.0	0.0	7.7750	1.0	1.0	0.0	1.0
116	-3.455494	70.5	0.0	0.0	7.7500	1.0	1.0	1.0	0.0
326	-3.444396	61.0	0.0	0.0	6.2375	1.0	1.0	0.0	1.0
683	-3.369645	14.0	5.0	2.0	46.9000	1.0	1.0	0.0	1.0
94	-3.368523	59.0	0.0	0.0	7.2500	1.0	1.0	0.0	1.0
13	-3.339799	39.0	1.0	5.0	31.2750	1.0	1.0	0.0	1.0
860	-3.322094	41.0	2.0	0.0	14.1083	1.0	1.0	0.0	1.0
360	-3.294984	40.0	1.0	4.0	27.9000	1.0	1.0	0.0	1.0
59	-3.260211	11.0	5.0	2.0	46.9000	1.0	1.0	0.0	1.0
280	-3.254866	65.0	0.0	0.0	7.7500	1.0	1.0	1.0	0.0
152	-3.238546	55.5	0.0	0.0	8.0500	1.0	1.0	0.0	1.0
176	-3.221140	28.0	3.0	1.0	25.4667	1.0	1.0	0.0	1.0
104	-3.193999	37.0	2.0	0.0	7.9250	1.0	1.0	0.0	1.0
480	-3.187256	9.0	5.0	2.0	46.9000	1.0	1.0	0.0	1.0
631	-3.077265	51.0	0.0	0.0	7.0542	1.0	1.0	0.0	1.0
406	-3.075261	51.0	0.0	0.0	7.7500	1.0	1.0	0.0	1.0

스코어가 낮을수록 살아남지 못했다.

저작자표시 (새창열림)

'Machine Learning' 카테고리의 다른 글

[AI class w8d4] 신경망의 기초 - 기계학습과 수학 (0)	2021.06.18
[AI class w8d3] 신경망의 기초 - 인공지능과 기계학습 소개 (0)	2021.06.18
[AI class w7d4] Linear Model for Classification 선형분류 TIL (0)	2021.06.14
[AI class w7d3] Linear Model for Regression 선형회귀 TIL (0)	2021.06.09
[AI class w6d5] Week6 과제 ML Basics 실습 (0)	2021.06.07

현재글[AI class w8d2] ML 기초 - 분류문제 실습

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Rolling Snowball

[AI class w8d2] ML 기초 - 분류문제 실습

확률적 식별 모델

다중 분류(multiclass classification)

실습 - MNIST

MNIST 데이터¶

이진분류기 (Binary classifier)¶

오차행렬 (Confusion matrix)¶

precision = $\frac{TP}{TP+FP}$ (정밀도)¶

recall = $\frac{TP}{TP+FN}$ (재현율)¶

Error cases 조사하기¶

Precision/Recall Trade-off¶

다중 분류 (Multiclass Classification)¶

Data Augmentation¶

Titanic 데이터셋¶

'Machine Learning' 카테고리의 다른 글

'Machine Learning'의 다른글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[AI class w8d2] ML 기초 - 분류문제 실습

확률적 식별 모델

다중 분류(multiclass classification)

실습 - MNIST

MNIST 데이터¶

이진분류기 (Binary classifier)¶

오차행렬 (Confusion matrix)¶

precision = $\frac{TP}{TP+FP}$ (정밀도)¶

recall = $\frac{TP}{TP+FN}$ (재현율)¶

Error cases 조사하기¶

Precision/Recall Trade-off¶

다중 분류 (Multiclass Classification)¶

Data Augmentation¶

Titanic 데이터셋¶

'Machine Learning' 카테고리의 다른 글

'Machine Learning'의 다른글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역