6-6. k-최근접 이웃(KNN)¶

유사한 특성을 가진 데이터는 유사한 범주에 속하는 경향이 있다는 가정으로 데이터를 분류하는 머신러닝 기법.¶

KNN기본 원리 : 학습데이터를 그래도 저장한 뒤 새로운 데이터 포인트에 대해 학습데이터에서 가장 가까운 K개의 데이터 포인트를 찾아 그것들로 부터 새로운 데이터 포인트의 범주를 라벨링한다.¶

KNN은 손글씨, 위성, 이미지 분석등 다양한 분류 및 회귀 문제에 높은 예측력을 보이고 비모수적인 방식을 사용하기 떄문에 결정경계가 매우 불규친한 분류 상황에서 종종 높은 예측 성능을 보임¶

KNN과정¶

1) 학습데이터 포인트 하나를 최근접 이웃으로 선택해 예측에 사용¶

2) 2차원 그래프상 새로운 데이터가 거리상 가장 가까운 데이터와 유사한 집단에 속할것이라 예상¶

3) 새로운 데이터와 기존 데이터들의 거리계산을 위한 계산식 사용, 보통 유클리디안 거리를 많이 사용한다¶

4) K의 수를 조절해 어떤 집단에 분류될지 조절해야한다¶

KNN회귀¶

1) KNN회귀는 가까운 이웃 데이터들을 고려하지만 개별값을 예측한다는 차이가 있음¶

2) KNN은 K개의 이웃 데이터를 사용해 회귀선을 도출, KNN은 하나의 회귀실을 도출하는 것이 아니다는 점에 선형 회귀와 다른 특징을 보임¶

KNN회귀 모델의 학습방법¶

1)K=3이라 설정하면 Y값을 예측하기 위해 특정 X지점에서 수직선과 가장 가까운 3개의 데이터 값을 찾은 뒤, 3개의 값을 평균낸다. 이평균이 특정 지점X의 예측값 Y인것¶

2) X축이 시점이라면 KNN회귀 모델은 모든 시점의 평균 데이터를 예측하게됨, 모든 시점에서 예측된 값을 하나의 선으로 연결하면 예측에 사용된 데이터들에 대한 표준 데이터를 표시할 수 있으므로 시계열분석에도 종종 사용¶

코드로 구현하는 sklearn 의 KNeighborsClassifier¶

KNN방식으로 분류분석을 구현하는 KNeighborsClassifier 함수에 대해 알아보자¶

In [8]:

#데이터 불러오기
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/indian_liver_patient.csv')
print(df['Dataset'].value_counts()) #타깃변수 데이터 불균형이 있음...
df.head()

1    416
2    167
Name: Dataset, dtype: int64

Out[8]:

	Age	Gender	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio	Dataset
0	65	Female	0.7	0.1	187	16	18	6.8	3.3	0.90	1
1	62	Male	10.9	5.5	699	64	100	7.5	3.2	0.74	1
2	62	Male	7.3	4.1	490	60	68	7.0	3.3	0.89	1
3	58	Male	1.0	0.4	182	14	20	6.8	3.4	1.00	1
4	72	Male	3.9	2.0	195	27	59	7.3	2.4	0.40	1

In [9]:

#데이터 전처리(Gender 변수 원핫 인코딩)
import numpy as np
print(df.info())
#np.wherer()함수는 NumPy에서 사용되는 조건에 따라 요소를 선택하는 함수
df['Gender'] = np.where(df['Gender']=='Female',0,1)

#결측치가 있는지 확인(4개가 있는거 확인, 얼마 없기 떄문에 결측치 제거)
print('결측치 제거전:', df.isnull().sum())
df.dropna(axis=0, inplace=True)
print('결측치 제거후:', df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Age                         583 non-null    int64  
 1   Gender                      583 non-null    object 
 2   Total_Bilirubin             583 non-null    float64
 3   Direct_Bilirubin            583 non-null    float64
 4   Alkaline_Phosphotase        583 non-null    int64  
 5   Alamine_Aminotransferase    583 non-null    int64  
 6   Aspartate_Aminotransferase  583 non-null    int64  
 7   Total_Protiens              583 non-null    float64
 8   Albumin                     583 non-null    float64
 9   Albumin_and_Globulin_Ratio  579 non-null    float64
 10  Dataset                     583 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.2+ KB
None
결측치 제거전: Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    4
Dataset                       0
dtype: int64
결측치 제거후: Age                           0
Gender                        0
Total_Bilirubin               0
Direct_Bilirubin              0
Alkaline_Phosphotase          0
Alamine_Aminotransferase      0
Aspartate_Aminotransferase    0
Total_Protiens                0
Albumin                       0
Albumin_and_Globulin_Ratio    0
Dataset                       0
dtype: int64

In [10]:

#학습데이터 평가데이터 분리
from sklearn.model_selection import train_test_split
x = df.drop(columns='Dataset')
y = df['Dataset']

X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.3, stratify=y, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#KNeighborClassifier를 이용해 KNN분류기를 생성한다
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)

(405, 10) (174, 10) (405,) (174,)

Out[10]:

KNeighborsClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [11]:

#적합된 KNN분류기에 predict 메서드를 활용해 test데이터를 입력한 뒤
#예측값을 pred에 저장
#분류이기에 혼동행렬을 활용한 성능지표를 통해 모델의 성능을 평가
pred = clf.predict(X_test)

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

test_cm = confusion_matrix(y_test, pred)
test_acc = accuracy_score(y_test, pred)
test_prc = precision_score(y_test, pred)
test_rcll = recall_score(y_test, pred)
test_f1 = f1_score(y_test, pred)

print(test_cm)
print('정확도',test_acc)
print('정밀도',test_prc)
print('재현율',test_rcll)
print('f1',test_f1)

[[98 26]
 [30 20]]
정확도 0.6781609195402298
정밀도 0.765625
재현율 0.7903225806451613
f1 0.7777777777777778

코드로 구현하는 sklearn 의 KNeighborsRegressor¶

KNN방식으로 회귀분석을 구현하는 KNeighborsRegressor 함수에 대해 알아보자¶

KNN 회귀에서 uniform과 distance는 가중치(weight)를 할당하는 방법을 의미¶

uniform: 모든 이웃들이 동일한 가중치를 가집니다. 이웃들의 거리와 관계없이 각 이웃의 값에 동일한 기여도를 부여합니다.¶

distance: 이웃들의 거리에 반비례하여 가중치를 할당합니다. 거리가 가까운 이웃일수록 더 큰 기여도를 가지게 됩니다. 따라서, 가까운 이웃들의 값이 예측에 더 큰 영향을 미치게 됩니다.¶

In [12]:

#임의의 데이터를 생성하여 KNN을 사용한 회귀분석을 실행해보자
import numpy as np

#임의이 샘플데이터 생성하기
np.random.seed(0)
X = np.sort(5* np.random.rand(400,1), axis=0)
T = np.linspace(0,5,500)[:, np.newaxis]
y = np.sin(X).ravel()

#타깃데이터에 노이즈 추가
y[::1] += 1 * (0.5 - np.random.rand(400))

#학습데이터 평가데이터 나눔
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#KNeighborRegressor을 이용해 weights를 다르게 적용한 KNN추정기를 생성한다
from sklearn.neighbors import KNeighborsRegressor

knn_uni = KNeighborsRegressor(n_neighbors=20, weights='uniform')
knn_dis = KNeighborsRegressor(n_neighbors=20, weights='distance')

knn_uni.fit(X_train, y_train)
knn_dis.fit(X_train, y_train)

#predict 메서드로 평가데이터 적용
#회귀문제에서 사용되는 다양한 성능평가지표로 두 모델의 성능을 비교
uni_pred = knn_uni.predict(X_test)
dis_pred = knn_dis.predict(X_test)

from sklearn.metrics import mean_squared_error, mean_absolute_error
preds = [uni_pred, dis_pred]
weights = ['uniform','distance']
evls = ['mse','rmse','mae']

target = pd.DataFrame(index=weights, columns=evls)

for pred, nm in zip(preds, weights):
    mse = mean_squared_error(y_test, pred)
    mae = mean_absolute_error(y_test, pred)
    rmse = np.sqrt(mse)
    
    target.loc[nm]['mse'] = round(mse,2)
    target.loc[nm]['mae'] = round(mae,2)
    target.loc[nm]['rmse'] = round(rmse,2)

target

(280, 1) (120, 1) (280,) (120,)

Out[12]:

	mse	rmse	mae
uniform	0.1	0.31	0.27
distance	0.11	0.34	0.28

두 모델의 차이를 직관적으로 비교해기 위해 시각해 수행¶

In [15]:

import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))

for i, weights in enumerate(['uniform','distance']):
    knn = KNeighborsRegressor(n_neighbors=20, weights=weights)
    
    y_ = knn.fit(X,y).predict(T)
    
    plt.subplot(2, 1, i+1)
    plt.scatter(X, y, color='darkorange', label='data')
    plt.plot(T, y_, color='navy', label='prediction')
    plt.axis('tight')
    plt.legend()
    plt.title("KNeighborsRegressor (k = %i, weights = '%s')" % (20, weights))
    
plt.tight_layout()
plt.show()