작업 2유형¶
서비스 이탈예측 데이터(분류)¶
데이터 설명 : 고객의 신상정보 데이터를 통한 회사 서비스 이탈 예측 (종속변수 : Exited)¶
x_train : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv¶
y_train : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv¶
x_test : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv¶
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_test.csv¶
In [174]:
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_test.csv')
#display(x_train.head())
#display(y_train.head())
#print(x_train.info())
#Gender컬럼 이상함 전처리 해줘야됨
#print(x_train.nunique())
#널값은 없음
#print(x_train.isnull().sum())
#Gender컬럼 전처리(Male도있고 male도있음 여자도 똑같음)
def soluiton(x):
if x==' male':
return 'Male'
elif x=='female':
return 'Female'
else:
return x
x_train['Gender'] = x_train['Gender'].apply(soluiton)
x_test['Gender'] = x_test['Gender'].apply(soluiton)
#CustomerId, Surname 컬럼은 의미없는거 같으니까 삭제해주자
X_train = x_train.drop(columns=['CustomerId','Surname'])
y_train = y_train['Exited']
X_test = x_test.drop(columns=['CustomerId','Surname'])
y_test = y_test['Exited']
#object 타입컬럼(Geography, Gender) 원핫인코딩해줌
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
#데이터 분할(학습세트=0.67, 검증세트=0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train, Y_vaildation = train_test_split(X_train, y_train, test_size = 0.33, random_state=43, stratify=y_train)
print(X_train.shape, X_vaildation.shape, Y_train.shape, Y_vaildation.shape)
#RomdomForestClassifier로 모델 피팅
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=1, oob_score = True)
RFC.fit(X_train, Y_train)
pred_train = RFC.predict(X_train)
pred_train_proba = RFC.predict_proba(X_train)[:,1]
pred_test = RFC.predict(X_vaildation)
pred_test_proba = RFC.predict_proba(X_vaildation)[:,1]
#예측값과 실제값을 비교해 성능지표를 평가해보자
#분류 문제이기 떄문에 혼동행렬의 accuracy, precision, recall, f1, roc_auc
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
test_acc = accuracy_score(Y_vaildation, pred_test)
train_f1 = f1_score(Y_train, pred_train)
test_f1 = f1_score(Y_vaildation, pred_test)
train_recall = recall_score(Y_train, pred_train)
test_recall = recall_score(Y_vaildation, pred_test)
train_pre = precision_score(Y_train, pred_train)
test_pre = precision_score(Y_vaildation, pred_test)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba)
test_roc_auc = roc_auc_score(Y_vaildation, pred_test_proba)
print('train_acc',train_acc)
print('train_f1',train_f1)
print('train_recall',train_recall)
print('train_pre',train_pre)
print('train_roc_auc',train_roc_auc)
print('\n')
print('test_acc',test_acc)
print('test_f1',test_f1)
print('test_recall',test_recall)
print('test_pre',test_pre)
print('test_roc_auc',test_roc_auc)
#test데이터도 예측 수행
pred_result = RFC.predict(X_test)
pred_result_proba = RFC.predict_proba(X_test)[:,1]
pd.DataFrame({'CustomerId':x_test.CustomerId, 'Exited':pred_result, 'Exited_proba':pred_result_proba}).to_csv('asdf1.csv', index=False)
(4354, 13) (2145, 13) (4354,) (2145,) train_acc 1.0 train_f1 1.0 train_recall 1.0 train_pre 1.0 train_roc_auc 1.0 test_acc 0.8596736596736597 test_f1 0.5592972181551976 test_recall 0.43707093821510296 test_pre 0.7764227642276422 test_roc_auc 0.8503562452103173
Out[174]:
| CustomerId | Exited | Exited_proba | |
|---|---|---|---|
| 0 | 15601012 | 1 | 0.84 |
| 1 | 15734762 | 1 | 0.93 |
| 2 | 15586757 | 0 | 0.10 |
| 3 | 15590888 | 0 | 0.20 |
| 4 | 15726087 | 0 | 0.30 |
| ... | ... | ... | ... |
| 3496 | 15733966 | 0 | 0.44 |
| 3497 | 15669994 | 0 | 0.13 |
| 3498 | 15712403 | 1 | 0.93 |
| 3499 | 15643819 | 0 | 0.02 |
| 3500 | 15644962 | 0 | 0.00 |
3501 rows × 3 columns
이직여부 판단 데이터¶
데이터 설명 : 이직여부 판단 데이터(target=1(이직), target=0(이직x)¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv¶
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_test.csv¶
In [175]:
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/X_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/HRdata/y_test.csv')
#display(x_train)
#display(y_train.head())
#print(x_train.info())
#print(x_train.nunique())
#print(x_train.isnull().sum())
#결측치가 있지만 따로 처리 하지 않음
#enrollee_id 컬럼은 삭제
X_train = x_train.drop(columns='enrollee_id')
X_test = x_test.drop(columns='enrollee_id')
#범주형 변수중 적당히 많은 nuniueq값은 라벨인코딩
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for i in ['city','experience']:
X_train[i] = label_encoder.fit_transform(X_train[i])
X_test[i] = label_encoder.fit_transform(X_test[i])
#적은 nunique값은 원핫인코딩
X_train = pd.get_dummies(X_train)
y_train = y_train['target']
X_test = pd.get_dummies(X_test)
#train과 컬럼순서 동일하게 하기
X_test = X_test[X_train.columns]
y_test = y_test['target']
#데이터 분할(학습 = 0.67 검증 = 0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train, Y_vaildation = train_test_split(X_train, y_train, test_size=0.3, stratify=y_train, random_state=1)
#랜덤포레스트 모델 피팅
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=43, oob_score=True)
RFC.fit(X_train, Y_train)
#성능평가지표 (분류문제이므로 혼동행렬) accuarcy,precision,recll,f1,roc_auc
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
pred_train = RFC.predict(X_train)
pred_train_proba = RFC.predict_proba(X_train)[:,1]
pred_test = RFC.predict(X_vaildation)
pred_test_proba = RFC.predict_proba(X_vaildation)[:,1]
#성능평가
train_acc = accuracy_score(Y_train, pred_train)
test_acc = accuracy_score(Y_vaildation, pred_test)
train_pre = precision_score(Y_train, pred_train)
test_pre = precision_score(Y_vaildation, pred_test)
train_recall = recall_score(Y_train, pred_train)
test_recall = recall_score(Y_vaildation, pred_test)
train_f1 = f1_score(Y_train, pred_train)
test_f1 = f1_score(Y_vaildation, pred_test)
train_roc_auc = roc_auc_score(Y_train, pred_train)
test_roc_auc = roc_auc_score(Y_vaildation, pred_test)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('test_acc',test_acc)
print('test_pre',test_pre)
print('test_recall',test_recall)
print('test_f1',test_f1)
print('test_roc_auc',test_roc_auc)
pred_result = RFC.predict(X_test)
pred_result_proba = RFC.predict_proba(X_test)[:,1]
display(pd.DataFrame({'enrollee_id' : x_test.enrollee_id, 'target':pred_result, 'target_proba':pred_result_proba}))
#pd.DataFrame({'enrollee_id' : x_test.enrollee_id, 'target':pred_result}).to_csv('result.csv', index=False)
#pd.DataFrame({'enrollee_id' : x_test.enrollee_id, 'target':pred_result_proba}).to_csv('result_proba', index=False)
train_acc 0.9991968793024323 train_pre 0.9986187845303868 train_recall 0.9981592268752876 train_f1 0.9983889528193325 train_roc_auc 0.9988503608012385 test_acc 0.7764989293361885 test_pre 0.5684062059238364 test_recall 0.43240343347639487 test_f1 0.49116392443631934 test_roc_auc 0.6616368094628764
| enrollee_id | target | target_proba | |
|---|---|---|---|
| 0 | 7129 | 0.0 | 0.47 |
| 1 | 31037 | 0.0 | 0.42 |
| 2 | 22179 | 1.0 | 0.61 |
| 3 | 29724 | 1.0 | 0.59 |
| 4 | 17977 | 0.0 | 0.30 |
| ... | ... | ... | ... |
| 6701 | 3601 | 0.0 | 0.08 |
| 6702 | 2745 | 1.0 | 0.57 |
| 6703 | 18520 | 0.0 | 0.20 |
| 6704 | 10067 | 0.0 | 0.49 |
| 6705 | 8203 | 1.0 | 0.54 |
6706 rows × 3 columns
정시 배송 여부 판단¶
데이터 설명 : e-commerce 배송의 정시 도착여부 (1:정시배송, 0:정시미배송)¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv¶
x_label(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_test.csv¶
In [55]:
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/X_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/shipping/y_test.csv')
#display(x_train.head())
#display(y_train.head())
print(x_train.info())
#딱히 이상한 값고 다들 원핫 인코딩 하니 적당한 종류
print(x_train.nunique())
#널 값 없음
#print(x_train.isnull().sum())
#ID컬럼 삭제
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
y_train = y_train['Reached.on.Time_Y.N']
y_test = y_test['Reached.on.Time_Y.N']
#원 핫 인코딩
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_test = X_test[X_train.columns]
#데이터 분할(학습데이터 0.66, 검증데이터0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train, Y_vaildation = train_test_split(X_train, y_train, test_size=0.33, random_state=43, stratify=y_train)
print(X_train.shape, X_vaildation.shape, Y_train.shape, Y_vaildation.shape)
#RandomForestClassifier로 모델링
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=42, oob_score=True)
RFC.fit(X_train, Y_train)
#혼동행렬 성능평가지표로 예측값과 실제값 평가ㄱㄱ
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_pred = RFC.predict(X_train)
train_pred_proba = RFC.predict_proba(X_train)[:,1]
test_pred = RFC.predict(X_vaildation)
test_pred_pro = RFC.predict_proba(X_vaildation)[:,1]
train_acc = accuracy_score(Y_train, train_pred)
test_acc = accuracy_score(Y_vaildation, test_pred)
train_pre = precision_score(Y_train, train_pred)
test_pre = precision_score(Y_vaildation, test_pred)
train_recall = recall_score(Y_train, train_pred)
test_recall = recall_score(Y_vaildation, test_pred)
train_f1 = f1_score(Y_train, train_pred)
test_f1 = f1_score(Y_vaildation, test_pred)
train_roc_auc = roc_auc_score(Y_train, train_pred_proba)
test_roc_auc = roc_auc_score(Y_vaildation, test_pred_pro)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('test_acc',test_acc)
print('test_pre',test_pre)
print('test_recall',test_recall)
print('test_f1',test_f1)
print('test_roc_auc',test_roc_auc)
result_pred = RFC.predict(X_test)
result_pred_proba = RFC.predict_proba(X_test)[:,1]
pd.DataFrame({'ID':x_test.ID, 'Reached.on.Time_Y.N':result_pred, 'Reached.on.Time_Y.N_proba':result_pred_proba})
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6598 entries, 0 to 6597 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 6598 non-null int64 1 Warehouse_block 6598 non-null object 2 Mode_of_Shipment 6598 non-null object 3 Customer_care_calls 6598 non-null object 4 Customer_rating 6598 non-null int64 5 Cost_of_the_Product 6598 non-null int64 6 Prior_purchases 6598 non-null int64 7 Product_importance 6598 non-null object 8 Gender 6598 non-null object 9 Discount_offered 6598 non-null int64 10 Weight_in_gms 6598 non-null int64 dtypes: int64(6), object(5) memory usage: 567.1+ KB None ID 6598 Warehouse_block 5 Mode_of_Shipment 3 Customer_care_calls 6 Customer_rating 5 Cost_of_the_Product 215 Prior_purchases 8 Product_importance 3 Gender 2 Discount_offered 65 Weight_in_gms 3365 dtype: int64 (4420, 24) (2178, 24) (4420,) (2178,) train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 test_acc 0.6643709825528007 test_pre 0.7480383609415867 test_recall 0.66 test_f1 0.7012668573763793 test_roc_auc 0.7492658139127387
Out[55]:
| ID | Reached.on.Time_Y.N | Reached.on.Time_Y.N_proba | |
|---|---|---|---|
| 0 | 6811 | 1 | 0.66 |
| 1 | 4320 | 0 | 0.37 |
| 2 | 5732 | 0 | 0.41 |
| 3 | 7429 | 0 | 0.38 |
| 4 | 2191 | 1 | 1.00 |
| ... | ... | ... | ... |
| 4396 | 2610 | 1 | 1.00 |
| 4397 | 3406 | 1 | 0.59 |
| 4398 | 10395 | 0 | 0.37 |
| 4399 | 3646 | 0 | 0.26 |
| 4400 | 573 | 1 | 1.00 |
4401 rows × 3 columns
성인 건강검진 데이터¶
데이터 설명 : 2018년도 성인의 건강검진 데이터( 종속변수 : 흡연상태 {흡연 : 1, 비흡연 : 0}¶
x_train:https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_train.csv¶
x_test:https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_test.csv¶
y_train:https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_train.csv¶
y_test:https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_test.csv¶
import pandas as pd x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_train.csv') x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/x_test.csv') y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_train.csv') y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/smoke/y_test.csv')
display(x_train.head()) display(y_train.head())
In [79]:
#print(x_train.info())
#print(x_train.nunique()) #특이값 없고
#print(x_train.isnull().sum()) #널 값 없고
#ID컬럼은 없애주자
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
Y_train = y_train['흡연상태']
Y_test = y_test['흡연상태']
#범주의 수가 많은 컬럼이 딱히 없으니까 원핫 인코딩 수행
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
#타겟변수 불균형인지 확인
#print(Y_train.value_counts())
#데이터 분할(학습데이터 0.67, 평가데이터 0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train, Y_vaildation = train_test_split(X_train, Y_train, test_size=0.33, random_state=43, stratify = Y_train)
print(X_train.shape, X_vaildation.shape, Y_train.shape, Y_vaildation.shape)
#RandomForestClassifier로 데이터모델링
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=42, oob_score=True)
RFC.fit(X_train, Y_train)
#검증데이터로 예측수행
pred_train = RFC.predict(X_train)
pred_train_proba = RFC.predict_proba(X_train)[:,1]
pred_vaildation = RFC.predict(X_vaildation)
pred_vaildation_proba = RFC.predict_proba(X_vaildation)[:,1]
#예측데이터와 실제값을 비교해 성능평가 ㄱ ㄱ ㄱ ㄱ
#분류 문제이기에 혼동행렬을 사용한 성능평가지표 활용
#accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
test_acc = accuracy_score(Y_vaildation, pred_vaildation)
train_pre = precision_score(Y_train, pred_train)
test_pre = precision_score(Y_vaildation, pred_vaildation)
train_recall = recall_score(Y_train, pred_train)
test_recall = recall_score(Y_vaildation, pred_vaildation)
train_f1 = f1_score(Y_train, pred_train)
test_f1 = f1_score(Y_vaildation, pred_vaildation)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba)
test_roc_auc = roc_auc_score(Y_vaildation, pred_vaildation_proba)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('test_acc',test_acc)
print('test_pre',test_pre)
print('test_recall',test_recall)
print('test_f1',test_f1)
print('test_roc_auc',test_roc_auc)
result_pred = RFC.predict(X_test)
result_pred_proba = RFC.predict_proba(X_test)[:,1]
pd.DataFrame({'ID':x_test['ID'], '흡연상태':result_pred, '흡연상태_proba':result_pred_proba}).to_csv('asdf.csv')
(29850, 27) (14703, 27) (29850,) (14703,) train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 test_acc 0.7541318098347276 test_pre 0.6511114882063466 test_recall 0.7110822831727205 test_f1 0.679776773850651 test_roc_auc 0.8361711013566313
Out[79]:
| ID | 흡연상태 | 흡연상태_proba | |
|---|---|---|---|
| 0 | 8 | 0 | 0.23 |
| 1 | 17 | 0 | 0.38 |
| 2 | 20 | 1 | 0.92 |
| 3 | 24 | 0 | 0.18 |
| 4 | 25 | 0 | 0.15 |
| ... | ... | ... | ... |
| 11134 | 55676 | 0 | 0.00 |
| 11135 | 55681 | 0 | 0.01 |
| 11136 | 55683 | 0 | 0.00 |
| 11137 | 55684 | 0 | 0.33 |
| 11138 | 55691 | 1 | 0.91 |
11139 rows × 3 columns
비행탑승 경험 만족도 데이터¶
데이터 설명 : 비행탑승 경험 만족도 (satisfaction컬럼)¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_test.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv¶
y_test : https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_test.csv¶
In [110]:
#데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/airline/y_test.csv')
display(x_train)
display(y_train)
| ID | Gender | Customer Type | Age | Type of Travel | Class | Flight Distance | Inflight wifi service | Departure/Arrival time convenient | Ease of Online booking | ... | Inflight entertainment | On-board service | Leg room service | Baggage handling | Checkin service | Inflight service | Cleanliness | Departure Delay in Minutes | Arrival Delay in Minutes | id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Female | Loyal Customer | 54 | Personal Travel | Eco | 1068 | 3 | 4 | 3 | ... | 5 | 5 | 3 | 5 | 3 | 5 | 3 | 47 | 22.0 | NaN |
| 1 | 2 | Male | Loyal Customer | 20 | Personal Travel | Eco | 1546 | 4 | 4 | 4 | ... | 4 | 3 | 3 | 4 | 4 | 4 | 4 | 5 | 2.0 | NaN |
| 2 | 3 | Male | Loyal Customer | 59 | Business travel | Business | 2962 | 0 | 4 | 0 | ... | 1 | 1 | 1 | 1 | 5 | 1 | 4 | 54 | 46.0 | NaN |
| 3 | 4 | Male | Loyal Customer | 35 | Business travel | Eco Plus | 106 | 5 | 4 | 4 | ... | 5 | 2 | 1 | 5 | 4 | 4 | 5 | 130 | 121.0 | NaN |
| 4 | 5 | Female | Loyal Customer | 9 | Business travel | Business | 2917 | 3 | 3 | 3 | ... | 4 | 4 | 4 | 5 | 4 | 3 | 4 | 0 | 0.0 | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 83118 | 103893 | Female | Loyal Customer | 52 | Personal Travel | Eco Plus | 748 | 3 | 5 | 3 | ... | 3 | 3 | 3 | 3 | 5 | 3 | 3 | 115 | 124.0 | NaN |
| 83119 | 103898 | Male | Loyal Customer | 56 | Personal Travel | Business | 601 | 1 | 4 | 1 | ... | 4 | 4 | 1 | 4 | 3 | 4 | 1 | 13 | 12.0 | NaN |
| 83120 | 103899 | Female | Loyal Customer | 26 | Business travel | Business | 1259 | 5 | 5 | 5 | ... | 5 | 4 | 2 | 5 | 5 | 5 | 5 | 0 | 0.0 | NaN |
| 83121 | 103901 | Male | disloyal Customer | 23 | Business travel | Eco | 833 | 5 | 5 | 5 | ... | 3 | 4 | 3 | 3 | 4 | 5 | 3 | 0 | 0.0 | NaN |
| 83122 | 103903 | Male | Loyal Customer | 35 | Personal Travel | Business | 362 | 3 | 5 | 3 | ... | 3 | 3 | 3 | 3 | 5 | 3 | 5 | 37 | 36.0 | NaN |
83123 rows × 24 columns
| ID | satisfaction | |
|---|---|---|
| 0 | 0 | neutral or dissatisfied |
| 1 | 2 | neutral or dissatisfied |
| 2 | 3 | satisfied |
| 3 | 4 | satisfied |
| 4 | 5 | satisfied |
| ... | ... | ... |
| 83118 | 103893 | neutral or dissatisfied |
| 83119 | 103898 | neutral or dissatisfied |
| 83120 | 103899 | satisfied |
| 83121 | 103901 | satisfied |
| 83122 | 103903 | neutral or dissatisfied |
83123 rows × 2 columns
In [163]:
#데이터 확인
#print(x_train.info())
#print(x_train.nunique())
#print(x_train.isnull().sum())
#결측값 fillna로 처리 x_test결측값은 train평균으로 처리해줘야함
#data leakage 떄문에
x_train['Arrival Delay in Minutes'] = x_train['Arrival Delay in Minutes'].fillna(x_train['Arrival Delay in Minutes'].mean())
x_test['Arrival Delay in Minutes'] = x_test['Arrival Delay in Minutes'].fillna(x_train['Arrival Delay in Minutes'].mean())
#ID컬럼과id컬럼 제거후 object타입 원핫 인코딩 수행
X_train = x_train.drop(columns = ['ID','id'])
X_test = x_test.drop(columns = ['ID','id'])
Y_train = y_train['satisfaction']
Y_test = y_test['satisfaction']
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_test = X_test[X_train.columns]
#데이터 분할(학습세트0.67, 평가세트0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train, Y_vaildation = train_test_split(X_train,Y_train,test_size=0.33, random_state=43, stratify=Y_train)
print(X_train.shape, X_vaildation.shape, Y_train.shape, Y_vaildation.shape)
#RandomForestClassifer로 데이터 모델링
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=42, oob_score=True)
RFC.fit(X_train, Y_train)
pred_train = RFC.predict(X_train)
pred_train_proba = RFC.predict_proba(X_train)[:,1]
pred_vaildation = RFC.predict(X_vaildation)
pred_vaildation_proba = RFC.predict_proba(X_vaildation)[:,1]
#예측데이터와 실제값과 비교해 성능평가 ㄱ ㄱ ㄱ
#분류문제이므로 혼동행렬을 통한 성능평가지표
#accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
test_acc = accuracy_score(Y_vaildation, pred_vaildation)
train_pre = precision_score(Y_train, pred_train, pos_label='satisfied')
test_pre = precision_score(Y_vaildation, pred_vaildation, pos_label='satisfied')
train_recall = recall_score(Y_train, pred_train, pos_label='satisfied')
test_recall = recall_score(Y_vaildation, pred_vaildation, pos_label='satisfied')
train_f1 = f1_score(Y_train, pred_train, pos_label='satisfied')
test_f1 = f1_score(Y_vaildation, pred_vaildation, pos_label='satisfied')
train_roc_auc = roc_auc_score(Y_train, pred_train_proba)
test_roc_auc = roc_auc_score(Y_vaildation, pred_vaildation_proba)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('test_acc',test_acc)
print('test_pre',test_pre)
print('test_recall',test_recall)
print('test_f1',test_f1)
print('test_roc_auc',test_roc_auc)
result_pred = RFC.predict(X_test)
result_pred_prd = RFC.predict_proba(X_test)[:,0]
#print(result_pred)
#print(result_pred_prd)
pd.DataFrame({'ID':x_test['ID'], 'satisfaction':result_pred, 'satisfaction_proba':result_pred_prd})
(55692, 27) (27431, 27) (55692,) (27431,) train_acc 0.9999640881993823 train_pre 1.0 train_recall 0.9999171259271536 train_f1 0.9999585612464778 train_roc_auc 1.0 test_acc 0.9601545696474791 test_pre 0.9679209294260447 test_recall 0.9391772524606713 test_f1 0.9533324793988301 test_roc_auc 0.993180218220634
Out[163]:
| ID | satisfaction | satisfaction_proba | |
|---|---|---|---|
| 0 | 1 | neutral or dissatisfied | 0.82 |
| 1 | 16 | neutral or dissatisfied | 1.00 |
| 2 | 17 | neutral or dissatisfied | 1.00 |
| 3 | 25 | neutral or dissatisfied | 0.99 |
| 4 | 27 | satisfied | 0.13 |
| ... | ... | ... | ... |
| 20776 | 103895 | neutral or dissatisfied | 0.96 |
| 20777 | 103896 | neutral or dissatisfied | 0.99 |
| 20778 | 103897 | satisfied | 0.00 |
| 20779 | 103900 | neutral or dissatisfied | 1.00 |
| 20780 | 103902 | neutral or dissatisfied | 0.94 |
20781 rows × 3 columns
수질 음용성 여부 데이터¶
데이터 설명 : 수질 음용성 여부 (Potablillity 컬럼 : 0,1)¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_test.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_train.csv¶
y_test : https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_test.csv¶
In [260]:
#데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/waters/y_test.csv')
display(x_train)
display(y_train.head())
| ID | ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8.662710 | 173.531947 | 20333.079495 | 5.636388 | 439.787938 | 459.633120 | 16.283311 | 89.924253 | 5.120103 |
| 1 | 1 | NaN | 226.270824 | 15380.124079 | 6.661474 | NaN | 392.558205 | 14.083110 | 50.286395 | 4.516870 |
| 2 | 2 | 7.583770 | 217.283262 | 36343.407055 | 8.532726 | 375.964391 | 393.877683 | 17.442301 | 77.722257 | 3.642289 |
| 3 | 3 | 6.584813 | 182.375456 | 24723.106296 | 6.238920 | NaN | 414.350751 | 17.582615 | 78.213738 | 4.404132 |
| 4 | 4 | 7.179864 | 180.854211 | 10859.553752 | 8.263503 | 341.302486 | 358.056264 | 12.065317 | 83.329918 | 3.878447 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2615 | 3266 | 8.218171 | 116.061950 | 22465.687557 | 5.000451 | NaN | 408.360199 | 6.732794 | 90.993268 | 4.246446 |
| 2616 | 3268 | NaN | 287.370208 | 41325.443548 | 5.517280 | NaN | 375.724519 | 13.977338 | 97.859775 | 4.322325 |
| 2617 | 3270 | 6.503638 | 163.256634 | 15000.187769 | 7.641834 | 334.103120 | 517.428392 | 17.530660 | 37.765518 | 4.963184 |
| 2618 | 3271 | 7.243410 | 188.046296 | 20877.534841 | 8.454107 | NaN | 377.649344 | 16.357375 | 71.627136 | 4.466936 |
| 2619 | 3275 | 7.933068 | 204.724281 | 12732.888243 | 7.717187 | 331.087177 | 449.685602 | 15.669628 | 55.404080 | 4.534233 |
2620 rows × 10 columns
| ID | Potability | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 2 | 0 |
| 3 | 3 | 0 |
| 4 | 4 | 0 |
In [266]:
#데이터 확인
#print(x_train.info()) #딱히 인코딩 할꼐 없음...
#print(x_train.nunique())
#print(x_train.isnull().sum())
#결측치가 너무 많으므로 삭제를 했음(행), Y_train도 삭제시켜줘야됨
x_train.dropna(subset=['Sulfate'], inplace=True)
y_train = y_train.loc[y_train.index.isin(x_train.index)]
#결측치 평균으로 대체
x_train['ph'].fillna(x_train['ph'].mean(), inplace=True)
x_train['Trihalomethanes'].fillna(x_train['Trihalomethanes'].mean(), inplace=True)
x_test.dropna(subset=['Sulfate'], inplace=True)
y_test = y_test.loc[y_test.index.isin(x_test.index)]
x_test['ph'].fillna(x_train['ph'].mean(), inplace=True)
x_test['Trihalomethanes'].fillna(x_train['Trihalomethanes'].mean(), inplace=True)
#ID컬럼 제외
X_train = x_train.drop(columns = 'ID')
X_test = x_test.drop(columns = 'ID')
Y_train = y_train['Potability']
Y_test = y_test['Potability']
#데이터 분할 (학습데이터 = 0.67, 검증데이터 = 0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train, Y_vaildation = train_test_split(X_train, Y_train, test_size=0.33, stratify=Y_train, random_state=43)
print(X_train.shape, X_vaildation.shape, Y_train.shape, Y_vaildation.shape)
#분류문제이니까 RandomForestClassifier를 활용해 모델링
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state = 42, oob_score=True)
RFC.fit(X_train, Y_train)
pred_train = RFC.predict(X_train)
pred_train_proba = RFC.predict_proba(X_train)[:,1]
pred_vaildation = RFC.predict(X_vaildation)
pred_vaildation_proba = RFC.predict_proba(X_vaildation)[:,1]
#예측데이터와 실제데이터를 비교해 성능평가
#분류니까 혼동행렬을 사용한 성능평가지표 활용
#accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
vaildation_acc = accuracy_score(Y_vaildation, pred_vaildation)
train_pre = precision_score(Y_train, pred_train)
vaildation_pre = precision_score(Y_vaildation, pred_vaildation)
train_recall = recall_score(Y_train, pred_train)
vaildation_recall = recall_score(Y_vaildation, pred_vaildation)
train_f1 = f1_score(Y_train, pred_train)
vaildation_f1 = f1_score(Y_vaildation, pred_vaildation)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba)
vaildation_roc_auc = accuracy_score(Y_vaildation, pred_vaildation)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('vaildation_acc',vaildation_acc)
print('vaildation_pre',vaildation_pre)
print('vaildation_recall',vaildation_recall)
print('vaildation_f1',vaildation_f1)
print('vaildation_roc_auc',vaildation_roc_auc)
result_pred = RFC.predict(X_test)
result_pred_proba = RFC.predict_proba(X_test)[:,1]
#print(result_pred)
#print(result_pred_proba)
pd.DataFrame({'ID':x_test['ID'], 'Potability':result_pred, 'Potability_proba':result_pred_proba})
(1342, 9) (661, 9) (1342,) (661,) train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 vaildation_acc 0.6717095310136157 vaildation_pre 0.6510067114093959 vaildation_recall 0.3702290076335878 vaildation_f1 0.4720194647201946 vaildation_roc_auc 0.6717095310136157
Out[266]:
| ID | Potability | Potability_proba | |
|---|---|---|---|
| 0 | 16 | 0 | 0.44 |
| 1 | 20 | 1 | 0.54 |
| 3 | 37 | 0 | 0.35 |
| 4 | 38 | 0 | 0.24 |
| 6 | 43 | 0 | 0.15 |
| ... | ... | ... | ... |
| 648 | 3256 | 0 | 0.19 |
| 649 | 3257 | 0 | 0.21 |
| 651 | 3267 | 0 | 0.19 |
| 653 | 3272 | 1 | 0.90 |
| 655 | 3274 | 0 | 0.13 |
504 rows × 3 columns
약물 분류 데이터¶
데이터 설명 : 투약하는 약을 분류 (종속변수 : Drug)¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_test.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_train.csv¶
y_test(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_test.csv¶
In [372]:
#데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/drug/y_test.csv')
display(x_train)
display(y_train.head()) #아 다중분류 문제이구나 !!!
#print(x_train.info())
#print(x_train.nunique()) #딱히 이상한 데이터는 없는듯
#print(x_train.isnull().sum()) #널값은 없다
# Age컬럼 변경해보자 (15~74) ['10대','20대','30대'....]
def solution(x):
if 10<=x<=19:
return '10대'
elif 20<=x<=29:
return '20대'
elif 30<=x<=39:
return '30대'
elif 40<=x<=49:
return '40대'
elif 50<=x<=59:
return '50대'
elif 60<=x<=69:
return '60대'
elif 70<=x<=79:
return '70대'
x_train['Age'] = x_train['Age'].apply(solution)
x_test['Age'] = x_test['Age'].apply(solution)
#ID컬럼은 제거해주자
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
Y_train = y_train['Drug']
Y_test = y_test['Drug']
#Age컬럼은 범주가 많으므로 라벨인코딩 나머지는 적당하니까 원핫인코딩 수행하자
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
X_train['Age'] = labelencoder.fit_transform(X_train['Age'])
X_test['Age'] = labelencoder.fit_transform(X_test['Age'])
#나머지는 원핫인코딩
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
#Na_to_K 컬럼 로그변환 해주자
import numpy as np
X_train['Na_to_K'] = np.log(X_train['Na_to_K'])
X_test['Na_to_K'] = np.log(X_test['Na_to_K'])
#데이터 분할 해주자(학습데이터:0.67, 검증데이터:0.33)
from sklearn.model_selection import train_test_split
X_train, X_vaildation, Y_train,Y_vaildation = train_test_split(X_train,Y_train,test_size=0.33, random_state=43)
#print(Y_train.value_counts())
#print(Y_vaildation.value_counts())
#모델피팅 전에 불균형 데이터 이므로 학습데이터 SMOTE로 리샘플링 해줌
from imblearn.over_sampling import SMOTE
X_train, Y_train = SMOTE().fit_resample(X_train, Y_train)
#print(Y_train.value_counts())
#다중분류 문제이므로 RandomForestClassifier로 모델을 피팅
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state = 1, oob_score=True)
RFC.fit(X_train, Y_train)
pred_train = RFC.predict(X_train)
pred_train_proba = RFC.predict_proba(X_train)
pred_vaildation = RFC.predict(X_vaildation)
pred_vaildation_proba = RFC.predict_proba(X_vaildation)
#예측데이터와 실제값을 비교해 성능평가 ㄱ ㄱ ㄱ
#다중분류 문제이기에 혼동행렬을 사용한 성능평가지표 활용
#accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
vaildation_acc = accuracy_score(Y_vaildation, pred_vaildation)
train_pre = precision_score(Y_train, pred_train, average=None)
vaildation_pre = precision_score(Y_vaildation, pred_vaildation, average=None)
train_recall = recall_score(Y_train, pred_train, average=None)
vaildation_recll = recall_score(Y_vaildation, pred_vaildation, average=None)
train_f1 = f1_score(Y_train, pred_train, average=None)
vaildation_f1 = f1_score(Y_vaildation, pred_vaildation, average=None)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba, multi_class='ovr')
vaildation_roc_auc = roc_auc_score(Y_vaildation, pred_vaildation_proba, multi_class='ovr')
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('vaildation_acc',vaildation_acc)
print('vaildation_pre',vaildation_pre)
print('vaildation_recll',vaildation_recll)
print('vaildation_f1',vaildation_f1)
print('vaildation_roc_auc',vaildation_roc_auc)
result_pred = RFC.predict(X_test)
result_pred_proba_0 = RFC.predict_proba(X_test)[:,0]
result_pred_proba_1 = RFC.predict_proba(X_test)[:,1]
result_pred_proba_2 = RFC.predict_proba(X_test)[:,2]
result_pred_proba_3 = RFC.predict_proba(X_test)[:,3]
result_pred_proba_4 = RFC.predict_proba(X_test)[:,4]
pd.DataFrame({'ID':x_test['ID'], 'Drug':result_pred,
'Drup_proba_0':result_pred_proba_0,
'Drup_proba_1':result_pred_proba_1,
'Drup_proba_2':result_pred_proba_2,
'Drup_proba_3':result_pred_proba_3,
'Drup_proba_4':result_pred_proba_4,})
| ID | Age | Sex | BP | Cholesterol | Na_to_K | |
|---|---|---|---|---|---|---|
| 0 | 0 | 36 | F | NORMAL | HIGH | 16.753 |
| 1 | 1 | 47 | F | LOW | HIGH | 11.767 |
| 2 | 2 | 69 | F | NORMAL | HIGH | 10.065 |
| 3 | 3 | 35 | M | LOW | NORMAL | 9.170 |
| 4 | 4 | 49 | M | LOW | NORMAL | 11.014 |
| ... | ... | ... | ... | ... | ... | ... |
| 152 | 194 | 36 | M | LOW | NORMAL | 11.424 |
| 153 | 195 | 39 | F | NORMAL | NORMAL | 17.225 |
| 154 | 196 | 34 | M | NORMAL | HIGH | 22.456 |
| 155 | 198 | 42 | M | HIGH | NORMAL | 12.766 |
| 156 | 199 | 34 | F | LOW | NORMAL | 12.923 |
157 rows × 6 columns
| ID | Drug | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 3 |
| 2 | 2 | 4 |
| 3 | 3 | 4 |
| 4 | 4 | 4 |
train_acc 1.0 train_pre [1. 1. 1. 1. 1.] train_recall [1. 1. 1. 1. 1.] train_f1 [1. 1. 1. 1. 1.] train_roc_auc 1.0 vaildation_acc 0.9807692307692307 vaildation_pre [1. 1. 0.8 1. 1. ] vaildation_recll [1. 0.83333333 1. 1. 1. ] vaildation_f1 [1. 0.90909091 0.88888889 1. 1. ] vaildation_roc_auc 1.0
Out[372]:
| ID | Drug | Drup_proba_0 | Drup_proba_1 | Drup_proba_2 | Drup_proba_3 | Drup_proba_4 | |
|---|---|---|---|---|---|---|---|
| 0 | 8 | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 1 | 9 | 4 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 2 | 14 | 1 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| 3 | 25 | 4 | 0.02 | 0.00 | 0.03 | 0.01 | 0.94 |
| 4 | 26 | 0 | 0.97 | 0.03 | 0.00 | 0.00 | 0.00 |
| 5 | 27 | 4 | 0.14 | 0.00 | 0.00 | 0.11 | 0.75 |
| 6 | 41 | 4 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 7 | 50 | 4 | 0.04 | 0.00 | 0.00 | 0.00 | 0.96 |
| 8 | 53 | 0 | 0.93 | 0.06 | 0.00 | 0.00 | 0.01 |
| 9 | 56 | 1 | 0.01 | 0.95 | 0.02 | 0.00 | 0.02 |
| 10 | 59 | 2 | 0.01 | 0.00 | 0.99 | 0.00 | 0.00 |
| 11 | 60 | 4 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 12 | 65 | 4 | 0.01 | 0.00 | 0.00 | 0.01 | 0.98 |
| 13 | 66 | 2 | 0.01 | 0.00 | 0.99 | 0.00 | 0.00 |
| 14 | 69 | 2 | 0.01 | 0.22 | 0.76 | 0.00 | 0.01 |
| 15 | 71 | 2 | 0.14 | 0.07 | 0.79 | 0.00 | 0.00 |
| 16 | 75 | 4 | 0.00 | 0.00 | 0.00 | 0.01 | 0.99 |
| 17 | 80 | 2 | 0.07 | 0.28 | 0.62 | 0.00 | 0.03 |
| 18 | 92 | 0 | 0.99 | 0.01 | 0.00 | 0.00 | 0.00 |
| 19 | 93 | 4 | 0.01 | 0.01 | 0.00 | 0.00 | 0.98 |
| 20 | 94 | 4 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 21 | 98 | 0 | 0.99 | 0.00 | 0.00 | 0.00 | 0.01 |
| 22 | 99 | 1 | 0.02 | 0.94 | 0.04 | 0.00 | 0.00 |
| 23 | 109 | 1 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |
| 24 | 110 | 0 | 0.95 | 0.05 | 0.00 | 0.00 | 0.00 |
| 25 | 113 | 4 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
| 26 | 117 | 2 | 0.00 | 0.01 | 0.98 | 0.00 | 0.01 |
| 27 | 121 | 0 | 0.97 | 0.00 | 0.00 | 0.00 | 0.03 |
| 28 | 125 | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 29 | 129 | 4 | 0.01 | 0.00 | 0.00 | 0.01 | 0.98 |
| 30 | 140 | 3 | 0.00 | 0.00 | 0.00 | 0.99 | 0.01 |
| 31 | 141 | 0 | 0.97 | 0.00 | 0.00 | 0.03 | 0.00 |
| 32 | 145 | 3 | 0.00 | 0.01 | 0.00 | 0.99 | 0.00 |
| 33 | 146 | 0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 34 | 147 | 4 | 0.13 | 0.00 | 0.00 | 0.12 | 0.75 |
| 35 | 149 | 1 | 0.03 | 0.93 | 0.04 | 0.00 | 0.00 |
| 36 | 161 | 0 | 0.99 | 0.00 | 0.00 | 0.00 | 0.01 |
| 37 | 169 | 0 | 0.82 | 0.00 | 0.00 | 0.00 | 0.18 |
| 38 | 171 | 0 | 0.91 | 0.09 | 0.00 | 0.00 | 0.00 |
| 39 | 173 | 0 | 0.93 | 0.00 | 0.00 | 0.06 | 0.01 |
| 40 | 177 | 1 | 0.01 | 0.96 | 0.03 | 0.00 | 0.00 |
| 41 | 179 | 3 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
| 42 | 197 | 0 | 0.98 | 0.00 | 0.00 | 0.00 | 0.02 |
'빅데이터분석기사 준비' 카테고리의 다른 글
| 빅분기 2유형 문제연습(회귀-1) (0) | 2023.06.05 |
|---|---|
| 빅분기 2유형 문제연습(분류) (0) | 2023.05.30 |
| 빅분기6장(앙상블(Ensemble)) (0) | 2023.05.21 |
| 빅데이터 분석기사(작업형 1유형) (0) | 2023.05.20 |
| 빅분기6장(의사결정나무(Decision Tree)) (0) | 2023.05.17 |