사기회사 분류 데이터¶
데이터 설명 : 사기회사 분류(종속변수 : {Risk:1 >> 사기 Risk:0 >> 정장})¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_test.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_train.csv¶
y_test(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_test.csv¶
In [224]:
#데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/audit/y_test.csv')
#display(x_train)
display(y_train.head())
#print(x_train.info())
#print(x_train.nunique())
#print(x_train.isnull().sum()) #Money_Value컬럼 널값 하나있음 ㅋㅋㅋㅋ(그냥 없애자)
#Money_Value 컬럼 널값 하나있는거 그냥 없애버리자~
#y_train 값도 맞춰서 삭제시켜줘야됨
x_train.dropna(subset=['Money_Value'], inplace=True)
y_train = y_train.loc[y_train.index.isin(x_train.index)]
x_test.dropna(subset=['Money_Value'], inplace=True)
y_test = y_test.loc[y_test.index.isin(x_test.index)]
#ID컬럼 없애주자 ㅋㅋㅋ 뭔가 할거 많아보기이긴 하지만 하나씩!!
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
Y_train = y_train['Risk']
Y_test = y_test['Risk']
#데이터 분류(학습데이터:0.67, 검증데이터:0.33)
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=0.33, random_state=43)
X_train.reset_index(drop=True, inplace=True)
X_vaildation.reset_index(drop=True, inplace=True)
#스케일링
from sklearn.preprocessing import StandardScaler
sds = StandardScaler()
sds.fit(X_train.drop(columns='LOCATION_ID'))
X_train_sc = sds.transform(X_train.drop(columns='LOCATION_ID'))
X_train_sc = pd.DataFrame(X_train_sc, columns=X_train.drop(columns='LOCATION_ID').columns)
X_train_sc['LOCATION_ID'] = X_train['LOCATION_ID']
X_vaildation_sc = sds.transform(X_vaildation.drop(columns='LOCATION_ID'))
X_vaildation_sc = pd.DataFrame(X_vaildation_sc, columns=X_vaildation.drop(columns='LOCATION_ID').columns)
X_vaildation_sc['LOCATION_ID'] = X_vaildation['LOCATION_ID']
#로그변환
import numpy as np
for i in X_train_sc.select_dtypes(exclude='object').columns:
X_train_sc[i] = X_train_sc[i]+0.00000001
X_train_sc[i] = np.log1p(X_train_sc[i])
X_vaildation_sc[i] = X_vaildation_sc[i]+0.00000001
X_vaildation_sc[i] = np.log1p(X_vaildation_sc[i])
#라벨 인코딩
from sklearn.preprocessing import LabelEncoder
labend = LabelEncoder()
for i in X_train_sc.select_dtypes(include='object').columns:
X_train_sc[i] = labend.fit_transform(X_train_sc[i])
X_vaildation_sc[i] = labend.fit_transform(X_vaildation_sc[i])
#불균형 데이터 이므로 SMOTE 업 샘플링 시켜주자
#Y_train.value_counts()
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_sc,Y_train = smote.fit_resample(X_train_sc, Y_train)
#분류문제이므로 RandomForestClassifier을 통한 모델링 ㄱ ㄱ ㄱ
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state = 47, n_estimators=100)
RFC.fit(X_train_sc, Y_train)
pred_train = RFC.predict(X_train_sc)
pred_train_proba_0 = RFC.predict_proba(X_train_sc)[:,0]
pred_train_proba_1 = RFC.predict_proba(X_train_sc)[:,1]
pred_vaildation = RFC.predict(X_vaildation_sc)
pred_vaildation_proba_0 = RFC.predict_proba(X_vaildation_sc)[:,0]
pred_vaildation_proba_1 = RFC.predict_proba(X_vaildation_sc)[:,1]
#예측결과와 실제값을 비교해 성능평가를 해보자
#분류이기에 혼동행렬을 통한 성능평가지표 활용
#accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
vaildation_acc = accuracy_score(Y_vaildation, pred_vaildation)
train_pre = precision_score(Y_train, pred_train)
vaildation_pre = precision_score(Y_vaildation, pred_vaildation)
train_recall = recall_score(Y_train, pred_train)
vaildation_recall = recall_score(Y_vaildation, pred_vaildation)
train_f1 = f1_score(Y_train, pred_train)
vaildation_f1 = f1_score(Y_vaildation, pred_vaildation)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba_1)
vaildation_roc_auc = roc_auc_score(Y_vaildation, pred_vaildation_proba_1)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('vaildation_acc',vaildation_acc)
print('vaildation_pre',vaildation_pre)
print('vaildation_recall',vaildation_recall)
print('vaildation_f1',vaildation_f1)
print('vaildation_roc_auc',vaildation_roc_auc)
#X_test데이터를 모델에 적용해 최종적으 평가 데이터를 확인해보자
#그러기 위해서 X_test 데이터 스케일링, 로그변환, 라벨인코딩 해줘야됨
#display(X_test)
#스케일링
X_test_sc = sds.transform(X_test.drop(columns='LOCATION_ID'))
X_test_sc = pd.DataFrame(X_test_sc, columns = X_test.drop(columns='LOCATION_ID').columns)
X_test_sc['LOCATION_ID'] = X_test['LOCATION_ID']
#로그변환
import numpy as np
for i in X_test_sc.select_dtypes(exclude='object'):
X_test_sc[i] = np.log1p(X_test_sc[i])
#라벨인코딩
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
for i in X_test_sc.select_dtypes(include='object'):
X_test_sc[i] = labelencoder.fit_transform(X_test_sc[i])
pred_test = RFC.predict(X_test_sc)
pred_test_proba_0 = RFC.predict_proba(X_test_sc)[:,0]
pred_test_proba_1 = RFC.predict_proba(X_test_sc)[:,1]
#최종적으로 평가 데이터를 예측한 값과 실제값과의 성능평가 ㄱ ㄱ ㄱ
#성능평가지표는 분류이므로 혼동행렬
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
test_acc = accuracy_score(Y_test, pred_test)
test_pre = precision_score(Y_test, pred_test)
test_recall = recall_score(Y_test, pred_test)
test_f1 = f1_score(Y_test, pred_test)
test_roc_auc = roc_auc_score(Y_test, pred_test_proba_1)
print('\n')
print('test_acc',test_acc)
print('test_pre',test_pre)
print('test_recall',test_recall)
print('test_f1',test_f1)
print('test_roc_auc',test_roc_auc)
pd.DataFrame({'ID':x_test['ID'], 'Risk':pred_test, 'proba_0':pred_test_proba_0, 'proba_1':pred_test_proba_1})
| ID | Risk | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 2 | 0 |
| 2 | 3 | 1 |
| 3 | 4 | 0 |
| 4 | 5 | 0 |
train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 vaildation_acc 1.0 vaildation_pre 1.0 vaildation_recall 1.0 vaildation_f1 1.0 vaildation_roc_auc 1.0 test_acc 1.0 test_pre 1.0 test_recall 1.0 test_f1 1.0 test_roc_auc 1.0
Out[224]:
| ID | Risk | proba_0 | proba_1 | |
|---|---|---|---|---|
| 0 | 1 | 1 | 0.00 | 1.00 |
| 1 | 9 | 1 | 0.00 | 1.00 |
| 2 | 10 | 0 | 0.86 | 0.14 |
| 3 | 11 | 0 | 1.00 | 0.00 |
| 4 | 12 | 0 | 1.00 | 0.00 |
| ... | ... | ... | ... | ... |
| 151 | 759 | 1 | 0.07 | 0.93 |
| 152 | 762 | 1 | 0.00 | 1.00 |
| 153 | 763 | 1 | 0.00 | 1.00 |
| 154 | 765 | 0 | 0.97 | 0.03 |
| 155 | 772 | 1 | 0.01 | 0.99 |
155 rows × 4 columns
센서데이터 동작유형 분류 데이터¶
데이터 설명 : 센서 데이터로 동작 유형 분류( 종속변수 pose : 0,1 구분)¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv¶
y_test(평가용) : https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_test.csv¶
In [6]:
# 데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_test.csv')
#display(x_train)
#display(y_train)
#print(x_train.info())
#print(x_train.nunique())
#print(x_train.isnull().sum())
#print(x_train.describe())
# 일단 id값 삭제
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
Y_train = y_train['pose']
Y_test = y_test['pose']
# 데이터를 분할해준다(학습데이터: 0.67, 검증데이터: 0.33)
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=0.33, random_state=43)
# 분류이므로 RandomForestClassifies로 모델링
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=1)
RFC.fit(X_train, Y_train)
pred_train = RFC.predict(X_train)
pred_validation = RFC.predict(X_validation)
pred_train_proba_0 = RFC.predict_proba(X_train)[:, 0]
pred_validation_proba_0 = RFC.predict_proba(X_validation)[:, 0]
pred_train_proba_1 = RFC.predict_proba(X_train)[:, 1]
pred_validation_proba_1 = RFC.predict_proba(X_validation)[:, 1]
# 분류이므로 혼동행렬을 사용한 성능평가지표 활용
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
validation_acc = accuracy_score(Y_validation, pred_validation)
train_pre = precision_score(Y_train, pred_train)
validation_pre = precision_score(Y_validation, pred_validation)
train_recall = recall_score(Y_train, pred_train)
validation_recall = recall_score(Y_validation, pred_validation)
train_f1 = f1_score(Y_train, pred_train)
validation_f1 = f1_score(Y_validation, pred_validation)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba_1)
validation_roc_auc = roc_auc_score(Y_validation, pred_validation_proba_1)
print('train_acc', train_acc)
print('train_pre', train_pre)
print('train_recall', train_recall)
print('train_f1', train_f1)
print('train_roc_auc', train_roc_auc)
print('\n')
print('validation_acc', validation_acc)
print('validation_pre', validation_pre)
print('validation_recall', validation_recall)
print('validation_f1', validation_f1)
print('validation_roc_auc', validation_roc_auc)
# 이제 최종적으로 평가 데이터를 활용해 성능평가와 최종 csv를 뽑아내보자
pred_test = RFC.predict(X_test)
pred_test_proba_0 = RFC.predict_proba(X_test)[:, 0]
pred_test_proba_1 = RFC.predict_proba(X_test)[:, 1]
# 평가데이터 성능평가
test_acc = accuracy_score(Y_test, pred_test)
test_pre = precision_score(Y_test, pred_test)
test_recall = recall_score(Y_test, pred_test)
test_f1 = f1_score(Y_test, pred_test)
test_roc_auc = roc_auc_score(Y_test, pred_test_proba_1)
print('\n')
print('test_acc', test_acc)
print('test_pre', test_pre)
print('test_recall', test_recall)
print('test_f1', test_f1)
print('test_roc_auc', test_roc_auc)
#평가데이터 예측결과 데이터프레임으로 출력
pd.DataFrame({'ID': x_test['ID'], 'pose':pred_test, 'pred_test_proba_0':pred_test_proba_0, 'pred_test_proba_1':pred_test_proba_1})
train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 validation_acc 0.9915309446254071 validation_pre 1.0 validation_recall 0.9831824062095731 validation_f1 0.9915198956294846 validation_roc_auc 0.9997954249897288 test_acc 0.9948409286328461 test_pre 1.0 test_recall 0.9891696750902527 test_f1 0.9945553539019965 test_roc_auc 1.0
Out[6]:
| ID | pose | pred_test_proba_0 | pred_test_proba_1 | |
|---|---|---|---|---|
| 0 | 1 | 1 | 0.01 | 0.99 |
| 1 | 3 | 1 | 0.00 | 1.00 |
| 2 | 8 | 1 | 0.05 | 0.95 |
| 3 | 10 | 1 | 0.01 | 0.99 |
| 4 | 17 | 0 | 0.98 | 0.02 |
| ... | ... | ... | ... | ... |
| 1158 | 5786 | 0 | 0.96 | 0.04 |
| 1159 | 5796 | 0 | 0.98 | 0.02 |
| 1160 | 5802 | 1 | 0.00 | 1.00 |
| 1161 | 5811 | 1 | 0.00 | 1.00 |
| 1162 | 5812 | 0 | 0.98 | 0.02 |
1163 rows × 4 columns
In [7]:
# 데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/muscle/y_test.csv')
#display(x_train)
#display(y_train)
#print(x_train.info())
#print(x_train.nunique())
#print(x_train.isnull().sum())
#print(x_train.describe())
# 일단 id값 삭제
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
Y_train = y_train['pose']
Y_test = y_test['pose']
# 데이터 분포를 정규분포에 가깝게 만들어주기 위해 로그변환
import numpy as np
for i in X_train.select_dtypes(exclude='object').columns:
if X_train[i].min() < 0:
X_train[i] = X_train[i].map(lambda x: x + abs(X_train[i].min()))
else:
X_train[i] = X_train[i]
# 로그변환
X_train[i] = np.log1p(X_train[i])
for i in X_test.select_dtypes(exclude='object').columns:
if X_test[i].min() < 0:
X_test[i] = X_test[i].map(lambda x: x + abs(X_test[i].min()))
else:
X_test[i] = X_test[i]
# 로그변환
X_test[i] = np.log1p(X_test[i])
# 데이터를 분할해준다(학습데이터: 0.67, 검증데이터: 0.33)
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=0.33, random_state=43)
# 숫자형 데이터들 스케일링 해줌 분류
from sklearn.preprocessing import StandardScaler
sds = StandardScaler()
sds.fit(X_train)
X_train_sc = sds.transform(X_train)
X_train_sc = pd.DataFrame(X_train_sc, columns=X_train.columns)
X_validation_sc = sds.transform(X_validation)
X_validation_sc = pd.DataFrame(X_validation_sc, columns=X_validation.columns)
#만약 object타입의 데이터가 있으면 인코딩 시켜줘야됨
#이번 데이터는 없어서 안해줬음
#SMOTE 업 샘플링 시켜주자
#Y_train.value_counts()
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_train_sc,Y_train = smote.fit_resample(X_train_sc, Y_train)
# 분류이므로 RandomForestClassifies로 모델링
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=1)
RFC.fit(X_train_sc, Y_train)
pred_train = RFC.predict(X_train_sc)
pred_validation = RFC.predict(X_validation_sc)
pred_train_proba_0 = RFC.predict_proba(X_train_sc)[:, 0]
pred_validation_proba_0 = RFC.predict_proba(X_validation_sc)[:, 0]
pred_train_proba_1 = RFC.predict_proba(X_train_sc)[:, 1]
pred_validation_proba_1 = RFC.predict_proba(X_validation_sc)[:, 1]
# 예측한 데이터와 실제값을 비교해 성능평가
# 분류이므로 혼동행렬을 사용한 성능평가지표 활용
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
validation_acc = accuracy_score(Y_validation, pred_validation)
train_pre = precision_score(Y_train, pred_train)
validation_pre = precision_score(Y_validation, pred_validation)
train_recall = recall_score(Y_train, pred_train)
validation_recall = recall_score(Y_validation, pred_validation)
train_f1 = f1_score(Y_train, pred_train)
validation_f1 = f1_score(Y_validation, pred_validation)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba_1)
validation_roc_auc = roc_auc_score(Y_validation, pred_validation_proba_1)
print('train_acc', train_acc)
print('train_pre', train_pre)
print('train_recall', train_recall)
print('train_f1', train_f1)
print('train_roc_auc', train_roc_auc)
print('\n')
print('validation_acc', validation_acc)
print('validation_pre', validation_pre)
print('validation_recall', validation_recall)
print('validation_f1', validation_f1)
print('validation_roc_auc', validation_roc_auc)
# 이제 최종적으로 평가 데이터를 활용해 성능평가와 최종 csv를 뽑아내보자
# 평가데이터 스케일링
X_test_sc = sds.transform(X_test)
X_test_sc = pd.DataFrame(X_test_sc, columns=X_test.columns)
pred_test = RFC.predict(X_test_sc)
pred_test_proba_0 = RFC.predict_proba(X_test_sc)[:, 0]
pred_test_proba_1 = RFC.predict_proba(X_test_sc)[:, 1]
# 평가데이터 성능평가
test_acc = accuracy_score(Y_test, pred_test)
test_pre = precision_score(Y_test, pred_test)
test_recall = recall_score(Y_test, pred_test)
test_f1 = f1_score(Y_test, pred_test)
test_roc_auc = roc_auc_score(Y_test, pred_test_proba_1)
print('\n')
print('test_acc', test_acc)
print('test_pre', test_pre)
print('test_recall', test_recall)
print('test_f1', test_f1)
print('test_roc_auc', test_roc_auc)
#평가데이터 예측결과 데이터프레임으로 출력
pd.DataFrame({'ID': x_test['ID'], 'pose':pred_test, 'pred_test_proba_0':pred_test_proba_0, 'pred_test_proba_1':pred_test_proba_1})
#csv로 만드는 코드
#pd.DataFrame({'ID': x_test['ID'], 'pose':pred_test, 'pred_test_proba_0':pred_test_proba_0, 'pred_test_proba_1':pred_test_proba_1}).to_csv('result.csv', index=False)
train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 validation_acc 0.9921824104234528 validation_pre 1.0 validation_recall 0.9844760672703752 validation_f1 0.9921773142112125 validation_roc_auc 0.9998438099506644 test_acc 0.9819432502149613 test_pre 1.0 test_recall 0.9620938628158845 test_f1 0.9806807727690893 test_roc_auc 0.9999688783766961
Out[7]:
| ID | pose | pred_test_proba_0 | pred_test_proba_1 | |
|---|---|---|---|---|
| 0 | 1 | 1 | 0.38 | 0.62 |
| 1 | 3 | 1 | 0.34 | 0.66 |
| 2 | 8 | 1 | 0.46 | 0.54 |
| 3 | 10 | 1 | 0.40 | 0.60 |
| 4 | 17 | 0 | 0.91 | 0.09 |
| ... | ... | ... | ... | ... |
| 1158 | 5786 | 0 | 0.74 | 0.26 |
| 1159 | 5796 | 0 | 0.85 | 0.15 |
| 1160 | 5802 | 1 | 0.35 | 0.65 |
| 1161 | 5811 | 1 | 0.40 | 0.60 |
| 1162 | 5812 | 0 | 0.84 | 0.16 |
1163 rows × 4 columns
당뇨여부판단 데이터¶
데이터 설명 : 당뇨여부 판단하기(종속변수 : Outcome , {당뇨 : 1, 정상 : 0})¶
x_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_train.csv¶
x_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_test.csv¶
y_train: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_train.csv¶
y_test: https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_test.csv¶
In [126]:
#데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_train.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/x_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/diabetes/y_test.csv')
display(x_train)
display(y_train.head())
#print(x_train.info())
#print(x_train.nunique()) # 띡히 특이한 데이터는 없다
#print(x_train.isnull().sum()) #널 값은 없음
#Age 컬럼 나이 5세 범위로 끊어서 나눠줌
def solution(x):
if 20<=x<=29:
return '20대'
elif 30<=x<=39:
return '30대'
elif 40<=x<=49:
return '40대'
elif 50<=x<=59:
return '50대'
elif 60<=x<=69:
return '60대'
elif 70<=x<=79:
return '70대'
else:
return '80대'
x_train['Age'] = x_train['Age'].apply(solution)
x_test['Age'] = x_test['Age'].apply(solution)
#ID값 없애주기
X_train = x_train.drop(columns='ID')
X_test = x_test.drop(columns='ID')
y_train = y_train['Outcome'].to_numpy()
y_test = y_test['Outcome'].to_numpy()
# 데이터 분포를 정규분포에 가깝게 만들어주기 위해 로그변환
import numpy as np
for i in X_train.select_dtypes(exclude='object').columns:
if X_train[i].min() < 0:
X_train[i] = X_train[i].map(lambda x: x+abs(X_train[i].min()))
X_train[i] = np.log1p(X_train[i])
for i in X_test.select_dtypes(exclude='object').columns:
if X_test[i].min() < 0:
X_test[i] = X_test[i].map(lambda x: x+abs(X_test[i].min()))
X_test[i] = np.log1p(X_test[i])
# 데이터 분할 학습데이터0.67, 검증데이터 0.33
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, y_train, test_size=0.33, random_state=43)
X_train.reset_index(drop=True, inplace=True)
X_validation.reset_index(drop=True, inplace=True)
#스케일링 해주기
from sklearn.preprocessing import RobustScaler
object_columns = X_train.select_dtypes(include='object').columns
std_scaler = RobustScaler()
std_scaler.fit(X_train.drop(columns = object_columns))
X_train_sc = std_scaler.transform(X_train.drop(columns=object_columns))
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.drop(columns = object_columns).columns)
X_validation_sc = std_scaler.transform(X_validation.drop(columns=object_columns))
X_validation_sc = pd.DataFrame(X_validation_sc, columns = X_validation.drop(columns=object_columns).columns)
for i in object_columns:
X_train_sc[i] = X_train[i]
X_validation_sc[i] = X_validation[i]
#object컬럼이 Age하나있는데 범주가 많으므로 라벨인코딩 수행
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
for i in X_train_sc.select_dtypes(include='object').columns:
X_train_sc[i] = labelencoder.fit_transform(X_train_sc[i])
X_validation_sc[i] = labelencoder.fit_transform(X_validation_sc[i])
#타겟값이 불균형 데이터 이므로 SMOTE를 사용한 업샘플링 ㄱ ㄱ ㄱ
#print(Y_train.value_counts())
from imblearn.over_sampling import SMOTE
X_train_sc, Y_train = SMOTE().fit_resample(X_train_sc, Y_train)
#RandomForestClassifer을 사용한 모델링 ㄱ ㄱ ㄱ ㄱ
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier()
RFC.fit(X_train_sc, Y_train)
pred_train = RFC.predict(X_train_sc)
pred_validation = RFC.predict(X_validation_sc)
pred_train_proba_0 = RFC.predict_proba(X_train_sc)[:,0]
pred_train_proba_1 = RFC.predict_proba(X_train_sc)[:,1]
pred_validation_proba_0 = RFC.predict_proba(X_validation_sc)[:,0]
pred_validation_proba_1 = RFC.predict_proba(X_validation_sc)[:,1]
#예측한 값과 실제값을 바교해 성능을 평가해보자
#분류데이터 이므로 혼동행렬을 사용한 성능평가지표 사용 ㄱ ㄱ
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
train_acc = accuracy_score(Y_train, pred_train)
validation_acc = accuracy_score(Y_validation, pred_validation)
train_pre = precision_score(Y_train, pred_train)
validation_pre = precision_score(Y_validation, pred_validation)
train_recall = recall_score(Y_train, pred_train)
validation_recall = recall_score(Y_validation, pred_validation)
train_f1 = f1_score(Y_train, pred_train)
validation_f1 = f1_score(Y_validation, pred_validation)
train_roc_auc = roc_auc_score(Y_train, pred_train_proba_1)
validation_roc_auc = roc_auc_score(Y_validation, pred_validation_proba_1)
print('train_acc',train_acc)
print('train_pre',train_pre)
print('train_recall',train_recall)
print('train_f1',train_f1)
print('train_roc_auc',train_roc_auc)
print('\n')
print('validation_acc',validation_acc)
print('validation_pre',validation_pre)
print('validation_recall',validation_recall)
print('validation_f1',validation_f1)
print('validation_roc_auc',validation_roc_auc)
#최종적으로 평가데이터로 성능평가 해보고
#최종예측결과 출력(데이터프레임으로)
#display(X_test)
#평가데이터 스케일링 ㄱ ㄱ
object_columns = X_test.select_dtypes(include='object').columns
X_test_sc = std_scaler.transform(X_test.drop(columns = object_columns))
X_test_sc = pd.DataFrame(X_test_sc, columns = X_test.drop(columns = object_columns).columns)
X_test_sc['Age'] = X_test['Age']
#Age컬럼 라벨인코딩 수행
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
for i in X_test_sc.select_dtypes(include='object').columns:
X_test_sc[i] = labelencoder.fit_transform(X_test_sc[i])
pred_test = RFC.predict(X_test_sc)
pred_test_proba_0 = RFC.predict_proba(X_test_sc)[:,0]
pred_test_proba_1 = RFC.predict_proba(X_test_sc)[:,1]
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
test_acc = accuracy_score(y_test, pred_test)
test_pre = precision_score(y_test, pred_test)
test_recall = recall_score(y_test, pred_test)
test_f1 = f1_score(y_test, pred_test)
test_roc_auc = roc_auc_score(y_test, pred_test_proba_1)
print('\n')
print('test_acc',test_acc)
print('test_pre',test_pre)
print('test_recall',test_recall)
print('test_f1',test_f1)
print('test_roc_auc',test_roc_auc)
#최종결과출력 데이터프레임으로 출력후 csv로 저장 ㄱㄱ
result = pd.DataFrame({'ID':x_test['ID'], 'Outcome':pred_test, 'Outcome_pro_0':pred_test_proba_0, 'Outcome_pro_1':pred_test_proba_1})
#//[=result.to_csv('asdf.csv', index=False)
| ID | Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 8 | 126 | 88 | 36 | 108 | 38.5 | 0.349 | 49 |
| 1 | 1 | 0 | 74 | 52 | 10 | 36 | 27.8 | 0.269 | 22 |
| 2 | 2 | 1 | 140 | 74 | 26 | 180 | 24.1 | 0.828 | 23 |
| 3 | 3 | 6 | 162 | 62 | 0 | 0 | 24.3 | 0.178 | 50 |
| 4 | 4 | 2 | 94 | 68 | 18 | 76 | 26.0 | 0.561 | 21 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 609 | 761 | 5 | 104 | 74 | 0 | 0 | 28.8 | 0.153 | 48 |
| 610 | 762 | 4 | 132 | 86 | 31 | 0 | 28.0 | 0.419 | 63 |
| 611 | 763 | 6 | 80 | 66 | 30 | 0 | 26.2 | 0.313 | 41 |
| 612 | 764 | 8 | 197 | 74 | 0 | 0 | 25.9 | 1.191 | 39 |
| 613 | 766 | 2 | 56 | 56 | 28 | 45 | 24.2 | 0.332 | 22 |
614 rows × 9 columns
| ID | Outcome | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 0 |
| 2 | 2 | 0 |
| 3 | 3 | 1 |
| 4 | 4 | 0 |
train_acc 1.0 train_pre 1.0 train_recall 1.0 train_f1 1.0 train_roc_auc 1.0 validation_acc 0.7487684729064039 validation_pre 0.6027397260273972 validation_recall 0.6666666666666666 validation_f1 0.6330935251798561 validation_roc_auc 0.8399690333996903 test_acc 0.8571428571428571 test_pre 0.7704918032786885 test_recall 0.8545454545454545 test_f1 0.810344827586207 test_roc_auc 0.9471992653810836
Out[126]:
| ID | Outcome | Outcome_pro_0 | Outcome_pro_1 | |
|---|---|---|---|---|
| 0 | 13 | 0 | 0.80 | 0.20 |
| 1 | 18 | 0 | 0.88 | 0.12 |
| 2 | 29 | 0 | 0.95 | 0.05 |
| 3 | 33 | 1 | 0.26 | 0.74 |
| 4 | 34 | 0 | 0.97 | 0.03 |
| ... | ... | ... | ... | ... |
| 149 | 751 | 1 | 0.24 | 0.76 |
| 150 | 752 | 0 | 0.69 | 0.31 |
| 151 | 759 | 0 | 1.00 | 0.00 |
| 152 | 765 | 1 | 0.03 | 0.97 |
| 153 | 767 | 1 | 0.08 | 0.92 |
154 rows × 4 columns
'빅데이터분석기사 준비' 카테고리의 다른 글
| 빅분기 2유형 문제연습(회귀-2) (0) | 2023.06.08 |
|---|---|
| 빅분기 2유형 문제연습(회귀-1) (0) | 2023.06.05 |
| 빅분기 2유형 문제연습 (0) | 2023.05.25 |
| 빅분기6장(앙상블(Ensemble)) (0) | 2023.05.21 |
| 빅데이터 분석기사(작업형 1유형) (0) | 2023.05.20 |