In [3]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_1.csv')
df
Out[3]:
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 20635 | -121.09 | 39.48 | 25.0 | 1665.0 | 374.0 | 845.0 | 330.0 | 1.5603 | 78100.0 | INLAND |
| 20636 | -121.21 | 39.49 | 18.0 | 697.0 | 150.0 | 356.0 | 114.0 | 2.5568 | 77100.0 | INLAND |
| 20637 | -121.22 | 39.43 | 17.0 | 2254.0 | 485.0 | 1007.0 | 433.0 | 1.7000 | 92300.0 | INLAND |
| 20638 | -121.32 | 39.43 | 18.0 | 1860.0 | 409.0 | 741.0 | 349.0 | 1.8672 | 84700.0 | INLAND |
| 20639 | -121.24 | 39.37 | 16.0 | 2785.0 | 616.0 | 1387.0 | 530.0 | 2.3886 | 89400.0 | INLAND |
20640 rows × 10 columns
결측치가 하나라도 존재하는 행의 경우 경우 해당 행을 삭제하라. 그후 남은 데이터의 상위 70%에 해당하는 데이터만 남겨둔 후 median_income 컬럼의 1분위수를 반올림하여 소숫점이하 2째자리까지 구하여라¶
In [13]:
target = df.dropna()
target = target[:int(len(target)*0.7)]
print(round(target['median_income'].quantile(0.25),2))
2.51
In [46]:
import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_2.csv')
df.head()
Out[46]:
| Country Name | Country Code | Year | Value | |
|---|---|---|---|---|
| 0 | Arab World | ARB | 1968 | 2576068.3 |
| 1 | Arab World | ARB | 1969 | 2843420.4 |
| 2 | Arab World | ARB | 1970 | 3138550.0 |
| 3 | Arab World | ARB | 1971 | 3642691.0 |
| 4 | Arab World | ARB | 1972 | 4331605.7 |
1990년도는 해당년도 평균 이하 GDP를 가지지만, 2010년도에는 해당년도 평균 이상 GDP를 가지는 국가의 숫자를 구하여라¶
In [56]:
target_1990 = df.loc[df['Year']==1990]
target_1990 = target_1990.loc[target_1990['Value']<=target_1990['Value'].mean()]
target_2010 = df.loc[df['Year']==2010]
target_2010 = target_2010.loc[target_2010['Value']>=target_2010['Value'].mean()]
display(target_2010)
display(target_1990)
result = set(target_2010['Country Code']) & set(target_1990['Country Code'])
print(len(result))
| Country Name | Country Code | Year | Value | |
|---|---|---|---|---|
| 183 | Early-demographic dividend | EAR | 2010 | 8.514323e+08 |
| 240 | East Asia & Pacific | EAS | 2010 | 1.693876e+09 |
| 297 | East Asia & Pacific (excluding high income) | EAP | 2010 | 7.879327e+08 |
| 354 | East Asia & Pacific (IDA & IBRD countries) | TEA | 2010 | 7.864009e+08 |
| 411 | Euro area | EMU | 2010 | 1.264525e+09 |
| 468 | Europe & Central Asia | ECS | 2010 | 2.092469e+09 |
| 496 | Europe & Central Asia (excluding high income) | ECA | 2010 | 3.156255e+08 |
| 524 | Europe & Central Asia (IDA & IBRD countries) | TEC | 2010 | 3.635513e+08 |
| 581 | European Union | EUU | 2010 | 1.699273e+09 |
| 712 | High income | HIC | 2010 | 4.520682e+09 |
| 769 | IBRD only | IBD | 2010 | 1.997395e+09 |
| 826 | IDA & IBRD total | IBT | 2010 | 2.142554e+09 |
| 1034 | Late-demographic dividend | LTE | 2010 | 1.351375e+09 |
| 1091 | Latin America & Caribbean | LCN | 2010 | 5.334533e+08 |
| 1148 | Latin America & Caribbean (excluding high income) | LAC | 2010 | 4.923605e+08 |
| 1205 | Latin America & the Caribbean (IDA & IBRD coun... | TLA | 2010 | 5.142115e+08 |
| 1294 | Low & middle income | LMY | 2010 | 2.075124e+09 |
| 1382 | Lower middle income | LMC | 2010 | 4.666965e+08 |
| 1431 | Middle East & North Africa | MEA | 2010 | 2.767962e+08 |
| 1536 | Middle income | MIC | 2010 | 2.045861e+09 |
| 1593 | North America | NAC | 2010 | 1.658358e+09 |
| 1650 | OECD members | OED | 2010 | 4.460625e+09 |
| 1796 | Post-demographic dividend | PST | 2010 | 4.241356e+09 |
| 2242 | Upper middle income | UMC | 2010 | 1.579223e+09 |
| 2299 | World | WLD | 2010 | 6.595453e+09 |
| 3463 | Brazil | BRA | 2010 | 2.208872e+08 |
| 4096 | China | CHN | 2010 | 6.100620e+08 |
| 5219 | France | FRA | 2010 | 2.646837e+08 |
| 5437 | Germany | DEU | 2010 | 3.417095e+08 |
| 6576 | Japan | JPN | 2010 | 5.700098e+08 |
| 11022 | United Kingdom | GBR | 2010 | 2.441173e+08 |
| 11079 | United States | USA | 2010 | 1.496437e+09 |
| Country Name | Country Code | Year | Value | |
|---|---|---|---|---|
| 22 | Arab World | ARB | 1990 | 44687716.2 |
| 79 | Caribbean small states | CSS | 1990 | 1788443.3 |
| 106 | Central Europe and the Baltics | CEB | 1990 | 25639112.0 |
| 277 | East Asia & Pacific (excluding high income) | EAP | 1990 | 66986302.0 |
| 334 | East Asia & Pacific (IDA & IBRD countries) | TEA | 1990 | 66850373.3 |
| ... | ... | ... | ... | ... |
| 11268 | Vietnam | VNM | 1990 | 647174.1 |
| 11325 | Virgin Islands (U.S.) | VIR | 1990 | 156470.0 |
| 11366 | Yemen, Rep. | YEM | 1990 | 564725.2 |
| 11423 | Zambia | ZMB | 1990 | 328521.7 |
| 11480 | Zimbabwe | ZWE | 1990 | 878381.7 |
194 rows × 4 columns
5
In [41]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_3.csv')
df.head()
Out[41]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3.0 | Braund, Mr. Owen Harris | male | 22.0 | 1.0 | 0.0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1.0 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1.0 | 0.0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3.0 | Heikkinen, Miss. Laina | female | 26.0 | 0.0 | 0.0 | STON/O2. 3101282 | 7.9250 | C85 | S |
| 3 | 4 | 1 | 1.0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1.0 | 0.0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3.0 | Allen, Mr. William Henry | male | 35.0 | 0.0 | 0.0 | 373450 | 8.0500 | C123 | S |
In [44]:
target = df.isnull().sum().sort_values(ascending=False)
print(target.index[0])
Fare
작업 2유형¶
여행자 보험 가입여부 분류 : https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data¶
DataUrl(train) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv¶
DataUrl(test) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv¶
종속 변수 : TravelInsurance , TravelInsurance가 1일 확률을 구해서 제출하라. 평가지표 : auc¶
제출 파일의 컬럼은 ID, proba 두개만 존재해야한다.¶
In [149]:
# 데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv')
#display(x_train)
#display(x_test)
#ID컬럼 제가
X_train = x_train.drop(columns = ['ID','TravelInsurance'])
X_test = x_test.drop(columns = 'ID')
y_train = x_train[['ID','TravelInsurance']]
display(X_train)
display(y_train)
y_train = y_train['TravelInsurance']
#print(X_train.info())
#print(X_train.nunique())
#print(X_train.isnull().sum()) #널 값은 없다
#print(X_test.isnull().sum())
# 수치형이지만 범주형인 컬럼 object로 변경
for i in ['ChronicDiseases']:
X_train['ChronicDiseases'] = X_train['ChronicDiseases'].astype('object')
X_test['ChronicDiseases'] = X_test['ChronicDiseases'].astype('object')
# 수치형 데이터 이상치 평균으로 대체
def solution(data,col):
q3 = data[col].quantile(0.75)
q1 = data[col].quantile(0.25)
iqr = q3 - q1
upper = q3 + (1.5*iqr)
lower = q1 - (1.5*iqr)
data[col] = data[col].map(lambda x : data[col].mean() if (x>upper) or (x<lower) else x)
return data
for i in X_train.select_dtypes(exclude='object').columns:
X_train = solution(X_train, i)
X_test = solution(X_test, i)
# 데이터 분할
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, y_train, test_size=0.33, random_state=1, stratify=y_train)
#print(X_train.shape, X_validation.shape, Y_train.shape, Y_validation.shape)
X_train.reset_index(drop=True, inplace=True)
X_validation.reset_index(drop=True, inplace=True)
#수치형 컬럼 로그변환 부터 해주자
import numpy as np
for i in X_train.select_dtypes(exclude='object').columns:
if X_train[i].min() < 0:
X_train[i] = X_train[i].map(lambda x : x+abs(X_train[i].min()))
X_train[i] = np.log1p(X_train[i])
if X_validation[i].min() < 0:
X_validation[i] = X_validation[i].map(lambda x : x+abs(X_validation[i].min()))
X_validation[i] = np.log1p(X_validation[i])
if X_test[i].min() < 0:
X_test[i] = X_test[i].map(lambda x : x+abs(X_test[i].min()))
X_test[i] = np.log1p(X_test[i])
#수치형 컬럼 StandardScaler 스케일링 해주자
obj_col = X_train.select_dtypes(include='object').columns
from sklearn.preprocessing import StandardScaler
sds = StandardScaler()
sds.fit(X_train.drop(columns = obj_col))
X_train_sc = sds.transform(X_train.drop(columns=obj_col))
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.drop(columns=obj_col).columns)
X_validation_sc = sds.transform(X_validation.drop(columns=obj_col))
X_validation_sc = pd.DataFrame(X_validation_sc, columns = X_validation.drop(columns=obj_col).columns)
X_test_sc = sds.transform(X_test.drop(columns=obj_col))
X_test_sc = pd.DataFrame(X_test_sc, columns = X_test.drop(columns=obj_col).columns)
for i in obj_col:
X_train_sc[i] = X_train[i]
X_validation_sc[i] = X_validation[i]
X_test_sc[i] = X_test[i]
# 원핫 인코딩 수행
#display(X_test_sc)
X_full = pd.concat([X_train_sc, X_validation_sc, X_test_sc])
X_full = pd.get_dummies(X_full)
X_train_sc = X_full[:len(X_train_sc)]
X_validation_sc = X_full[len(X_train_sc):len(X_train_sc)+len(X_validation_sc)]
X_test_sc = X_full[len(X_train_sc)+len(X_validation_sc):]
#display(X_test_sc)
#모델 선정을 하자( 랜포 vs xg)
#평가 지표는 auc
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=10)
RFC.fit(X_train_sc, Y_train)
pred_train = RFC.predict(X_train_sc)
pred_validation = RFC.predict(X_validation_sc)
pred_train_proba0 = RFC.predict_proba(X_train_sc)[:,0]
pred_train_proba1 = RFC.predict_proba(X_train_sc)[:,1]
pred_validation_proba0 = RFC.predict_proba(X_validation_sc)[:,0]
pred_validation_proba1 = RFC.predict_proba(X_validation_sc)[:,1]
from sklearn.metrics import roc_auc_score
roc_auc_score_train = roc_auc_score(Y_train, pred_train_proba1)
roc_auc_score_validation = roc_auc_score(Y_validation, pred_validation_proba1)
#print('roc_auc_score_train',roc_auc_score_train)
#print('roc_auc_score_validation',roc_auc_score_validation)
#print('\n')
#xgboost
import xgboost as xgb
xgb = xgb.XGBClassifier(random_state=11, n_estimators=30, max_depth=3)
xgb.fit(X_train_sc, Y_train)
pred_train = xgb.predict(X_train_sc)
pred_validation = xgb.predict(X_validation_sc)
pred_train_proba0 = xgb.predict_proba(X_train_sc)[:,0]
pred_train_proba1 = xgb.predict_proba(X_train_sc)[:,1]
pred_validation_proba0 = xgb.predict_proba(X_validation_sc)[:,0]
pred_validation_proba1 = xgb.predict_proba(X_validation_sc)[:,1]
from sklearn.metrics import roc_auc_score
roc_auc_score_train = roc_auc_score(Y_train, pred_train_proba1)
roc_auc_score_validation = roc_auc_score(Y_validation, pred_validation_proba1)
#print('roc_auc_score_train',roc_auc_score_train)
#print('roc_auc_score_validation',roc_auc_score_validation)
#xgboost가 더 높게 나오니까 xgboost로 모델선정 하자
#xgboost 최적의 매개변수를 찾자
#from sklearn.model_selection import GridSearchCV
#import xgboost as xgb
#model = xgb.XGBClassifier(random_state=11)
#parameters = {
#'n_estimators' : [20,30,40],
#'max_depth' : [2,3,4]
#}
#gridshcv = GridSearchCV(model, parameters, cv=5)
#gridshcv.fit(X_train_sc, Y_train)
#print('최적의 매개변수', gridshcv.best_params_)
#최종적으로 roc값과 test예측 결과를 csv로 만들어서
#ID, proba 두개만 존재해야한다
print(roc_auc_score_validation)
pred_test = xgb.predict(X_test_sc)
pred_test_proba0 = xgb.predict_proba(X_test_sc)[:,0]
pred_test_proba1 = xgb.predict_proba(X_test_sc)[:,1]
#pd.DataFrame({'ID':x_test['ID'], 'proba':pred_test_proba1}).to_csv('2017654.csv', index=False)
| Age | Employment Type | GraduateOrNot | AnnualIncome | FamilyMembers | ChronicDiseases | FrequentFlyer | EverTravelledAbroad | |
|---|---|---|---|---|---|---|---|---|
| 0 | 26 | Private Sector/Self Employed | Yes | 1400000 | 5 | 0 | No | Yes |
| 1 | 30 | Private Sector/Self Employed | No | 1450000 | 5 | 0 | Yes | Yes |
| 2 | 32 | Government Sector | Yes | 900000 | 4 | 0 | No | No |
| 3 | 26 | Private Sector/Self Employed | Yes | 1400000 | 7 | 0 | No | Yes |
| 4 | 34 | Private Sector/Self Employed | No | 1400000 | 3 | 1 | No | Yes |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1485 | 29 | Private Sector/Self Employed | Yes | 900000 | 5 | 0 | No | No |
| 1486 | 31 | Government Sector | No | 300000 | 3 | 0 | No | No |
| 1487 | 29 | Private Sector/Self Employed | Yes | 1100000 | 5 | 0 | Yes | No |
| 1488 | 30 | Private Sector/Self Employed | No | 550000 | 3 | 0 | Yes | Yes |
| 1489 | 28 | Private Sector/Self Employed | Yes | 800000 | 6 | 0 | No | Yes |
1490 rows × 8 columns
| ID | TravelInsurance | |
|---|---|---|
| 0 | 1008 | 1 |
| 1 | 199 | 1 |
| 2 | 86 | 0 |
| 3 | 560 | 1 |
| 4 | 161 | 1 |
| ... | ... | ... |
| 1485 | 584 | 0 |
| 1486 | 353 | 0 |
| 1487 | 1124 | 0 |
| 1488 | 1356 | 0 |
| 1489 | 1686 | 0 |
1490 rows × 2 columns
0.827598017124831
Out[149]:
| ID | proba | |
|---|---|---|
| 0 | 6 | 0.298050 |
| 1 | 9 | 0.289210 |
| 2 | 20 | 0.174476 |
| 3 | 21 | 0.051719 |
| 4 | 23 | 0.107967 |
| ... | ... | ... |
| 492 | 1964 | 0.174312 |
| 493 | 1970 | 0.254646 |
| 494 | 1973 | 0.948108 |
| 495 | 1976 | 0.239027 |
| 496 | 1982 | 0.976219 |
497 rows × 2 columns
A등급에 해당하는 유저는 몇명인지 확인하라¶
In [156]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_1.csv')
display(df.head())
df['차이등급'] = df['투약후'] - df['투약전']
df['차이등급'] = df['차이등급'].map(lambda x : 'A등급' if x<-3 else 'B등급' if -3<x<0 else 'C등급')
target = df.loc[df['차이등급']=='A등급']
print(len(target))
| ID | 투약전 | 투약후 | |
|---|---|---|---|
| 0 | user_1 | 55.137 | 56.106 |
| 1 | user_2 | 66.584 | 60.409 |
| 2 | user_3 | 52.259 | 50.639 |
| 3 | user_4 | 77.081 | 69.164 |
| 4 | user_5 | 62.677 | 57.622 |
121
카이제곱검정 통계량을 반올림하여 소숫점 이하 3째자리까지 구하여라¶
In [163]:
target = df['차이등급'].value_counts().to_frame()
target['experter'] = [target['차이등급'].sum()*0.5, target['차이등급'].sum()*0.25, target['차이등급'].sum()*0.25]
from scipy.stats import chisquare
s,p = chisquare(target['차이등급'], target['experter'])
print(round(s,3))
3.613
카이제곱 검정 p값을 반올림하여 소숫점 이하 3자리까지 구하고, 유의수준 0.05하에서 귀무가설과 대립가설중 유의한 가설을 하나를 선택하시오(귀무/대립)¶
In [165]:
print(round(p,3))
print('pvalue값이 0.05보다 크므로 귀무가설 채택')
0.164 pvalue값이 0.05보다 크므로 귀무가설 채택
A,B 공장에서 생산한 기계들의 rpm 값들을 기록한 데이터이다. 대응 표본 t 검정을 통해 B공장 제품들이 A 공장 제품들보다 rpm이 높다고 말할 수 있는지 검정하려한다¶
In [168]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_2_.csv')
df
Out[168]:
| rpm | group | |
|---|---|---|
| 0 | 163.6 | A |
| 1 | 246.9 | A |
| 2 | 234.9 | A |
| 3 | 156.9 | A |
| 4 | 226.6 | A |
| ... | ... | ... |
| 195 | 239.9 | B |
| 196 | 276.4 | B |
| 197 | 260.9 | B |
| 198 | 197.2 | B |
| 199 | 233.3 | B |
200 rows × 2 columns
A,B 공장 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶
In [173]:
from scipy.stats import shapiro
a = df.loc[df['group']=='A']['rpm']
b = df.loc[df['group']=='B']['rpm']
print('a_pvalue',shapiro(a)[1])
print('b_pvalue',shapiro(b)[1])
a_pvalue 0.397915244102478 b_pvalue 0.9562830924987793
In [174]:
from scipy.stats import levene
levene(a,b)
Out[174]:
LeveneResult(statistic=0.014634456943777612, pvalue=0.9038351280607142)
대응 표본 t 검정을 통해 B공장 제품들의 rpm이 A 공장 제품의 rpm보다 크다고 말할 수 있는지 검정하라. pvalue를 소숫점 이하 3자리까지 출력하고 귀무가설, 대립가설 중 하나를 출력하라*¶
In [179]:
from scipy.stats import ttest_rel
s, p = ttest_rel(b,a,alternative='greater')
if round(p,3) >= 0.05:
print(p,'귀무가설 채택')
elif round(p,3) < 0.05:
print(p,'대립가설 채택')
0.008927208346711192 대립가설 채택
'빅데이터분석기사 준비' 카테고리의 다른 글
| 빅분기 모의고사 7회차 (0) | 2023.06.23 |
|---|---|
| 빅분기 모의고사 6회차 (0) | 2023.06.22 |
| 빅분기 모의고사 4회차 (1) | 2023.06.20 |
| 빅분기 모의고사 3회차 (0) | 2023.06.19 |
| 빅분기 모의고사 2회차 (0) | 2023.06.16 |