빅분기 모의고사 5회차

작업 1유형¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_1.csv ¶

캘리포니아 집값 일부 변형 : https://www.kaggle.com/datasets/harrywang/housing?select=housing.csv)¶

In [3]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_1.csv')
df

Out[3]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY
...	...	...	...	...	...	...	...	...	...	...
20635	-121.09	39.48	25.0	1665.0	374.0	845.0	330.0	1.5603	78100.0	INLAND
20636	-121.21	39.49	18.0	697.0	150.0	356.0	114.0	2.5568	77100.0	INLAND
20637	-121.22	39.43	17.0	2254.0	485.0	1007.0	433.0	1.7000	92300.0	INLAND
20638	-121.32	39.43	18.0	1860.0	409.0	741.0	349.0	1.8672	84700.0	INLAND
20639	-121.24	39.37	16.0	2785.0	616.0	1387.0	530.0	2.3886	89400.0	INLAND

20640 rows × 10 columns

결측치가 하나라도 존재하는 행의 경우 경우 해당 행을 삭제하라. 그후 남은 데이터의 상위 70%에 해당하는 데이터만 남겨둔 후 median_income 컬럼의 1분위수를 반올림하여 소숫점이하 2째자리까지 구하여라¶

In [13]:

target = df.dropna()
target = target[:int(len(target)*0.7)]
print(round(target['median_income'].quantile(0.25),2))

2.51

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_2.csv ¶

년도별 국가의 GDP : https://www.kaggle.com/datasets/tunguz/country-regional-and-world-gdp ¶

In [46]:

import pandas as pd
df =pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_2.csv')
df.head()

Out[46]:

	Country Name	Country Code	Year	Value
0	Arab World	ARB	1968	2576068.3
1	Arab World	ARB	1969	2843420.4
2	Arab World	ARB	1970	3138550.0
3	Arab World	ARB	1971	3642691.0
4	Arab World	ARB	1972	4331605.7

1990년도는 해당년도 평균 이하 GDP를 가지지만, 2010년도에는 해당년도 평균 이상 GDP를 가지는 국가의 숫자를 구하여라¶

In [56]:

target_1990 = df.loc[df['Year']==1990]
target_1990 = target_1990.loc[target_1990['Value']<=target_1990['Value'].mean()]

target_2010 = df.loc[df['Year']==2010]
target_2010 = target_2010.loc[target_2010['Value']>=target_2010['Value'].mean()]

display(target_2010)
display(target_1990)
result = set(target_2010['Country Code']) & set(target_1990['Country Code'])
print(len(result))

	Country Name	Country Code	Year	Value
183	Early-demographic dividend	EAR	2010	8.514323e+08
240	East Asia & Pacific	EAS	2010	1.693876e+09
297	East Asia & Pacific (excluding high income)	EAP	2010	7.879327e+08
354	East Asia & Pacific (IDA & IBRD countries)	TEA	2010	7.864009e+08
411	Euro area	EMU	2010	1.264525e+09
468	Europe & Central Asia	ECS	2010	2.092469e+09
496	Europe & Central Asia (excluding high income)	ECA	2010	3.156255e+08
524	Europe & Central Asia (IDA & IBRD countries)	TEC	2010	3.635513e+08
581	European Union	EUU	2010	1.699273e+09
712	High income	HIC	2010	4.520682e+09
769	IBRD only	IBD	2010	1.997395e+09
826	IDA & IBRD total	IBT	2010	2.142554e+09
1034	Late-demographic dividend	LTE	2010	1.351375e+09
1091	Latin America & Caribbean	LCN	2010	5.334533e+08
1148	Latin America & Caribbean (excluding high income)	LAC	2010	4.923605e+08
1205	Latin America & the Caribbean (IDA & IBRD coun...	TLA	2010	5.142115e+08
1294	Low & middle income	LMY	2010	2.075124e+09
1382	Lower middle income	LMC	2010	4.666965e+08
1431	Middle East & North Africa	MEA	2010	2.767962e+08
1536	Middle income	MIC	2010	2.045861e+09
1593	North America	NAC	2010	1.658358e+09
1650	OECD members	OED	2010	4.460625e+09
1796	Post-demographic dividend	PST	2010	4.241356e+09
2242	Upper middle income	UMC	2010	1.579223e+09
2299	World	WLD	2010	6.595453e+09
3463	Brazil	BRA	2010	2.208872e+08
4096	China	CHN	2010	6.100620e+08
5219	France	FRA	2010	2.646837e+08
5437	Germany	DEU	2010	3.417095e+08
6576	Japan	JPN	2010	5.700098e+08
11022	United Kingdom	GBR	2010	2.441173e+08
11079	United States	USA	2010	1.496437e+09

	Country Name	Country Code	Year	Value
22	Arab World	ARB	1990	44687716.2
79	Caribbean small states	CSS	1990	1788443.3
106	Central Europe and the Baltics	CEB	1990	25639112.0
277	East Asia & Pacific (excluding high income)	EAP	1990	66986302.0
334	East Asia & Pacific (IDA & IBRD countries)	TEA	1990	66850373.3
...	...	...	...	...
11268	Vietnam	VNM	1990	647174.1
11325	Virgin Islands (U.S.)	VIR	1990	156470.0
11366	Yemen, Rep.	YEM	1990	564725.2
11423	Zambia	ZMB	1990	328521.7
11480	Zimbabwe	ZWE	1990	878381.7

194 rows × 4 columns

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_3.csv ¶

타이타닉 탑승자 생존 데이터 : https://www.kaggle.com/competitions/titanic ¶

In [41]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_3.csv')
df.head()

Out[41]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3.0	Braund, Mr. Owen Harris	male	22.0	1.0	A/5 21171	7.2500	NaN	S
1	2	1	1.0	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1.0	PC 17599	71.2833	C85	C
2	3	1	3.0	Heikkinen, Miss. Laina	female	26.0	0.0	STON/O2. 3101282	7.9250	C85	S
3	4	1	1.0	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1.0	113803	53.1000	C123	S
4	5	0	3.0	Allen, Mr. William Henry	male	35.0	0.0	373450	8.0500	C123	S

In [44]:

target = df.isnull().sum().sort_values(ascending=False)
print(target.index[0])

Fare

작업 2유형¶

여행자 보험 가입여부 분류 : https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data ¶

DataUrl(train) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv ¶

DataUrl(test) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv ¶

종속 변수 : TravelInsurance , TravelInsurance가 1일 확률을 구해서 제출하라. 평가지표 : auc¶

제출 파일의 컬럼은 ID, proba 두개만 존재해야한다.¶

In [149]:

# 데이터 불러오기
import pandas as pd
x_train = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv')
x_test = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv')

#display(x_train)
#display(x_test)
#ID컬럼 제가
X_train = x_train.drop(columns = ['ID','TravelInsurance'])
X_test = x_test.drop(columns = 'ID')
y_train = x_train[['ID','TravelInsurance']]
display(X_train)
display(y_train)
y_train = y_train['TravelInsurance']


#print(X_train.info())
#print(X_train.nunique())
#print(X_train.isnull().sum()) #널 값은 없다
#print(X_test.isnull().sum())

# 수치형이지만 범주형인 컬럼 object로 변경
for i in ['ChronicDiseases']:
    X_train['ChronicDiseases'] = X_train['ChronicDiseases'].astype('object')
    X_test['ChronicDiseases'] = X_test['ChronicDiseases'].astype('object')

# 수치형 데이터 이상치 평균으로 대체
def solution(data,col):
    q3 = data[col].quantile(0.75)
    q1 = data[col].quantile(0.25)
    iqr = q3 - q1
    
    upper = q3 + (1.5*iqr)
    lower = q1 - (1.5*iqr)
    data[col] = data[col].map(lambda x : data[col].mean() if (x>upper) or (x<lower) else x)
    return data

for i in X_train.select_dtypes(exclude='object').columns:
    X_train = solution(X_train, i)
    X_test = solution(X_test, i)
    

# 데이터 분할
from sklearn.model_selection import train_test_split
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, y_train, test_size=0.33, random_state=1, stratify=y_train)
#print(X_train.shape, X_validation.shape, Y_train.shape, Y_validation.shape)
X_train.reset_index(drop=True, inplace=True)
X_validation.reset_index(drop=True, inplace=True)

#수치형 컬럼 로그변환 부터 해주자
import numpy as np
for i in X_train.select_dtypes(exclude='object').columns:
    if X_train[i].min() < 0:
        X_train[i] = X_train[i].map(lambda x : x+abs(X_train[i].min()))
    X_train[i] = np.log1p(X_train[i])
    
    if X_validation[i].min() < 0:
        X_validation[i] = X_validation[i].map(lambda x : x+abs(X_validation[i].min()))
    X_validation[i] = np.log1p(X_validation[i])
    
    if X_test[i].min() < 0:
        X_test[i] = X_test[i].map(lambda x : x+abs(X_test[i].min()))
    X_test[i] = np.log1p(X_test[i])
    

#수치형 컬럼 StandardScaler 스케일링 해주자
obj_col = X_train.select_dtypes(include='object').columns
from sklearn.preprocessing import StandardScaler
sds = StandardScaler()
sds.fit(X_train.drop(columns = obj_col))

X_train_sc = sds.transform(X_train.drop(columns=obj_col))
X_train_sc = pd.DataFrame(X_train_sc, columns = X_train.drop(columns=obj_col).columns)

X_validation_sc = sds.transform(X_validation.drop(columns=obj_col))
X_validation_sc = pd.DataFrame(X_validation_sc, columns = X_validation.drop(columns=obj_col).columns)

X_test_sc = sds.transform(X_test.drop(columns=obj_col))
X_test_sc = pd.DataFrame(X_test_sc, columns = X_test.drop(columns=obj_col).columns)

for i in obj_col:
    X_train_sc[i] = X_train[i]
    X_validation_sc[i] = X_validation[i]
    X_test_sc[i] = X_test[i]


# 원핫 인코딩 수행
#display(X_test_sc)
X_full = pd.concat([X_train_sc, X_validation_sc, X_test_sc])
X_full = pd.get_dummies(X_full)
X_train_sc = X_full[:len(X_train_sc)]
X_validation_sc = X_full[len(X_train_sc):len(X_train_sc)+len(X_validation_sc)]
X_test_sc = X_full[len(X_train_sc)+len(X_validation_sc):]
#display(X_test_sc)


#모델 선정을 하자( 랜포 vs xg)
#평가 지표는 auc
from sklearn.ensemble import RandomForestClassifier
RFC = RandomForestClassifier(random_state=10)
RFC.fit(X_train_sc, Y_train)

pred_train = RFC.predict(X_train_sc)
pred_validation = RFC.predict(X_validation_sc)

pred_train_proba0 = RFC.predict_proba(X_train_sc)[:,0]
pred_train_proba1 = RFC.predict_proba(X_train_sc)[:,1]

pred_validation_proba0 = RFC.predict_proba(X_validation_sc)[:,0]
pred_validation_proba1 = RFC.predict_proba(X_validation_sc)[:,1]
                                                            
from sklearn.metrics import roc_auc_score
roc_auc_score_train = roc_auc_score(Y_train, pred_train_proba1)
roc_auc_score_validation = roc_auc_score(Y_validation, pred_validation_proba1)

#print('roc_auc_score_train',roc_auc_score_train)
#print('roc_auc_score_validation',roc_auc_score_validation)
#print('\n')


#xgboost
import xgboost as xgb
xgb = xgb.XGBClassifier(random_state=11, n_estimators=30, max_depth=3)
xgb.fit(X_train_sc, Y_train)

pred_train = xgb.predict(X_train_sc)
pred_validation = xgb.predict(X_validation_sc)

pred_train_proba0 = xgb.predict_proba(X_train_sc)[:,0]
pred_train_proba1 = xgb.predict_proba(X_train_sc)[:,1]

pred_validation_proba0 = xgb.predict_proba(X_validation_sc)[:,0]
pred_validation_proba1 = xgb.predict_proba(X_validation_sc)[:,1]

from sklearn.metrics import roc_auc_score
roc_auc_score_train = roc_auc_score(Y_train, pred_train_proba1)
roc_auc_score_validation = roc_auc_score(Y_validation, pred_validation_proba1)


#print('roc_auc_score_train',roc_auc_score_train)
#print('roc_auc_score_validation',roc_auc_score_validation)

#xgboost가 더 높게 나오니까 xgboost로 모델선정 하자
#xgboost 최적의 매개변수를 찾자
#from sklearn.model_selection import GridSearchCV
#import xgboost as xgb
#model = xgb.XGBClassifier(random_state=11)
#parameters = {
    #'n_estimators' : [20,30,40],
    #'max_depth' : [2,3,4]
#}
#gridshcv = GridSearchCV(model, parameters, cv=5)
#gridshcv.fit(X_train_sc, Y_train)
#print('최적의 매개변수', gridshcv.best_params_)


#최종적으로 roc값과 test예측 결과를 csv로 만들어서 
#ID, proba 두개만 존재해야한다
print(roc_auc_score_validation)
pred_test = xgb.predict(X_test_sc)
pred_test_proba0 = xgb.predict_proba(X_test_sc)[:,0]
pred_test_proba1 = xgb.predict_proba(X_test_sc)[:,1]
#pd.DataFrame({'ID':x_test['ID'], 'proba':pred_test_proba1}).to_csv('2017654.csv', index=False)

	Age	Employment Type	GraduateOrNot	AnnualIncome	FamilyMembers	ChronicDiseases	FrequentFlyer	EverTravelledAbroad
0	26	Private Sector/Self Employed	Yes	1400000	5	0	No	Yes
1	30	Private Sector/Self Employed	No	1450000	5	0	Yes	Yes
2	32	Government Sector	Yes	900000	4	0	No	No
3	26	Private Sector/Self Employed	Yes	1400000	7	0	No	Yes
4	34	Private Sector/Self Employed	No	1400000	3	1	No	Yes
...	...	...	...	...	...	...	...	...
1485	29	Private Sector/Self Employed	Yes	900000	5	0	No	No
1486	31	Government Sector	No	300000	3	0	No	No
1487	29	Private Sector/Self Employed	Yes	1100000	5	0	Yes	No
1488	30	Private Sector/Self Employed	No	550000	3	0	Yes	Yes
1489	28	Private Sector/Self Employed	Yes	800000	6	0	No	Yes

1490 rows × 8 columns

	ID	TravelInsurance
0	1008	1
1	199	1
2	86	0
3	560	1
4	161	1
...	...	...
1485	584	0
1486	353	0
1487	1124	0
1488	1356	0
1489	1686	0

1490 rows × 2 columns

0.827598017124831

Out[149]:

	ID	proba
0	6	0.298050
1	9	0.289210
2	20	0.174476
3	21	0.051719
4	23	0.107967
...	...	...
492	1964	0.174312
493	1970	0.254646
494	1973	0.948108
495	1976	0.239027
496	1982	0.976219

497 rows × 2 columns

작업형 3유형¶

다이어트약의 전후 체중 변화 기록이다. DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_1.csv ¶

투약 후 체중에서 투약 전 체중을 뺏을 때 값은 일반 적으로 세가지 등급으로 나눈다.¶

-3이하 : A등급,¶

-3초과 0이하 : B등급,¶

0 초과 : C등급.¶

약 실험에서 A,B,C 그룹간의 인원 수 비율은 2:1:1로 알려져 있다. 위 데이터 표본은 각 범주의 비율에 적합한지 카이제곱 검정하려한다.¶

A등급에 해당하는 유저는 몇명인지 확인하라¶

In [156]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_1.csv')
display(df.head())

df['차이등급'] = df['투약후'] - df['투약전']
df['차이등급'] = df['차이등급'].map(lambda x : 'A등급' if x<-3 else 'B등급' if -3<x<0 else 'C등급')
target = df.loc[df['차이등급']=='A등급']
print(len(target))

	ID	투약전	투약후
0	user_1	55.137	56.106
1	user_2	66.584	60.409
2	user_3	52.259	50.639
3	user_4	77.081	69.164
4	user_5	62.677	57.622

카이제곱검정 통계량을 반올림하여 소숫점 이하 3째자리까지 구하여라¶

In [163]:

target = df['차이등급'].value_counts().to_frame()
target['experter'] = [target['차이등급'].sum()*0.5, target['차이등급'].sum()*0.25, target['차이등급'].sum()*0.25]

from scipy.stats import chisquare
s,p = chisquare(target['차이등급'], target['experter'])
print(round(s,3))

3.613

카이제곱 검정 p값을 반올림하여 소숫점 이하 3자리까지 구하고, 유의수준 0.05하에서 귀무가설과 대립가설중 유의한 가설을 하나를 선택하시오(귀무/대립)¶

In [165]:

print(round(p,3))
print('pvalue값이 0.05보다 크므로 귀무가설 채택')

0.164
pvalue값이 0.05보다 크므로 귀무가설 채택

A,B 공장에서 생산한 기계들의 rpm 값들을 기록한 데이터이다. 대응 표본 t 검정을 통해 B공장 제품들이 A 공장 제품들보다 rpm이 높다고 말할 수 있는지 검정하려한다¶

In [168]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_2_.csv')
df

Out[168]:

	rpm	group
0	163.6	A
1	246.9	A
2	234.9	A
3	156.9	A
4	226.6	A
...	...	...
195	239.9	B
196	276.4	B
197	260.9	B
198	197.2	B
199	233.3	B

200 rows × 2 columns

A,B 공장 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶

In [173]:

from scipy.stats import shapiro
a = df.loc[df['group']=='A']['rpm']
b = df.loc[df['group']=='B']['rpm']
print('a_pvalue',shapiro(a)[1])
print('b_pvalue',shapiro(b)[1])

a_pvalue 0.397915244102478
b_pvalue 0.9562830924987793

A,B 공장 생산 제품의 rpm은 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶

scipy.stats.levene 함수는 등분산성을 검정하기 위한 Levene의 검정을 수행하는 데 사용¶

In [174]:

from scipy.stats import levene
levene(a,b)

Out[174]:

LeveneResult(statistic=0.014634456943777612, pvalue=0.9038351280607142)

대응 표본 t 검정을 통해 B공장 제품들의 rpm이 A 공장 제품의 rpm보다 크다고 말할 수 있는지 검정하라. pvalue를 소숫점 이하 3자리까지 출력하고 귀무가설, 대립가설 중 하나를 출력하라*¶

In [179]:

from scipy.stats import ttest_rel
s, p = ttest_rel(b,a,alternative='greater')
if round(p,3) >= 0.05:
    print(p,'귀무가설 채택')
elif round(p,3) < 0.05:
    print(p,'대립가설 채택')

0.008927208346711192 대립가설 채택

'빅데이터분석기사 준비' 카테고리의 다른 글

빅분기 모의고사 7회차 (0)	2023.06.23
빅분기 모의고사 6회차 (0)	2023.06.22
빅분기 모의고사 4회차 (1)	2023.06.20
빅분기 모의고사 3회차 (0)	2023.06.19
빅분기 모의고사 2회차 (0)	2023.06.16

빅데이터 개인공부 블로그

빅분기 모의고사 5회차

작업 1유형¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_1.csv ¶

캘리포니아 집값 일부 변형 : https://www.kaggle.com/datasets/harrywang/housing?select=housing.csv)¶

결측치가 하나라도 존재하는 행의 경우 경우 해당 행을 삭제하라. 그후 남은 데이터의 상위 70%에 해당하는 데이터만 남겨둔 후 median_income 컬럼의 1분위수를 반올림하여 소숫점이하 2째자리까지 구하여라¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_2.csv ¶

년도별 국가의 GDP : https://www.kaggle.com/datasets/tunguz/country-regional-and-world-gdp ¶

1990년도는 해당년도 평균 이하 GDP를 가지지만, 2010년도에는 해당년도 평균 이상 GDP를 가지는 국가의 숫자를 구하여라¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_3.csv ¶

타이타닉 탑승자 생존 데이터 : https://www.kaggle.com/competitions/titanic ¶

작업 2유형¶

여행자 보험 가입여부 분류 : https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data ¶

DataUrl(train) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv ¶

DataUrl(test) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv ¶

종속 변수 : TravelInsurance , TravelInsurance가 1일 확률을 구해서 제출하라. 평가지표 : auc¶

제출 파일의 컬럼은 ID, proba 두개만 존재해야한다.¶

작업형 3유형¶

다이어트약의 전후 체중 변화 기록이다. DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_1.csv ¶

투약 후 체중에서 투약 전 체중을 뺏을 때 값은 일반 적으로 세가지 등급으로 나눈다.¶

-3이하 : A등급,¶

-3초과 0이하 : B등급,¶

0 초과 : C등급.¶

약 실험에서 A,B,C 그룹간의 인원 수 비율은 2:1:1로 알려져 있다. 위 데이터 표본은 각 범주의 비율에 적합한지 카이제곱 검정하려한다.¶

A등급에 해당하는 유저는 몇명인지 확인하라¶

카이제곱검정 통계량을 반올림하여 소숫점 이하 3째자리까지 구하여라¶

카이제곱 검정 p값을 반올림하여 소숫점 이하 3자리까지 구하고, 유의수준 0.05하에서 귀무가설과 대립가설중 유의한 가설을 하나를 선택하시오(귀무/대립)¶

A,B 공장에서 생산한 기계들의 rpm 값들을 기록한 데이터이다. 대응 표본 t 검정을 통해 B공장 제품들이 A 공장 제품들보다 rpm이 높다고 말할 수 있는지 검정하려한다¶

A,B 공장 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶

A,B 공장 생산 제품의 rpm은 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶

scipy.stats.levene 함수는 등분산성을 검정하기 위한 Levene의 검정을 수행하는 데 사용¶

대응 표본 t 검정을 통해 B공장 제품들의 rpm이 A 공장 제품의 rpm보다 크다고 말할 수 있는지 검정하라. pvalue를 소숫점 이하 3자리까지 출력하고 귀무가설, 대립가설 중 하나를 출력하라*¶

'빅데이터분석기사 준비' 카테고리의 다른 글

티스토리툴바

빅분기 모의고사 5회차

작업 1유형¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_1.csv¶

캘리포니아 집값 일부 변형 : https://www.kaggle.com/datasets/harrywang/housing?select=housing.csv)¶

결측치가 하나라도 존재하는 행의 경우 경우 해당 행을 삭제하라. 그후 남은 데이터의 상위 70%에 해당하는 데이터만 남겨둔 후 median_income 컬럼의 1분위수를 반올림하여 소숫점이하 2째자리까지 구하여라¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_2.csv¶

년도별 국가의 GDP : https://www.kaggle.com/datasets/tunguz/country-regional-and-world-gdp¶

1990년도는 해당년도 평균 이하 GDP를 가지지만, 2010년도에는 해당년도 평균 이상 GDP를 가지는 국가의 숫자를 구하여라¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_3.csv¶

타이타닉 탑승자 생존 데이터 : https://www.kaggle.com/competitions/titanic¶

작업 2유형¶

여행자 보험 가입여부 분류 : https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data¶

DataUrl(train) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv¶

DataUrl(test) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv¶

종속 변수 : TravelInsurance , TravelInsurance가 1일 확률을 구해서 제출하라. 평가지표 : auc¶

제출 파일의 컬럼은 ID, proba 두개만 존재해야한다.¶

작업형 3유형¶

다이어트약의 전후 체중 변화 기록이다. DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_1.csv¶

투약 후 체중에서 투약 전 체중을 뺏을 때 값은 일반 적으로 세가지 등급으로 나눈다.¶

-3이하 : A등급,¶

-3초과 0이하 : B등급,¶

0 초과 : C등급.¶

약 실험에서 A,B,C 그룹간의 인원 수 비율은 2:1:1로 알려져 있다. 위 데이터 표본은 각 범주의 비율에 적합한지 카이제곱 검정하려한다.¶

A등급에 해당하는 유저는 몇명인지 확인하라¶

카이제곱검정 통계량을 반올림하여 소숫점 이하 3째자리까지 구하여라¶

카이제곱 검정 p값을 반올림하여 소숫점 이하 3자리까지 구하고, 유의수준 0.05하에서 귀무가설과 대립가설중 유의한 가설을 하나를 선택하시오(귀무/대립)¶

A,B 공장에서 생산한 기계들의 rpm 값들을 기록한 데이터이다. 대응 표본 t 검정을 통해 B공장 제품들이 A 공장 제품들보다 rpm이 높다고 말할 수 있는지 검정하려한다¶

A,B 공장 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶

A,B 공장 생산 제품의 rpm은 각각 정규성을 가지는지 샤피로 검정을 통해 확인하라. (각 공장의 pvalue 출력할 것)¶

scipy.stats.levene 함수는 등분산성을 검정하기 위한 Levene의 검정을 수행하는 데 사용¶

대응 표본 t 검정을 통해 B공장 제품들의 rpm이 A 공장 제품의 rpm보다 크다고 말할 수 있는지 검정하라. pvalue를 소숫점 이하 3자리까지 출력하고 귀무가설, 대립가설 중 하나를 출력하라*¶

'빅데이터분석기사 준비' 카테고리의 다른 글

관련글

티스토리툴바

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_1.csv ¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_2.csv ¶

년도별 국가의 GDP : https://www.kaggle.com/datasets/tunguz/country-regional-and-world-gdp ¶

DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p1_3.csv ¶

타이타닉 탑승자 생존 데이터 : https://www.kaggle.com/competitions/titanic ¶

여행자 보험 가입여부 분류 : https://www.kaggle.com/datasets/tejashvi14/travel-insurance-prediction-data ¶

DataUrl(train) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_train_.csv ¶

DataUrl(test) = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p2_test_.csv ¶

다이어트약의 전후 체중 변화 기록이다. DataUrl = https://raw.githubusercontent.com/Datamanim/datarepo/main/krdatacertificate/e3_p3_1.csv ¶