빅데이터분석 실기 1유형 준비

세용용용용 2023. 4. 27. 20:32

In [6]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/lol.csv', sep='\t')
df

Out[6]:

	gameId	creationTime	gameDuration	seasonId	winner	firstBlood	firstTower	firstInhibitor	firstBaron	firstDragon	...	t2_towerKills	t2_inhibitorKills	t2_baronKills	t2_dragonKills	t2_riftHeraldKills	t2_ban1	t2_ban2	t2_ban3	t2_ban4	t2_ban5
0	3326086514	1504279457970	1949	9	1	2	1	1	1	1	...	5	0	0	1	1	114	67	43	16	51
1	3229566029	1497848803862	1851	9	1	1	1	1	0	1	...	2	0	0	0	0	11	67	238	51	420
2	3327363504	1504360103310	1493	9	1	2	1	1	1	2	...	2	0	0	1	0	157	238	121	57	28
3	3326856598	1504348503996	1758	9	1	1	1	1	1	1	...	0	0	0	0	0	164	18	141	40	51
4	3330080762	1504554410899	2094	9	1	2	1	1	1	1	...	3	0	0	1	0	86	11	201	122	18
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
51485	3308904636	1503076540231	1944	9	2	1	2	2	0	2	...	10	2	0	4	0	55	-1	90	238	157
51486	3215685759	1496957179355	3304	9	2	1	1	2	2	2	...	11	7	4	4	1	157	55	119	154	105
51487	3322765040	1504029863961	2156	9	2	2	2	2	0	1	...	10	2	0	2	0	113	122	53	11	157
51488	3256675373	1499562036246	1475	9	2	2	2	2	0	2	...	11	3	0	1	0	154	39	51	90	114
51489	3317333020	1503612754059	1445	9	1	1	1	1	1	2	...	1	0	0	1	0	11	157	141	31	18

51490 rows × 61 columns

In [7]:

type(df)

Out[7]:

pandas.core.frame.DataFrame

데이터 상위 5개 행 출력¶

In [20]:

answer = df.head(5)
answer

Out[20]:

	gameId	creationTime	gameDuration	seasonId	winner	firstBlood	firstTower	firstInhibitor	firstBaron	firstDragon	...	t2_towerKills	t2_dragonKills	t2_riftHeraldKills	t2_ban1	t2_ban2	t2_ban3	t2_ban4	t2_ban5
0	3326086514	1504279457970	1949	9	1	2	1	1	1	1	...	5	1	1	114	67	43	16	51
1	3229566029	1497848803862	1851	9	1	1	1	1	0	1	...	2	0	0	11	67	238	51	420
2	3327363504	1504360103310	1493	9	1	2	1	1	1	2	...	2	1	0	157	238	121	57	28
3	3326856598	1504348503996	1758	9	1	1	1	1	1	1	...	0	0	0	164	18	141	40	51
4	3330080762	1504554410899	2094	9	1	2	1	1	1	1	...	3	1	0	86	11	201	122	18

5 rows × 61 columns

데이터의 행과 열의 갯수를 파악하라¶

In [21]:

print(df.shape)
print('행:',df.shape[0])
print('열:',df.shape[0])

(51490, 61)
행: 51490
열: 51490

전체 컬럼을 출력하라¶

In [22]:

answer = df.columns
answer

Out[22]:

Index(['gameId', 'creationTime', 'gameDuration', 'seasonId', 'winner',
       'firstBlood', 'firstTower', 'firstInhibitor', 'firstBaron',
       'firstDragon', 'firstRiftHerald', 't1_champ1id', 't1_champ1_sum1',
       't1_champ1_sum2', 't1_champ2id', 't1_champ2_sum1', 't1_champ2_sum2',
       't1_champ3id', 't1_champ3_sum1', 't1_champ3_sum2', 't1_champ4id',
       't1_champ4_sum1', 't1_champ4_sum2', 't1_champ5id', 't1_champ5_sum1',
       't1_champ5_sum2', 't1_towerKills', 't1_inhibitorKills', 't1_baronKills',
       't1_dragonKills', 't1_riftHeraldKills', 't1_ban1', 't1_ban2', 't1_ban3',
       't1_ban4', 't1_ban5', 't2_champ1id', 't2_champ1_sum1', 't2_champ1_sum2',
       't2_champ2id', 't2_champ2_sum1', 't2_champ2_sum2', 't2_champ3id',
       't2_champ3_sum1', 't2_champ3_sum2', 't2_champ4id', 't2_champ4_sum1',
       't2_champ4_sum2', 't2_champ5id', 't2_champ5_sum1', 't2_champ5_sum2',
       't2_towerKills', 't2_inhibitorKills', 't2_baronKills', 't2_dragonKills',
       't2_riftHeraldKills', 't2_ban1', 't2_ban2', 't2_ban3', 't2_ban4',
       't2_ban5'],
      dtype='object')

6번쨰 컬럼명을 출력하라¶

In [25]:

answer = df.columns[5]
answer

Out[25]:

'firstBlood'

6번쨰 컬럼의 데이터 타입을 확인하라¶

In [34]:

answer = df.iloc[:,5].dtype
answer

Out[34]:

dtype('int64')

데이터셋의 인덱스 구성은 어떤가¶

In [35]:

answer = df.index
answer

Out[35]:

RangeIndex(start=0, stop=51490, step=1)

6번쨰 컬럼의 3번쨰 값은 무엇인가??¶

In [38]:

answer = df.iloc[2,5]
answer

Out[38]:

데이터를 로드하라. 컬럼이 한글이기에 적절한 처리해줘야함¶

‘https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv’¶

In [40]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv',encoding='euc-kr')
df

Out[40]:

	id	일자	시도명	읍면동명	거주인구	근무인구	방문인구	총 유동인구	평균 속도	평균 소요 시간	평균 기온	일강수량	평균 풍속
0	22448	2018-01-01	제주시	도두동	32249.987	3418.266	102709.092	138377.345	39.556	29.167	5.0	0.0	2.50
1	22449	2018-01-01	제주시	외도동	213500.997	10341.172	112692.789	336534.958	32.900	30.900	5.0	0.0	2.50
2	22450	2018-01-01	제주시	이도2동	1212382.218	96920.834	541194.481	1850497.533	29.538	35.692	2.9	0.0	2.40
3	22451	2018-01-01	제주시	일도1동	33991.653	6034.253	72155.919	112181.825	30.000	23.500	2.9	0.0	2.40
4	22452	2018-01-01	서귀포시	대천동	155036.925	9403.969	150882.409	315323.303	41.583	14.375	5.1	0.0	2.30
...	...	...	...	...	...	...	...	...	...	...	...	...	...
9616	32064	2020-04-30	서귀포시	표선면	228260.005	16891.732	152832.449	397984.186	44.514	34.054	18.1	0.0	4.70
9617	32065	2020-04-30	제주시	한림읍	459959.064	26007.122	249700.419	735666.605	48.609	61.377	18.0	0.0	4.35
9618	32066	2020-04-30	제주시	도두동	28397.481	3144.895	84052.697	115595.073	41.053	29.421	20.3	0.0	3.00
9619	32067	2020-04-30	서귀포시	안덕면	348037.846	29106.286	251129.660	628273.792	46.595	49.189	17.6	0.0	3.50
9620	32068	2020-04-30	제주시	연동	1010643.372	65673.477	447622.068	1523938.917	40.863	27.765	14.1	0.0	4.80

9621 rows × 13 columns

데이터 마지막 3개행을 출력하라¶

In [41]:

answer = df.tail(3)
answer

Out[41]:

	id	일자	시도명	읍면동명	거주인구	근무인구	방문인구	총 유동인구	평균 속도	평균 소요 시간	평균 기온	평균 풍속
9618	32066	2020-04-30	제주시	도두동	28397.481	3144.895	84052.697	115595.073	41.053	29.421	20.3	3.0
9619	32067	2020-04-30	서귀포시	안덕면	348037.846	29106.286	251129.660	628273.792	46.595	49.189	17.6	3.5
9620	32068	2020-04-30	제주시	연동	1010643.372	65673.477	447622.068	1523938.917	40.863	27.765	14.1	4.8

수치형 변수를 가진 컬럼을 출력하라¶

In [45]:

answer = df.select_dtypes(include=[int,float]).columns
answer

Out[45]:

Index(['id', '거주인구', '근무인구', '방문인구', '총 유동인구', '평균 속도', '평균 소요 시간', '평균 기온',
       '일강수량', '평균 풍속'],
      dtype='object')

번주형 변수를 가진 컬럼을 출력하라¶

In [46]:

answer = df.select_dtypes(include=object).columns
answer

Out[46]:

Index(['일자', '시도명', '읍면동명'], dtype='object')

각 컬럼의 결측치 숫자를 파악하라¶

In [50]:

answer = df.isnull().sum()
answer

Out[50]:

id          0
일자          0
시도명         0
읍면동명        0
거주인구        0
근무인구        0
방문인구        0
총 유동인구      0
평균 속도       0
평균 소요 시간    0
평균 기온       0
일강수량        0
평균 풍속       0
dtype: int64

각 컬럼의 데이터수, 데이터타입을 한번에 확인하라¶

In [51]:

answer = df.info()
answer

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9621 entries, 0 to 9620
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        9621 non-null   int64  
 1   일자        9621 non-null   object 
 2   시도명       9621 non-null   object 
 3   읍면동명      9621 non-null   object 
 4   거주인구      9621 non-null   float64
 5   근무인구      9621 non-null   float64
 6   방문인구      9621 non-null   float64
 7   총 유동인구    9621 non-null   float64
 8   평균 속도     9621 non-null   float64
 9   평균 소요 시간  9621 non-null   float64
 10  평균 기온     9621 non-null   float64
 11  일강수량      9621 non-null   float64
 12  평균 풍속     9621 non-null   float64
dtypes: float64(9), int64(1), object(3)
memory usage: 977.3+ KB

각 수치형 변수의 분포(사분위,평균,표준편차,최대,최소)를 확인하라¶

In [54]:

answer = df.describe()
answer

Out[54]:

	id	거주인구	근무인구	방문인구	총 유동인구	평균 속도	평균 소요 시간	평균 기온	일강수량	평균 풍속
count	9621.000000	9.621000e+03	9621.000000	9621.000000	9.621000e+03	9621.000000	9621.000000	9621.000000	9621.000000	9621.000000
mean	27258.000000	3.174315e+05	35471.201510	195889.561802	5.487922e+05	41.109084	37.215873	13.550828	6.972426	2.753171
std	2777.487804	2.982079e+05	40381.214775	140706.090325	4.608802e+05	8.758631	12.993786	7.745515	27.617260	1.498538
min	22448.000000	9.305552e+03	1407.936000	11538.322000	2.225181e+04	24.333000	12.667000	-9.600000	0.000000	0.000000
25%	24853.000000	9.539939e+04	12074.498000	99632.153000	2.216910e+05	34.250000	27.889000	7.600000	0.000000	1.700000
50%	27258.000000	2.221105e+05	21960.928000	152805.335000	3.866935e+05	39.640000	34.500000	13.400000	0.000000	2.400000
75%	29663.000000	4.106671e+05	40192.032000	236325.109000	6.406918e+05	49.105000	46.176000	19.700000	1.500000	3.400000
max	32068.000000	1.364504e+06	263476.965000	723459.209000	2.066484e+06	103.000000	172.200000	30.400000	587.500000	13.333000

거주인구 컬럼의 값들을 출력하라¶

In [58]:

answer = df['거주인구']
answer

Out[58]:

0         32249.987
1        213500.997
2       1212382.218
3         33991.653
4        155036.925
           ...     
9616     228260.005
9617     459959.064
9618      28397.481
9619     348037.846
9620    1010643.372
Name: 거주인구, Length: 9621, dtype: float64

평균 속도 컬럼의 4분위 범위(IQR) 값을 구하여라¶

In [65]:

#IQR = 3분위수-1분위수
answer = df['평균 속도'].quantile(0.75) - df['평균 속도'].quantile(0.25)
answer

Out[65]:

14.854999999999997

읍면동명 컬럼의 유일값 갯수를 출력하라¶

In [71]:

answer = len(df['읍면동명'].unique())
answer

Out[71]:

읍면동명 컬럼의 유일값을 모두 출력하라¶

In [72]:

answer = df['읍면동명'].unique()
answer

Out[72]:

array(['도두동', '외도동', '이도2동', '일도1동', '대천동', '서홍동', '한경면', '송산동', '조천읍',
       '일도2동', '영천동', '예래동', '대륜동', '삼도1동', '이호동', '건입동', '중앙동', '삼양동',
       '삼도2동', '이도1동', '남원읍', '대정읍', '정방동', '효돈동', '아라동', '한림읍', '구좌읍',
       '용담1동', '오라동', '화북동', '연동', '표선면', '중문동', '성산읍', '안덕면', '천지동',
       '노형동', '동홍동', '용담2동', '봉개동', '애월읍'], dtype=object)

데이터를 로드하라¶

In [73]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/chipo.csv')
df

Out[73]:

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98
...	...	...	...	...	...
4617	1833	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...	$11.75
4618	1833	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...	$11.75
4619	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$11.25
4620	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...	$8.75
4621	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$8.75

4622 rows × 5 columns

quantity컬럼 값이 3인 데이터를 추출하여 첫 5행을 출력¶

In [80]:

answer = df.loc[df['quantity']==3].head()
answer

Out[80]:

	order_id	quantity	item_name	choice_description	item_price
409	178	3	Chicken Bowl	[[Fresh Tomato Salsa (Mild), Tomatillo-Green C...	$32.94
445	193	3	Bowl	[Braised Carnitas, Pinto Beans, [Sour Cream, C...	$22.20
689	284	3	Canned Soft Drink	[Diet Coke]	$3.75
818	338	3	Bottled Water	NaN	$3.27
850	350	3	Canned Soft Drink	[Sprite]	$3.75

In [83]:

answer.reset_index(drop=True)

Out[83]:

	order_id	quantity	item_name	choice_description	item_price
0	178	3	Chicken Bowl	[[Fresh Tomato Salsa (Mild), Tomatillo-Green C...	$32.94
1	193	3	Bowl	[Braised Carnitas, Pinto Beans, [Sour Cream, C...	$22.20
2	284	3	Canned Soft Drink	[Diet Coke]	$3.75
3	338	3	Bottled Water	NaN	$3.27
4	350	3	Canned Soft Drink	[Sprite]	$3.75

quantity 컬럼 값이 3인 데이터를 추출하여 index를 0부터 정렬하고 첫 5행을 출력하라¶

In [86]:

answer = df[df['quantity']==3].head(5).reset_index(drop=True)
answer

Out[86]:

	order_id	quantity	item_name	choice_description	item_price
0	178	3	Chicken Bowl	[[Fresh Tomato Salsa (Mild), Tomatillo-Green C...	$32.94
1	193	3	Bowl	[Braised Carnitas, Pinto Beans, [Sour Cream, C...	$22.20
2	284	3	Canned Soft Drink	[Diet Coke]	$3.75
3	338	3	Bottled Water	NaN	$3.27
4	350	3	Canned Soft Drink	[Sprite]	$3.75

quantity, item_price 두개의 컬럼으로 구성된 새로운 데이터 프레임을 정의하라¶

In [87]:

answer = df[['quantity','item_price']]
answer

Out[87]:

	quantity	item_price
0	1	$2.39
1	1	$3.39
2	1	$3.39
3	1	$2.39
4	2	$16.98
...	...	...
4617	1	$11.75
4618	1	$11.75
4619	1	$11.25
4620	1	$8.75
4621	1	$8.75

4622 rows × 2 columns

item_price 컬럼의 달러표시 문자를 제거하고 float타입으로 저장하여 new_price컬럼에 저장하라¶

In [99]:

df['new_price'] = df['item_price'].str.replace('$','').astype('float')
answer = df['new_price'].head()
answer

C:\Users\Public\Documents\ESTsoft\CreatorTemp/ipykernel_80432/1472329730.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  df['new_price'] = df['item_price'].str.replace('$','').astype('float')

Out[99]:

0     2.39
1     3.39
2     3.39
3     2.39
4    16.98
Name: new_price, dtype: float64

item_name명이 Chicken Salad Bowl인 데이터 프레임을 추출하고 index 값을 초기화 하여라¶

In [103]:

answer = df[df['item_name']=='Chicken Salad Bowl'].reset_index(drop=True)
answer.head(3)

Out[103]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	20	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$8.75	8.75
1	60	2	Chicken Salad Bowl	[Tomatillo Green Chili Salsa, [Sour Cream, Che...	$22.50	22.50
2	94	2	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$22.50	22.50

new_price값이 9이하이고 item_name값이 Chicken Salad Bowl인 데이터 프레임을 추출¶

In [111]:

answer = df.loc[(df['new_price']<=9) &(df['item_name']=='Chicken Salad Bowl')]
answer.head(5)

Out[111]:

	order_id	quantity	item_name	choice_description	item_price	new_price
44	20	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$8.75	8.75
256	111	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...	$8.75	8.75
526	220	1	Chicken Salad Bowl	[Roasted Chili Corn Salsa, [Black Beans, Sour ...	$8.75	8.75
528	221	1	Chicken Salad Bowl	[Tomatillo Green Chili Salsa, [Fajita Vegetabl...	$8.75	8.75
529	221	1	Chicken Salad Bowl	[Tomatillo Green Chili Salsa, [Fajita Vegetabl...	$8.75	8.75

df의 new_price 컬럼 값에 따라 오름차순으로 정리하고 index를 초기화 하여라¶

In [125]:

answer = df.sort_values(by='new_price', ascending=True).reset_index(drop=True)
answer.head(4)

Out[125]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	471	1	Bottled Water	NaN	$1.09	1.09
1	338	1	Canned Soda	[Coca Cola]	$1.09	1.09
2	1575	1	Canned Soda	[Dr. Pepper]	$1.09	1.09
3	47	1	Canned Soda	[Dr. Pepper]	$1.09	1.09

df의 item_name 컬럼 값중 chips 포함하는 경우의 데이터를 출력하라¶

In [129]:

answer = df.loc[df['item_name'].str.contains('Chips')]
answer.head(5)

Out[129]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39	2.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39	2.39
6	3	1	Side of Chips	NaN	$1.69	1.69
10	5	1	Chips and Guacamole	NaN	$4.45	4.45
14	7	1	Chips and Guacamole	NaN	$4.45	4.45

df의 짝수번쨰 컬럼만을 포함하는 데이터프레임을 출력¶

In [133]:

answer = df.loc[:,::2]
answer.head(5)

Out[133]:

	order_id	item_name	item_price
0	1	Chips and Fresh Tomato Salsa	$2.39
1	1	Izze	$3.39
2	1	Nantucket Nectar	$3.39
3	1	Chips and Tomatillo-Green Chili Salsa	$2.39
4	2	Chicken Bowl	$16.98

df의 new_price 컬럼 값에 따라 내림차순으로 정리하고 index를 초기화 하여라¶

In [140]:

answer = df.sort_values(by='new_price',ascending=False).reset_index(drop=True)
answer.head(4)

Out[140]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	1443	15	Chips and Fresh Tomato Salsa	NaN	$44.25	44.25
1	1398	3	Carnitas Bowl	[Roasted Chili Corn Salsa, [Fajita Vegetables,...	$35.25	35.25
2	511	4	Chicken Burrito	[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...	$35.00	35.00
3	1443	4	Chicken Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Chees...	$35.00	35.00

df의 item_name 컬럼 값이 Steak Salad 또는 Bowl인 데이터를 인덱싱하라¶

In [142]:

answer=df.loc[(df['item_name']=='Steak Salad') | (df['item_name']=='Bowl')]
answer

Out[142]:

	order_id	quantity	item_name	choice_description	item_price	new_price
445	193	3	Bowl	[Braised Carnitas, Pinto Beans, [Sour Cream, C...	$22.20	22.20
664	276	1	Steak Salad	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$8.99	8.99
673	279	1	Bowl	[Adobo-Marinated and Grilled Steak, [Sour Crea...	$7.40	7.40
752	311	1	Steak Salad	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$8.99	8.99
893	369	1	Steak Salad	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	$8.99	8.99
3502	1406	1	Steak Salad	[[Lettuce, Fajita Veggies]]	$8.69	8.69

df의 item_name 컬럼 값이 Steak Salad 또는 Bowl인 데이터를 데이터 프레임화 한 후, item_name를 기준으로 중복행이 있으면 제거하되 첫번쨰 케이스만 남겨라¶

drop_duplicates('컬럼명') : 중복제거!! 기억하자¶

In [147]:

answer = df.loc[(df['item_name']=='Steak Salad')|(df['item_name']=='Bowl')].drop_duplicates('item_name')
answer

Out[147]:

	order_id	quantity	item_name	choice_description	item_price	new_price
445	193	3	Bowl	[Braised Carnitas, Pinto Beans, [Sour Cream, C...	$22.20	22.20
664	276	1	Steak Salad	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$8.99	8.99

df의 item_name 컬럼 값이 Steak Salad 또는 Bowl인 데이터를 데이터 프레임화 한 후, item_name를 기준으로 중복행이 있으면 제거하되 마지막 케이스만 남겨라¶

drop_duplicates('컬럼명', keep='last') : 중복제거(마지막행만 남김)!!¶

In [149]:

answer = df.loc[(df['item_name']=='Steak Salad') | (df['item_name']=='Bowl')].drop_duplicates('item_name',keep='last')
answer

Out[149]:

	order_id	quantity	item_name	choice_description	item_price	new_price
673	279	1	Bowl	[Adobo-Marinated and Grilled Steak, [Sour Crea...	$7.40	7.40
3502	1406	1	Steak Salad	[[Lettuce, Fajita Veggies]]	$8.69	8.69

df의 데이터 중 new_price값이 new_price값의 평균값 이상을 가지는 데이터들을 인덱싱하라¶

mean() : 평균값¶

In [152]:

answer = df.loc[df['new_price']>=df['new_price'].mean()]
answer

Out[152]:

	order_id	quantity	item_name	choice_description	item_price	new_price
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98	16.98
5	3	1	Chicken Bowl	[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...	$10.98	10.98
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	$11.75	11.75
8	4	1	Steak Soft Tacos	[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...	$9.25	9.25
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	$9.25	9.25
...	...	...	...	...	...	...
4617	1833	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...	$11.75	11.75
4618	1833	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...	$11.75	11.75
4619	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$11.25	11.25
4620	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...	$8.75	8.75
4621	1834	1	Chicken Salad Bowl	[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...	$8.75	8.75

2890 rows × 6 columns

36. df의 데이터 중 item_name의 값이 lzze 데이터를 Fizzy Lizzy로 수정해라¶

In [159]:

df.loc[df['item_name']=='Izze','item_name'] = 'Fizzy Lizzy'
df.head()

Out[159]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39	2.39
1	1	1	Fizzy Lizzy	[Clementine]	$3.39	3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98	16.98

37. df의 데이터 중 choice_description 값이 NaN 인 데이터의 갯수를 구하여라¶

In [162]:

answer = df['choice_description'].isnull().sum()
answer

Out[162]:

38. df의 데이터 중 choice_description 값이 NaN인 데이터를 NoData값으로 대체하라(loc이용)¶

In [166]:

df.loc[df['choice_description'].isnull(),'choice_description']='NoData'
df.head()

Out[166]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	1	1	Chips and Fresh Tomato Salsa	NoData	$2.39	2.39
1	1	1	Fizzy Lizzy	[Clementine]	$3.39	3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NoData	$2.39	2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98	16.98

39. df의 데이터 중 choice_description 값에 Black이 들어가는 경우를 인덱싱하라¶

In [168]:

answer = df.loc[df['choice_description'].str.contains('Black')]
answer.head(5)

Out[168]:

	order_id	quantity	item_name	choice_description	item_price	new_price
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98	16.98
7	4	1	Steak Burrito	[Tomatillo Red Chili Salsa, [Fajita Vegetables...	$11.75	11.75
9	5	1	Steak Burrito	[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...	$9.25	9.25
11	6	1	Chicken Crispy Tacos	[Roasted Chili Corn Salsa, [Fajita Vegetables,...	$8.75	8.75
12	6	1	Chicken Soft Tacos	[Roasted Chili Corn Salsa, [Rice, Black Beans,...	$8.75	8.75

40. df의 데이터 중 choice_description값에 Vegetables이 들어가지 않는 경우의 갯수를 출력하라¶

In [171]:

answer = len(df.loc[~df['choice_description'].str.contains('Vegetables')])
answer

Out[171]:

41. df의 데이터 중 item_name 값이 N으로 시작하는 데이터를 모두 추출하라¶

str.startswith('시작문자') : ?으로 시작하는 데이터 추출¶

In [172]:

answer = df.loc[df['item_name'].str.startswith('N')]
answer.head(3)

Out[172]:

	order_id	quantity	item_name	choice_description	item_price	new_price
2	1	1	Nantucket Nectar	[Apple]	$3.39	3.39
22	11	1	Nantucket Nectar	[Pomegranate Cherry]	$3.39	3.39
105	46	1	Nantucket Nectar	[Pineapple Orange Banana]	$3.39	3.39

42. df의 데이터 중 item_name 값의 단어갯수가 15개 이상인 데이터를 인덱싱하라¶

str.len() : 단어의 갯수 리턴¶

In [173]:

answer = df.loc[df['item_name'].str.len()>=15]
answer.head(3)

Out[173]:

	order_id	quantity	item_name	choice_description	item_price	new_price
0	1	1	Chips and Fresh Tomato Salsa	NoData	$2.39	2.39
2	1	1	Nantucket Nectar	[Apple]	$3.39	3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NoData	$2.39	2.39

43. df의 데이터 중 new_price값이 lst에 해당하는 경우의 데이터 프레임을 구하고 그 갯수를 출력하라¶

lst=[1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98]¶

In [176]:

lst=[1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98]
answer = len(df.loc[df['new_price'].isin(lst)])
answer

Out[176]:

44. 데이터를 로드하고 상위 5개 컬럼을 출력하라¶

In [2]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv')
answer = df.head(5)
answer

Out[2]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

45. 데이터의 각 host_name의 빈도수를 구하고 host_name으로 정렬하여 상위 5개를 출력해라¶

In [13]:

answer = df['host_name'].value_counts().sort_index()
answer.head(5)

Out[13]:

'Cil                        1
(Ari) HENRY LEE             1
(Email hidden by Airbnb)    6
(Mary) Haiy                 1
-TheQueensCornerLot         1
Name: host_name, dtype: int64

46.데이터의 각 host_name의 빈도수를 구하고 빈도수 기준 내림차순 정렬한 데이터 프레임을 만들어라. 빈도수 컬럼을 counts로 명명하라¶

In [31]:

df.groupby('host_name').size().to_frame().rename(columns={0:'counts'}).sort_values(by='counts',ascending=False)

Out[31]:

	counts
host_name
Michael	417
David	403
Sonder (NYC)	327
John	294
Alex	279
...	...
Jerbean	1
Jerald	1
Jeonghoon	1
Jeny	1
현선	1

11452 rows × 1 columns

47. neighbourhood_group의 값에 따른 neighbourhood컬럼 값의 갯수를 구하여라¶

In [50]:

df.groupby(['neighbourhood_group','neighbourhood'],as_index=False).size()

Out[50]:

	neighbourhood_group	neighbourhood	size
0	Bronx	Allerton	42
1	Bronx	Baychester	7
2	Bronx	Belmont	24
3	Bronx	Bronxdale	19
4	Bronx	Castle Hill	9
...	...	...	...
216	Staten Island	Tottenville	7
217	Staten Island	West Brighton	18
218	Staten Island	Westerleigh	2
219	Staten Island	Willowbrook	1
220	Staten Island	Woodrow	1

221 rows × 3 columns

48. neighbourhood_group의 값에 따른 neighbourhood컬럼 값 중 neighbourhood_group그룹의 최댓값들을 출력하라¶

In [57]:

df.groupby(['neighbourhood_group','neighbourhood'],as_index=False).size().groupby('neighbourhood_group',as_index=False).max()

Out[57]:

	neighbourhood_group	neighbourhood	size
0	Bronx	Woodlawn	70
1	Brooklyn	Windsor Terrace	3920
2	Manhattan	West Village	2658
3	Queens	Woodside	900
4	Staten Island	Woodrow	48

49. neighbourhood_group 값에 따른 price값의 평균, 분산, 최대, 최소 값을 구하여라¶

In [60]:

df[['neighbourhood_group','price']].groupby('neighbourhood_group').agg(['mean','var','max','min'])

Out[60]:

	price
	mean	var	max	min
neighbourhood_group
Bronx	87.496792	11386.885081	2500	0
Brooklyn	124.383207	34921.719135	10000	0
Manhattan	196.875814	84904.159185	10000	0
Queens	99.517649	27923.130227	10000	10
Staten Island	114.812332	77073.088342	5000	13

50. neighbourhood_group 값에 따른 reviews_per_month 평균, 분산, 최대, 최소 값을 구하여라¶

In [65]:

df[['neighbourhood_group','reviews_per_month']].groupby('neighbourhood_group').agg(['mean','var','max','min'])

Out[65]:

	reviews_per_month
	mean	var	max	min
neighbourhood_group
Bronx	1.837831	2.799878	10.34	0.02
Brooklyn	1.283212	2.299040	14.00	0.01
Manhattan	1.272131	2.651206	58.50	0.01
Queens	1.941200	4.897848	20.94	0.01
Staten Island	1.872580	2.840895	10.12	0.02

51. neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 구하라¶

In [5]:

df.groupby(['neighbourhood','neighbourhood_group'],as_index=False).price.mean()

Out[5]:

	neighbourhood	neighbourhood_group	price
0	Allerton	Bronx	87.595238
1	Arden Heights	Staten Island	67.250000
2	Arrochar	Staten Island	115.000000
3	Arverne	Queens	171.779221
4	Astoria	Queens	117.187778
...	...	...	...
216	Windsor Terrace	Brooklyn	138.993631
217	Woodhaven	Queens	67.170455
218	Woodlawn	Bronx	60.090909
219	Woodrow	Staten Island	700.000000
220	Woodside	Queens	85.097872

221 rows × 3 columns

52. neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하라¶

unstack() : MultiIndex를 가진 Series나 DataFrame에서 level을 기준으로 index를 column으로 이동시키는 함수입니다.¶

In [7]:

df.groupby(['neighbourhood','neighbourhood_group']).price.mean().unstack()

Out[7]:

neighbourhood_group	Bronx	Brooklyn	Manhattan	Queens	Staten Island
neighbourhood
Allerton	87.595238	NaN	NaN	NaN	NaN
Arden Heights	NaN	NaN	NaN	NaN	67.25
Arrochar	NaN	NaN	NaN	NaN	115.00
Arverne	NaN	NaN	NaN	171.779221	NaN
Astoria	NaN	NaN	NaN	117.187778	NaN
...	...	...	...	...	...
Windsor Terrace	NaN	138.993631	NaN	NaN	NaN
Woodhaven	NaN	NaN	NaN	67.170455	NaN
Woodlawn	60.090909	NaN	NaN	NaN	NaN
Woodrow	NaN	NaN	NaN	NaN	700.00
Woodside	NaN	NaN	NaN	85.097872	NaN

221 rows × 5 columns

53. neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 계층적 indexing 없이 구하고 nan 값은 -999값으로 채워라¶

fillna() : 결측치(null 또는 NaN 값)를 다른 값으로 채우는 역할을 합니다.¶

In [10]:

df.groupby(['neighbourhood','neighbourhood_group']).price.mean().unstack().fillna('-999')

Out[10]:

neighbourhood_group	Bronx	Brooklyn	Manhattan	Queens	Staten Island
neighbourhood
Allerton	87.595238	-999	-999	-999	-999
Arden Heights	-999	-999	-999	-999	67.25
Arrochar	-999	-999	-999	-999	115.0
Arverne	-999	-999	-999	171.779221	-999
Astoria	-999	-999	-999	117.187778	-999
...	...	...	...	...	...
Windsor Terrace	-999	138.993631	-999	-999	-999
Woodhaven	-999	-999	-999	67.170455	-999
Woodlawn	60.090909	-999	-999	-999	-999
Woodrow	-999	-999	-999	-999	700.0
Woodside	-999	-999	-999	85.097872	-999

221 rows × 5 columns

54. 데이터중 neighbourhood_group 값이 Queens값을 가지는 데이터들 중 neighbourhood 그룹별로 price값의 평균, 분산, 최대, 최소값을 구하라¶

In [24]:

df.loc[df['neighbourhood_group']=='Queens'].groupby('neighbourhood').price.agg(['mean','var','max','min'])

Out[24]:

	mean	var	max	min
neighbourhood
Arverne	171.779221	37383.411141	1500	35
Astoria	117.187778	122428.811196	10000	25
Bay Terrace	142.000000	6816.400000	258	32
Bayside	157.948718	166106.470985	2600	30
Bayswater	87.470588	2330.889706	230	45
Belle Harbor	171.500000	8226.571429	350	85
Bellerose	99.357143	3093.016484	240	42
Breezy Point	213.333333	1008.333333	250	195
Briarwood	105.875000	18503.165909	1000	30
Cambria Heights	81.730769	2960.604615	250	31
College Point	88.000000	6445.555556	400	44
Corona	59.171875	2430.557292	359	23
Ditmars Steinway	95.029126	4214.112785	600	22
Douglaston	82.750000	2347.071429	178	40
East Elmhurst	81.183784	4369.487779	700	16
Edgemere	94.727273	4219.418182	200	40
Elmhurst	80.459916	3896.563005	443	15
Far Rockaway	165.862069	52575.980296	900	35
Flushing	93.514085	12417.177448	1500	15
Forest Hills	121.625000	44533.368881	2350	16
Fresh Meadows	99.500000	6394.451613	375	39
Glendale	90.796296	3230.882250	299	25
Hollis	88.642857	1483.016484	175	50
Holliswood	135.750000	5091.583333	239	79
Howard Beach	115.400000	4241.936842	250	40
Jackson Heights	80.897849	2150.470590	260	23
Jamaica	95.770563	8067.047130	1000	10
Jamaica Estates	182.947368	29006.052632	750	35
Jamaica Hills	132.125000	11304.696429	325	50
Kew Gardens	88.375000	2632.306452	219	35
Kew Gardens Hills	112.307692	6230.621538	399	40
Laurelton	95.333333	4355.529412	254	34
Little Neck	75.200000	2368.700000	149	32
Long Island City	127.465549	15529.503009	2000	30
Maspeth	83.645455	2223.607089	245	11
Middle Village	109.580645	3537.118280	265	28
Neponsit	274.666667	5625.333333	350	200
Ozone Park	85.274194	1705.415389	200	25
Queens Village	83.933333	2386.470056	320	25
Rego Park	83.877358	3659.003863	300	21
Richmond Hill	87.117021	3474.147449	300	28
Ridgewood	77.184397	2629.311889	375	20
Rockaway Beach	132.178571	6254.622078	545	49
Rosedale	76.694915	2963.939801	350	22
South Ozone Park	82.400000	4619.989744	400	29
Springfield Gardens	94.235294	2490.848739	300	40
St. Albans	100.828947	9489.903684	600	25
Sunnyside	84.865014	2727.746922	600	12
Whitestone	107.545455	13632.072727	400	35
Woodhaven	67.170455	1642.464864	250	10
Woodside	85.097872	5001.900636	500	28

55. 데이터중 neighbourhood_group 값에 따른 room_type 컬럼의 숫자를 구하고 neighbourhood_group 값을 기준으로 각 값의 비율을 구하여라¶

In [50]:

answer = df[['neighbourhood_group','room_type']].groupby(['neighbourhood_group','room_type']).size().unstack()
answer.loc[:,:] = (answer.values / answer.sum(axis=1).values.reshape(-1,1))
answer

Out[50]:

room_type	Entire home/apt	Private room	Shared room
neighbourhood_group
Bronx	0.347388	0.597617	0.054995
Brooklyn	0.475478	0.503979	0.020543
Manhattan	0.609344	0.368496	0.022160
Queens	0.369926	0.595129	0.034945
Staten Island	0.471850	0.504021	0.024129

56. 데이터를 로드하고 데이터 행과 열의 갯수를 출력하라¶

데이터 파일을 불러올 때, 'Unnamed: 0' 열을 인덱스(index) 열로 지정하여 불러옵니다.¶

data = pd.read_csv('data.csv', index_col='Unnamed: 0')¶

In [89]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/BankChurnersUp.csv',index_col='Unnamed: 0')
df.shape

Out[89]:

(10127, 18)

57. Income_Category의 카테고리를 map 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라¶

Unknown : N

Less than $40K : a

$40K - $60K : b

$60K - $80K : c

$80K - $120K : d

$120K +’ : e

In [32]:

dic = {
    'Unknown' : 'N',
    'Less than $40K' : 'a',
    '$40K - $60K' : 'b',
    '$60K - $80K' : 'c',
    '$80K - $120K' : 'd',
    '$120K +' : 'e'
}
df['newIncome'] = df['Income_Category'].map(dic)
df['newIncome']

Out[32]:

0        c
1        a
2        d
3        a
4        c
        ..
10122    b
10123    b
10124    a
10125    b
10126    a
Name: newIncome, Length: 10127, dtype: object

58. Income_Category의 카테고리를 apply 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라¶

Unknown : N

Less than $40K : a

$40K - $60K : b

$60K - $80K : c

$80K - $120K : d

$120K +’ : e

In [33]:

def solution(x):
    if x=='Unknown':
        return 'N'
    elif x=='Less than $40K':
        return 'a'
    elif x=='$40K - $60K':
        return 'b'
    elif x=='$60K - $80K':
        return 'c'
    elif x=='$80K - $120K':
        return 'd'
    elif x=='$120K +':
        return 'e'
df['newIncome'] = df['Income_Category'].apply(solution)
df['newIncome']

Out[33]:

0        c
1        a
2        d
3        a
4        c
        ..
10122    b
10123    b
10124    a
10125    b
10126    a
Name: newIncome, Length: 10127, dtype: object

59. Customer_Age의 값을 이용하여 나이 구간을 AgeState 컬럼으로 정의하라.¶

0~9 :0

10~19 :10

20~29 :20 ...

각 구간의 빈도수를 출력하라

In [34]:

def solution(age):
    if 0<=age<=9:
        return 0
    elif 10<=age<=19:
        return 10
    elif 20<=age<=29:
        return 20
    elif 30<=age<=39:
        return 30
    elif 40<=age<=49:
        return 40
    elif 50<=age<=59:
        return 50
    elif 60<=age<=69:
        return 60
    elif 70<=age<=79:
        return 70
df['AgeState'] = df['Customer_Age'].apply(solution)
df['AgeState'].value_counts().sort_index()

Out[34]:

20     195
30    1841
40    4561
50    2998
60     530
70       2
Name: AgeState, dtype: int64

In [35]:

df['AgeState'] = df['Customer_Age'].map(lambda x: (x//10)*10)
df['AgeState'].value_counts().sort_index()

Out[35]:

20     195
30    1841
40    4561
50    2998
60     530
70       2
Name: AgeState, dtype: int64

60. Education_Level의 값중 Graduate단어가 포함되는 값은 1 그렇지 않은 경우에는 0으로 변경하여 newEduLevel 컬럼을 정의하고 빈도수를 출력하라¶

In [36]:

def solution(str):
    if 'Graduate' in str:
        return 1
    else:
        return 0

df['newEduLevel'] = df['Education_Level'].apply(solution)
df['newEduLevel'].value_counts()

Out[36]:

0    6483
1    3644
Name: newEduLevel, dtype: int64

In [37]:

df['newEduLevel'] = df['Education_Level'].map(lambda x : 1 if 'Graduate' in x else 0)
df['newEduLevel'].value_counts()

Out[37]:

0    6483
1    3644
Name: newEduLevel, dtype: int64

61. Credit_Limit 컬럼값이 4500 이상인 경우 1 그외의 경우에는 모두 0으로 하는 newLimit 정의하라. newLimit 각 값들의 빈도수를 출력하라¶

In [38]:

df['newLimit'] = df['Credit_Limit'].map(lambda x: 1 if x>=4500 else 0)
df['newLimit'].value_counts()

Out[38]:

1    5096
0    5031
Name: newLimit, dtype: int64

In [39]:

def solution(num):
    if num>=4500:
        return 1
    else:
        return 0
df['newLimit'] = df['Credit_Limit'].apply(solution)
df['newLimit'].value_counts()

Out[39]:

1    5096
0    5031
Name: newLimit, dtype: int64

62. Marital_Status 컬럼값이 Married 이고 Card_Category 컬럼의 값이 Platinum인 경우 1 그외의 경우에는 모두 0으로 하는 newState컬럼을 정의하라. newState의 각 값들의 빈도수를 출력하라¶

In [40]:

def solution(x):
    if x.Marital_Status == 'Married' and x.Card_Category == 'Platinum':
        return 1
    else:
        return 0

df['newState'] = df.apply(solution,axis=1)
df['newState'].value_counts()

Out[40]:

0    10120
1        7
Name: newState, dtype: int64

63. Gender 컬럼값 M인 경우 male F인 경우 female로 값을 변경하여 Gender 컬럼에 새롭게 정의하라. 각 value의 빈도를 출력하라¶

In [42]:

df['Gender'] = df['Gender'].map(lambda x : 'male' if x =='M' else 'female')
df['Gender'].value_counts()

Out[42]:

female    5358
male      4769
Name: Gender, dtype: int64

In [49]:

def solution(x):
    if x=='M':
        return 'male'
    else:
        return 'female'
df['Gender'] = df['Gender'].apply(solution)
df['Gender'].value_counts()

Out[49]:

female    5358
male      4769
Name: Gender, dtype: int64

64. 데이터를 로드하고 각 열의 데이터 타입을 파악하라¶

In [2]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/timeTest.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Yr_Mo_Dy  6574 non-null   object 
 1   RPT       6568 non-null   float64
 2   VAL       6571 non-null   float64
 3   ROS       6572 non-null   float64
 4   KIL       6569 non-null   float64
 5   SHA       6572 non-null   float64
 6   BIR       6574 non-null   float64
 7   DUB       6571 non-null   float64
 8   CLA       6572 non-null   float64
 9   MUL       6571 non-null   float64
 10  CLO       6573 non-null   float64
 11  BEL       6574 non-null   float64
 12  MAL       6570 non-null   float64
dtypes: float64(12), object(1)
memory usage: 667.8+ KB

65. Yr_Mo_Dy을 판다스에서 인식할 수 있는 datetime64타입으로 변경하라¶

In [3]:

#1번쨰 방법
answer=df['Yr_Mo_Dy'].astype('datetime64')
answer.head(4)
#2번쨰 방법
answer = pd.to_datetime(df['Yr_Mo_Dy'])
answer.head(4)

Out[3]:

0   2061-01-01
1   2061-01-02
2   2061-01-03
3   2061-01-04
Name: Yr_Mo_Dy, dtype: datetime64[ns]

In [4]:

df['Yr_Mo_Dy']=pd.to_datetime(df['Yr_Mo_Dy'])

66. Yr_Mo_Dy에 존재하는 년도의 유일값을 모두 출력하라¶

In [5]:

df['Yr_Mo_Dy'].dt.year.unique()

Out[5]:

array([2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)

67. Yr_Mo_Dy에 년도가 2061년 이상의 경우에는 모두 잘못된 데이터이다. 해당경우의 값은 100을 빼서 새롭게 날짜를 Yr_Mo_Dy 컬럼에 정의하라¶

In [6]:

df['Yr_Mo_Dy']= pd.to_datetime(df['Yr_Mo_Dy'].astype('str').map(lambda x : str(int(x[:4])-100)+x[4:] if x[:4]>='2061' else x))
df['Yr_Mo_Dy']

Out[6]:

0      1961-01-01
1      1961-01-02
2      1961-01-03
3      1961-01-04
4      1961-01-05
          ...    
6569   1978-12-27
6570   1978-12-28
6571   1978-12-29
6572   1978-12-30
6573   1978-12-31
Name: Yr_Mo_Dy, Length: 6574, dtype: datetime64[ns]

68. 년도별 각컬럼의 평균값을 구하여라¶

In [7]:

df.groupby(df['Yr_Mo_Dy'].dt.year).agg('mean').head(4)

Out[7]:

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
Yr_Mo_Dy
1961	12.299583	10.351796	11.362369	6.958227	10.881763	7.729726	9.733923	8.858788	8.647652	9.835577	13.502795	13.680773
1962	12.246923	10.110438	11.732712	6.960440	10.657918	7.393068	11.020712	8.793753	8.316822	9.676247	12.930685	14.323956
1963	12.813452	10.836986	12.541151	7.330055	11.724110	8.434712	11.075699	10.336548	8.903589	10.224438	13.638877	14.999014
1964	12.363661	10.920164	12.104372	6.787787	11.454481	7.570874	10.259153	9.467350	7.789016	10.207951	13.740546	14.910301

69. weekday컬럼을 만들고 요일별로 매핑하라 ( 월요일: 0 ~ 일요일 :6)¶

In [8]:

df['weekday'] = df['Yr_Mo_Dy'].dt.weekday
df['weekday'].head(3).to_frame()

Out[8]:

	weekday
0	6
1	0
2	1

70. weekday컬럼을 기준으로 주말이면 1 평일이면 0의 값을 가지는 WeekCheck 컬럼을 만들어라¶

In [9]:

df['WeekCheck'] = df['weekday'].map(lambda x : 1 if x==5 or x==6 else 0)
df['WeekCheck'].head(3).to_frame()

Out[9]:

	WeekCheck
0	1
1	0
2	0

71. 년도, 일자 상관없이 모든 컬럼의 각 달의 평균을 구하여라¶

In [10]:

answer = df.groupby(df['Yr_Mo_Dy'].dt.month).agg('mean')
answer

Out[10]:

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL	weekday	WeekCheck
Yr_Mo_Dy
1	14.847325	12.914560	13.299624	7.199498	11.667734	8.054839	11.819355	9.512047	9.543208	10.053566	14.550520	18.028763	2.989247	0.284946
2	13.710906	12.111122	12.879132	6.942411	11.551772	7.633858	11.206024	9.341437	9.313169	9.518051	13.728898	17.156142	3.005906	0.287402
3	13.158687	11.505842	12.648118	7.265907	11.554516	7.959409	11.310179	9.635896	9.700324	10.096953	13.810609	16.909317	3.000000	0.283154
4	12.555648	10.429759	12.204815	6.898037	10.677667	7.441389	10.221315	8.909056	8.930870	9.158019	12.664759	14.937611	3.011111	0.288889
5	11.724032	10.145619	11.550394	6.307487	10.224301	6.942061	8.797738	8.452903	8.040806	8.524857	12.767258	13.736039	2.982079	0.283154
6	10.451317	8.949704	10.361315	5.652278	9.529926	6.410093	8.009556	7.920796	7.639796	7.729185	12.246407	12.861818	3.007407	0.285185
7	9.992007	8.357778	9.349642	5.416935	9.302634	5.972348	7.843501	7.262760	7.544480	7.321416	11.676505	12.800789	3.001792	0.288530
8	10.213411	8.415143	9.993441	5.270681	8.901559	5.891057	7.772312	6.842025	7.240573	7.002783	11.110090	12.565943	2.991039	0.283154
9	11.458519	9.981002	10.756883	5.615176	9.766315	6.566222	8.609722	7.745677	7.610556	7.689278	12.686389	14.761963	3.009259	0.287037
10	12.660610	11.010681	11.453943	6.065215	10.550251	7.159910	9.387778	8.726308	8.347181	8.850376	14.155323	16.697151	2.998208	0.286738
11	13.200722	11.639500	12.293407	6.247611	10.501130	7.134333	10.814861	8.427167	8.604000	8.943167	13.815741	18.114185	2.992593	0.283333
12	14.446398	12.353602	13.212276	6.829910	11.301254	7.963710	11.849050	9.209355	9.447258	9.627670	14.259516	18.697599	3.017921	0.288530

72. 모든 결측치는 컬럼기준 직전의 값으로 대체하고 첫번째 행에 결측치가 있을경우 뒤에있는 값으로 대채하라¶

In [11]:

df = df.fillna(method='ffill').fillna(method='bfill')
df.isnull().sum()

Out[11]:

Yr_Mo_Dy     0
RPT          0
VAL          0
ROS          0
KIL          0
SHA          0
BIR          0
DUB          0
CLA          0
MUL          0
CLO          0
BEL          0
MAL          0
weekday      0
WeekCheck    0
dtype: int64

73. 년도 - 월을 기준으로 모든 컬럼의 평균값을 구하여라¶

In [18]:

answer = df.groupby(df['Yr_Mo_Dy'].dt.strftime('%Y-%m')).agg('mean')
answer

Out[18]:

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL	weekday	WeekCheck
Yr_Mo_Dy
1961-01	14.932581	12.084194	13.431613	7.736774	11.215806	8.588065	11.184839	9.270968	9.085806	10.107419	13.880968	14.703226	2.935484	0.290323
1961-02	16.269286	14.975357	14.324643	9.116786	13.852143	10.937500	11.890714	11.846071	11.821429	12.714286	18.583214	15.411786	3.000000	0.285714
1961-03	11.015806	11.296452	10.752903	7.137742	10.509355	8.866774	9.644194	9.829677	10.646129	11.251935	16.410968	15.637742	3.000000	0.258065
1961-04	10.722667	9.427667	9.998000	5.830667	8.435000	6.495000	6.925333	7.094667	7.342333	7.237000	11.147333	10.278333	3.166667	0.333333
1961-05	9.860968	8.735161	10.818065	5.892258	9.490323	6.574839	7.477097	8.177097	8.039355	8.499355	11.900323	12.011613	2.806452	0.258065
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1978-08	9.645161	8.259355	9.032258	4.502903	7.368065	5.935161	5.650323	5.417742	7.241290	5.536774	10.466774	12.054194	2.903226	0.258065
1978-09	10.913667	10.895000	10.635000	5.725000	10.372000	9.278333	10.790333	9.583000	10.069333	8.939000	15.680333	19.391333	3.100000	0.300000
1978-10	9.897742	8.670968	9.295806	4.721290	8.525161	6.774194	8.115484	7.337742	8.297742	8.243871	13.776774	17.150000	2.935484	0.290323
1978-11	16.151667	14.802667	13.508000	7.317333	11.475000	8.743000	11.492333	9.657333	10.701333	10.676000	17.404667	20.723000	2.966667	0.266667
1978-12	16.175484	13.748065	15.635161	7.094839	11.398710	9.241613	12.077419	10.194839	10.616774	11.028710	13.859677	21.371613	3.193548	0.322581

216 rows × 14 columns

74. RPT 컬럼의 값을 일자별로 1차 차분하라¶

In [21]:

answer = df['RPT'].diff()
answer

Out[21]:

0        NaN
1      -0.33
2       3.79
3      -7.92
4       2.75
        ... 
6569    3.75
6570   -4.37
6571    0.79
6572    4.50
6573    1.83
Name: RPT, Length: 6574, dtype: float64

75. RPT와 VAL의 컬럼을 일주일 간격으로 각각 이동평균 한 값을 구하여라¶

rolling(7) : 7일간 이동평균을 구하기 위해 사용하는 함수¶

In [33]:

answer = df[['RPT','VAL']].rolling(7).mean()
answer.head(10)

Out[33]:

	RPT	VAL
0	NaN	NaN
1	NaN	NaN
2	NaN	NaN
3	NaN	NaN
4	NaN	NaN
5	NaN	NaN
6	14.124286	12.727143
7	13.541429	11.982857
8	13.237143	11.392857
9	12.504286	10.570000

76. 년-월-일:시 컬럼을 pandas에서 인식할 수 있는 datetime 형태로 변경하라. 서울시의 제공데이터의 경우 0시가 24시로 표현된다¶

In [63]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/seoul_pm.csv')
df

Out[63]:

	(년-월-일:시)	PM10등급	PM10	PM2.5등급	PM2.5	오존등급	오존	이산화질소등급	이산화질소	일산화탄소등급	일산화탄소	아황산가스등급	아황산가스
0	2021-05-15:15	보통	47.0	보통	19.0	좋음	0.017	좋음	0.023	좋음	0.4	좋음	0.003
1	2021-05-15:14	보통	43.0	보통	20.0	좋음	0.024	좋음	0.019	좋음	0.3	좋음	0.003
2	2021-05-15:13	보통	34.0	보통	24.0	보통	0.035	좋음	0.017	좋음	0.4	좋음	0.004
3	2021-05-15:12	보통	41.0	보통	27.0	보통	0.037	좋음	0.020	좋음	0.4	좋음	0.004
4	2021-05-15:11	보통	51.0	보통	34.0	보통	0.033	좋음	0.023	좋음	0.4	좋음	0.005
...	...	...	...	...	...	...	...	...	...	...	...	...	...
1234	2021-03-25:05	보통	39.0	보통	18.0	좋음	0.026	좋음	0.025	좋음	0.4	좋음	0.003
1235	2021-03-25:04	보통	34.0	좋음	15.0	좋음	0.017	보통	0.033	좋음	0.4	좋음	0.002
1236	2021-03-25:03	보통	35.0	좋음	13.0	좋음	0.029	좋음	0.025	좋음	0.4	좋음	0.003
1237	2021-03-25:02	보통	35.0	좋음	13.0	보통	0.031	좋음	0.025	좋음	0.3	좋음	0.003
1238	2021-03-25:01	보통	42.0	좋음	13.0	좋음	0.022	보통	0.037	좋음	0.4	좋음	0.003

1239 rows × 13 columns

In [64]:

import datetime
def date_ft(x):
    if x[-2:] != '24':
        x = x.replace(':',' ')
        return pd.to_datetime(x)
    else:
        x = pd.to_datetime(x.split(':')[0]) + datetime.timedelta(days = 1)
        return x
df['(년-월-일:시)'] = df['(년-월-일:시)'].apply(date_ft)
df.head(3)

Out[64]:

	(년-월-일:시)	PM10등급	PM10	PM2.5등급	PM2.5	오존등급	오존	이산화질소등급	이산화질소	일산화탄소등급	일산화탄소	아황산가스등급	아황산가스
0	2021-05-15 15:00:00	보통	47.0	보통	19.0	좋음	0.017	좋음	0.023	좋음	0.4	좋음	0.003
1	2021-05-15 14:00:00	보통	43.0	보통	20.0	좋음	0.024	좋음	0.019	좋음	0.3	좋음	0.003
2	2021-05-15 13:00:00	보통	34.0	보통	24.0	보통	0.035	좋음	0.017	좋음	0.4	좋음	0.004

77. 일자별 영어요일 이름을 dayName 컬럼에 저장하라¶

요일명 뽑기¶

dt.day_name()¶

In [73]:

df['dayName'] = df['(년-월-일:시)'].dt.day_name()
df['dayName'].head(5)

Out[73]:

0    Saturday
1    Saturday
2    Saturday
3    Saturday
4    Saturday
Name: dayName, dtype: object

78. 일자별 각 PM10등급의 빈도수를 파악하라¶

In [77]:

df['PM10등급'].value_counts()

Out[77]:

보통      651
좋음      361
나쁨      145
매우나쁨     76
Name: PM10등급, dtype: int64

In [89]:

answer = df.groupby(['dayName','PM10등급'],as_index=False).size()
answer

Out[89]:

	dayName	PM10등급	size
0	Friday	나쁨	31
1	Friday	매우나쁨	17
2	Friday	보통	120
3	Friday	좋음	21
4	Monday	나쁨	1
5	Monday	매우나쁨	21
6	Monday	보통	83
7	Monday	좋음	63
8	Saturday	나쁨	31
9	Saturday	매우나쁨	27
10	Saturday	보통	71
11	Saturday	좋음	54
12	Sunday	나쁨	2
13	Sunday	매우나쁨	1
14	Sunday	보통	67
15	Sunday	좋음	98
16	Thursday	나쁨	41
17	Thursday	보통	144
18	Thursday	좋음	5
19	Tuesday	나쁨	13
20	Tuesday	매우나쁨	10
21	Tuesday	보통	71
22	Tuesday	좋음	74
23	Wednesday	나쁨	26
24	Wednesday	보통	95
25	Wednesday	좋음	46

79. 시간이 연속적으로 존재하며 결측치가 없는지 확인하라¶

In [98]:

#시간을 차분했을 경우 첫 값은 nan, 이후 모든 차분값이 동일하면 연속이라 판단한다.
answer = len(df['(년-월-일:시)'].diff().unique())

if answer == 2:
    answer = True
else:
    answer = False
answer

Out[98]:

True

80. 오전 10시와 오후 10시(22시) 의 PM10의 평균값을 각각 구하여라¶

In [140]:

answer = df.loc[(df['(년-월-일:시)'].dt.hour==10) | (df['(년-월-일:시)'].dt.hour==22)]
answer = answer[['(년-월-일:시)','PM10']].groupby(df['(년-월-일:시)'].dt.hour).agg('mean')
answer

Out[140]:

	PM10
(년-월-일:시)
10	70.384615
22	69.941176

In [151]:

df.groupby(df['(년-월-일:시)'].dt.hour).mean().loc[[10,22],['PM10']]

Out[151]:

	PM10
(년-월-일:시)
10	70.384615
22	69.941176

81. 날짜 컬럼을 index로 만들어라¶

In [178]:

df = df.set_index('(년-월-일:시)',drop=True)
df.head(3)

Out[178]:

	PM10등급	PM10	PM2.5등급	PM2.5	오존등급	오존	이산화질소등급	이산화질소	일산화탄소등급	일산화탄소	아황산가스등급	아황산가스	dayName
(년-월-일:시)
2021-05-15 15:00:00	보통	47.0	보통	19.0	좋음	0.017	좋음	0.023	좋음	0.4	좋음	0.003	Saturday
2021-05-15 14:00:00	보통	43.0	보통	20.0	좋음	0.024	좋음	0.019	좋음	0.3	좋음	0.003	Saturday
2021-05-15 13:00:00	보통	34.0	보통	24.0	보통	0.035	좋음	0.017	좋음	0.4	좋음	0.004	Saturday

82. 데이터를 주 단위로 뽑아서 최소, 최대 평균, 표준표차를 구하여라¶

In [187]:

answer = df.resample('W').agg(['min','max','mean','std'])
answer

Out[187]:

	PM10				PM2.5				오존		...	이산화질소		일산화탄소				아황산가스
	min	max	mean	std	min	max	mean	std	min	max	...	mean	std	min	max	mean	std	min	max	mean	std
(년-월-일:시)
2021-03-28	6.0	160.0	72.747368	43.345462	4.0	113.0	44.705263	29.551928	0.002	0.085	...	0.044579	0.023722	0.3	1.4	0.611579	0.232408	0.002	0.006	0.003274	0.000961
2021-04-04	3.0	598.0	97.148810	129.911976	1.0	120.0	23.168675	22.399578	0.003	0.055	...	0.027929	0.014978	0.3	0.9	0.445833	0.135741	0.002	0.004	0.002732	0.000541
2021-04-11	17.0	102.0	41.059524	16.325911	7.0	70.0	21.761905	11.479343	0.009	0.070	...	0.022583	0.009562	0.3	0.7	0.389286	0.087573	0.002	0.004	0.002744	0.000569
2021-04-18	3.0	367.0	48.180723	43.254468	2.0	38.0	17.066265	7.867952	0.002	0.070	...	0.023753	0.013553	0.3	0.6	0.386747	0.084954	0.002	0.004	0.002464	0.000579
2021-04-25	17.0	126.0	55.119048	26.659936	7.0	61.0	26.392857	13.094788	0.006	0.090	...	0.028571	0.014640	0.3	0.8	0.457143	0.122142	0.001	0.011	0.003631	0.001763
2021-05-02	3.0	97.0	40.612121	24.813103	1.0	43.0	16.644578	8.850965	0.003	0.064	...	0.020428	0.011676	0.3	0.6	0.392771	0.092485	0.001	0.006	0.002524	0.000768
2021-05-09	8.0	1024.0	161.660714	239.679148	3.0	172.0	34.738095	39.788248	0.002	0.073	...	0.024187	0.012371	0.3	1.0	0.419277	0.103230	0.002	0.004	0.002771	0.000579
2021-05-16	16.0	111.0	40.014815	21.876855	7.0	76.0	21.577778	15.622633	0.004	0.123	...	0.030793	0.009503	0.3	0.8	0.440741	0.094075	0.001	0.006	0.002459	0.001696

8 rows × 24 columns

In [221]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/under5MortalityRate.csv')
df.drop('Indicator',axis=1,inplace=True)
df['First Tooltip'] = df['First Tooltip'].map(lambda x : float(x.split(' ')[0]))
df

Out[221]:

	Location	Period	Dim1	First Tooltip
0	Afghanistan	2019	Both sexes	60.27
1	Afghanistan	2019	Male	63.83
2	Afghanistan	2019	Female	56.57
3	Afghanistan	2018	Both sexes	62.54
4	Afghanistan	2018	Male	66.08
...	...	...	...	...
29994	Timor-Leste	2002	Both sexes	96.97
29995	Timor-Leste	2002	Male	102.60
29996	Timor-Leste	2002	Female	91.04
29997	Timor-Leste	2001	Both sexes	102.50
29998	Timor-Leste	2001	Male	108.20

29999 rows × 4 columns

84. 년도가 2015년 이상, Dim1이 Both sexes인 케이스만 추출¶

In [222]:

target = df.loc[(df['Period']>=2015)&(df['Dim1']=='Both sexes')]
target

Out[222]:

	Location	Period	Dim1	First Tooltip
0	Afghanistan	2019	Both sexes	60.27
3	Afghanistan	2018	Both sexes	62.54
6	Afghanistan	2017	Both sexes	64.94
9	Afghanistan	2016	Both sexes	67.57
12	Afghanistan	2015	Both sexes	70.44
...	...	...	...	...
29943	Timor-Leste	2019	Both sexes	44.22
29946	Timor-Leste	2018	Both sexes	45.62
29949	Timor-Leste	2017	Both sexes	47.27
29952	Timor-Leste	2016	Both sexes	49.01
29955	Timor-Leste	2015	Both sexes	50.76

860 rows × 4 columns

85. 84번 문제에서 추출한 데이터로 아래와 같이 나라에 따른 년도별 사망률을 데이터 프레임화 하여라¶

In [223]:

answer = target.pivot(index='Location',columns='Period',values='First Tooltip')
answer

Out[223]:

Period	2015	2016	2017	2018	2019
Location
Afghanistan	70.44	67.57	64.94	62.54	60.27
Albania	9.57	9.42	9.42	9.53	9.68
Algeria	25.18	24.79	24.32	23.81	23.26
Andorra	3.53	3.37	3.22	3.09	2.97
Angola	88.20	84.21	80.62	77.67	74.69
...	...	...	...	...	...
Syrian Arab Republic	23.18	23.27	22.97	22.11	21.53
Tajikistan	37.75	36.82	35.81	34.80	33.78
Thailand	10.80	10.32	9.86	9.42	9.01
The former Yugoslav Republic of Macedonia	12.97	11.97	9.94	7.83	6.12
Timor-Leste	50.76	49.01	47.27	45.62	44.22

172 rows × 5 columns

86.Dim1에 따른 년도별 사망비율의 평균을 구하라¶

In [238]:

answer = df.pivot_table(index='Dim1', columns='Period', values='First Tooltip', aggfunc='mean')
answer

Out[238]:

Period	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	...	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Dim1
Both sexes	147.700816	155.537544	157.811094	156.147206	154.539857	155.797179	159.241163	156.170114	150.813222	150.574000	...	37.718488	35.573663	34.290988	33.099360	32.053314	31.012093	29.956337	29.030465	28.083837	27.191744
Female	140.909796	149.210175	151.516094	150.250882	148.688286	149.843205	153.048721	149.988295	144.719667	144.451474	...	34.953023	32.877616	31.654070	30.521337	29.524302	28.544360	27.542035	26.675291	25.782616	24.945349
Male	154.151224	161.538246	163.760781	161.742059	160.081000	161.456923	165.089535	162.015000	156.573556	156.375053	...	40.340174	38.140291	36.793081	35.543663	34.446105	33.354302	32.242616	31.273198	30.283023	29.350349

3 rows × 70 columns

87. 데이터에서 한국 KOR 데이터만 추출하라¶

In [239]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/winter.csv')
df

Out[239]:

	Year	City	Sport	Discipline	Athlete	Country	Gender	Event	Medal
0	1924	Chamonix	Biathlon	Biathlon	BERTHET, G.	FRA	Men	Military Patrol	Bronze
1	1924	Chamonix	Biathlon	Biathlon	MANDRILLON, C.	FRA	Men	Military Patrol	Bronze
2	1924	Chamonix	Biathlon	Biathlon	MANDRILLON, Maurice	FRA	Men	Military Patrol	Bronze
3	1924	Chamonix	Biathlon	Biathlon	VANDELLE, André	FRA	Men	Military Patrol	Bronze
4	1924	Chamonix	Biathlon	Biathlon	AUFDENBLATTEN, Adolf	SUI	Men	Military Patrol	Gold
...	...	...	...	...	...	...	...	...	...
5765	2014	Sochi	Skiing	Snowboard	JONES, Jenny	GBR	Women	Slopestyle	Bronze
5766	2014	Sochi	Skiing	Snowboard	ANDERSON, Jamie	USA	Women	Slopestyle	Gold
5767	2014	Sochi	Skiing	Snowboard	MALTAIS, Dominique	CAN	Women	Snowboard Cross	Silver
5768	2014	Sochi	Skiing	Snowboard	SAMKOVA, Eva	CZE	Women	Snowboard Cross	Gold
5769	2014	Sochi	Skiing	Snowboard	TRESPEUCH, Chloe	FRA	Women	Snowboard Cross	Bronze

5770 rows × 9 columns

In [241]:

answer = df.loc[df['Country']=='KOR']
answer

Out[241]:

	Year	City	Sport	Discipline	Athlete	Country	Gender	Event	Medal
2652	1992	Albertville	Skating	Short Track Speed Skating	LEE, Jun-Ho	KOR	Men	1000M	Bronze
2653	1992	Albertville	Skating	Short Track Speed Skating	KIM, Ki-Hoon	KOR	Men	1000M	Gold
2671	1992	Albertville	Skating	Short Track Speed Skating	KIM, Ki-Hoon	KOR	Men	5000M Relay	Gold
2672	1992	Albertville	Skating	Short Track Speed Skating	LEE, Jun-Ho	KOR	Men	5000M Relay	Gold
2673	1992	Albertville	Skating	Short Track Speed Skating	MO, Ji-Soo	KOR	Men	5000M Relay	Gold
...	...	...	...	...	...	...	...	...	...
5528	2014	Sochi	Skating	Short Track Speed Skating	PARK, Seung-Hi	KOR	Women	500M	Bronze
5544	2014	Sochi	Skating	Speed skating	JOO, Hyong Jun	KOR	Men	Team Pursuit	Silver
5545	2014	Sochi	Skating	Speed skating	KIM, Cheol Min	KOR	Men	Team Pursuit	Silver
5546	2014	Sochi	Skating	Speed skating	LEE, Seung Hoon	KOR	Men	Team Pursuit	Silver
5565	2014	Sochi	Skating	Speed skating	LEE, Sang Hwa	KOR	Women	500M	Gold

87 rows × 9 columns

88. 한국 올림픽 메달리스트 데이터에서 년도에 따른 medal 갯수를 데이터프레임화 하라¶

In [249]:

Ans = answer.pivot_table(index='Year', columns='Medal', aggfunc='size').fillna(0)
Ans

Out[249]:

Medal	Bronze	Gold	Silver
Year
1992	1.0	5.0	1.0
1994	1.0	8.0	1.0
1998	2.0	6.0	4.0
2002	0.0	5.0	2.0
2006	2.0	14.0	3.0
2010	2.0	6.0	10.0
2014	2.0	7.0	5.0

89. 전체 데이터에서 sport종류에 따른 성별수를 구하여라¶

In [251]:

Ans = df.pivot_table(index='Sport', columns='Gender', aggfunc='size')
Ans

Out[251]:

Gender	Men	Women
Sport
Biathlon	270	150
Bobsleigh	416	36
Curling	97	75
Ice Hockey	1231	305
Luge	135	45
Skating	665	564
Skiing	1130	651

90. 전체 데이터에서 Discipline종류에 따른 따른 Medal수를 구하여라¶

In [253]:

Ans = df.pivot_table(index = 'Discipline', columns='Medal', aggfunc='size')
Ans

Out[253]:

Medal	Bronze	Gold	Silver
Discipline
Alpine Skiing	141	143	144
Biathlon	139	140	141
Bobsleigh	147	134	141
Cross Country Skiing	263	264	262
Curling	56	58	58
Figure skating	118	122	119
Freestyle Skiing	34	34	34
Ice Hockey	512	510	514
Luge	60	62	58
Nordic Combined	55	55	55
Short Track Speed Skating	96	97	97
Skeleton	10	10	10
Ski Jumping	68	69	70
Snowboard	30	30	30
Speed skating	190	193	197

91. df1과 df2 데이터를 하나의 데이터 프레임으로 합쳐라¶

index_col = 0 은 컬럼을 인덱스로 지정하는것¶

df1,df2 두개의 데이터 프레임을 출력하기 위한 함수¶

pd.concat() : 두개의 데이터 프레임을 연결하는 pandas의 함수이다.¶

axis = 0 >>> 세로방향 axis = 1 >>> 가로방향

In [365]:

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/mergeTEst.csv',index_col=0)
df1 = df.iloc[:4,:]
df2 = df.iloc[4:,:]

display(df1)
display(df2)

total = pd.concat([df1,df2],axis=0)
total

	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Location
Afghanistan	64.023	61.640	59.367	57.170	55.08	53.107	51.267	49.560	47.983	46.453
Albania	11.803	10.807	9.943	9.267	8.79	8.493	8.363	8.363	8.453	8.597
Algeria	23.540	22.907	22.450	22.117	21.85	21.587	21.257	20.850	20.407	19.930
Andorra	4.240	4.033	3.843	3.667	3.49	3.330	3.187	3.060	2.933	2.827

	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Location
Angola	75.713	71.280	67.233	63.570	60.430	57.757	55.510	53.460	51.757	50.093
Antigua and Barbuda	8.667	8.223	7.807	7.420	7.070	6.757	6.483	6.230	6.000	5.783
Argentina	12.887	12.380	11.840	11.283	10.733	10.203	9.683	9.177	8.680	8.227
Armenia	16.497	15.677	14.897	14.170	13.477	12.817	12.183	11.583	11.007	10.497
Australia	3.993	3.803	3.623	3.467	3.343	3.253	3.183	3.137	3.090	3.047
Austria	3.573	3.463	3.333	3.210	3.113	3.043	2.987	2.943	2.897	2.843

Out[365]:

	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Location
Afghanistan	64.023	61.640	59.367	57.170	55.080	53.107	51.267	49.560	47.983	46.453
Albania	11.803	10.807	9.943	9.267	8.790	8.493	8.363	8.363	8.453	8.597
Algeria	23.540	22.907	22.450	22.117	21.850	21.587	21.257	20.850	20.407	19.930
Andorra	4.240	4.033	3.843	3.667	3.490	3.330	3.187	3.060	2.933	2.827
Angola	75.713	71.280	67.233	63.570	60.430	57.757	55.510	53.460	51.757	50.093
Antigua and Barbuda	8.667	8.223	7.807	7.420	7.070	6.757	6.483	6.230	6.000	5.783
Argentina	12.887	12.380	11.840	11.283	10.733	10.203	9.683	9.177	8.680	8.227
Armenia	16.497	15.677	14.897	14.170	13.477	12.817	12.183	11.583	11.007	10.497
Australia	3.993	3.803	3.623	3.467	3.343	3.253	3.183	3.137	3.090	3.047
Austria	3.573	3.463	3.333	3.210	3.113	3.043	2.987	2.943	2.897	2.843

92. df3과 df4 데이터를 하나의 데이터 프레임으로 합쳐라. 둘다 포함하고 있는 년도에 대해서만 고려한다¶

pd.concat(join ='inner') : 이너조인 둘다있는 값으로 합친다¶

In [371]:

df3 = df.iloc[:2,:4]
df4 = df.iloc[5:,3:]

display(df3)
display(df4)

total = pd.concat([df3,df4],axis=0,join='inner')
total

	2010	2011	2012	2013
Location
Afghanistan	64.023	61.640	59.367	57.170
Albania	11.803	10.807	9.943	9.267

	2013	2014	2015	2016	2017	2018	2019
Location
Antigua and Barbuda	7.420	7.070	6.757	6.483	6.230	6.000	5.783
Argentina	11.283	10.733	10.203	9.683	9.177	8.680	8.227
Armenia	14.170	13.477	12.817	12.183	11.583	11.007	10.497
Australia	3.467	3.343	3.253	3.183	3.137	3.090	3.047
Austria	3.210	3.113	3.043	2.987	2.943	2.897	2.843

Out[371]:

	2013
Location
Afghanistan	57.170
Albania	9.267
Antigua and Barbuda	7.420
Argentina	11.283
Armenia	14.170
Australia	3.467
Austria	3.210

93. df3과 df4 데이터를 하나의 데이터 프레임으로 합쳐라. 모든 컬럼을 포함하고, 결측치는 0으로 대체한다¶

pd.concat(join='outer') : 모든 컬럼을 포함하며 조인¶

In [375]:

total = pd.concat([df3,df4],axis=0,join='outer').fillna(0)
total

Out[375]:

	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019
Location
Afghanistan	64.023	61.640	59.367	57.170	0.000	0.000	0.000	0.000	0.000	0.000
Albania	11.803	10.807	9.943	9.267	0.000	0.000	0.000	0.000	0.000	0.000
Antigua and Barbuda	0.000	0.000	0.000	7.420	7.070	6.757	6.483	6.230	6.000	5.783
Argentina	0.000	0.000	0.000	11.283	10.733	10.203	9.683	9.177	8.680	8.227
Armenia	0.000	0.000	0.000	14.170	13.477	12.817	12.183	11.583	11.007	10.497
Australia	0.000	0.000	0.000	3.467	3.343	3.253	3.183	3.137	3.090	3.047
Austria	0.000	0.000	0.000	3.210	3.113	3.043	2.987	2.943	2.897	2.843

94. df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 두 데이터 모두 포함하는 데이터만 출력하라¶

In [382]:

df5 = df.T.iloc[:7,:3]
df6 = df.T.iloc[6:,2:5]

display(df5)
display(df6)

total = pd.merge(df5,df6, on='Algeria', how = 'inner')
total

Location	Afghanistan	Albania	Algeria
2010	64.023	11.803	23.540
2011	61.640	10.807	22.907
2012	59.367	9.943	22.450
2013	57.170	9.267	22.117
2014	55.080	8.790	21.850
2015	53.107	8.493	21.587
2016	51.267	8.363	21.257

Location	Algeria	Andorra	Angola
2016	21.257	3.187	55.510
2017	20.850	3.060	53.460
2018	20.407	2.933	51.757
2019	19.930	2.827	50.093

Out[382]:

Location	Afghanistan	Albania	Algeria	Andorra	Angola
0	51.267	8.363	21.257	3.187	55.51

95. df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 합집합으로 합쳐라¶

In [384]:

total = pd.merge(df5,df6,on='Algeria', how='outer')
total

Out[384]:

Location	Afghanistan	Albania	Algeria	Andorra	Angola
0	64.023	11.803	23.540	NaN	NaN
1	61.640	10.807	22.907	NaN	NaN
2	59.367	9.943	22.450	NaN	NaN
3	57.170	9.267	22.117	NaN	NaN
4	55.080	8.790	21.850	NaN	NaN
5	53.107	8.493	21.587	NaN	NaN
6	51.267	8.363	21.257	3.187	55.510
7	NaN	NaN	20.850	3.060	53.460
8	NaN	NaN	20.407	2.933	51.757
9	NaN	NaN	19.930	2.827	50.093