빅데이터분석기사 준비
빅데이터분석 실기 1유형 준비
세용용용용
2023. 4. 27. 20:32
In [6]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/lol.csv', sep='\t')
df
Out[6]:
| gameId | creationTime | gameDuration | seasonId | winner | firstBlood | firstTower | firstInhibitor | firstBaron | firstDragon | ... | t2_towerKills | t2_inhibitorKills | t2_baronKills | t2_dragonKills | t2_riftHeraldKills | t2_ban1 | t2_ban2 | t2_ban3 | t2_ban4 | t2_ban5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3326086514 | 1504279457970 | 1949 | 9 | 1 | 2 | 1 | 1 | 1 | 1 | ... | 5 | 0 | 0 | 1 | 1 | 114 | 67 | 43 | 16 | 51 |
| 1 | 3229566029 | 1497848803862 | 1851 | 9 | 1 | 1 | 1 | 1 | 0 | 1 | ... | 2 | 0 | 0 | 0 | 0 | 11 | 67 | 238 | 51 | 420 |
| 2 | 3327363504 | 1504360103310 | 1493 | 9 | 1 | 2 | 1 | 1 | 1 | 2 | ... | 2 | 0 | 0 | 1 | 0 | 157 | 238 | 121 | 57 | 28 |
| 3 | 3326856598 | 1504348503996 | 1758 | 9 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 164 | 18 | 141 | 40 | 51 |
| 4 | 3330080762 | 1504554410899 | 2094 | 9 | 1 | 2 | 1 | 1 | 1 | 1 | ... | 3 | 0 | 0 | 1 | 0 | 86 | 11 | 201 | 122 | 18 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 51485 | 3308904636 | 1503076540231 | 1944 | 9 | 2 | 1 | 2 | 2 | 0 | 2 | ... | 10 | 2 | 0 | 4 | 0 | 55 | -1 | 90 | 238 | 157 |
| 51486 | 3215685759 | 1496957179355 | 3304 | 9 | 2 | 1 | 1 | 2 | 2 | 2 | ... | 11 | 7 | 4 | 4 | 1 | 157 | 55 | 119 | 154 | 105 |
| 51487 | 3322765040 | 1504029863961 | 2156 | 9 | 2 | 2 | 2 | 2 | 0 | 1 | ... | 10 | 2 | 0 | 2 | 0 | 113 | 122 | 53 | 11 | 157 |
| 51488 | 3256675373 | 1499562036246 | 1475 | 9 | 2 | 2 | 2 | 2 | 0 | 2 | ... | 11 | 3 | 0 | 1 | 0 | 154 | 39 | 51 | 90 | 114 |
| 51489 | 3317333020 | 1503612754059 | 1445 | 9 | 1 | 1 | 1 | 1 | 1 | 2 | ... | 1 | 0 | 0 | 1 | 0 | 11 | 157 | 141 | 31 | 18 |
51490 rows × 61 columns
In [7]:
type(df)
Out[7]:
pandas.core.frame.DataFrame
데이터 상위 5개 행 출력¶
In [20]:
answer = df.head(5)
answer
Out[20]:
| gameId | creationTime | gameDuration | seasonId | winner | firstBlood | firstTower | firstInhibitor | firstBaron | firstDragon | ... | t2_towerKills | t2_inhibitorKills | t2_baronKills | t2_dragonKills | t2_riftHeraldKills | t2_ban1 | t2_ban2 | t2_ban3 | t2_ban4 | t2_ban5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3326086514 | 1504279457970 | 1949 | 9 | 1 | 2 | 1 | 1 | 1 | 1 | ... | 5 | 0 | 0 | 1 | 1 | 114 | 67 | 43 | 16 | 51 |
| 1 | 3229566029 | 1497848803862 | 1851 | 9 | 1 | 1 | 1 | 1 | 0 | 1 | ... | 2 | 0 | 0 | 0 | 0 | 11 | 67 | 238 | 51 | 420 |
| 2 | 3327363504 | 1504360103310 | 1493 | 9 | 1 | 2 | 1 | 1 | 1 | 2 | ... | 2 | 0 | 0 | 1 | 0 | 157 | 238 | 121 | 57 | 28 |
| 3 | 3326856598 | 1504348503996 | 1758 | 9 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 164 | 18 | 141 | 40 | 51 |
| 4 | 3330080762 | 1504554410899 | 2094 | 9 | 1 | 2 | 1 | 1 | 1 | 1 | ... | 3 | 0 | 0 | 1 | 0 | 86 | 11 | 201 | 122 | 18 |
5 rows × 61 columns
데이터의 행과 열의 갯수를 파악하라¶
In [21]:
print(df.shape)
print('행:',df.shape[0])
print('열:',df.shape[0])
(51490, 61)
행: 51490
열: 51490
전체 컬럼을 출력하라¶
In [22]:
answer = df.columns
answer
Out[22]:
Index(['gameId', 'creationTime', 'gameDuration', 'seasonId', 'winner',
'firstBlood', 'firstTower', 'firstInhibitor', 'firstBaron',
'firstDragon', 'firstRiftHerald', 't1_champ1id', 't1_champ1_sum1',
't1_champ1_sum2', 't1_champ2id', 't1_champ2_sum1', 't1_champ2_sum2',
't1_champ3id', 't1_champ3_sum1', 't1_champ3_sum2', 't1_champ4id',
't1_champ4_sum1', 't1_champ4_sum2', 't1_champ5id', 't1_champ5_sum1',
't1_champ5_sum2', 't1_towerKills', 't1_inhibitorKills', 't1_baronKills',
't1_dragonKills', 't1_riftHeraldKills', 't1_ban1', 't1_ban2', 't1_ban3',
't1_ban4', 't1_ban5', 't2_champ1id', 't2_champ1_sum1', 't2_champ1_sum2',
't2_champ2id', 't2_champ2_sum1', 't2_champ2_sum2', 't2_champ3id',
't2_champ3_sum1', 't2_champ3_sum2', 't2_champ4id', 't2_champ4_sum1',
't2_champ4_sum2', 't2_champ5id', 't2_champ5_sum1', 't2_champ5_sum2',
't2_towerKills', 't2_inhibitorKills', 't2_baronKills', 't2_dragonKills',
't2_riftHeraldKills', 't2_ban1', 't2_ban2', 't2_ban3', 't2_ban4',
't2_ban5'],
dtype='object')
6번쨰 컬럼명을 출력하라¶
In [25]:
answer = df.columns[5]
answer
Out[25]:
'firstBlood'
6번쨰 컬럼의 데이터 타입을 확인하라¶
In [34]:
answer = df.iloc[:,5].dtype
answer
Out[34]:
dtype('int64')
데이터셋의 인덱스 구성은 어떤가¶
In [35]:
answer = df.index
answer
Out[35]:
RangeIndex(start=0, stop=51490, step=1)
6번쨰 컬럼의 3번쨰 값은 무엇인가??¶
In [38]:
answer = df.iloc[2,5]
answer
Out[38]:
2
데이터를 로드하라. 컬럼이 한글이기에 적절한 처리해줘야함¶
‘https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv’¶
In [40]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/Jeju.csv',encoding='euc-kr')
df
Out[40]:
| id | 일자 | 시도명 | 읍면동명 | 거주인구 | 근무인구 | 방문인구 | 총 유동인구 | 평균 속도 | 평균 소요 시간 | 평균 기온 | 일강수량 | 평균 풍속 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22448 | 2018-01-01 | 제주시 | 도두동 | 32249.987 | 3418.266 | 102709.092 | 138377.345 | 39.556 | 29.167 | 5.0 | 0.0 | 2.50 |
| 1 | 22449 | 2018-01-01 | 제주시 | 외도동 | 213500.997 | 10341.172 | 112692.789 | 336534.958 | 32.900 | 30.900 | 5.0 | 0.0 | 2.50 |
| 2 | 22450 | 2018-01-01 | 제주시 | 이도2동 | 1212382.218 | 96920.834 | 541194.481 | 1850497.533 | 29.538 | 35.692 | 2.9 | 0.0 | 2.40 |
| 3 | 22451 | 2018-01-01 | 제주시 | 일도1동 | 33991.653 | 6034.253 | 72155.919 | 112181.825 | 30.000 | 23.500 | 2.9 | 0.0 | 2.40 |
| 4 | 22452 | 2018-01-01 | 서귀포시 | 대천동 | 155036.925 | 9403.969 | 150882.409 | 315323.303 | 41.583 | 14.375 | 5.1 | 0.0 | 2.30 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9616 | 32064 | 2020-04-30 | 서귀포시 | 표선면 | 228260.005 | 16891.732 | 152832.449 | 397984.186 | 44.514 | 34.054 | 18.1 | 0.0 | 4.70 |
| 9617 | 32065 | 2020-04-30 | 제주시 | 한림읍 | 459959.064 | 26007.122 | 249700.419 | 735666.605 | 48.609 | 61.377 | 18.0 | 0.0 | 4.35 |
| 9618 | 32066 | 2020-04-30 | 제주시 | 도두동 | 28397.481 | 3144.895 | 84052.697 | 115595.073 | 41.053 | 29.421 | 20.3 | 0.0 | 3.00 |
| 9619 | 32067 | 2020-04-30 | 서귀포시 | 안덕면 | 348037.846 | 29106.286 | 251129.660 | 628273.792 | 46.595 | 49.189 | 17.6 | 0.0 | 3.50 |
| 9620 | 32068 | 2020-04-30 | 제주시 | 연동 | 1010643.372 | 65673.477 | 447622.068 | 1523938.917 | 40.863 | 27.765 | 14.1 | 0.0 | 4.80 |
9621 rows × 13 columns
데이터 마지막 3개행을 출력하라¶
In [41]:
answer = df.tail(3)
answer
Out[41]:
| id | 일자 | 시도명 | 읍면동명 | 거주인구 | 근무인구 | 방문인구 | 총 유동인구 | 평균 속도 | 평균 소요 시간 | 평균 기온 | 일강수량 | 평균 풍속 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9618 | 32066 | 2020-04-30 | 제주시 | 도두동 | 28397.481 | 3144.895 | 84052.697 | 115595.073 | 41.053 | 29.421 | 20.3 | 0.0 | 3.0 |
| 9619 | 32067 | 2020-04-30 | 서귀포시 | 안덕면 | 348037.846 | 29106.286 | 251129.660 | 628273.792 | 46.595 | 49.189 | 17.6 | 0.0 | 3.5 |
| 9620 | 32068 | 2020-04-30 | 제주시 | 연동 | 1010643.372 | 65673.477 | 447622.068 | 1523938.917 | 40.863 | 27.765 | 14.1 | 0.0 | 4.8 |
수치형 변수를 가진 컬럼을 출력하라¶
In [45]:
answer = df.select_dtypes(include=[int,float]).columns
answer
Out[45]:
Index(['id', '거주인구', '근무인구', '방문인구', '총 유동인구', '평균 속도', '평균 소요 시간', '평균 기온',
'일강수량', '평균 풍속'],
dtype='object')
번주형 변수를 가진 컬럼을 출력하라¶
In [46]:
answer = df.select_dtypes(include=object).columns
answer
Out[46]:
Index(['일자', '시도명', '읍면동명'], dtype='object')
각 컬럼의 결측치 숫자를 파악하라¶
In [50]:
answer = df.isnull().sum()
answer
Out[50]:
id 0
일자 0
시도명 0
읍면동명 0
거주인구 0
근무인구 0
방문인구 0
총 유동인구 0
평균 속도 0
평균 소요 시간 0
평균 기온 0
일강수량 0
평균 풍속 0
dtype: int64
각 컬럼의 데이터수, 데이터타입을 한번에 확인하라¶
In [51]:
answer = df.info()
answer
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9621 entries, 0 to 9620
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 9621 non-null int64
1 일자 9621 non-null object
2 시도명 9621 non-null object
3 읍면동명 9621 non-null object
4 거주인구 9621 non-null float64
5 근무인구 9621 non-null float64
6 방문인구 9621 non-null float64
7 총 유동인구 9621 non-null float64
8 평균 속도 9621 non-null float64
9 평균 소요 시간 9621 non-null float64
10 평균 기온 9621 non-null float64
11 일강수량 9621 non-null float64
12 평균 풍속 9621 non-null float64
dtypes: float64(9), int64(1), object(3)
memory usage: 977.3+ KB
각 수치형 변수의 분포(사분위,평균,표준편차,최대,최소)를 확인하라¶
In [54]:
answer = df.describe()
answer
Out[54]:
| id | 거주인구 | 근무인구 | 방문인구 | 총 유동인구 | 평균 속도 | 평균 소요 시간 | 평균 기온 | 일강수량 | 평균 풍속 | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 9621.000000 | 9.621000e+03 | 9621.000000 | 9621.000000 | 9.621000e+03 | 9621.000000 | 9621.000000 | 9621.000000 | 9621.000000 | 9621.000000 |
| mean | 27258.000000 | 3.174315e+05 | 35471.201510 | 195889.561802 | 5.487922e+05 | 41.109084 | 37.215873 | 13.550828 | 6.972426 | 2.753171 |
| std | 2777.487804 | 2.982079e+05 | 40381.214775 | 140706.090325 | 4.608802e+05 | 8.758631 | 12.993786 | 7.745515 | 27.617260 | 1.498538 |
| min | 22448.000000 | 9.305552e+03 | 1407.936000 | 11538.322000 | 2.225181e+04 | 24.333000 | 12.667000 | -9.600000 | 0.000000 | 0.000000 |
| 25% | 24853.000000 | 9.539939e+04 | 12074.498000 | 99632.153000 | 2.216910e+05 | 34.250000 | 27.889000 | 7.600000 | 0.000000 | 1.700000 |
| 50% | 27258.000000 | 2.221105e+05 | 21960.928000 | 152805.335000 | 3.866935e+05 | 39.640000 | 34.500000 | 13.400000 | 0.000000 | 2.400000 |
| 75% | 29663.000000 | 4.106671e+05 | 40192.032000 | 236325.109000 | 6.406918e+05 | 49.105000 | 46.176000 | 19.700000 | 1.500000 | 3.400000 |
| max | 32068.000000 | 1.364504e+06 | 263476.965000 | 723459.209000 | 2.066484e+06 | 103.000000 | 172.200000 | 30.400000 | 587.500000 | 13.333000 |
거주인구 컬럼의 값들을 출력하라¶
In [58]:
answer = df['거주인구']
answer
Out[58]:
0 32249.987
1 213500.997
2 1212382.218
3 33991.653
4 155036.925
...
9616 228260.005
9617 459959.064
9618 28397.481
9619 348037.846
9620 1010643.372
Name: 거주인구, Length: 9621, dtype: float64
평균 속도 컬럼의 4분위 범위(IQR) 값을 구하여라¶
In [65]:
#IQR = 3분위수-1분위수
answer = df['평균 속도'].quantile(0.75) - df['평균 속도'].quantile(0.25)
answer
Out[65]:
14.854999999999997
읍면동명 컬럼의 유일값 갯수를 출력하라¶
In [71]:
answer = len(df['읍면동명'].unique())
answer
Out[71]:
41
읍면동명 컬럼의 유일값을 모두 출력하라¶
In [72]:
answer = df['읍면동명'].unique()
answer
Out[72]:
array(['도두동', '외도동', '이도2동', '일도1동', '대천동', '서홍동', '한경면', '송산동', '조천읍',
'일도2동', '영천동', '예래동', '대륜동', '삼도1동', '이호동', '건입동', '중앙동', '삼양동',
'삼도2동', '이도1동', '남원읍', '대정읍', '정방동', '효돈동', '아라동', '한림읍', '구좌읍',
'용담1동', '오라동', '화북동', '연동', '표선면', '중문동', '성산읍', '안덕면', '천지동',
'노형동', '동홍동', '용담2동', '봉개동', '애월읍'], dtype=object)
데이터를 로드하라¶
In [73]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/chipo.csv')
df
Out[73]:
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 |
| 1 | 1 | 1 | Izze | [Clementine] | $3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 |
| ... | ... | ... | ... | ... | ... |
| 4617 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... | $11.75 |
| 4618 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... | $11.75 |
| 4619 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $11.25 |
| 4620 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... | $8.75 |
| 4621 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $8.75 |
4622 rows × 5 columns
quantity컬럼 값이 3인 데이터를 추출하여 첫 5행을 출력¶
In [80]:
answer = df.loc[df['quantity']==3].head()
answer
Out[80]:
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 409 | 178 | 3 | Chicken Bowl | [[Fresh Tomato Salsa (Mild), Tomatillo-Green C... | $32.94 |
| 445 | 193 | 3 | Bowl | [Braised Carnitas, Pinto Beans, [Sour Cream, C... | $22.20 |
| 689 | 284 | 3 | Canned Soft Drink | [Diet Coke] | $3.75 |
| 818 | 338 | 3 | Bottled Water | NaN | $3.27 |
| 850 | 350 | 3 | Canned Soft Drink | [Sprite] | $3.75 |
In [83]:
answer.reset_index(drop=True)
Out[83]:
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 178 | 3 | Chicken Bowl | [[Fresh Tomato Salsa (Mild), Tomatillo-Green C... | $32.94 |
| 1 | 193 | 3 | Bowl | [Braised Carnitas, Pinto Beans, [Sour Cream, C... | $22.20 |
| 2 | 284 | 3 | Canned Soft Drink | [Diet Coke] | $3.75 |
| 3 | 338 | 3 | Bottled Water | NaN | $3.27 |
| 4 | 350 | 3 | Canned Soft Drink | [Sprite] | $3.75 |
quantity 컬럼 값이 3인 데이터를 추출하여 index를 0부터 정렬하고 첫 5행을 출력하라¶
In [86]:
answer = df[df['quantity']==3].head(5).reset_index(drop=True)
answer
Out[86]:
| order_id | quantity | item_name | choice_description | item_price | |
|---|---|---|---|---|---|
| 0 | 178 | 3 | Chicken Bowl | [[Fresh Tomato Salsa (Mild), Tomatillo-Green C... | $32.94 |
| 1 | 193 | 3 | Bowl | [Braised Carnitas, Pinto Beans, [Sour Cream, C... | $22.20 |
| 2 | 284 | 3 | Canned Soft Drink | [Diet Coke] | $3.75 |
| 3 | 338 | 3 | Bottled Water | NaN | $3.27 |
| 4 | 350 | 3 | Canned Soft Drink | [Sprite] | $3.75 |
quantity, item_price 두개의 컬럼으로 구성된 새로운 데이터 프레임을 정의하라¶
In [87]:
answer = df[['quantity','item_price']]
answer
Out[87]:
| quantity | item_price | |
|---|---|---|
| 0 | 1 | $2.39 |
| 1 | 1 | $3.39 |
| 2 | 1 | $3.39 |
| 3 | 1 | $2.39 |
| 4 | 2 | $16.98 |
| ... | ... | ... |
| 4617 | 1 | $11.75 |
| 4618 | 1 | $11.75 |
| 4619 | 1 | $11.25 |
| 4620 | 1 | $8.75 |
| 4621 | 1 | $8.75 |
4622 rows × 2 columns
item_price 컬럼의 달러표시 문자를 제거하고 float타입으로 저장하여 new_price컬럼에 저장하라¶
In [99]:
df['new_price'] = df['item_price'].str.replace('$','').astype('float')
answer = df['new_price'].head()
answer
C:\Users\Public\Documents\ESTsoft\CreatorTemp/ipykernel_80432/1472329730.py:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
df['new_price'] = df['item_price'].str.replace('$','').astype('float')
Out[99]:
0 2.39
1 3.39
2 3.39
3 2.39
4 16.98
Name: new_price, dtype: float64
item_name명이 Chicken Salad Bowl인 데이터 프레임을 추출하고 index 값을 초기화 하여라¶
In [103]:
answer = df[df['item_name']=='Chicken Salad Bowl'].reset_index(drop=True)
answer.head(3)
Out[103]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 20 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $8.75 | 8.75 |
| 1 | 60 | 2 | Chicken Salad Bowl | [Tomatillo Green Chili Salsa, [Sour Cream, Che... | $22.50 | 22.50 |
| 2 | 94 | 2 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $22.50 | 22.50 |
new_price값이 9이하이고 item_name값이 Chicken Salad Bowl인 데이터 프레임을 추출¶
In [111]:
answer = df.loc[(df['new_price']<=9) &(df['item_name']=='Chicken Salad Bowl')]
answer.head(5)
Out[111]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 44 | 20 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $8.75 | 8.75 |
| 256 | 111 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... | $8.75 | 8.75 |
| 526 | 220 | 1 | Chicken Salad Bowl | [Roasted Chili Corn Salsa, [Black Beans, Sour ... | $8.75 | 8.75 |
| 528 | 221 | 1 | Chicken Salad Bowl | [Tomatillo Green Chili Salsa, [Fajita Vegetabl... | $8.75 | 8.75 |
| 529 | 221 | 1 | Chicken Salad Bowl | [Tomatillo Green Chili Salsa, [Fajita Vegetabl... | $8.75 | 8.75 |
df의 new_price 컬럼 값에 따라 오름차순으로 정리하고 index를 초기화 하여라¶
In [125]:
answer = df.sort_values(by='new_price', ascending=True).reset_index(drop=True)
answer.head(4)
Out[125]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 471 | 1 | Bottled Water | NaN | $1.09 | 1.09 |
| 1 | 338 | 1 | Canned Soda | [Coca Cola] | $1.09 | 1.09 |
| 2 | 1575 | 1 | Canned Soda | [Dr. Pepper] | $1.09 | 1.09 |
| 3 | 47 | 1 | Canned Soda | [Dr. Pepper] | $1.09 | 1.09 |
df의 item_name 컬럼 값중 chips 포함하는 경우의 데이터를 출력하라¶
In [129]:
answer = df.loc[df['item_name'].str.contains('Chips')]
answer.head(5)
Out[129]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 | 2.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 | 2.39 |
| 6 | 3 | 1 | Side of Chips | NaN | $1.69 | 1.69 |
| 10 | 5 | 1 | Chips and Guacamole | NaN | $4.45 | 4.45 |
| 14 | 7 | 1 | Chips and Guacamole | NaN | $4.45 | 4.45 |
df의 짝수번쨰 컬럼만을 포함하는 데이터프레임을 출력¶
In [133]:
answer = df.loc[:,::2]
answer.head(5)
Out[133]:
| order_id | item_name | item_price | |
|---|---|---|---|
| 0 | 1 | Chips and Fresh Tomato Salsa | $2.39 |
| 1 | 1 | Izze | $3.39 |
| 2 | 1 | Nantucket Nectar | $3.39 |
| 3 | 1 | Chips and Tomatillo-Green Chili Salsa | $2.39 |
| 4 | 2 | Chicken Bowl | $16.98 |
df의 new_price 컬럼 값에 따라 내림차순으로 정리하고 index를 초기화 하여라¶
In [140]:
answer = df.sort_values(by='new_price',ascending=False).reset_index(drop=True)
answer.head(4)
Out[140]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 1443 | 15 | Chips and Fresh Tomato Salsa | NaN | $44.25 | 44.25 |
| 1 | 1398 | 3 | Carnitas Bowl | [Roasted Chili Corn Salsa, [Fajita Vegetables,... | $35.25 | 35.25 |
| 2 | 511 | 4 | Chicken Burrito | [Fresh Tomato Salsa, [Fajita Vegetables, Rice,... | $35.00 | 35.00 |
| 3 | 1443 | 4 | Chicken Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Chees... | $35.00 | 35.00 |
df의 item_name 컬럼 값이 Steak Salad 또는 Bowl인 데이터를 인덱싱하라¶
In [142]:
answer=df.loc[(df['item_name']=='Steak Salad') | (df['item_name']=='Bowl')]
answer
Out[142]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 445 | 193 | 3 | Bowl | [Braised Carnitas, Pinto Beans, [Sour Cream, C... | $22.20 | 22.20 |
| 664 | 276 | 1 | Steak Salad | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $8.99 | 8.99 |
| 673 | 279 | 1 | Bowl | [Adobo-Marinated and Grilled Steak, [Sour Crea... | $7.40 | 7.40 |
| 752 | 311 | 1 | Steak Salad | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $8.99 | 8.99 |
| 893 | 369 | 1 | Steak Salad | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $8.99 | 8.99 |
| 3502 | 1406 | 1 | Steak Salad | [[Lettuce, Fajita Veggies]] | $8.69 | 8.69 |
In [147]:
answer = df.loc[(df['item_name']=='Steak Salad')|(df['item_name']=='Bowl')].drop_duplicates('item_name')
answer
Out[147]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 445 | 193 | 3 | Bowl | [Braised Carnitas, Pinto Beans, [Sour Cream, C... | $22.20 | 22.20 |
| 664 | 276 | 1 | Steak Salad | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $8.99 | 8.99 |
In [149]:
answer = df.loc[(df['item_name']=='Steak Salad') | (df['item_name']=='Bowl')].drop_duplicates('item_name',keep='last')
answer
Out[149]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 673 | 279 | 1 | Bowl | [Adobo-Marinated and Grilled Steak, [Sour Crea... | $7.40 | 7.40 |
| 3502 | 1406 | 1 | Steak Salad | [[Lettuce, Fajita Veggies]] | $8.69 | 8.69 |
In [152]:
answer = df.loc[df['new_price']>=df['new_price'].mean()]
answer
Out[152]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 | 16.98 |
| 5 | 3 | 1 | Chicken Bowl | [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... | $10.98 | 10.98 |
| 7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 | 11.75 |
| 8 | 4 | 1 | Steak Soft Tacos | [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... | $9.25 | 9.25 |
| 9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 | 9.25 |
| ... | ... | ... | ... | ... | ... | ... |
| 4617 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Sour ... | $11.75 | 11.75 |
| 4618 | 1833 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese... | $11.75 | 11.75 |
| 4619 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $11.25 | 11.25 |
| 4620 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Lettu... | $8.75 | 8.75 |
| 4621 | 1834 | 1 | Chicken Salad Bowl | [Fresh Tomato Salsa, [Fajita Vegetables, Pinto... | $8.75 | 8.75 |
2890 rows × 6 columns
36. df의 데이터 중 item_name의 값이 lzze 데이터를 Fizzy Lizzy로 수정해라¶
In [159]:
df.loc[df['item_name']=='Izze','item_name'] = 'Fizzy Lizzy'
df.head()
Out[159]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NaN | $2.39 | 2.39 |
| 1 | 1 | 1 | Fizzy Lizzy | [Clementine] | $3.39 | 3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 | 3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NaN | $2.39 | 2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 | 16.98 |
37. df의 데이터 중 choice_description 값이 NaN 인 데이터의 갯수를 구하여라¶
In [162]:
answer = df['choice_description'].isnull().sum()
answer
Out[162]:
1246
38. df의 데이터 중 choice_description 값이 NaN인 데이터를 NoData값으로 대체하라(loc이용)¶
In [166]:
df.loc[df['choice_description'].isnull(),'choice_description']='NoData'
df.head()
Out[166]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NoData | $2.39 | 2.39 |
| 1 | 1 | 1 | Fizzy Lizzy | [Clementine] | $3.39 | 3.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 | 3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NoData | $2.39 | 2.39 |
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 | 16.98 |
39. df의 데이터 중 choice_description 값에 Black이 들어가는 경우를 인덱싱하라¶
In [168]:
answer = df.loc[df['choice_description'].str.contains('Black')]
answer.head(5)
Out[168]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 4 | 2 | 2 | Chicken Bowl | [Tomatillo-Red Chili Salsa (Hot), [Black Beans... | $16.98 | 16.98 |
| 7 | 4 | 1 | Steak Burrito | [Tomatillo Red Chili Salsa, [Fajita Vegetables... | $11.75 | 11.75 |
| 9 | 5 | 1 | Steak Burrito | [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... | $9.25 | 9.25 |
| 11 | 6 | 1 | Chicken Crispy Tacos | [Roasted Chili Corn Salsa, [Fajita Vegetables,... | $8.75 | 8.75 |
| 12 | 6 | 1 | Chicken Soft Tacos | [Roasted Chili Corn Salsa, [Rice, Black Beans,... | $8.75 | 8.75 |
40. df의 데이터 중 choice_description값에 Vegetables이 들어가지 않는 경우의 갯수를 출력하라¶
In [171]:
answer = len(df.loc[~df['choice_description'].str.contains('Vegetables')])
answer
Out[171]:
3900
In [172]:
answer = df.loc[df['item_name'].str.startswith('N')]
answer.head(3)
Out[172]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 | 3.39 |
| 22 | 11 | 1 | Nantucket Nectar | [Pomegranate Cherry] | $3.39 | 3.39 |
| 105 | 46 | 1 | Nantucket Nectar | [Pineapple Orange Banana] | $3.39 | 3.39 |
In [173]:
answer = df.loc[df['item_name'].str.len()>=15]
answer.head(3)
Out[173]:
| order_id | quantity | item_name | choice_description | item_price | new_price | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Chips and Fresh Tomato Salsa | NoData | $2.39 | 2.39 |
| 2 | 1 | 1 | Nantucket Nectar | [Apple] | $3.39 | 3.39 |
| 3 | 1 | 1 | Chips and Tomatillo-Green Chili Salsa | NoData | $2.39 | 2.39 |
In [176]:
lst=[1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98]
answer = len(df.loc[df['new_price'].isin(lst)])
answer
Out[176]:
1393
44. 데이터를 로드하고 상위 5개 컬럼을 출력하라¶
In [2]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/AB_NYC_2019.csv')
answer = df.head(5)
answer
Out[2]:
| id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
| 1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
| 2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
| 3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
| 4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |
45. 데이터의 각 host_name의 빈도수를 구하고 host_name으로 정렬하여 상위 5개를 출력해라¶
In [13]:
answer = df['host_name'].value_counts().sort_index()
answer.head(5)
Out[13]:
'Cil 1
(Ari) HENRY LEE 1
(Email hidden by Airbnb) 6
(Mary) Haiy 1
-TheQueensCornerLot 1
Name: host_name, dtype: int64
46.데이터의 각 host_name의 빈도수를 구하고 빈도수 기준 내림차순 정렬한 데이터 프레임을 만들어라. 빈도수 컬럼을 counts로 명명하라¶
In [31]:
df.groupby('host_name').size().to_frame().rename(columns={0:'counts'}).sort_values(by='counts',ascending=False)
Out[31]:
| counts | |
|---|---|
| host_name | |
| Michael | 417 |
| David | 403 |
| Sonder (NYC) | 327 |
| John | 294 |
| Alex | 279 |
| ... | ... |
| Jerbean | 1 |
| Jerald | 1 |
| Jeonghoon | 1 |
| Jeny | 1 |
| 현선 | 1 |
11452 rows × 1 columns
47. neighbourhood_group의 값에 따른 neighbourhood컬럼 값의 갯수를 구하여라¶
In [50]:
df.groupby(['neighbourhood_group','neighbourhood'],as_index=False).size()
Out[50]:
| neighbourhood_group | neighbourhood | size | |
|---|---|---|---|
| 0 | Bronx | Allerton | 42 |
| 1 | Bronx | Baychester | 7 |
| 2 | Bronx | Belmont | 24 |
| 3 | Bronx | Bronxdale | 19 |
| 4 | Bronx | Castle Hill | 9 |
| ... | ... | ... | ... |
| 216 | Staten Island | Tottenville | 7 |
| 217 | Staten Island | West Brighton | 18 |
| 218 | Staten Island | Westerleigh | 2 |
| 219 | Staten Island | Willowbrook | 1 |
| 220 | Staten Island | Woodrow | 1 |
221 rows × 3 columns
48. neighbourhood_group의 값에 따른 neighbourhood컬럼 값 중 neighbourhood_group그룹의 최댓값들을 출력하라¶
In [57]:
df.groupby(['neighbourhood_group','neighbourhood'],as_index=False).size().groupby('neighbourhood_group',as_index=False).max()
Out[57]:
| neighbourhood_group | neighbourhood | size | |
|---|---|---|---|
| 0 | Bronx | Woodlawn | 70 |
| 1 | Brooklyn | Windsor Terrace | 3920 |
| 2 | Manhattan | West Village | 2658 |
| 3 | Queens | Woodside | 900 |
| 4 | Staten Island | Woodrow | 48 |
49. neighbourhood_group 값에 따른 price값의 평균, 분산, 최대, 최소 값을 구하여라¶
In [60]:
df[['neighbourhood_group','price']].groupby('neighbourhood_group').agg(['mean','var','max','min'])
Out[60]:
| price | ||||
|---|---|---|---|---|
| mean | var | max | min | |
| neighbourhood_group | ||||
| Bronx | 87.496792 | 11386.885081 | 2500 | 0 |
| Brooklyn | 124.383207 | 34921.719135 | 10000 | 0 |
| Manhattan | 196.875814 | 84904.159185 | 10000 | 0 |
| Queens | 99.517649 | 27923.130227 | 10000 | 10 |
| Staten Island | 114.812332 | 77073.088342 | 5000 | 13 |
50. neighbourhood_group 값에 따른 reviews_per_month 평균, 분산, 최대, 최소 값을 구하여라¶
In [65]:
df[['neighbourhood_group','reviews_per_month']].groupby('neighbourhood_group').agg(['mean','var','max','min'])
Out[65]:
| reviews_per_month | ||||
|---|---|---|---|---|
| mean | var | max | min | |
| neighbourhood_group | ||||
| Bronx | 1.837831 | 2.799878 | 10.34 | 0.02 |
| Brooklyn | 1.283212 | 2.299040 | 14.00 | 0.01 |
| Manhattan | 1.272131 | 2.651206 | 58.50 | 0.01 |
| Queens | 1.941200 | 4.897848 | 20.94 | 0.01 |
| Staten Island | 1.872580 | 2.840895 | 10.12 | 0.02 |
51. neighbourhood 값과 neighbourhood_group 값에 따른 price 의 평균을 구하라¶
In [5]:
df.groupby(['neighbourhood','neighbourhood_group'],as_index=False).price.mean()
Out[5]:
| neighbourhood | neighbourhood_group | price | |
|---|---|---|---|
| 0 | Allerton | Bronx | 87.595238 |
| 1 | Arden Heights | Staten Island | 67.250000 |
| 2 | Arrochar | Staten Island | 115.000000 |
| 3 | Arverne | Queens | 171.779221 |
| 4 | Astoria | Queens | 117.187778 |
| ... | ... | ... | ... |
| 216 | Windsor Terrace | Brooklyn | 138.993631 |
| 217 | Woodhaven | Queens | 67.170455 |
| 218 | Woodlawn | Bronx | 60.090909 |
| 219 | Woodrow | Staten Island | 700.000000 |
| 220 | Woodside | Queens | 85.097872 |
221 rows × 3 columns
In [7]:
df.groupby(['neighbourhood','neighbourhood_group']).price.mean().unstack()
Out[7]:
| neighbourhood_group | Bronx | Brooklyn | Manhattan | Queens | Staten Island |
|---|---|---|---|---|---|
| neighbourhood | |||||
| Allerton | 87.595238 | NaN | NaN | NaN | NaN |
| Arden Heights | NaN | NaN | NaN | NaN | 67.25 |
| Arrochar | NaN | NaN | NaN | NaN | 115.00 |
| Arverne | NaN | NaN | NaN | 171.779221 | NaN |
| Astoria | NaN | NaN | NaN | 117.187778 | NaN |
| ... | ... | ... | ... | ... | ... |
| Windsor Terrace | NaN | 138.993631 | NaN | NaN | NaN |
| Woodhaven | NaN | NaN | NaN | 67.170455 | NaN |
| Woodlawn | 60.090909 | NaN | NaN | NaN | NaN |
| Woodrow | NaN | NaN | NaN | NaN | 700.00 |
| Woodside | NaN | NaN | NaN | 85.097872 | NaN |
221 rows × 5 columns
In [10]:
df.groupby(['neighbourhood','neighbourhood_group']).price.mean().unstack().fillna('-999')
Out[10]:
| neighbourhood_group | Bronx | Brooklyn | Manhattan | Queens | Staten Island |
|---|---|---|---|---|---|
| neighbourhood | |||||
| Allerton | 87.595238 | -999 | -999 | -999 | -999 |
| Arden Heights | -999 | -999 | -999 | -999 | 67.25 |
| Arrochar | -999 | -999 | -999 | -999 | 115.0 |
| Arverne | -999 | -999 | -999 | 171.779221 | -999 |
| Astoria | -999 | -999 | -999 | 117.187778 | -999 |
| ... | ... | ... | ... | ... | ... |
| Windsor Terrace | -999 | 138.993631 | -999 | -999 | -999 |
| Woodhaven | -999 | -999 | -999 | 67.170455 | -999 |
| Woodlawn | 60.090909 | -999 | -999 | -999 | -999 |
| Woodrow | -999 | -999 | -999 | -999 | 700.0 |
| Woodside | -999 | -999 | -999 | 85.097872 | -999 |
221 rows × 5 columns
54. 데이터중 neighbourhood_group 값이 Queens값을 가지는 데이터들 중 neighbourhood 그룹별로 price값의 평균, 분산, 최대, 최소값을 구하라¶
In [24]:
df.loc[df['neighbourhood_group']=='Queens'].groupby('neighbourhood').price.agg(['mean','var','max','min'])
Out[24]:
| mean | var | max | min | |
|---|---|---|---|---|
| neighbourhood | ||||
| Arverne | 171.779221 | 37383.411141 | 1500 | 35 |
| Astoria | 117.187778 | 122428.811196 | 10000 | 25 |
| Bay Terrace | 142.000000 | 6816.400000 | 258 | 32 |
| Bayside | 157.948718 | 166106.470985 | 2600 | 30 |
| Bayswater | 87.470588 | 2330.889706 | 230 | 45 |
| Belle Harbor | 171.500000 | 8226.571429 | 350 | 85 |
| Bellerose | 99.357143 | 3093.016484 | 240 | 42 |
| Breezy Point | 213.333333 | 1008.333333 | 250 | 195 |
| Briarwood | 105.875000 | 18503.165909 | 1000 | 30 |
| Cambria Heights | 81.730769 | 2960.604615 | 250 | 31 |
| College Point | 88.000000 | 6445.555556 | 400 | 44 |
| Corona | 59.171875 | 2430.557292 | 359 | 23 |
| Ditmars Steinway | 95.029126 | 4214.112785 | 600 | 22 |
| Douglaston | 82.750000 | 2347.071429 | 178 | 40 |
| East Elmhurst | 81.183784 | 4369.487779 | 700 | 16 |
| Edgemere | 94.727273 | 4219.418182 | 200 | 40 |
| Elmhurst | 80.459916 | 3896.563005 | 443 | 15 |
| Far Rockaway | 165.862069 | 52575.980296 | 900 | 35 |
| Flushing | 93.514085 | 12417.177448 | 1500 | 15 |
| Forest Hills | 121.625000 | 44533.368881 | 2350 | 16 |
| Fresh Meadows | 99.500000 | 6394.451613 | 375 | 39 |
| Glendale | 90.796296 | 3230.882250 | 299 | 25 |
| Hollis | 88.642857 | 1483.016484 | 175 | 50 |
| Holliswood | 135.750000 | 5091.583333 | 239 | 79 |
| Howard Beach | 115.400000 | 4241.936842 | 250 | 40 |
| Jackson Heights | 80.897849 | 2150.470590 | 260 | 23 |
| Jamaica | 95.770563 | 8067.047130 | 1000 | 10 |
| Jamaica Estates | 182.947368 | 29006.052632 | 750 | 35 |
| Jamaica Hills | 132.125000 | 11304.696429 | 325 | 50 |
| Kew Gardens | 88.375000 | 2632.306452 | 219 | 35 |
| Kew Gardens Hills | 112.307692 | 6230.621538 | 399 | 40 |
| Laurelton | 95.333333 | 4355.529412 | 254 | 34 |
| Little Neck | 75.200000 | 2368.700000 | 149 | 32 |
| Long Island City | 127.465549 | 15529.503009 | 2000 | 30 |
| Maspeth | 83.645455 | 2223.607089 | 245 | 11 |
| Middle Village | 109.580645 | 3537.118280 | 265 | 28 |
| Neponsit | 274.666667 | 5625.333333 | 350 | 200 |
| Ozone Park | 85.274194 | 1705.415389 | 200 | 25 |
| Queens Village | 83.933333 | 2386.470056 | 320 | 25 |
| Rego Park | 83.877358 | 3659.003863 | 300 | 21 |
| Richmond Hill | 87.117021 | 3474.147449 | 300 | 28 |
| Ridgewood | 77.184397 | 2629.311889 | 375 | 20 |
| Rockaway Beach | 132.178571 | 6254.622078 | 545 | 49 |
| Rosedale | 76.694915 | 2963.939801 | 350 | 22 |
| South Ozone Park | 82.400000 | 4619.989744 | 400 | 29 |
| Springfield Gardens | 94.235294 | 2490.848739 | 300 | 40 |
| St. Albans | 100.828947 | 9489.903684 | 600 | 25 |
| Sunnyside | 84.865014 | 2727.746922 | 600 | 12 |
| Whitestone | 107.545455 | 13632.072727 | 400 | 35 |
| Woodhaven | 67.170455 | 1642.464864 | 250 | 10 |
| Woodside | 85.097872 | 5001.900636 | 500 | 28 |
55. 데이터중 neighbourhood_group 값에 따른 room_type 컬럼의 숫자를 구하고 neighbourhood_group 값을 기준으로 각 값의 비율을 구하여라¶
In [50]:
answer = df[['neighbourhood_group','room_type']].groupby(['neighbourhood_group','room_type']).size().unstack()
answer.loc[:,:] = (answer.values / answer.sum(axis=1).values.reshape(-1,1))
answer
Out[50]:
| room_type | Entire home/apt | Private room | Shared room |
|---|---|---|---|
| neighbourhood_group | |||
| Bronx | 0.347388 | 0.597617 | 0.054995 |
| Brooklyn | 0.475478 | 0.503979 | 0.020543 |
| Manhattan | 0.609344 | 0.368496 | 0.022160 |
| Queens | 0.369926 | 0.595129 | 0.034945 |
| Staten Island | 0.471850 | 0.504021 | 0.024129 |
In [89]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/BankChurnersUp.csv',index_col='Unnamed: 0')
df.shape
Out[89]:
(10127, 18)
57. Income_Category의 카테고리를 map 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라¶
Unknown : N
Less than $40K : a
$40K - $60K : b
$60K - $80K : c
$80K - $120K : d
$120K +’ : e
In [32]:
dic = {
'Unknown' : 'N',
'Less than $40K' : 'a',
'$40K - $60K' : 'b',
'$60K - $80K' : 'c',
'$80K - $120K' : 'd',
'$120K +' : 'e'
}
df['newIncome'] = df['Income_Category'].map(dic)
df['newIncome']
Out[32]:
0 c
1 a
2 d
3 a
4 c
..
10122 b
10123 b
10124 a
10125 b
10126 a
Name: newIncome, Length: 10127, dtype: object
58. Income_Category의 카테고리를 apply 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라¶
Unknown : N
Less than $40K : a
$40K - $60K : b
$60K - $80K : c
$80K - $120K : d
$120K +’ : e
In [33]:
def solution(x):
if x=='Unknown':
return 'N'
elif x=='Less than $40K':
return 'a'
elif x=='$40K - $60K':
return 'b'
elif x=='$60K - $80K':
return 'c'
elif x=='$80K - $120K':
return 'd'
elif x=='$120K +':
return 'e'
df['newIncome'] = df['Income_Category'].apply(solution)
df['newIncome']
Out[33]:
0 c
1 a
2 d
3 a
4 c
..
10122 b
10123 b
10124 a
10125 b
10126 a
Name: newIncome, Length: 10127, dtype: object
59. Customer_Age의 값을 이용하여 나이 구간을 AgeState 컬럼으로 정의하라.¶
0~9 :0
10~19 :10
20~29 :20 ...
각 구간의 빈도수를 출력하라
In [34]:
def solution(age):
if 0<=age<=9:
return 0
elif 10<=age<=19:
return 10
elif 20<=age<=29:
return 20
elif 30<=age<=39:
return 30
elif 40<=age<=49:
return 40
elif 50<=age<=59:
return 50
elif 60<=age<=69:
return 60
elif 70<=age<=79:
return 70
df['AgeState'] = df['Customer_Age'].apply(solution)
df['AgeState'].value_counts().sort_index()
Out[34]:
20 195
30 1841
40 4561
50 2998
60 530
70 2
Name: AgeState, dtype: int64
In [35]:
df['AgeState'] = df['Customer_Age'].map(lambda x: (x//10)*10)
df['AgeState'].value_counts().sort_index()
Out[35]:
20 195
30 1841
40 4561
50 2998
60 530
70 2
Name: AgeState, dtype: int64
60. Education_Level의 값중 Graduate단어가 포함되는 값은 1 그렇지 않은 경우에는 0으로 변경하여 newEduLevel 컬럼을 정의하고 빈도수를 출력하라¶
In [36]:
def solution(str):
if 'Graduate' in str:
return 1
else:
return 0
df['newEduLevel'] = df['Education_Level'].apply(solution)
df['newEduLevel'].value_counts()
Out[36]:
0 6483
1 3644
Name: newEduLevel, dtype: int64
In [37]:
df['newEduLevel'] = df['Education_Level'].map(lambda x : 1 if 'Graduate' in x else 0)
df['newEduLevel'].value_counts()
Out[37]:
0 6483
1 3644
Name: newEduLevel, dtype: int64
61. Credit_Limit 컬럼값이 4500 이상인 경우 1 그외의 경우에는 모두 0으로 하는 newLimit 정의하라. newLimit 각 값들의 빈도수를 출력하라¶
In [38]:
df['newLimit'] = df['Credit_Limit'].map(lambda x: 1 if x>=4500 else 0)
df['newLimit'].value_counts()
Out[38]:
1 5096
0 5031
Name: newLimit, dtype: int64
In [39]:
def solution(num):
if num>=4500:
return 1
else:
return 0
df['newLimit'] = df['Credit_Limit'].apply(solution)
df['newLimit'].value_counts()
Out[39]:
1 5096
0 5031
Name: newLimit, dtype: int64
62. Marital_Status 컬럼값이 Married 이고 Card_Category 컬럼의 값이 Platinum인 경우 1 그외의 경우에는 모두 0으로 하는 newState컬럼을 정의하라. newState의 각 값들의 빈도수를 출력하라¶
In [40]:
def solution(x):
if x.Marital_Status == 'Married' and x.Card_Category == 'Platinum':
return 1
else:
return 0
df['newState'] = df.apply(solution,axis=1)
df['newState'].value_counts()
Out[40]:
0 10120
1 7
Name: newState, dtype: int64
63. Gender 컬럼값 M인 경우 male F인 경우 female로 값을 변경하여 Gender 컬럼에 새롭게 정의하라. 각 value의 빈도를 출력하라¶
In [42]:
df['Gender'] = df['Gender'].map(lambda x : 'male' if x =='M' else 'female')
df['Gender'].value_counts()
Out[42]:
female 5358
male 4769
Name: Gender, dtype: int64
In [49]:
def solution(x):
if x=='M':
return 'male'
else:
return 'female'
df['Gender'] = df['Gender'].apply(solution)
df['Gender'].value_counts()
Out[49]:
female 5358
male 4769
Name: Gender, dtype: int64
64. 데이터를 로드하고 각 열의 데이터 타입을 파악하라¶
In [2]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/timeTest.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Yr_Mo_Dy 6574 non-null object
1 RPT 6568 non-null float64
2 VAL 6571 non-null float64
3 ROS 6572 non-null float64
4 KIL 6569 non-null float64
5 SHA 6572 non-null float64
6 BIR 6574 non-null float64
7 DUB 6571 non-null float64
8 CLA 6572 non-null float64
9 MUL 6571 non-null float64
10 CLO 6573 non-null float64
11 BEL 6574 non-null float64
12 MAL 6570 non-null float64
dtypes: float64(12), object(1)
memory usage: 667.8+ KB
65. Yr_Mo_Dy을 판다스에서 인식할 수 있는 datetime64타입으로 변경하라¶
In [3]:
#1번쨰 방법
answer=df['Yr_Mo_Dy'].astype('datetime64')
answer.head(4)
#2번쨰 방법
answer = pd.to_datetime(df['Yr_Mo_Dy'])
answer.head(4)
Out[3]:
0 2061-01-01
1 2061-01-02
2 2061-01-03
3 2061-01-04
Name: Yr_Mo_Dy, dtype: datetime64[ns]
In [4]:
df['Yr_Mo_Dy']=pd.to_datetime(df['Yr_Mo_Dy'])
66. Yr_Mo_Dy에 존재하는 년도의 유일값을 모두 출력하라¶
In [5]:
df['Yr_Mo_Dy'].dt.year.unique()
Out[5]:
array([2061, 2062, 2063, 2064, 2065, 2066, 2067, 2068, 2069, 2070, 1971,
1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)
67. Yr_Mo_Dy에 년도가 2061년 이상의 경우에는 모두 잘못된 데이터이다. 해당경우의 값은 100을 빼서 새롭게 날짜를 Yr_Mo_Dy 컬럼에 정의하라¶
In [6]:
df['Yr_Mo_Dy']= pd.to_datetime(df['Yr_Mo_Dy'].astype('str').map(lambda x : str(int(x[:4])-100)+x[4:] if x[:4]>='2061' else x))
df['Yr_Mo_Dy']
Out[6]:
0 1961-01-01
1 1961-01-02
2 1961-01-03
3 1961-01-04
4 1961-01-05
...
6569 1978-12-27
6570 1978-12-28
6571 1978-12-29
6572 1978-12-30
6573 1978-12-31
Name: Yr_Mo_Dy, Length: 6574, dtype: datetime64[ns]
68. 년도별 각컬럼의 평균값을 구하여라¶
In [7]:
df.groupby(df['Yr_Mo_Dy'].dt.year).agg('mean').head(4)
Out[7]:
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||
| 1961 | 12.299583 | 10.351796 | 11.362369 | 6.958227 | 10.881763 | 7.729726 | 9.733923 | 8.858788 | 8.647652 | 9.835577 | 13.502795 | 13.680773 |
| 1962 | 12.246923 | 10.110438 | 11.732712 | 6.960440 | 10.657918 | 7.393068 | 11.020712 | 8.793753 | 8.316822 | 9.676247 | 12.930685 | 14.323956 |
| 1963 | 12.813452 | 10.836986 | 12.541151 | 7.330055 | 11.724110 | 8.434712 | 11.075699 | 10.336548 | 8.903589 | 10.224438 | 13.638877 | 14.999014 |
| 1964 | 12.363661 | 10.920164 | 12.104372 | 6.787787 | 11.454481 | 7.570874 | 10.259153 | 9.467350 | 7.789016 | 10.207951 | 13.740546 | 14.910301 |
69. weekday컬럼을 만들고 요일별로 매핑하라 ( 월요일: 0 ~ 일요일 :6)¶
In [8]:
df['weekday'] = df['Yr_Mo_Dy'].dt.weekday
df['weekday'].head(3).to_frame()
Out[8]:
| weekday | |
|---|---|
| 0 | 6 |
| 1 | 0 |
| 2 | 1 |
70. weekday컬럼을 기준으로 주말이면 1 평일이면 0의 값을 가지는 WeekCheck 컬럼을 만들어라¶
In [9]:
df['WeekCheck'] = df['weekday'].map(lambda x : 1 if x==5 or x==6 else 0)
df['WeekCheck'].head(3).to_frame()
Out[9]:
| WeekCheck | |
|---|---|
| 0 | 1 |
| 1 | 0 |
| 2 | 0 |
71. 년도, 일자 상관없이 모든 컬럼의 각 달의 평균을 구하여라¶
In [10]:
answer = df.groupby(df['Yr_Mo_Dy'].dt.month).agg('mean')
answer
Out[10]:
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | weekday | WeekCheck | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||||
| 1 | 14.847325 | 12.914560 | 13.299624 | 7.199498 | 11.667734 | 8.054839 | 11.819355 | 9.512047 | 9.543208 | 10.053566 | 14.550520 | 18.028763 | 2.989247 | 0.284946 |
| 2 | 13.710906 | 12.111122 | 12.879132 | 6.942411 | 11.551772 | 7.633858 | 11.206024 | 9.341437 | 9.313169 | 9.518051 | 13.728898 | 17.156142 | 3.005906 | 0.287402 |
| 3 | 13.158687 | 11.505842 | 12.648118 | 7.265907 | 11.554516 | 7.959409 | 11.310179 | 9.635896 | 9.700324 | 10.096953 | 13.810609 | 16.909317 | 3.000000 | 0.283154 |
| 4 | 12.555648 | 10.429759 | 12.204815 | 6.898037 | 10.677667 | 7.441389 | 10.221315 | 8.909056 | 8.930870 | 9.158019 | 12.664759 | 14.937611 | 3.011111 | 0.288889 |
| 5 | 11.724032 | 10.145619 | 11.550394 | 6.307487 | 10.224301 | 6.942061 | 8.797738 | 8.452903 | 8.040806 | 8.524857 | 12.767258 | 13.736039 | 2.982079 | 0.283154 |
| 6 | 10.451317 | 8.949704 | 10.361315 | 5.652278 | 9.529926 | 6.410093 | 8.009556 | 7.920796 | 7.639796 | 7.729185 | 12.246407 | 12.861818 | 3.007407 | 0.285185 |
| 7 | 9.992007 | 8.357778 | 9.349642 | 5.416935 | 9.302634 | 5.972348 | 7.843501 | 7.262760 | 7.544480 | 7.321416 | 11.676505 | 12.800789 | 3.001792 | 0.288530 |
| 8 | 10.213411 | 8.415143 | 9.993441 | 5.270681 | 8.901559 | 5.891057 | 7.772312 | 6.842025 | 7.240573 | 7.002783 | 11.110090 | 12.565943 | 2.991039 | 0.283154 |
| 9 | 11.458519 | 9.981002 | 10.756883 | 5.615176 | 9.766315 | 6.566222 | 8.609722 | 7.745677 | 7.610556 | 7.689278 | 12.686389 | 14.761963 | 3.009259 | 0.287037 |
| 10 | 12.660610 | 11.010681 | 11.453943 | 6.065215 | 10.550251 | 7.159910 | 9.387778 | 8.726308 | 8.347181 | 8.850376 | 14.155323 | 16.697151 | 2.998208 | 0.286738 |
| 11 | 13.200722 | 11.639500 | 12.293407 | 6.247611 | 10.501130 | 7.134333 | 10.814861 | 8.427167 | 8.604000 | 8.943167 | 13.815741 | 18.114185 | 2.992593 | 0.283333 |
| 12 | 14.446398 | 12.353602 | 13.212276 | 6.829910 | 11.301254 | 7.963710 | 11.849050 | 9.209355 | 9.447258 | 9.627670 | 14.259516 | 18.697599 | 3.017921 | 0.288530 |
72. 모든 결측치는 컬럼기준 직전의 값으로 대체하고 첫번째 행에 결측치가 있을경우 뒤에있는 값으로 대채하라¶
In [11]:
df = df.fillna(method='ffill').fillna(method='bfill')
df.isnull().sum()
Out[11]:
Yr_Mo_Dy 0
RPT 0
VAL 0
ROS 0
KIL 0
SHA 0
BIR 0
DUB 0
CLA 0
MUL 0
CLO 0
BEL 0
MAL 0
weekday 0
WeekCheck 0
dtype: int64
73. 년도 - 월을 기준으로 모든 컬럼의 평균값을 구하여라¶
In [18]:
answer = df.groupby(df['Yr_Mo_Dy'].dt.strftime('%Y-%m')).agg('mean')
answer
Out[18]:
| RPT | VAL | ROS | KIL | SHA | BIR | DUB | CLA | MUL | CLO | BEL | MAL | weekday | WeekCheck | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Yr_Mo_Dy | ||||||||||||||
| 1961-01 | 14.932581 | 12.084194 | 13.431613 | 7.736774 | 11.215806 | 8.588065 | 11.184839 | 9.270968 | 9.085806 | 10.107419 | 13.880968 | 14.703226 | 2.935484 | 0.290323 |
| 1961-02 | 16.269286 | 14.975357 | 14.324643 | 9.116786 | 13.852143 | 10.937500 | 11.890714 | 11.846071 | 11.821429 | 12.714286 | 18.583214 | 15.411786 | 3.000000 | 0.285714 |
| 1961-03 | 11.015806 | 11.296452 | 10.752903 | 7.137742 | 10.509355 | 8.866774 | 9.644194 | 9.829677 | 10.646129 | 11.251935 | 16.410968 | 15.637742 | 3.000000 | 0.258065 |
| 1961-04 | 10.722667 | 9.427667 | 9.998000 | 5.830667 | 8.435000 | 6.495000 | 6.925333 | 7.094667 | 7.342333 | 7.237000 | 11.147333 | 10.278333 | 3.166667 | 0.333333 |
| 1961-05 | 9.860968 | 8.735161 | 10.818065 | 5.892258 | 9.490323 | 6.574839 | 7.477097 | 8.177097 | 8.039355 | 8.499355 | 11.900323 | 12.011613 | 2.806452 | 0.258065 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1978-08 | 9.645161 | 8.259355 | 9.032258 | 4.502903 | 7.368065 | 5.935161 | 5.650323 | 5.417742 | 7.241290 | 5.536774 | 10.466774 | 12.054194 | 2.903226 | 0.258065 |
| 1978-09 | 10.913667 | 10.895000 | 10.635000 | 5.725000 | 10.372000 | 9.278333 | 10.790333 | 9.583000 | 10.069333 | 8.939000 | 15.680333 | 19.391333 | 3.100000 | 0.300000 |
| 1978-10 | 9.897742 | 8.670968 | 9.295806 | 4.721290 | 8.525161 | 6.774194 | 8.115484 | 7.337742 | 8.297742 | 8.243871 | 13.776774 | 17.150000 | 2.935484 | 0.290323 |
| 1978-11 | 16.151667 | 14.802667 | 13.508000 | 7.317333 | 11.475000 | 8.743000 | 11.492333 | 9.657333 | 10.701333 | 10.676000 | 17.404667 | 20.723000 | 2.966667 | 0.266667 |
| 1978-12 | 16.175484 | 13.748065 | 15.635161 | 7.094839 | 11.398710 | 9.241613 | 12.077419 | 10.194839 | 10.616774 | 11.028710 | 13.859677 | 21.371613 | 3.193548 | 0.322581 |
216 rows × 14 columns
74. RPT 컬럼의 값을 일자별로 1차 차분하라¶
In [21]:
answer = df['RPT'].diff()
answer
Out[21]:
0 NaN
1 -0.33
2 3.79
3 -7.92
4 2.75
...
6569 3.75
6570 -4.37
6571 0.79
6572 4.50
6573 1.83
Name: RPT, Length: 6574, dtype: float64
In [33]:
answer = df[['RPT','VAL']].rolling(7).mean()
answer.head(10)
Out[33]:
| RPT | VAL | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | NaN | NaN |
| 2 | NaN | NaN |
| 3 | NaN | NaN |
| 4 | NaN | NaN |
| 5 | NaN | NaN |
| 6 | 14.124286 | 12.727143 |
| 7 | 13.541429 | 11.982857 |
| 8 | 13.237143 | 11.392857 |
| 9 | 12.504286 | 10.570000 |
76. 년-월-일:시 컬럼을 pandas에서 인식할 수 있는 datetime 형태로 변경하라. 서울시의 제공데이터의 경우 0시가 24시로 표현된다¶
In [63]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/seoul_pm.csv')
df
Out[63]:
| (년-월-일:시) | PM10등급 | PM10 | PM2.5등급 | PM2.5 | 오존등급 | 오존 | 이산화질소등급 | 이산화질소 | 일산화탄소등급 | 일산화탄소 | 아황산가스등급 | 아황산가스 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-05-15:15 | 보통 | 47.0 | 보통 | 19.0 | 좋음 | 0.017 | 좋음 | 0.023 | 좋음 | 0.4 | 좋음 | 0.003 |
| 1 | 2021-05-15:14 | 보통 | 43.0 | 보통 | 20.0 | 좋음 | 0.024 | 좋음 | 0.019 | 좋음 | 0.3 | 좋음 | 0.003 |
| 2 | 2021-05-15:13 | 보통 | 34.0 | 보통 | 24.0 | 보통 | 0.035 | 좋음 | 0.017 | 좋음 | 0.4 | 좋음 | 0.004 |
| 3 | 2021-05-15:12 | 보통 | 41.0 | 보통 | 27.0 | 보통 | 0.037 | 좋음 | 0.020 | 좋음 | 0.4 | 좋음 | 0.004 |
| 4 | 2021-05-15:11 | 보통 | 51.0 | 보통 | 34.0 | 보통 | 0.033 | 좋음 | 0.023 | 좋음 | 0.4 | 좋음 | 0.005 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1234 | 2021-03-25:05 | 보통 | 39.0 | 보통 | 18.0 | 좋음 | 0.026 | 좋음 | 0.025 | 좋음 | 0.4 | 좋음 | 0.003 |
| 1235 | 2021-03-25:04 | 보통 | 34.0 | 좋음 | 15.0 | 좋음 | 0.017 | 보통 | 0.033 | 좋음 | 0.4 | 좋음 | 0.002 |
| 1236 | 2021-03-25:03 | 보통 | 35.0 | 좋음 | 13.0 | 좋음 | 0.029 | 좋음 | 0.025 | 좋음 | 0.4 | 좋음 | 0.003 |
| 1237 | 2021-03-25:02 | 보통 | 35.0 | 좋음 | 13.0 | 보통 | 0.031 | 좋음 | 0.025 | 좋음 | 0.3 | 좋음 | 0.003 |
| 1238 | 2021-03-25:01 | 보통 | 42.0 | 좋음 | 13.0 | 좋음 | 0.022 | 보통 | 0.037 | 좋음 | 0.4 | 좋음 | 0.003 |
1239 rows × 13 columns
In [64]:
import datetime
def date_ft(x):
if x[-2:] != '24':
x = x.replace(':',' ')
return pd.to_datetime(x)
else:
x = pd.to_datetime(x.split(':')[0]) + datetime.timedelta(days = 1)
return x
df['(년-월-일:시)'] = df['(년-월-일:시)'].apply(date_ft)
df.head(3)
Out[64]:
| (년-월-일:시) | PM10등급 | PM10 | PM2.5등급 | PM2.5 | 오존등급 | 오존 | 이산화질소등급 | 이산화질소 | 일산화탄소등급 | 일산화탄소 | 아황산가스등급 | 아황산가스 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2021-05-15 15:00:00 | 보통 | 47.0 | 보통 | 19.0 | 좋음 | 0.017 | 좋음 | 0.023 | 좋음 | 0.4 | 좋음 | 0.003 |
| 1 | 2021-05-15 14:00:00 | 보통 | 43.0 | 보통 | 20.0 | 좋음 | 0.024 | 좋음 | 0.019 | 좋음 | 0.3 | 좋음 | 0.003 |
| 2 | 2021-05-15 13:00:00 | 보통 | 34.0 | 보통 | 24.0 | 보통 | 0.035 | 좋음 | 0.017 | 좋음 | 0.4 | 좋음 | 0.004 |
In [73]:
df['dayName'] = df['(년-월-일:시)'].dt.day_name()
df['dayName'].head(5)
Out[73]:
0 Saturday
1 Saturday
2 Saturday
3 Saturday
4 Saturday
Name: dayName, dtype: object
78. 일자별 각 PM10등급의 빈도수를 파악하라¶
In [77]:
df['PM10등급'].value_counts()
Out[77]:
보통 651
좋음 361
나쁨 145
매우나쁨 76
Name: PM10등급, dtype: int64
In [89]:
answer = df.groupby(['dayName','PM10등급'],as_index=False).size()
answer
Out[89]:
| dayName | PM10등급 | size | |
|---|---|---|---|
| 0 | Friday | 나쁨 | 31 |
| 1 | Friday | 매우나쁨 | 17 |
| 2 | Friday | 보통 | 120 |
| 3 | Friday | 좋음 | 21 |
| 4 | Monday | 나쁨 | 1 |
| 5 | Monday | 매우나쁨 | 21 |
| 6 | Monday | 보통 | 83 |
| 7 | Monday | 좋음 | 63 |
| 8 | Saturday | 나쁨 | 31 |
| 9 | Saturday | 매우나쁨 | 27 |
| 10 | Saturday | 보통 | 71 |
| 11 | Saturday | 좋음 | 54 |
| 12 | Sunday | 나쁨 | 2 |
| 13 | Sunday | 매우나쁨 | 1 |
| 14 | Sunday | 보통 | 67 |
| 15 | Sunday | 좋음 | 98 |
| 16 | Thursday | 나쁨 | 41 |
| 17 | Thursday | 보통 | 144 |
| 18 | Thursday | 좋음 | 5 |
| 19 | Tuesday | 나쁨 | 13 |
| 20 | Tuesday | 매우나쁨 | 10 |
| 21 | Tuesday | 보통 | 71 |
| 22 | Tuesday | 좋음 | 74 |
| 23 | Wednesday | 나쁨 | 26 |
| 24 | Wednesday | 보통 | 95 |
| 25 | Wednesday | 좋음 | 46 |
79. 시간이 연속적으로 존재하며 결측치가 없는지 확인하라¶
In [98]:
#시간을 차분했을 경우 첫 값은 nan, 이후 모든 차분값이 동일하면 연속이라 판단한다.
answer = len(df['(년-월-일:시)'].diff().unique())
if answer == 2:
answer = True
else:
answer = False
answer
Out[98]:
True
80. 오전 10시와 오후 10시(22시) 의 PM10의 평균값을 각각 구하여라¶
In [140]:
answer = df.loc[(df['(년-월-일:시)'].dt.hour==10) | (df['(년-월-일:시)'].dt.hour==22)]
answer = answer[['(년-월-일:시)','PM10']].groupby(df['(년-월-일:시)'].dt.hour).agg('mean')
answer
Out[140]:
| PM10 | |
|---|---|
| (년-월-일:시) | |
| 10 | 70.384615 |
| 22 | 69.941176 |
In [151]:
df.groupby(df['(년-월-일:시)'].dt.hour).mean().loc[[10,22],['PM10']]
Out[151]:
| PM10 | |
|---|---|
| (년-월-일:시) | |
| 10 | 70.384615 |
| 22 | 69.941176 |
81. 날짜 컬럼을 index로 만들어라¶
In [178]:
df = df.set_index('(년-월-일:시)',drop=True)
df.head(3)
Out[178]:
| PM10등급 | PM10 | PM2.5등급 | PM2.5 | 오존등급 | 오존 | 이산화질소등급 | 이산화질소 | 일산화탄소등급 | 일산화탄소 | 아황산가스등급 | 아황산가스 | dayName | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| (년-월-일:시) | |||||||||||||
| 2021-05-15 15:00:00 | 보통 | 47.0 | 보통 | 19.0 | 좋음 | 0.017 | 좋음 | 0.023 | 좋음 | 0.4 | 좋음 | 0.003 | Saturday |
| 2021-05-15 14:00:00 | 보통 | 43.0 | 보통 | 20.0 | 좋음 | 0.024 | 좋음 | 0.019 | 좋음 | 0.3 | 좋음 | 0.003 | Saturday |
| 2021-05-15 13:00:00 | 보통 | 34.0 | 보통 | 24.0 | 보통 | 0.035 | 좋음 | 0.017 | 좋음 | 0.4 | 좋음 | 0.004 | Saturday |
82. 데이터를 주 단위로 뽑아서 최소, 최대 평균, 표준표차를 구하여라¶
In [187]:
answer = df.resample('W').agg(['min','max','mean','std'])
answer
Out[187]:
| PM10 | PM2.5 | 오존 | ... | 이산화질소 | 일산화탄소 | 아황산가스 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| min | max | mean | std | min | max | mean | std | min | max | ... | mean | std | min | max | mean | std | min | max | mean | std | |
| (년-월-일:시) | |||||||||||||||||||||
| 2021-03-28 | 6.0 | 160.0 | 72.747368 | 43.345462 | 4.0 | 113.0 | 44.705263 | 29.551928 | 0.002 | 0.085 | ... | 0.044579 | 0.023722 | 0.3 | 1.4 | 0.611579 | 0.232408 | 0.002 | 0.006 | 0.003274 | 0.000961 |
| 2021-04-04 | 3.0 | 598.0 | 97.148810 | 129.911976 | 1.0 | 120.0 | 23.168675 | 22.399578 | 0.003 | 0.055 | ... | 0.027929 | 0.014978 | 0.3 | 0.9 | 0.445833 | 0.135741 | 0.002 | 0.004 | 0.002732 | 0.000541 |
| 2021-04-11 | 17.0 | 102.0 | 41.059524 | 16.325911 | 7.0 | 70.0 | 21.761905 | 11.479343 | 0.009 | 0.070 | ... | 0.022583 | 0.009562 | 0.3 | 0.7 | 0.389286 | 0.087573 | 0.002 | 0.004 | 0.002744 | 0.000569 |
| 2021-04-18 | 3.0 | 367.0 | 48.180723 | 43.254468 | 2.0 | 38.0 | 17.066265 | 7.867952 | 0.002 | 0.070 | ... | 0.023753 | 0.013553 | 0.3 | 0.6 | 0.386747 | 0.084954 | 0.002 | 0.004 | 0.002464 | 0.000579 |
| 2021-04-25 | 17.0 | 126.0 | 55.119048 | 26.659936 | 7.0 | 61.0 | 26.392857 | 13.094788 | 0.006 | 0.090 | ... | 0.028571 | 0.014640 | 0.3 | 0.8 | 0.457143 | 0.122142 | 0.001 | 0.011 | 0.003631 | 0.001763 |
| 2021-05-02 | 3.0 | 97.0 | 40.612121 | 24.813103 | 1.0 | 43.0 | 16.644578 | 8.850965 | 0.003 | 0.064 | ... | 0.020428 | 0.011676 | 0.3 | 0.6 | 0.392771 | 0.092485 | 0.001 | 0.006 | 0.002524 | 0.000768 |
| 2021-05-09 | 8.0 | 1024.0 | 161.660714 | 239.679148 | 3.0 | 172.0 | 34.738095 | 39.788248 | 0.002 | 0.073 | ... | 0.024187 | 0.012371 | 0.3 | 1.0 | 0.419277 | 0.103230 | 0.002 | 0.004 | 0.002771 | 0.000579 |
| 2021-05-16 | 16.0 | 111.0 | 40.014815 | 21.876855 | 7.0 | 76.0 | 21.577778 | 15.622633 | 0.004 | 0.123 | ... | 0.030793 | 0.009503 | 0.3 | 0.8 | 0.440741 | 0.094075 | 0.001 | 0.006 | 0.002459 | 0.001696 |
8 rows × 24 columns
83. Indicator을 삭제하고 First Tooltip 컬럼에서 신뢰구간에 해당하는 표현을 지워라¶
In [221]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/under5MortalityRate.csv')
df.drop('Indicator',axis=1,inplace=True)
df['First Tooltip'] = df['First Tooltip'].map(lambda x : float(x.split(' ')[0]))
df
Out[221]:
| Location | Period | Dim1 | First Tooltip | |
|---|---|---|---|---|
| 0 | Afghanistan | 2019 | Both sexes | 60.27 |
| 1 | Afghanistan | 2019 | Male | 63.83 |
| 2 | Afghanistan | 2019 | Female | 56.57 |
| 3 | Afghanistan | 2018 | Both sexes | 62.54 |
| 4 | Afghanistan | 2018 | Male | 66.08 |
| ... | ... | ... | ... | ... |
| 29994 | Timor-Leste | 2002 | Both sexes | 96.97 |
| 29995 | Timor-Leste | 2002 | Male | 102.60 |
| 29996 | Timor-Leste | 2002 | Female | 91.04 |
| 29997 | Timor-Leste | 2001 | Both sexes | 102.50 |
| 29998 | Timor-Leste | 2001 | Male | 108.20 |
29999 rows × 4 columns
84. 년도가 2015년 이상, Dim1이 Both sexes인 케이스만 추출¶
In [222]:
target = df.loc[(df['Period']>=2015)&(df['Dim1']=='Both sexes')]
target
Out[222]:
| Location | Period | Dim1 | First Tooltip | |
|---|---|---|---|---|
| 0 | Afghanistan | 2019 | Both sexes | 60.27 |
| 3 | Afghanistan | 2018 | Both sexes | 62.54 |
| 6 | Afghanistan | 2017 | Both sexes | 64.94 |
| 9 | Afghanistan | 2016 | Both sexes | 67.57 |
| 12 | Afghanistan | 2015 | Both sexes | 70.44 |
| ... | ... | ... | ... | ... |
| 29943 | Timor-Leste | 2019 | Both sexes | 44.22 |
| 29946 | Timor-Leste | 2018 | Both sexes | 45.62 |
| 29949 | Timor-Leste | 2017 | Both sexes | 47.27 |
| 29952 | Timor-Leste | 2016 | Both sexes | 49.01 |
| 29955 | Timor-Leste | 2015 | Both sexes | 50.76 |
860 rows × 4 columns
85. 84번 문제에서 추출한 데이터로 아래와 같이 나라에 따른 년도별 사망률을 데이터 프레임화 하여라¶
In [223]:
answer = target.pivot(index='Location',columns='Period',values='First Tooltip')
answer
Out[223]:
| Period | 2015 | 2016 | 2017 | 2018 | 2019 |
|---|---|---|---|---|---|
| Location | |||||
| Afghanistan | 70.44 | 67.57 | 64.94 | 62.54 | 60.27 |
| Albania | 9.57 | 9.42 | 9.42 | 9.53 | 9.68 |
| Algeria | 25.18 | 24.79 | 24.32 | 23.81 | 23.26 |
| Andorra | 3.53 | 3.37 | 3.22 | 3.09 | 2.97 |
| Angola | 88.20 | 84.21 | 80.62 | 77.67 | 74.69 |
| ... | ... | ... | ... | ... | ... |
| Syrian Arab Republic | 23.18 | 23.27 | 22.97 | 22.11 | 21.53 |
| Tajikistan | 37.75 | 36.82 | 35.81 | 34.80 | 33.78 |
| Thailand | 10.80 | 10.32 | 9.86 | 9.42 | 9.01 |
| The former Yugoslav Republic of Macedonia | 12.97 | 11.97 | 9.94 | 7.83 | 6.12 |
| Timor-Leste | 50.76 | 49.01 | 47.27 | 45.62 | 44.22 |
172 rows × 5 columns
86.Dim1에 따른 년도별 사망비율의 평균을 구하라¶
In [238]:
answer = df.pivot_table(index='Dim1', columns='Period', values='First Tooltip', aggfunc='mean')
answer
Out[238]:
| Period | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | ... | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dim1 | |||||||||||||||||||||
| Both sexes | 147.700816 | 155.537544 | 157.811094 | 156.147206 | 154.539857 | 155.797179 | 159.241163 | 156.170114 | 150.813222 | 150.574000 | ... | 37.718488 | 35.573663 | 34.290988 | 33.099360 | 32.053314 | 31.012093 | 29.956337 | 29.030465 | 28.083837 | 27.191744 |
| Female | 140.909796 | 149.210175 | 151.516094 | 150.250882 | 148.688286 | 149.843205 | 153.048721 | 149.988295 | 144.719667 | 144.451474 | ... | 34.953023 | 32.877616 | 31.654070 | 30.521337 | 29.524302 | 28.544360 | 27.542035 | 26.675291 | 25.782616 | 24.945349 |
| Male | 154.151224 | 161.538246 | 163.760781 | 161.742059 | 160.081000 | 161.456923 | 165.089535 | 162.015000 | 156.573556 | 156.375053 | ... | 40.340174 | 38.140291 | 36.793081 | 35.543663 | 34.446105 | 33.354302 | 32.242616 | 31.273198 | 30.283023 | 29.350349 |
3 rows × 70 columns
87. 데이터에서 한국 KOR 데이터만 추출하라¶
In [239]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/winter.csv')
df
Out[239]:
| Year | City | Sport | Discipline | Athlete | Country | Gender | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1924 | Chamonix | Biathlon | Biathlon | BERTHET, G. | FRA | Men | Military Patrol | Bronze |
| 1 | 1924 | Chamonix | Biathlon | Biathlon | MANDRILLON, C. | FRA | Men | Military Patrol | Bronze |
| 2 | 1924 | Chamonix | Biathlon | Biathlon | MANDRILLON, Maurice | FRA | Men | Military Patrol | Bronze |
| 3 | 1924 | Chamonix | Biathlon | Biathlon | VANDELLE, André | FRA | Men | Military Patrol | Bronze |
| 4 | 1924 | Chamonix | Biathlon | Biathlon | AUFDENBLATTEN, Adolf | SUI | Men | Military Patrol | Gold |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5765 | 2014 | Sochi | Skiing | Snowboard | JONES, Jenny | GBR | Women | Slopestyle | Bronze |
| 5766 | 2014 | Sochi | Skiing | Snowboard | ANDERSON, Jamie | USA | Women | Slopestyle | Gold |
| 5767 | 2014 | Sochi | Skiing | Snowboard | MALTAIS, Dominique | CAN | Women | Snowboard Cross | Silver |
| 5768 | 2014 | Sochi | Skiing | Snowboard | SAMKOVA, Eva | CZE | Women | Snowboard Cross | Gold |
| 5769 | 2014 | Sochi | Skiing | Snowboard | TRESPEUCH, Chloe | FRA | Women | Snowboard Cross | Bronze |
5770 rows × 9 columns
In [241]:
answer = df.loc[df['Country']=='KOR']
answer
Out[241]:
| Year | City | Sport | Discipline | Athlete | Country | Gender | Event | Medal | |
|---|---|---|---|---|---|---|---|---|---|
| 2652 | 1992 | Albertville | Skating | Short Track Speed Skating | LEE, Jun-Ho | KOR | Men | 1000M | Bronze |
| 2653 | 1992 | Albertville | Skating | Short Track Speed Skating | KIM, Ki-Hoon | KOR | Men | 1000M | Gold |
| 2671 | 1992 | Albertville | Skating | Short Track Speed Skating | KIM, Ki-Hoon | KOR | Men | 5000M Relay | Gold |
| 2672 | 1992 | Albertville | Skating | Short Track Speed Skating | LEE, Jun-Ho | KOR | Men | 5000M Relay | Gold |
| 2673 | 1992 | Albertville | Skating | Short Track Speed Skating | MO, Ji-Soo | KOR | Men | 5000M Relay | Gold |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 5528 | 2014 | Sochi | Skating | Short Track Speed Skating | PARK, Seung-Hi | KOR | Women | 500M | Bronze |
| 5544 | 2014 | Sochi | Skating | Speed skating | JOO, Hyong Jun | KOR | Men | Team Pursuit | Silver |
| 5545 | 2014 | Sochi | Skating | Speed skating | KIM, Cheol Min | KOR | Men | Team Pursuit | Silver |
| 5546 | 2014 | Sochi | Skating | Speed skating | LEE, Seung Hoon | KOR | Men | Team Pursuit | Silver |
| 5565 | 2014 | Sochi | Skating | Speed skating | LEE, Sang Hwa | KOR | Women | 500M | Gold |
87 rows × 9 columns
88. 한국 올림픽 메달리스트 데이터에서 년도에 따른 medal 갯수를 데이터프레임화 하라¶
In [249]:
Ans = answer.pivot_table(index='Year', columns='Medal', aggfunc='size').fillna(0)
Ans
Out[249]:
| Medal | Bronze | Gold | Silver |
|---|---|---|---|
| Year | |||
| 1992 | 1.0 | 5.0 | 1.0 |
| 1994 | 1.0 | 8.0 | 1.0 |
| 1998 | 2.0 | 6.0 | 4.0 |
| 2002 | 0.0 | 5.0 | 2.0 |
| 2006 | 2.0 | 14.0 | 3.0 |
| 2010 | 2.0 | 6.0 | 10.0 |
| 2014 | 2.0 | 7.0 | 5.0 |
89. 전체 데이터에서 sport종류에 따른 성별수를 구하여라¶
In [251]:
Ans = df.pivot_table(index='Sport', columns='Gender', aggfunc='size')
Ans
Out[251]:
| Gender | Men | Women |
|---|---|---|
| Sport | ||
| Biathlon | 270 | 150 |
| Bobsleigh | 416 | 36 |
| Curling | 97 | 75 |
| Ice Hockey | 1231 | 305 |
| Luge | 135 | 45 |
| Skating | 665 | 564 |
| Skiing | 1130 | 651 |
90. 전체 데이터에서 Discipline종류에 따른 따른 Medal수를 구하여라¶
In [253]:
Ans = df.pivot_table(index = 'Discipline', columns='Medal', aggfunc='size')
Ans
Out[253]:
| Medal | Bronze | Gold | Silver |
|---|---|---|---|
| Discipline | |||
| Alpine Skiing | 141 | 143 | 144 |
| Biathlon | 139 | 140 | 141 |
| Bobsleigh | 147 | 134 | 141 |
| Cross Country Skiing | 263 | 264 | 262 |
| Curling | 56 | 58 | 58 |
| Figure skating | 118 | 122 | 119 |
| Freestyle Skiing | 34 | 34 | 34 |
| Ice Hockey | 512 | 510 | 514 |
| Luge | 60 | 62 | 58 |
| Nordic Combined | 55 | 55 | 55 |
| Short Track Speed Skating | 96 | 97 | 97 |
| Skeleton | 10 | 10 | 10 |
| Ski Jumping | 68 | 69 | 70 |
| Snowboard | 30 | 30 | 30 |
| Speed skating | 190 | 193 | 197 |
In [365]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/Datamanim/pandas/main/mergeTEst.csv',index_col=0)
df1 = df.iloc[:4,:]
df2 = df.iloc[4:,:]
display(df1)
display(df2)
total = pd.concat([df1,df2],axis=0)
total
| 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Location | ||||||||||
| Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 | 55.08 | 53.107 | 51.267 | 49.560 | 47.983 | 46.453 |
| Albania | 11.803 | 10.807 | 9.943 | 9.267 | 8.79 | 8.493 | 8.363 | 8.363 | 8.453 | 8.597 |
| Algeria | 23.540 | 22.907 | 22.450 | 22.117 | 21.85 | 21.587 | 21.257 | 20.850 | 20.407 | 19.930 |
| Andorra | 4.240 | 4.033 | 3.843 | 3.667 | 3.49 | 3.330 | 3.187 | 3.060 | 2.933 | 2.827 |
| 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Location | ||||||||||
| Angola | 75.713 | 71.280 | 67.233 | 63.570 | 60.430 | 57.757 | 55.510 | 53.460 | 51.757 | 50.093 |
| Antigua and Barbuda | 8.667 | 8.223 | 7.807 | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
| Argentina | 12.887 | 12.380 | 11.840 | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
| Armenia | 16.497 | 15.677 | 14.897 | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
| Australia | 3.993 | 3.803 | 3.623 | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
| Austria | 3.573 | 3.463 | 3.333 | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
Out[365]:
| 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Location | ||||||||||
| Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 | 55.080 | 53.107 | 51.267 | 49.560 | 47.983 | 46.453 |
| Albania | 11.803 | 10.807 | 9.943 | 9.267 | 8.790 | 8.493 | 8.363 | 8.363 | 8.453 | 8.597 |
| Algeria | 23.540 | 22.907 | 22.450 | 22.117 | 21.850 | 21.587 | 21.257 | 20.850 | 20.407 | 19.930 |
| Andorra | 4.240 | 4.033 | 3.843 | 3.667 | 3.490 | 3.330 | 3.187 | 3.060 | 2.933 | 2.827 |
| Angola | 75.713 | 71.280 | 67.233 | 63.570 | 60.430 | 57.757 | 55.510 | 53.460 | 51.757 | 50.093 |
| Antigua and Barbuda | 8.667 | 8.223 | 7.807 | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
| Argentina | 12.887 | 12.380 | 11.840 | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
| Armenia | 16.497 | 15.677 | 14.897 | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
| Australia | 3.993 | 3.803 | 3.623 | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
| Austria | 3.573 | 3.463 | 3.333 | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
In [371]:
df3 = df.iloc[:2,:4]
df4 = df.iloc[5:,3:]
display(df3)
display(df4)
total = pd.concat([df3,df4],axis=0,join='inner')
total
| 2010 | 2011 | 2012 | 2013 | |
|---|---|---|---|---|
| Location | ||||
| Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 |
| Albania | 11.803 | 10.807 | 9.943 | 9.267 |
| 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|
| Location | |||||||
| Antigua and Barbuda | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
| Argentina | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
| Armenia | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
| Australia | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
| Austria | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
Out[371]:
| 2013 | |
|---|---|
| Location | |
| Afghanistan | 57.170 |
| Albania | 9.267 |
| Antigua and Barbuda | 7.420 |
| Argentina | 11.283 |
| Armenia | 14.170 |
| Australia | 3.467 |
| Austria | 3.210 |
In [375]:
total = pd.concat([df3,df4],axis=0,join='outer').fillna(0)
total
Out[375]:
| 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Location | ||||||||||
| Afghanistan | 64.023 | 61.640 | 59.367 | 57.170 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Albania | 11.803 | 10.807 | 9.943 | 9.267 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| Antigua and Barbuda | 0.000 | 0.000 | 0.000 | 7.420 | 7.070 | 6.757 | 6.483 | 6.230 | 6.000 | 5.783 |
| Argentina | 0.000 | 0.000 | 0.000 | 11.283 | 10.733 | 10.203 | 9.683 | 9.177 | 8.680 | 8.227 |
| Armenia | 0.000 | 0.000 | 0.000 | 14.170 | 13.477 | 12.817 | 12.183 | 11.583 | 11.007 | 10.497 |
| Australia | 0.000 | 0.000 | 0.000 | 3.467 | 3.343 | 3.253 | 3.183 | 3.137 | 3.090 | 3.047 |
| Austria | 0.000 | 0.000 | 0.000 | 3.210 | 3.113 | 3.043 | 2.987 | 2.943 | 2.897 | 2.843 |
94. df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 두 데이터 모두 포함하는 데이터만 출력하라¶
In [382]:
df5 = df.T.iloc[:7,:3]
df6 = df.T.iloc[6:,2:5]
display(df5)
display(df6)
total = pd.merge(df5,df6, on='Algeria', how = 'inner')
total
| Location | Afghanistan | Albania | Algeria |
|---|---|---|---|
| 2010 | 64.023 | 11.803 | 23.540 |
| 2011 | 61.640 | 10.807 | 22.907 |
| 2012 | 59.367 | 9.943 | 22.450 |
| 2013 | 57.170 | 9.267 | 22.117 |
| 2014 | 55.080 | 8.790 | 21.850 |
| 2015 | 53.107 | 8.493 | 21.587 |
| 2016 | 51.267 | 8.363 | 21.257 |
| Location | Algeria | Andorra | Angola |
|---|---|---|---|
| 2016 | 21.257 | 3.187 | 55.510 |
| 2017 | 20.850 | 3.060 | 53.460 |
| 2018 | 20.407 | 2.933 | 51.757 |
| 2019 | 19.930 | 2.827 | 50.093 |
Out[382]:
| Location | Afghanistan | Albania | Algeria | Andorra | Angola |
|---|---|---|---|---|---|
| 0 | 51.267 | 8.363 | 21.257 | 3.187 | 55.51 |
95. df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 합집합으로 합쳐라¶
In [384]:
total = pd.merge(df5,df6,on='Algeria', how='outer')
total
Out[384]:
| Location | Afghanistan | Albania | Algeria | Andorra | Angola |
|---|---|---|---|---|---|
| 0 | 64.023 | 11.803 | 23.540 | NaN | NaN |
| 1 | 61.640 | 10.807 | 22.907 | NaN | NaN |
| 2 | 59.367 | 9.943 | 22.450 | NaN | NaN |
| 3 | 57.170 | 9.267 | 22.117 | NaN | NaN |
| 4 | 55.080 | 8.790 | 21.850 | NaN | NaN |
| 5 | 53.107 | 8.493 | 21.587 | NaN | NaN |
| 6 | 51.267 | 8.363 | 21.257 | 3.187 | 55.510 |
| 7 | NaN | NaN | 20.850 | 3.060 | 53.460 |
| 8 | NaN | NaN | 20.407 | 2.933 | 51.757 |
| 9 | NaN | NaN | 19.930 | 2.827 | 50.093 |