데이터 분석 기초
시계열 분석 기초(arima, sarima)
세용용용용
2023. 7. 25. 15:55
In [2]:
# 시계열 분해 간단 실습
import pandas as pd
import warnings
#데이터 불러오기
data = pd.read_csv("https://raw.githubusercontent.com/ADPclass/ADP_book_ver01/main/data/arima_data.csv", names = ['day','price'])
print(data.info())
data.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 60 entries, 0 to 59 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 day 60 non-null object 1 price 60 non-null int64 dtypes: int64(1), object(1) memory usage: 1.1+ KB None
Out[2]:
| day | price | |
|---|---|---|
| 0 | 2013-01-01 | 3794 |
| 1 | 2013-02-01 | 3863 |
| 2 | 2013-03-01 | 5190 |
| 3 | 2013-04-01 | 5783 |
| 4 | 2013-05-01 | 6298 |
day타입이 object이므로 시계열 분석을 위해 datetime 형식으로 변환해주자!!!!¶
In [3]:
data['day'] = pd.to_datetime(data['day'])
print(data.info())
#시간을 index로 분석할 시계열 데이터의 값을 단일 컬럼으로 만들어야됨
data.set_index('day', inplace=True)
data.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 60 entries, 0 to 59 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 day 60 non-null datetime64[ns] 1 price 60 non-null int64 dtypes: datetime64[ns](1), int64(1) memory usage: 1.1 KB None
Out[3]:
| price | |
|---|---|
| day | |
| 2013-01-01 | 3794 |
| 2013-02-01 | 3863 |
| 2013-03-01 | 5190 |
| 2013-04-01 | 5783 |
| 2013-05-01 | 6298 |
In [4]:
import matplotlib.pyplot as plt
plt.plot(data.index, data['price'])
Out[4]:
[<matplotlib.lines.Line2D at 0x1895e97d1f0>]
In [5]:
from statsmodels.tsa.seasonal import seasonal_decompose
ts = data
result = seasonal_decompose(ts, model='multiplicative')
plt.rcParams['figure.figsize'] = [12,8]
result.plot()
plt.show()
In [6]:
# 파이썬을 활용 정상성 검정 실습
from statsmodels.tsa.stattools import adfuller
# train, test 나눠준다
# 정상성 검정은 train 데이터에만 적용하고,
# 모델의 성능 평가는 test 데이터를 사용하여 진행하는 것이 일반적인 접근 방법
training = data[:"2016-12-01"]
test = data.drop(training.index)
# 데이터에 명확한 추세가 보이므로 "ct" 사용
adf = adfuller(training, regression='ct')
print("statistic: {}".format(adf[0]))
print("p-value: {}".format(adf[1]))
statistic: -1.9997199341328458 p-value: 0.6015863303793819
In [7]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
diff_data = training.diff(1)
diff_data = diff_data.dropna()
diff_data.plot()
Out[7]:
<AxesSubplot:xlabel='day'>
In [8]:
# 1차 차분한 그래프가 트렌드를 보이지 않기에 매개변수는 "c"값을 적용
adf = adfuller(diff_data)
print("statistic: {}".format(adf[0]))
print("p-value: {}".format(adf[1]))
statistic: -12.094547576926395 p-value: 2.0851606399613667e-22
3) AR모형과 MA모형¶
In [9]:
# 파이썬 활용 AR모형의 p값 찾기
plot_pacf(diff_data) #AR(p)의 값 확인 가능
plt.show()
(2) MA모형¶
과거의 예측 오차들의 가중이동평균으로 현재 시점의 데이터를 표현하는 모형!!!¶
즉, 과거의 예측 오차를 이용해 미래를 예측하는 모형이라고 할 수 있다!!!¶
과거 예측 오차들의 따라 가중이동평균은 달라짐, 그렇기에 MA모형은 최적의 모형이 되는 구간을 구하는 것이 중요¶
MA모형이 최적이 되게끔 하는 변수가 q이며 이 모형을 MA(q)모형 이라함!!!¶
ACF : 자기상관 함수로 시차에 따른 자기상관성을 의미¶
ACF 값을 시차에 따라 그래프로 시작화 해보면 최적의 q값을 찾을 수 있음(0으로 수혐할 떄 시차를 q로 설정)¶
In [10]:
# 파이썬 활용 MA모형 q값 찾기 실습
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
plot_acf(diff_data) # MA(q)의 값 확인
Out[10]:
acf값을 확인했을때 약 시차2 이후에 0에 수렴하는 것을 알 수 있다. 즉, MA모형에서 최적의 q값은 2로 설정할 수 있다.¶
In [11]:
#앞서 사용했던 데이터의 p,d,q(2,1,2) 값을 ARIMA에 적용시켜 모델을 예측해보자
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(training, order=(2,1,2))
res = model.fit()
res.summary()
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
self._init_dates(dates, freq)
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
self._init_dates(dates, freq)
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\tsa\base\tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency MS will be used.
self._init_dates(dates, freq)
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:966: UserWarning: Non-stationary starting autoregressive parameters found. Using zeros as starting parameters.
warn('Non-stationary starting autoregressive parameters'
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\tsa\statespace\sarimax.py:978: UserWarning: Non-invertible starting MA parameters found. Using zeros as starting parameters.
warn('Non-invertible starting MA parameters found.'
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
Out[11]:
| Dep. Variable: | price | No. Observations: | 48 |
|---|---|---|---|
| Model: | ARIMA(2, 1, 2) | Log Likelihood | -375.875 |
| Date: | Tue, 25 Jul 2023 | AIC | 761.750 |
| Time: | 14:59:50 | BIC | 771.001 |
| Sample: | 01-01-2013 | HQIC | 765.231 |
| - 12-01-2016 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ar.L1 | -1.3167 | 0.190 | -6.933 | 0.000 | -1.689 | -0.944 |
| ar.L2 | -0.3190 | 0.191 | -1.673 | 0.094 | -0.693 | 0.055 |
| ma.L1 | 1.9700 | 0.243 | 8.109 | 0.000 | 1.494 | 2.446 |
| ma.L2 | 0.9949 | 0.242 | 4.119 | 0.000 | 0.522 | 1.468 |
| sigma2 | 4.457e+05 | 1.13e-06 | 3.93e+11 | 0.000 | 4.46e+05 | 4.46e+05 |
| Ljung-Box (L1) (Q): | 0.11 | Jarque-Bera (JB): | 0.38 |
|---|---|---|---|
| Prob(Q): | 0.74 | Prob(JB): | 0.83 |
| Heteroskedasticity (H): | 1.49 | Skew: | -0.21 |
| Prob(H) (two-sided): | 0.44 | Kurtosis: | 2.89 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 9.29e+26. Standard errors may be unstable.
변환 되지 않은 training 세트를(2,1,2) 로 학습시킨 결과를 회귀분석에서 보았떤 결과가 나타난다.¶
주의 깊에 봐야할 부분은¶
AIC와 AR,MA모델의 p-value이다.¶
1) AIC는 다른 모델과 비교할 떄 사용할 수 있으며 AIC가 작을수록 모델의 성능이 좋다고 할수 있다!!!¶
2) coef에서 ar,ma가 p-value 0.05이하이면 ar과ma 모형을 사용할 수 있다는 것이다.¶
3) ar,ma 뒤에 있는 L1,L2는 모델에서 사용하는 시차의 개념, 만약 p,d,q 에서 p값이 5라면(여기서는 2임) ar.L1~ar.L5 변수가 모델에서 사용된다!!!¶
In [12]:
plt.plot(res.predict())
plt.plot(training)
Out[12]:
[<matplotlib.lines.Line2D at 0x189644ddaf0>]
In [13]:
#예측 데이터
forecast_data = res.forecast(steps = len(test), alpha = 0.05)
pred_y = forecast_data
pred_y
C:\Users\82108\anaconda3\lib\site-packages\statsmodels\tsa\statespace\representation.py:374: FutureWarning: Unknown keyword arguments: dict_keys(['alpha']).Passing unknown keyword arguments will raise a TypeError beginning in version 0.15. warnings.warn(msg, FutureWarning)
Out[13]:
2017-01-01 5830.344320 2017-02-01 5508.143690 2017-03-01 5883.768662 2017-04-01 5491.991051 2017-05-01 5887.992295 2017-06-01 5491.583116 2017-07-01 5887.181951 2017-08-01 5492.780201 2017-09-01 5885.864326 2017-10-01 5494.133156 2017-11-01 5884.503309 2017-12-01 5495.493515 Freq: MS, Name: predicted_mean, dtype: float64
In [14]:
#실제 데이터
test_y = test
test_y
Out[14]:
| price | |
|---|---|
| day | |
| 2017-01-01 | 5236 |
| 2017-02-01 | 5299 |
| 2017-03-01 | 6744 |
| 2017-04-01 | 7927 |
| 2017-05-01 | 8561 |
| 2017-06-01 | 8930 |
| 2017-07-01 | 9960 |
| 2017-08-01 | 8548 |
| 2017-09-01 | 7843 |
| 2017-10-01 | 7620 |
| 2017-11-01 | 7676 |
| 2017-12-01 | 5809 |
잘 예측 했는지 시각화하여 확인해보자¶
In [15]:
plt.plot(pred_y, color='gold', label='Predict') #예상한 데이터
plt.plot(test_y, color='green', label='test') # 실제 가격 그래프
plt.legend()
plt.show()
In [16]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
print("r2_score", r2_score(test_y, pred_y))
RMSE = mean_squared_error(test_y, pred_y)**0.5
print('RMSE', RMSE)
r2_score -1.6434021791291618 RMSE 2302.418748557072
In [1]:
# auto_arima 설치
!pip install pmdarima
Collecting pmdarima
Downloading pmdarima-2.0.3-cp39-cp39-win_amd64.whl (572 kB)
Collecting statsmodels>=0.13.2
Downloading statsmodels-0.14.0-cp39-cp39-win_amd64.whl (9.4 MB)
Requirement already satisfied: Cython!=0.29.18,!=0.29.31,>=0.29 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (0.29.24)
Requirement already satisfied: urllib3 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (1.26.7)
Requirement already satisfied: joblib>=0.11 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (1.2.0)
Collecting numpy>=1.21.2
Downloading numpy-1.25.1-cp39-cp39-win_amd64.whl (15.1 MB)
Requirement already satisfied: scipy>=1.3.2 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (1.7.1)
Requirement already satisfied: pandas>=0.19 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (1.3.4)
Requirement already satisfied: setuptools!=50.0.0,>=38.6.0 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (58.0.4)
Requirement already satisfied: scikit-learn>=0.22 in c:\users\82108\anaconda3\lib\site-packages (from pmdarima) (1.2.2)
Requirement already satisfied: pytz>=2017.3 in c:\users\82108\anaconda3\lib\site-packages (from pandas>=0.19->pmdarima) (2021.3)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\82108\anaconda3\lib\site-packages (from pandas>=0.19->pmdarima) (2.8.2)
Requirement already satisfied: six>=1.5 in c:\users\82108\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.19->pmdarima) (1.16.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\82108\anaconda3\lib\site-packages (from scikit-learn>=0.22->pmdarima) (2.2.0)
Downloading numpy-1.22.4-cp39-cp39-win_amd64.whl (14.7 MB)
Requirement already satisfied: patsy>=0.5.2 in c:\users\82108\anaconda3\lib\site-packages (from statsmodels>=0.13.2->pmdarima) (0.5.2)
Collecting packaging>=21.3
Downloading packaging-23.1-py3-none-any.whl (48 kB)
Installing collected packages: numpy, packaging, statsmodels, pmdarima
Attempting uninstall: numpy
Found existing installation: numpy 1.20.3
Uninstalling numpy-1.20.3:
Successfully uninstalled numpy-1.20.3
Attempting uninstall: packaging
Found existing installation: packaging 21.0
Uninstalling packaging-21.0:
Successfully uninstalled packaging-21.0
Attempting uninstall: statsmodels
Found existing installation: statsmodels 0.12.2
Uninstalling statsmodels-0.12.2:
Successfully uninstalled statsmodels-0.12.2
Successfully installed numpy-1.22.4 packaging-23.1 pmdarima-2.0.3 statsmodels-0.14.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. daal4py 2021.3.0 requires daal==2021.2.3, which is not installed. numba 0.54.1 requires numpy<1.21,>=1.17, but you have numpy 1.22.4 which is incompatible.
auto_arima 주요 매개변수¶
1) 데이터프레임(data_index를 가지고 1개의 관측값을 가진 데이터 입력)¶
2) information_criterion='aic'(주로 aic를 사용, 최상 모델을 찾기 위한 모델 평가지표)¶
3) m(분기별:4, 월별:12, 계절성이 없는 경우:1)¶
4) seansonal(계절성이 존재시 True, 아니면 False)¶
5) start_p >>> 최적AR(p) 모형을 찾을떄의 p의 시작값¶
6) max_p >>> 최적AR(P) 모형을 찾을떄의 p의 최댓값¶
7) start_q >>> MA(q) 모형 찾을떄 q의 시작값¶
8) max_q >>> MA(q) 모형 찾을떄 q의 최댓값¶
9) start_P, max_P >>> 계절성 자기 회귀 차수 P의 시작, 최대값¶
10) start_Q, max_Q >>> 계절성 이동 평균 차수 Q의 시작, 최댓값¶
In [17]:
#auto_arima 함수 호출
from pmdarima import auto_arima
# trace는 결과값에 학습정보를 표현하기 위해 True로 사용
auto_model = auto_arima(training, start_p=0, d=1, start_q=0,
max_p=3, max_q=3,
start_P=0, start_Q=0, max_P=3, max_Q=3, m=12,
seasonal=True, information_criterion='aic',
trace = True)
Performing stepwise search to minimize aic ARIMA(0,1,0)(0,1,0)[12] : AIC=481.846, Time=0.02 sec ARIMA(1,1,0)(1,1,0)[12] : AIC=482.652, Time=0.06 sec ARIMA(0,1,1)(0,1,1)[12] : AIC=482.466, Time=0.07 sec ARIMA(0,1,0)(1,1,0)[12] : AIC=483.637, Time=0.04 sec ARIMA(0,1,0)(0,1,1)[12] : AIC=483.669, Time=0.03 sec ARIMA(0,1,0)(1,1,1)[12] : AIC=inf, Time=0.15 sec ARIMA(1,1,0)(0,1,0)[12] : AIC=481.031, Time=0.03 sec ARIMA(1,1,0)(0,1,1)[12] : AIC=482.740, Time=0.07 sec ARIMA(1,1,0)(1,1,1)[12] : AIC=inf, Time=0.23 sec ARIMA(2,1,0)(0,1,0)[12] : AIC=482.616, Time=0.04 sec ARIMA(1,1,1)(0,1,0)[12] : AIC=482.682, Time=0.07 sec ARIMA(0,1,1)(0,1,0)[12] : AIC=480.687, Time=0.03 sec ARIMA(0,1,1)(1,1,0)[12] : AIC=482.403, Time=0.07 sec ARIMA(0,1,1)(1,1,1)[12] : AIC=inf, Time=0.19 sec ARIMA(0,1,2)(0,1,0)[12] : AIC=482.683, Time=0.04 sec ARIMA(1,1,2)(0,1,0)[12] : AIC=inf, Time=0.07 sec ARIMA(0,1,1)(0,1,0)[12] intercept : AIC=482.687, Time=0.03 sec Best model: ARIMA(0,1,1)(0,1,0)[12] Total fit time: 1.255 seconds
In [18]:
auto_model.summary()
Out[18]:
| Dep. Variable: | y | No. Observations: | 48 |
|---|---|---|---|
| Model: | SARIMAX(0, 1, 1)x(0, 1, [], 12) | Log Likelihood | -238.344 |
| Date: | Tue, 25 Jul 2023 | AIC | 480.687 |
| Time: | 15:19:07 | BIC | 483.798 |
| Sample: | 01-01-2013 | HQIC | 481.761 |
| - 12-01-2016 | |||
| Covariance Type: | opg |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| ma.L1 | -0.3185 | 0.177 | -1.801 | 0.072 | -0.665 | 0.028 |
| sigma2 | 4.803e+04 | 1.64e+04 | 2.924 | 0.003 | 1.58e+04 | 8.02e+04 |
| Ljung-Box (L1) (Q): | 0.00 | Jarque-Bera (JB): | 1.15 |
|---|---|---|---|
| Prob(Q): | 0.95 | Prob(JB): | 0.56 |
| Heteroskedasticity (H): | 1.56 | Skew: | -0.14 |
| Prob(H) (two-sided): | 0.45 | Kurtosis: | 2.16 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
In [21]:
#학습셋으로 부터 test데이터 길이 만큼 예측
auto_pred_y = pd.DataFrame(auto_model.predict(n_periods=len(test)),
index = test.index)
auto_pred_y.columns = ['predicted_price']
auto_pred_y
Out[21]:
| predicted_price | |
|---|---|
| day | |
| 2017-01-01 | 5609.436974 |
| 2017-02-01 | 5761.436974 |
| 2017-03-01 | 7225.436974 |
| 2017-04-01 | 8298.436974 |
| 2017-05-01 | 8841.436974 |
| 2017-06-01 | 9452.436974 |
| 2017-07-01 | 10359.436974 |
| 2017-08-01 | 8777.436974 |
| 2017-09-01 | 8068.436974 |
| 2017-10-01 | 7832.436974 |
| 2017-11-01 | 7935.436974 |
| 2017-12-01 | 6279.436974 |
최종적으로 test데이터 길이만큼 예측을 했으니 실제test데이터와 차이가 얼마나 있는지 시각화 해보자¶
In [22]:
plt.figure(figsize=(10,6))
plt.plot(training, label='Train') # Train 데이터
plt.plot(auto_pred_y, label='Prediction') # 모델이 예측한 그래프
plt.plot(test, label='Test') # 실제 가격 그래프
plt.legend(loc='upper left')
plt.show()
평가지표인 R**2 값과 RMSE 값을 확인해보자¶
In [23]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
print("r2_score", r2_score(test_y, auto_pred_y))
RMSE = mean_squared_error(test_y, auto_pred_y)**0.5
print("RMSE", RMSE)
r2_score 0.9305467077215415 RMSE 373.206423341773