Time Series Analysis Project

Forecasting

TimeSeries - Python

2022

Brief : This project aims to forecast Dioxid of Carbone(C02) concentration in the Atmosphere analyzing the trendy and seasonal behaviours.

Objective

Understand the C02 concentration and it variation in the atmostphere creating time series model.

Project Overview:

Understand CO2 concentration from dataset.
Explore and preprocess the data for further analyse and modeling.
Create time series model for trendy and seasional components and compare which one is the best fit.
Evaluate the model using testing dataset.
Note: For more details HERE

1. Data Understanding

First we have to find data related to Dioxid of Carbone(C02) concentration in the Atmosphere. We can get this information from the Scripps Institution of Oceanography which can shoe monthly, weekly, dayli datasets. For our project, we will use the montly report: monthly_in_situ_co2_mlo.csv.

Note: For download dataset click HERE

The information recollected comes the observatory located at about 3,400 m altitude on Mauna Loa Volcano on Hawaii Island. This place has not influence of changing CO2 due it is near local vegetation and the wind prevail well-mixed air to the site. This information is provided from March 1958.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


# Create the dataframe from csv file:
cols = ["year", "month", "date1", "date2", "co2", "co2_season_adj", "co2_spline_season_adj", "co2_spline", "co2_fill_7", "co2_fill_8"]
df_co2=pd.read_csv("./monthly_in_situ_co2_mlo.csv", skiprows = 57, header=None, names=cols)
print('Dataframe has {:d} data points of {:d} features'.format(df_co2.shape[0],df_co2.shape[1]))
df_co2.head()

Output:
Dataframe has 780 data points of 10 features
	year	month	date1	date2	co2	co2_season_adj	co2_spline_season_adj	co2_spline	co2_fill_7	co2_fill_8
0	1958	1	21200	1958.0411	-99.99	-99.99	-99.99	-99.99	-99.99	-99.99
1	1958	2	21231	1958.1260	-99.99	-99.99	-99.99	-99.99	-99.99	-99.99
2	1958	3	21259	1958.2027	315.71	314.44	316.20	314.91	315.71	314.44
3	1958	4	21290	1958.2877	317.45	315.16	317.30	314.99	317.45	315.16
4	1958	5	21320	1958.3699	317.51	314.70	317.88	315.07	317.51	314.70

⬆

2. Pre-Processing

After a quick review, the raw dataset has some information which need to be pre-processed before using in the model:

Usage only the columns year, month, co2(ppm).
Add columns like time $t= (t+0.5)/12$.
Delete comments in " " " and headers which cannot help us to understand the data(skiprows = 57).
Drop co2 values equal to -99.99 which are some inhomogeneities.
Drop NaN values and reset index.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


# Pre-process the dataset:
df_co2 = df_co2.iloc[:,[x for x in range(df_co2.shape[1]) if x in [0,1,4]]]
df_co2["time"] = [(i+0.5)/12 for i in range(0, df_co2.iloc[:,0].size)]
df_co2[df_co2 == -99.99] = np.NaN
df_co2 = df_co2.dropna()
print('Dataframe has {:d} data points of {:d} features'.format(df_co2.shape[0],df_co2.shape[1]))
df_co2.head()

Output:
Dataframe has 768 data points of 4 features
	year	month	co2	time
2	1958	3	315.71	0.208333
3	1958	4	317.45	0.291667
4	1958	5	317.51	0.375000
6	1958	7	315.87	0.541667
7	1958	8	314.93	0.625000

For further modeling, we will split the data into train and test dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


# Take values for x and y
x = df_co2["time"].values
y = df_co2["co2"].values

# Split data
x_train, x_test, y_train, y_test = train_test_split(x.reshape(-1,1), y.reshape(-1,1), test_size = 0.20, shuffle=False )

# Define training and testing dataset:
df_co2_train = df_co2[0:len(x_train)]
df_co2_test = df_co2[len(x_train):len(df_co2)]

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape, df_co2_train.shape, df_co2_test.shape)

# Plotting The Keeling Curve
plt.scatter(x_train[:,0],y_train[:,0], linewidth = 1, color = 'r')
plt.xlabel('Year(ti)')
plt.ylabel('CO2 ppm')
plt.title('CO2 Concentration by Year')
plt.show()
plt.savefig('./images/CO2-curve.png')

CO2 Concentration by Year

⬆

3. Building Model

We will use information regarded to C02 concentration which can be model as:

$$ C_ i = F(t_ i) + P_ i + R_ i $$

$F(t_ i)$ : Trendy pattern
$P_ i$ : Seasonal pattern
$R_ i$ : Residual

Later during evaluation we will see if this decomposition is meaningful. The main idea is to transform the data into stationarity

Trendy pattern:

For this part, we need to go throught some models like linear regression, polynomical grade 2 or 3 and try to verify which one can fit better.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


# Comparing Linear, Quadratic and Cubic Model Fit

plt.plot(x_train[:,0],y_train[:,0], linewidth = 1, color = 'blue', label ="Original Data Train")
plt.plot(x_test[:,0],y_test[:,0], linewidth = 1, color = 'lightblue', label ="Original Data Test")

plt.plot(x_train[:,0], y_pred_train, linewidth = 2, color = 'red', label ="Linear Fit on Data Train")
plt.plot(x_train[:,0], y_pred_train_pol2, linewidth = 2, color = 'orange', label ="Quadratic Fit on Data Train")
plt.plot(x_train[:,0], y_pred_train_pol3, linewidth = 2, color = 'green', label ="Cubic Fit on Data Train")

plt.axvline(x=x_test[0,:], ymin=0, ymax=9, linewidth = 2, color = 'red')

plt.plot(x_test[:,0], y_pred_test, linewidth = 2, color = 'red', label ="Linear Fit on Data Test")
plt.plot(x_test[:,0], y_pred_test_pol2, linewidth = 2, color = 'orange', label ="Quadratic Fit on Data Test")
plt.plot(x_test[:,0], y_pred_test_pol3, linewidth = 2, color = 'green', label ="Cubic Fit on Data Test")

print("Error Model:")
print("=========================")
print("R2 score: ",R_1, R_2, R_3)
print("RMSE:",rmse_1, rmse_2, rmse_3)
print("MAPE: ", mape_1, mape_2, mape_3, "\n")

plt.legend()
plt.xlabel('Year')
plt.ylabel('CO2 ppm')
plt.title('Keeling Curve - CO2 Concentration')
plt.savefig('./images/Trendy-curve.jpg')
plt.show()

Trendy Curve

Seasonal Pattern:

For model seasonal pattern, we can use the residuals of quadratic model because this model has a good RSME besides the other models. With this residual, we need to find the pattern which can be modeled.

1
2
3
4
5
6
7


plt.plot(x_train[:,0], df_co2_train["R_quadratic"].values, color ='b')
plt.xlabel('Year')
plt.ylabel('Residual')
plt.title('Quadratic Residual')
plt.savefig('./images/Seasonal-curve.jpg')
plt.show()

Seasonal Curve

As we can see, this kind of residual has the seasonal form. We will use the mean of months and interpolate as a periodic function which will be replicate:

Sinosoidal

Now we have the trendy and seasonal patterns, it is time to joint them and complete the model:

Trendy Pattern: $ F(t) = 0.012107 * t^2 + 0.799038 * t + 314.132792 $
Seasonal Pattern: $ P(t) = Sinosoidal(interpolate -by- month) $

So $ C_i = F(t)+P(t) $

Final-model

⬆

4. Evaluation

For this part, we can use the following metrics to evaluate over the test dataset:

Evaluate RMSE:
- Root Mean Square Error: $ \sqrt{ \sum\limits_{i=0}^{n} \frac{(y - y_{pred})^2}{n}} $
Information Criteria AIC, BIC:
- AIC (Akaike Information Criteria): $ -2 * \text{log-likelihood} + 2k $
- BIC (Bayesian Information Criteria): $ -2 * \text{log-likelihood} + k \log(n) $
Meaningfull evaluation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


# Plotting The Keeling Curve
plt.scatter(x,y, linewidth = 1, color = 'blue', label ="Original Data" )
#plt.plot(x_train[:,0], y_pred_train_pol2, linewidth = 3, color = 'orange', label ="Quadratic Fit on Data Train")
#plt.plot(x_test[:,0], y_pred_test_pol2, linewidth = 3, color = 'red', label ="Quadratic Fit on Data Test")

plt.axvline(x=x_test[0,:], ymin=0, ymax=9, linewidth = 2, color = 'red')

plt.plot(x_train[:,0],Ci_train, linewidth = 2, color = 'red', label ="F(t) + P(t) on Data Train")
plt.plot(x_test[:,0],Ci_test, linewidth = 2, color = 'orange', label ="F(t) + P(t) on Data Test")
plt.legend()
plt.xlabel('Year(ti)')
plt.ylabel('CO2 ppm')
plt.title('Keeling Curve - CO2 Concentration')
plt.savefig('images/FinalCi.png')
plt.show()

Final-ci

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


print("Error Models Comparinson:")
print("=========================")
print("RMSE - Linear Model: ",rmse_1)
print("RMSE - Quadratic Model: ", rmse_2)
print("RMSE - Sinoidal Model: ", rmse_sin)
print("RMSE - Linear+Quadratic Model: ", rmse_ci)

Error Models Comparinson:
=========================
RMSE - Linear Model:  12.205045063273001
RMSE - Quadratic Model:  2.8185169840722186
RMSE - Sinoidal Model:  403.2159909879654
RMSE - Linear+Quadratic Model:  1.6761379284135005


AIC with linear method is: 1918.707028906646
BIC with linear method is: 1921.7439815090595
--------------------
AIC with nonlinear method is: 767.0099028009191
BIC with nonlinear method is: 773.0838080057464
--------------------
AIC with sinusoidal method is: 42096479.288302734
BIC with sinusoidal method is: 42096494.47306575
--------------------
AIC with Nonlinear+Sinoidal method is: 755.0368639627404
BIC with Nonlinear+Sinoidal method is: 776.2955321796359
--------------------

Ratio Ft/Pi : 11.676531 
Ratio Pi/Ri : 1.702243 

⬆

5. Conclusions and Lesson learned

The ratio between the range of values of $F(t)$ is 11 times that of $P(t)$ which means trendy model is a good estimation. On the other hand, seasonal model can be seen as a refinement of the model and does not affect the linear trend.
The range of values of the residual is comparable, roughly 1.7 times the range of the residuals.
AIC, BIC information for final model can be seen as a good approximation for eal CO2 concentrarion.
RMSE of final model also has reach a bettet performance than quadratic and linear model.

⬆

Hello, I'm Jhon Vargas - a Data Scientist & Technical Lead

Curious. Self Learning. Proactive.