Xgboost time series forcasting with sktime [ep#1]

6 min readDec 25, 2022

sktime introduction and explore sequential data to get ready for modeling

photo modify by Author on https://www.indeed.com/career-advice/career-development/types-of-graphs-and-charts and Juan Moyano/Stocksy photo

Hello, welcome to my note about time series ML experiment on sales forecasting project. I implement to my real job to forecast next a few week sales ahead. In this artical we will talk about

1. Time series data (rough concept to deal with one single — sales data)

2. Time series analysis (EDA and Stationary testing)

3. Time series forecasting (modeling and prediction)

4. Time series cross validation (temporal cross validation)

5. fine-tune xgboost (get best parameter)

Lets’get it started !!! In genaral we familiar with data as Tabular(DataFrame) with n features X1, X2, X3,…Xn and label Y(target).

supervised learning overview by https://www.tibco.com/reference-center/what-is-supervised-learning

We split data(X_train, X_test, y_train, y_test). Use train data to build model-> fit(X_train, so let’s talk about it!

1Time series one single — sales data (univariate): As below, it doesn’t have any feature. It just one single column of sequential data.

what we have to do is to tranform time sequence for supervised task. we can also define function by hand with below code, but we don’t. We use sktime!

def series_to_supervised(data, n_in=1, n_out=1):
    df = DataFrame(data)      
    cols = list()      
    # input sequence (t-n, ... t-1)      
    for i in range(n_in, 0, -1):      
        cols.append(df.shift(i))      
    # forecast sequence (t, t+1, ... t+n)      
    for i in range(0, n_out):      
        cols.append(df.shift(-i))      
    # put it all together      
    agg = concat(cols, axis=1)      
    # drop rows with NaN values      
    agg.dropna(inplace=True)      
    return agg.values      
      
data = series_to_supervised(train, n_in=n_input)      
 train_x, train_y = data[:, :-1], data[:, -1]

First of all, most of information of this article was from a good medium article about time series with sktime as show below.

Build Complex Time Series Regression Pipelines with sktime

How to forecast with scikit-learn and XGBoost models with sktime

towardsdatascience.com

I learned, explored and made the story talking up in my taste, Thanks it for reference and hope you/reader also enjoy with my taste

sktime package : framework for a wide range of time series machine learning tasks.

sktime have many fucntion to make time series more convenience. We start with make_reduction function: tranform single data to lag of data as feature X . for example Y = 5 was from =2, 3 and 4 sequentially by lag_1, lag_2 and lag_3.

image by **Rhys Kilian** from Build Complex Time Series Regression Pipelines with sktime artical

Before make_reduction we can’t skip split data. We work make_reduction on Train data. the split data function call temporal_train_test_split: split data by sequential (last 4 week, last 3 month, last 6 month,..blah blah blah). The difference from our general train_test_split is shuffling. base on sequential data, time series can’t split with shuffling or randomly.

image by **Marcello La Rosa** on Survey and Cross-benchmark Comparison of Remaining Time Prediction Methods in Business Process Monitoring

one another important function is ForecastingHorizon: array of relative or absolute values are specific data points for which we want to generate forecasts.

test_size = 12
window_length = 52

y= df.squeeze()

y_train, y_test = temporal_train_test_split(y, test_size= test_size)
fh = ForecastingHorizon(y_test.index, is_relative = False)

With those 3 functions, now we get ready to create model, but we still don’t. Because we have to do one of important step to make more understanding and familiar with the data. That’s EDA >>> let’s go!!

2 Time series analysis (EDA and stationary testing). Step back before split any data, we shouldn’t skip EDA.

Data have 2 columns, “date” and “milk” (sales of milk chocolate). The data was time sequential record in a day. I explored the data as 3 level

resamples by day (‘D’) >>> to make sum of sales by day
add[‘year’], [‘month’] and [‘week_of_year’] feature >>> for more explore
Visualize whole year sales >>> overview sales by day in a year.

df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace= True)

df_day = df.resample('D').sum()
df_day = df_day.reset_index(drop=False)

df_day['year'] = df_day['date'].dt.year
df_day['month'] = df_day['date'].dt.month
df_day['week_of_year'] = df_day['date'].dt.week

df_day_y21 = df_day[(df_day['year']==2021)]

df_day_y21.set_index('date', inplace= True)
calplot.calplot(df_day_y21['milk'],edgecolor= None, cmap= 'afmhot_r')

df_day_y22 = df_day[(df_day['year']==2022)]

df_day_y22.set_index('date', inplace= True)
calplot.calplot(df_day_y22['milk'],edgecolor= None, cmap= 'afmhot_r')

As above both of Y21 and Y22, sales was active around three preiod a year. First active was in Feb, second was quite ambiguous in May(Jun) till Aug and last active was in Nov till Dec. Next explore is monthly

resamples by month(‘M’) >>> to make sum of sales by month
also add[‘year’], [‘month’] and [‘week_of_year’] feature
Visualize whole year sales >>> compare sales of Y21 and Y22.

df_month = df.resample('M').sum()
df_month = df_month.reset_index(drop=False)

df_month['year'] = df_month['date'].dt.year
df_month['month'] = df_month['date'].dt.month
df_month['week_of_year'] = df_month['date'].dt.week

eda_df = df_month.pivot(index='month', columns='year', values='milk')
eda_df.plot(kind='line')

eda_df.plot(kind='bar')

As above, trend line was not big difference. it was quite stable at around 4500–5000 units. I assume abnormal sales was happen in OCT’21. Last explore is weekly.

resamples by week(‘W’) >>> to make sum of sales by week
also add[‘year’], [‘month’] and [‘week_of_year’] feature
Visualize whole year sales >>> continuously of sales by week.

df_week = df.resample('W').sum()
df_week = df_week.reset_index(drop=False)

df_week['year'] = df_week['date'].dt.year
df_week['month'] = df_week['date'].dt.month
df_week['week_of_year'] = df_week['date'].dt.week

df_week.set_index('date', inplace= True)

Cmilk_df = df_week[['milk']]

Cmilk_df.plot(figsize= (10,3))
plt.show()

As above we can see two year sales continuously was quite stable at around 1000–1500 units a week. by line plot we can rough inspect this data is stationary. but to make it as a formal summary we step next to test stationary by ADF-test

Why is Augmented Dickey–Fuller test (ADF Test) so important in Time Series Analysis

ADF Test

) so important in Time Series Analysis ADF Testmedium.com

code here,

from statsmodels.tsa.stattools import adfuller

result = adfuller(Cmilk_df.values.flatten())

print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])

The p-value was very less than the significance level of 0.05 and hence we can reject the null hypothesis and take that the series is stationary.

The test result will make our model no need to add more additional term to fix non-stationary such as Deseasonalizer and Detrender that includes in sktime.

ep#1 conclusion

Since we start till this step, we deal with one-single data. Transform and ready for supervise task. Explored and understood the data character by 3 levels of EDA (day, month and week level) and also test stationary (by ADF-test).

Therefore we are ready to bomb many models and see how does it each model predict our next sales. I will pick up one of good model to make it robust and learn more to fine-tune.

>>>>> lets go to ep#2. (complete notebook code as in ep#2 ) >>>>>> Thx.