Sales Forecast with only monthly sales data[Deep Lerning Tims series Forecasting]

9 min readDec 29, 2022

Get understand deep learning time series forecasting technique. let’s talks!

source : https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.investopedia.com%2Fterms%2Ff%2Fforecasting.asp&psig=AOvVaw11mEbmXdQucuDkdHdTQC-h&ust=1668686589174000&source=images&cd=vfe&ved=2ahUKEwji-LCM1LL7AhWPyaACHWwMARwQjRx6BAgAEAo

Almost all of the content, I digest from famous Forecasting book “Deep Learning for Time Series Forecasting” by Jason Brownlee.

I make it easy to consume for me and hope it can also help you too 😄

Part I “Start with Data”

I will start with the data. With my exprerience, to start with data can help you to see the overview and if you familiar with the model you will realise that data input is important same as model turing or other techniqus that make your model robust.

what about “only monthly sales data” ? pls see below

It’s no featureX1,X2,X3,… that you can take it train model as you familiar to do Regression model. this kind of data call “sequential data” (the sequence of data by time step-day/month/year). This data has their suitable technique to deal with call “Time series forecasting”

[0] Jason Brownlee: “Time series forecasting” classes of methods that you can design experiments include the following:
1. Baseline. Simple forecasting methods such as persistence and averages.
2.Autoregression. The Box-Jenkins process and methods such as SARIMA.
3.Exponential Smoothing. Single, double and triple exponential smoothing methods.
4.Linear Machine Learning. Linear regression methods and variants such as regularization.
5.Nonlinear Machine Learning. kNN, decision trees, support vector regression and more.
6.Ensemble Machine Learning. Random forest, gradient boosting, stacking and more.
7.Deep Learning. MLPs, CNNs, LSTMs, and Hybrid models.

In this section we are focus on Deep Learning method.

Part II “Series as Supervised Learning”

as above since No4 till 7 are the Supervised Learning that we have to train, test model. but the data have no feature!! what should we do?? let’s talk about it.

If you confuse and have the question about how Sequential data able to train Supervise model. I confident above figure can breakthrough your unclear and bright it 😍

Supervise learning need feature X, lable y to learn and generate model to predict unseen data next. This step we will take a look, how to transform “Sequential data -> Supervise input data” step by step.

first, we have to split all data(108 months) to train and test set. it simples just split last 12 months(1 year) be a test. >>> 96 months as a train/12 months as a test data set <<<

Second, consider and fix how many feature you need(how many prior time step). >>> In this case we use 24 tims step to predict next 1 time step <<< lets talk about CODE below

def series_to_supervised(data, n_in=1, n_out=1):
    df = DataFrame(data)      
    cols = list()      
    # input sequence (t-n, ... t-1)      
    for i in range(n_in, 0, -1):      
        cols.append(df.shift(i))      
    # forecast sequence (t, t+1, ... t+n)      
    for i in range(0, n_out):      
        cols.append(df.shift(-i))      
    # put it all together      
    agg = concat(cols, axis=1)      
    # drop rows with NaN values      
    agg.dropna(inplace=True)      
    return agg.values      
      
data = series_to_supervised(train, n_in=n_input)      
 train_x, train_y = data[:, :-1], data[:, -1]

Part III “Deep Learning-MLPs, CNNs, LSTMs, and Hybrid models” (fit and predict)

from part II we got train_X and train_y ready to fit to model. let’s me recap those models as follow

(1) MLP and (2) CNN

source : https://www.researchgate.net/publication/336358944/figure/fig2/AS:811915186012166@1570587073649/The-common-architecture-of-MLP-and-CNN-designed-for-classification-and-regression-based.png

[1] The common architecture of MLP and CNN designed for classification and regression based neural network. MLP is consisted of fully-connected (FC) layers and CNN also contains convolution layers, pooling layers besides that.

(3) RNN(LSTM)

source : http://andrisfaesal.blogspot.com/2021/01/perbedaan-cnn-vs-rnn-vs-ann-untuk.html#

RNN have feedback loops in the recurrent layer. This lets them maintain information in ‘memory’ over time. But, it can be difficult to train standard RNNs to solve problems that require learning long-term temporal dependencies.
LSTM networks are a type of RNN that uses special units in addition to standard units. LSTM units include a ‘memory cell’ that can maintain information in memory for long periods of time. This memory cell lets them learn longer-term dependencies.
LSTM deal with vanishing and exploding gradient problem by introducing new gates, such as input and forget gates, which allow for a better control over the gradient flow and enable better preservation of “long-range dependencies”

source : https://ashutoshtripathi.com/2021/07/02/what-is-the-main-difference-between-rnn-and-lstm-nlp-rnn-vs-lstm/

(4) CNN-LSTM, CNN network that learns input features and an LSTM that interpretsthem.

(5) ConvLSTM, combination of CNNs and LSTMs where the LSTM units read input data using the convolutional process of a CNN.

train set to fit model

As above, we will compare 5 model (1) MLP, (2)CNN, (3)LSTM, (4)CNN-LSTM, (5)ConvLSTM fit to train_X, train_y as we have already done spliting and series to supervised data transformation. lets talk about CODE below

# fit a model
def mlp_model_fit(train, config):           
 # unpack config
 n_input, n_nodes, n_epochs, n_batch = config
 # prepare data
 data = series_to_supervised(train, n_in=n_input) 
 train_x, train_y = data[:, :-1], data[:, -1]
 # define model
 model = Sequential()
 model.add(Dense(n_nodes, activation='relu', input_dim=n_input))
 model.add(Dense(1))
 model.compile(loss='mse', optimizer='adam')
 # fit
 model.fit(train_x, train_y, epochs=n_epochs, batch_size=n_batch, verbose=0)
 return model

# fit a model
def cnn_model_fit(train, config):
 # unpack config
 n_input, n_filters, n_kernel, n_epochs, n_batch = config
 # prepare data
 data = series_to_supervised(train, n_in=n_input)
 train_x, train_y = data[:, :-1], data[:, -1]
 train_x = train_x.reshape((train_x.shape[0], train_x.shape[1], 1))
 # define model
 model = Sequential()
 model.add(Conv1D(filters=n_filters, kernel_size=n_kernel, activation='relu', input_shape=(n_input, 1)))
 model.add(Conv1D(filters=n_filters, kernel_size=n_kernel, activation='relu'))
 model.add(MaxPooling1D(pool_size=2))
 model.add(Flatten())
 model.add(Dense(1))
 model.compile(loss='mse', optimizer='adam')
 # fit
 model.fit(train_x, train_y, epochs=n_epochs, batch_size=n_batch, verbose=0)
 return model

# fit a model
def lstm_model_fit(train, config):
 # unpack config
 n_input, n_nodes, n_epochs, n_batch, n_diff = config
 # prepare data
 if n_diff > 0:
  train = difference(train, n_diff)
 data = series_to_supervised(train, n_in=n_input)
 train_x, train_y = data[:, :-1], data[:, -1]
 train_x = train_x.reshape((train_x.shape[0], train_x.shape[1], 1))
 # define model
 model = Sequential()
 model.add(LSTM(n_nodes, activation='relu', input_shape=(n_input, 1)))
 model.add(Dense(n_nodes, activation='relu'))
 model.add(Dense(1))
 model.compile(loss='mse', optimizer='adam')
 # fit
 model.fit(train_x, train_y, epochs=n_epochs, batch_size=n_batch, verbose=0)
 return model

def cl_model_fit(train, config):
 # unpack config
 n_seq, n_steps, n_filters, n_kernel, n_nodes, n_epochs, n_batch = config
 n_input = n_seq * n_steps
 # prepare data
 data = series_to_supervised(train, n_in=n_input)
 train_x, train_y = data[:, :-1], data[:, -1]
 train_x = train_x.reshape((train_x.shape[0], n_seq, n_steps, 1))
 # define model
 model = Sequential()
 model.add(TimeDistributed(Conv1D(filters=n_filters, kernel_size=n_kernel, activation='relu', input_shape=(None,n_steps,1))))
 model.add(TimeDistributed(Conv1D(filters=n_filters, kernel_size=n_kernel, activation='relu')))
 model.add(TimeDistributed(MaxPooling1D(pool_size=2)))
 model.add(TimeDistributed(Flatten()))
 model.add(LSTM(n_nodes, activation='relu'))
 model.add(Dense(n_nodes, activation='relu'))
 model.add(Dense(1))
 model.compile(loss='mse', optimizer='adam')
 # fit
 model.fit(train_x, train_y, epochs=n_epochs, batch_size=n_batch, verbose=0)
 return model

def conl_model_fit(train, config):
 # unpack config
 n_seq, n_steps, n_filters, n_kernel, n_nodes, n_epochs, n_batch = config
 n_input = n_seq * n_steps
 # prepare data
 data = series_to_supervised(train, n_in=n_input)
 train_x, train_y = data[:, :-1], data[:, -1]
 train_x = train_x.reshape((train_x.shape[0], n_seq, 1, n_steps, 1))
 # define model
 model = Sequential()
 model.add(ConvLSTM2D(filters=n_filters, kernel_size=(1,n_kernel), activation='relu', input_shape=(n_seq, 1, n_steps, 1)))
 model.add(Flatten())
 model.add(Dense(n_nodes, activation='relu'))
 model.add(Dense(1))
 model.compile(loss='mse', optimizer='adam')
 # fit
 model.fit(train_x, train_y, epochs=n_epochs, batch_size=n_batch, verbose=0)
 return model

test set to predict model

now we create function(def) of all 5 model, next we are going to talk about prediction part that talk test set. but first of all let recap our test set and how complex dose it is and how to get it ready to use in model.predict()

>>> split 96 months as a train/12 months as a test data set <<<

>>> use 24 tims step to predict next 1 time step <<<

it mean 12 months test set is the result of 24 times step prior, see below

as you can see, our 12 times step test is test_y that we will use to measure error prediction. for test_X we have to generate from 24 times step prior if 12 test(test_y) by loop it back to history it then split history and reshape to x_input. yess we get test_X name x_input!!! lets see more and CODE below.

# forecast with a pre-fit model
def model_predict(model, history, config):
 # unpack config
 n_input, _, _, _, _ = config
 # prepare data
 x_input = array(history[-n_input:]).reshape((1, n_input, 1))
 # forecast
 yhat = model.predict(x_input, verbose=0)
 return yhat[0]

a difference about input dimentional you need to know in each model, so in term of code to prepare data, pls see below

#mlp- 2D_input_shape
x_input = array(history[-n_input:]).reshape(1, n_input)

#cnn- 3D_input_shape
train_x = train_x.reshape((train_x.shape[0], train_x.shape[1], 1))
x_input = array(history[-n_input:]).reshape((1, n_input, 1))

#lstm- 3D_input_shape
train_x = train_x.reshape((train_x.shape[0], train_x.shape[1], 1))
x_input = array(history[-n_input:]).reshape((1, n_input, 1))

#cnn-lstm- 4D_input_shape
train_x = train_x.reshape((train_x.shape[0], n_seq, n_steps, 1))
x_input = array(history[-n_input:]).reshape((1, n_seq, n_steps, 1))

#convlstm- 5D_input_shape
train_x = train_x.reshape((train_x.shape[0], n_seq, 1, n_steps, 1))
x_input = array(history[-n_input:]).reshape((1, n_seq, 1, n_steps, 1))

[0]Jason Brownlee: Cnn-lstm, The number of lag observations per sample is simply (n seq n steps). This is a 4-dimensional input array now with the dimensions: [samples, subsequences, timesteps,features]
[0]Jason Brownlee: Convlstm, This type of model is called a Convolutional LSTM, or ConvLSTM for short. It is provided in Keras as a layer called ConvLSTM2D for 2D data. We can configure it for use with 1D sequence data by assuming that we have one row with multiple columns. As with the CNN-LSTM, the input data is split into subsequences where each subsequence has a fixed number of time steps, although we must also specify the number of rows in each subsequence, which in this case is fixed at 1. The shape is five-dimensional, with the dimensions: [samples, subsequences, rows,columns, features].

Prediction

Now we got 5 model have already fit and we understood the X_input(test_X), so next step is predict. lets see CODE below

# walk-forward validation for univariate data
def walk_forward_validation(data, n_test, cfg):
 predictions = list()
 # split dataset
 train, test = train_test_split(data, n_test)
 # fit model
 model = cnn_model_fit(train, cfg)
 # seed history with training dataset
 history = [x for x in train]
 # step over each time-step in the test set
 for i in range(len(test)):
  # fit model and make forecast for history
  yhat = model_predict(model, history, cfg)
  # store forecast in list of predictions
  predictions.append(yhat)
  # add actual observation to history for the next loop
  history.append(test[i])
 # estimate prediction error
 error = measure_rmse(test, predictions)
 print(' > %.3f' % error)
 return error
 
# repeat evaluation of a config
def repeat_evaluate(data, config, n_test, n_repeats=30):
 # fit and evaluate the model n times
 scores = [walk_forward_validation(data, n_test, config) for _ in range(n_repeats)]
 return scores

we predict through Walk-forward validation. That is a Cross-Validation Technique for Time-Series Data and we repeat n = 30 for evaluation

Walk-Forward Optimization — Cross-Validation Technique for Time-Series Data

The basic theory and implementation of walk-forward optimization as a cross-validation technique for time-series data

audhiaprilliant.medium.com

Prediction Result

by eye, we can see it go to the same way but can’t definitely separate or measure which is the best model. To know it we have to Evaluate them to measure prediction error in number.

Part IV “Evaluation” measure prediction error

any model need Evaluation to measure the error and decide does the model good enough to use. RMSE is one of popular measurement

Barchart show the lowest RMSE_Score was cnn_predict from CNN model. It was very close to MLP model at around 1500, RMSE_Score of complexity model was higher it mean to add more complexity was not guarantee best model (less error). it depend on the data and problem as well.

Part V “Summary and Feature work”

first, lets plot the best model CNN with Actual. pls see

We understand and learn how to tranform Time series data to supervise data (you can solve the problem with Machine Learning after the data have already perform in Supervise data such as Ensemble Machine Learning — Random forest, gradient boosting, stacking and more…)
In this project, I decide to talk about one step prediction, but in the real world the Multu-step forecasting is more practical.
We will pick up our best model CNN and move to learn about Multu-step forecasting with Machine Learnig ngmodel adding (XGboost)

Thank you for your attention since beginnig part till the final.

[1] Robust Sub-meter Level Indoor Localization with a Single WiFi Access Point — Regression versus Classification