Regression on Price Promotion data

10 min readJan 26, 2023

How to use regression model to predict offtake(sales_out) on sales_in , number of store on promotion data in retail business.

Hello, it again for me to talk about sales forecasting/prediction in retails. I worked in this domain so long, sales is always the first of all.

In previous article, I used to talk about sales forecasting on one-single sales data(univariate) by sktime package and one another article was sales forecasting with Deep learning technic.

This article I will work on Tabular Data/ Table Data / Datafram data (whatever name) but it mean we work on multivariate.

1. Data set

I have Tabular of tain and test data(use it as unseen data) in csv. It contain: date, sales_in, number of stores, offtake(average sales_out pre store) as our target in numerical type of data and one categirical data pro(price_promotion)

my interested in this project was the relation between sales_in — offtake and stores — offtake. how do they were related to? a lot of stores number should get high offtake? a lot of sales_in should get high offtake as well? Is that correct? , I would like to explore in this preject. let’s go together!

df.tail to see the data from the bottom by author

after loaded and did some data cleaning, my first action was simple to generate more useful feature from existing feature such as date(time) to get year, month and week_of_year and get dummy of categirical data

dataset afte extracted some feature such as **Date(time)** and get **dummy of categirical data** by author

now we expand from original 5 features to 12 features columns.

2. Data Visualization

Now the data have no any missing values, we can explore it to get more insigh before any model making.

We start with our offtake (target), stores and sales_in.

offtake: was stable, up and down in the range of 5–12(approximately)
Stores: Trend Line was noticable going down and decline
Sales_in: seem to be stable but we got 3 strange peak(investigated need)

For 3 peaks of Sales_in, author decided to call it as outlier that need to remove, so the new chart above show Sales_in after remove outlier.

For categirical data [‘Pro’], countplot show the normal situation of promotion. The normal price(RSP) was highest and so on in other price promotion.

I this article we will cut hevvy level of promotion such as BOGO (buy1 get1) and buy2+1 off. To keep just only normal price discount level.

hevvy promotion is predictable, the offtake is going to be high. We need to explore what we can’t expect or assump by our ideas, so we cut it off !!! (it mean we need the power of Machine Learning to help us!)

3. Feature checking

To describe each feature base on statistics measurement. we inspected: ‘types’, ‘counts’, ‘distincts’, ‘nulls’, ‘missing_ration’, ‘uniques’, ‘skewness’, ‘kurtosis’, corr_col : we define not a code to inspect below.

obs = df_train.shape[0]
types = df_train.dtypes
counts = df_train.apply(lambda x: x.count())
uniques = df_train.apply(lambda x: [x.unique()]).transpose()
nulls = df_train.apply(lambda x: x.isnull().sum())
distincts = df_train.apply(lambda x: x.unique().shape[0])
missing_ration = (df_train.isnull().sum()/ obs) * 100
skewness = df_train.skew()
kurtosis = df_train.kurt() 

corr = df_train.corr()['offtake']
corr_col = 'corr '  + 'offtake'
str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
str.columns = cols
str.sort_values(by ='corr offtake', ascending=False)

‘corr offtake’ showed relation between each feature call correlation. It represent relation between feature in the range number between -1 till 1.

What is correlation?

Not causation.

towardsdatascience.com

by Feature checking Table, we see pro_2for was strongest relation with offtake in the positive and pro_RSP was also strongest relation with offtake in the negative. our interested feature sales-in was abit in negative and stores was quite ok in positive.

This chart show correlation, easy to see and summarize. Our 2 main focus variables Sales_in and Stores was 0.03 and 0.20

scatterplot show and support score 0.03 and 0.20 of Sales_in and Stores.

Untill this step, we can roughly concluded that the offtake is going to be high and higher with promotion doing and go back with RSP price. Our interested feature Sales_in and Stores have no any effect or relate to the offtake much. It seem to be simple conclusion but It make sense and practical in the real world. The question at this step is how to take those features to predict the offtake and which is the best important feature by score.

4. Preprocessing

Any model need data format correctly to train, so Preprocessing is needed.

This article have 2 set of data train and test, We import and work anything on train and keep the test as unseen data to test final model.

Train/Test Split and Cross Validation in Python

Hi everyone! After my last post on linear regression in Python, I thought it would only be natural to write a post…

towardsdatascience.com

Work on train data, start with split the data as train and valid set. In this project the data was small. it just 102 record so I decided to split train[:90] and test[90:]

it better to check Target(y) variation both of train and valid and it should be close and perform similarly each other as below

standardized and normalized

Standardizing, the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais
Normalization, the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.

How, When and Why Should You Normalize/Standardize/Rescale Your Data?

Before diving into this topic, lets first start with some definitions.

pub.towardsai.net

above article was a good article to explain standardized and normalized.

for my project, I used MinMaxScaler() for normalization and StandardScaler for Standardizing.

standardized and normalized step by author

again for displot for check variation if X both of train and Valid after standardized and normalized. It was good.

distribution after standardized and normalized by author

Now we have X_train_std ready to fit in the model.

5. Model & Measure regression performance

As show in the headline “Regression” so I will explore Regression model to predict the sales. in conclude I did 5 treebase models

DecisionTreeRegressor (as baseline)
DecisionTreeRegressor + TimeSeriesSplit Validation
DecisionTreeRegressor + TimeSeriesSplit Validation + more params
RandomForestRegressor + TimeSeriesSplit Validation +more params
Xgboost + TimeSeriesSplit Validation + more params

For metric eveluation, I choose R-Square to get score (high is good) and MAE(mean_absolute_error) to see the error(low is good).

What are the best metrics to evaluate your regression model?

R Square, MSE, RMSE, MAE

towardsdatascience.com

let’s me show some interesting part and compare it to you.

1. DecisionTreeRegressor (as baseline) was predict as show below.

**1. DecisionTreeRegressor** prediction plot by author

**1. DecisionTreeRegressor** model preformance plot by author

2. DecisionTreeRegressor + TimeSeriesSplit Validation.

for the second model we plus on baseline with TimeSeriesSplit for Validation.(I deciede to split sequentiallly to not shuffle the data, for me it make sense and pretical when use and explain to my team and boss)

cv_ls was a varible to contain set of train and valid that was splited. I set 5 for n_split, mean divide whole data as 5 fold. 10 for test_size, mean 10 data point had splited to valid in each fold and step ahead in next fold.

2. DecisionTreeRegressor + **TimeSeriesSplit** **Validation** prediction plot by author

2. DecisionTreeRegressor + **TimeSeriesSplit** **Validation** model preformance plot by author

both of 1 and 2, model well perform in Train set with morethan 0.90 of R-Square but bad in Valid with nagative R-Square and error was big. We need more better model.

3. DecisionTreeRegressor + TimeSeriesSplit Validation + more params

parameter adding for GridSearch by author

3. DecisionTreeRegressor + TimeSeriesSplit Validation + **more params** prediction plot by author

3. DecisionTreeRegressor + TimeSeriesSplit Validation + **more params** model preformance plot by author

It was improve but error still high. by eye we can see 2 data point in valid was big gap but other point was good.

4. RandomForestRegressor + TimeSeriesSplit Validation +more params

RandomForest improve one of 2 big gap above better and error was lower than 1.0. it good but we explored more.

5. Xgboost + TimeSeriesSplit Validation + more params

Xgboost was better from RandomForest both of R-squre and error in Valid and more better both same in Train set. It mean Xgboost was the best model in this project!!

Result Summay in each model focus on Valid set only

model_1: R-squre = -0.218 | MAE = 1.372
model_2: R-squre = -0.397 | MAE = 1.527
model_3: R-squre = 0.149 | MAE = 1.126
model_4: R-squre = 0.248 | MAE = 0.947
model_5: R-squre = 0.337 | MAE = 0.866 ~ (Best Model) ><

Feature Important

next step on Xgboost our best model to get score(.feature_importances_)

RSP and pro_2for were our best 2 features as we used to have the assumption on its correlation far above. Our main 2 interested features Sales_in and Stores were small, match with our assumption too.

anyway, it make sense and lastly we will take the best model to predict on unseen data (test set) and measure model performance in the same metric.

6. Predict on Unseen data (test set)

We got 10 record ready to final test our best model Xgboost, but first of all we need to manipulate data same process as train data….. the complete test show as below

next to preprocessing step to get data ready to predict

**unseen data model performance** by author

by eye we can see not big gap of test and predict data and by score was show R-squre = 0.635 | MAE = 0.575 on unseen(test). it good and better than train and Valid data in Model_5

Conclusion

We explore multivariate data(not only one-single data), how to predict offtake data on sales_in, stores and promotion data in retails business domain
Correlation is important to roughly see and prove the questions untill set assumption with our domain experience on data.
Valadation is important for any model.
Treebase Regressions are useful and flexible to use.

Next!!, This project the data set was so small. To make more and more powerful of the model, more train data is needed (at least 1000 for train record: 100 for test) Anyway those prediction was used to my real task already. This was enough for me from this project ^^

lastly, I have an idea for next project to combine and convert promotion price to class of data 1(promotion) or 0 (no promotion) then create Classification model to classify which data record belong to which class.
hope to see you in the next project Classification on Price Promotion data

>> full code here https://github.com/MossMojito/Regression

Thank for your reading :)

Regression on Price Promotion data

1. Data set

2. Data Visualization

3. Feature checking

What is correlation?

Not causation.

4. Preprocessing

Train/Test Split and Cross Validation in Python

Hi everyone! After my last post on linear regression in Python, I thought it would only be natural to write a post…

standardized and normalized

How, When and Why Should You Normalize/Standardize/Rescale Your Data?

Before diving into this topic, lets first start with some definitions.

5. Model & Measure regression performance

What are the best metrics to evaluate your regression model?

R Square, MSE, RMSE, MAE

1. DecisionTreeRegressor (as baseline) was predict as show below.

2. DecisionTreeRegressor + TimeSeriesSplit Validation.

3. DecisionTreeRegressor + TimeSeriesSplit Validation + more params

4. RandomForestRegressor + TimeSeriesSplit Validation +more params

5. Xgboost + TimeSeriesSplit Validation + more params

Result Summay in each model focus on Valid set only

Feature Important

6. Predict on Unseen data (test set)

Conclusion

Written by Patiparn Nualchan