Regression on Price Promotion data
How to use regression model to predict offtake(sales_out) on sales_in , number of store on promotion data in retail business.
Hello, it again for me to talk about sales forecasting/prediction in retails. I worked in this domain so long, sales is always the first of all.
In previous article, I used to talk about sales forecasting on one-single sales data(univariate) by sktime package and one another article was sales forecasting with Deep learning technic.
This article I will work on Tabular Data/ Table Data / Datafram data (whatever name) but it mean we work on multivariate.
1. Data set
I have Tabular of tain and test data(use it as unseen data) in csv. It contain: date, sales_in, number of stores, offtake(average sales_out pre store) as our target in numerical type of data and one categirical data pro(price_promotion)
my interested in this project was the relation between sales_in — offtake and stores — offtake. how do they were related to? a lot of stores number should get high offtake? a lot of sales_in should get high offtake as well? Is that correct? , I would like to explore in this preject. let’s go together!
after loaded and did some data cleaning, my first action was simple to generate more useful feature from existing feature such as date(time) to get year, month and week_of_year and get dummy of categirical data
now we expand from original 5 features to 12 features columns.
2. Data Visualization
Now the data have no any missing values, we can explore it to get more insigh before any model making.
We start with our offtake (target), stores and sales_in.
- offtake: was stable, up and down in the range of 5–12(approximately)
- Stores: Trend Line was noticable going down and decline
- Sales_in: seem to be stable but we got 3 strange peak(investigated need)
For 3 peaks of Sales_in, author decided to call it as outlier that need to remove, so the new chart above show Sales_in after remove outlier.
For categirical data [‘Pro’], countplot show the normal situation of promotion. The normal price(RSP) was highest and so on in other price promotion.
I this article we will cut hevvy level of promotion such as BOGO (buy1 get1) and buy2+1 off. To keep just only normal price discount level.
hevvy promotion is predictable, the offtake is going to be high. We need to explore what we can’t expect or assump by our ideas, so we cut it off !!! (it mean we need the power of Machine Learning to help us!)
3. Feature checking
To describe each feature base on statistics measurement. we inspected: ‘types’, ‘counts’, ‘distincts’, ‘nulls’, ‘missing_ration’, ‘uniques’, ‘skewness’, ‘kurtosis’, corr_col : we define not a code to inspect below.
obs = df_train.shape[0]
types = df_train.dtypes
counts = df_train.apply(lambda x: x.count())
uniques = df_train.apply(lambda x: [x.unique()]).transpose()
nulls = df_train.apply(lambda x: x.isnull().sum())
distincts = df_train.apply(lambda x: x.unique().shape[0])
missing_ration = (df_train.isnull().sum()/ obs) * 100
skewness = df_train.skew()
kurtosis = df_train.kurt()
corr = df_train.corr()['offtake']
corr_col = 'corr ' + 'offtake'
str = pd.concat([types, counts, distincts, nulls, missing_ration, uniques, skewness, kurtosis, corr], axis = 1, sort=False)
cols = ['types', 'counts', 'distincts', 'nulls', 'missing_ration', 'uniques', 'skewness', 'kurtosis', corr_col ]
str.columns = cols
str.sort_values(by ='corr offtake', ascending=False)
‘corr offtake’ showed relation between each feature call correlation. It represent relation between feature in the range number between -1 till 1.
by Feature checking Table, we see pro_2for was strongest relation with offtake in the positive and pro_RSP was also strongest relation with offtake in the negative. our interested feature sales-in was abit in negative and stores was quite ok in positive.
This chart show correlation, easy to see and summarize. Our 2 main focus variables Sales_in and Stores was 0.03 and 0.20
scatterplot show and support score 0.03 and 0.20 of Sales_in and Stores.
Untill this step, we can roughly concluded that the offtake is going to be high and higher with promotion doing and go back with RSP price. Our interested feature Sales_in and Stores have no any effect or relate to the offtake much. It seem to be simple conclusion but It make sense and practical in the real world. The question at this step is how to take those features to predict the offtake and which is the best important feature by score.
4. Preprocessing
Any model need data format correctly to train, so Preprocessing is needed.
This article have 2 set of data train and test, We import and work anything on train and keep the test as unseen data to test final model.
Work on train data, start with split the data as train and valid set. In this project the data was small. it just 102 record so I decided to split train[:90] and test[90:]
it better to check Target(y) variation both of train and valid and it should be close and perform similarly each other as below
standardized and normalized
Standardizing, the features around the center and 0 with a standard deviation of 1 is important when we compare measurements that have different units. Variables that are measured at different scales do not contribute equally to the analysis and might end up creating a bais
Normalization, the goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values. For machine learning, every dataset does not require normalization. It is required only when features have different ranges.
above article was a good article to explain standardized and normalized.
for my project, I used MinMaxScaler() for normalization and StandardScaler for Standardizing.
again for displot for check variation if X both of train and Valid after standardized and normalized. It was good.
Now we have X_train_std ready to fit in the model.
5. Model & Measure regression performance
As show in the headline “Regression” so I will explore Regression model to predict the sales. in conclude I did 5 treebase models
- DecisionTreeRegressor (as baseline)
- DecisionTreeRegressor + TimeSeriesSplit Validation
- DecisionTreeRegressor + TimeSeriesSplit Validation + more params
- RandomForestRegressor + TimeSeriesSplit Validation +more params
- Xgboost + TimeSeriesSplit Validation + more params
For metric eveluation, I choose R-Square to get score (high is good) and MAE(mean_absolute_error) to see the error(low is good).
let’s me show some interesting part and compare it to you.
1. DecisionTreeRegressor (as baseline) was predict as show below.
2. DecisionTreeRegressor + TimeSeriesSplit Validation.
for the second model we plus on baseline with TimeSeriesSplit for Validation.(I deciede to split sequentiallly to not shuffle the data, for me it make sense and pretical when use and explain to my team and boss)
cv_ls was a varible to contain set of train and valid that was splited. I set 5 for n_split, mean divide whole data as 5 fold. 10 for test_size, mean 10 data point had splited to valid in each fold and step ahead in next fold.
both of 1 and 2, model well perform in Train set with morethan 0.90 of R-Square but bad in Valid with nagative R-Square and error was big. We need more better model.
3. DecisionTreeRegressor + TimeSeriesSplit Validation + more params
It was improve but error still high. by eye we can see 2 data point in valid was big gap but other point was good.
4. RandomForestRegressor + TimeSeriesSplit Validation +more params
RandomForest improve one of 2 big gap above better and error was lower than 1.0. it good but we explored more.
5. Xgboost + TimeSeriesSplit Validation + more params
Xgboost was better from RandomForest both of R-squre and error in Valid and more better both same in Train set. It mean Xgboost was the best model in this project!!
Result Summay in each model focus on Valid set only
- model_1: R-squre = -0.218 | MAE = 1.372
- model_2: R-squre = -0.397 | MAE = 1.527
- model_3: R-squre = 0.149 | MAE = 1.126
- model_4: R-squre = 0.248 | MAE = 0.947
- model_5: R-squre = 0.337 | MAE = 0.866 ~ (Best Model) ><
Feature Important
next step on Xgboost our best model to get score(.feature_importances_)
RSP and pro_2for were our best 2 features as we used to have the assumption on its correlation far above. Our main 2 interested features Sales_in and Stores were small, match with our assumption too.
anyway, it make sense and lastly we will take the best model to predict on unseen data (test set) and measure model performance in the same metric.
6. Predict on Unseen data (test set)
We got 10 record ready to final test our best model Xgboost, but first of all we need to manipulate data same process as train data….. the complete test show as below
next to preprocessing step to get data ready to predict
by eye we can see not big gap of test and predict data and by score was show R-squre = 0.635 | MAE = 0.575 on unseen(test). it good and better than train and Valid data in Model_5
Conclusion
- We explore multivariate data(not only one-single data), how to predict offtake data on sales_in, stores and promotion data in retails business domain
- Correlation is important to roughly see and prove the questions untill set assumption with our domain experience on data.
- Valadation is important for any model.
- Treebase Regressions are useful and flexible to use.
Next!!, This project the data set was so small. To make more and more powerful of the model, more train data is needed (at least 1000 for train record: 100 for test) Anyway those prediction was used to my real task already. This was enough for me from this project ^^
lastly, I have an idea for next project to combine and convert promotion price to class of data 1(promotion) or 0 (no promotion) then create Classification model to classify which data record belong to which class.
hope to see you in the next project Classification on Price Promotion data
>> full code here https://github.com/MossMojito/Regression
Thank for your reading :)