Classification Promotion on sales data

Patiparn Nualchan
9 min readJan 29, 2023

--

continues part of Regression on Price Promotion data. It switch RSP and all of promotion to class 0 and 1. Learn how to classify them.

Classification

another part of Supervise learnign that the target is categorical/class not continuous data. It mean we can not calculate sum or average on the target. the main explore on it is frequency base calculation such as countplot.

Image from Freepik

1. Data set

I manipulate data abit to switch it from Regression data to Classification data show as below.

Classification data by author

and I did some data cleaning and generated feature from ‘date’ same.

Classification data (use) by author

and I changed name of target feature from pro_npro to ‘pro_class’ now we expand from original 6 features to 9 features columns.

2. Data Visualization

as mentioned above, it limit to explore class target but it still. I first took a look as inside of class 1 (pro) because I combined all of price promotion together and it better to see each promotion inside.

inside of class 1 (promotion) by author
Targer class 0(RSp) and 1(pro) by author

by eye we can see and say it’s balance (if Minority class is high gap with Majority class it call imbalance case. It have to do some action(simple approach are up or down sampling) before create any model. To aviod domination of Majority on Minority — if imbalance stay in the end result, model will relply all Majority, so imbalance is very important issue in classification) lucklly, our data no need to do any approach ^^

imbalance simple approach by by my phd : Eakasit_Pacharawongsakda

Anyway, We can explore target class through other continuous feature by plot them then separate each class by color or any sub-graph blah blah blah

offtake plot by date color by target class
stores plot by date color by target class
sales_in plot by date color by target class

you can see offtake seem to separate class 0 in low rage and 1 in higher but not much in other (stores and sales_in)

offtake displot color by target class
store displot color by target class
sales_in displot color by target class

[ *Remake: for Sales_in, it seem to have some outstanding data and it should remove as outlier, In actually outliers is a mistake or other wrong recording, but this case all of outstanding was correct (verified my author) and the previous task- Regression I decided to remove and did it, so this classification I would like to do it in case of not-remove to keep and see it effect ]

corr heatmap by authhor

final check with heatmap to show correlation off ‘pro_class’ with other all feature (we can calculate corr because we still keep target class as number 0, 1 to see and check this kind of thing, but in the formal and correct type it should be category type and it can’t be for the calculation)

for correlation world, it have its own world in many name of type to use depend on many factor, main factor is type of data such as below table

Correlation coefficient type

the default of .corr() is Pearson’s that were our result in heatmap.

3. Preprocessing

now we are ready to next step to the model, but we need preprocessing step to get our data ready too.

In this time I decided to not do Standardizing and Normalization because as my research ,learn annd experience Treebase algorithm (Decisiontree, Randomforest or any boosting) not really need scale the data. But I’m not expert enough to say “it no need”. I just would like to not do and lets see how does it going to be 😆

Therefore, we splited df_train as train and valid then did it as array([]) on feature that we selected [‘month’, ‘week_of_year’, ‘sales_in’, ‘stores’,’offtake’]

split train and Valid by author
did it as array by author
X_train[0:5] by author

4. Model & Measure regression performance

I did it on Treebase same for the model, but I think main talking is measurement. Classification have unique measurement to measure prediction class correct or not correct it call confusion metrix

Classification measurement by my phd : Eakasit_Pacharawongsakda

confusion metrix is the result table/ TP = True Positive / FP = Flase Positive / TN = True Negative / FN = False Negative. They are variable to calculate

TP/FP/TN/FN are ready to calculate Precision / Recall / Accuracy and F1.

4.1 DecisionTreeClassifier

I did DecisionTree as a baseline same and measured it with F1 score (already combine Precision / Recall) together and Accuracy

real offtake and DecisionTreeClassifier prediction result plot comparison on train by author
DecisionTreeClassifier prediction report on train by author

baseline predicted on train was super powerful 98 %Accuracy and 99% F1 both on RSP and Pro with only 1 miss class (yellow hilight)

real offtake and DecisionTreeClassifier prediction result plot comparison on valid by author
DecisionTreeClassifier prediction report on valid by author

and for Valid was still good at 86%Accuracy and 91% F1 on RSP with 2 data miss class prediction (one each).

4.2 RandomForestClassifier

next to 100 trees as a Forest with Validation (TimeSeriesSplit) same.

real offtake and RandomForestClassifier prediction result plot comparison on train by author
RandomForestClassifierprediction report on train by author

RandomForestClassifier predicted 1 data miss same as DT and the score was same 98 %Accuracy and 99% F1 both on RSP and Pro

real offtake and RandomForestClassifier prediction result plot comparison on valid by author
RandomForestClassifierprediction report on valid by author

it was same score and 2 data miss class prediction same as DT.

4.3 XGBClassifier

I love Xgb, so I will always see Xgb in my project 😄

real offtake and XGBClassifier prediction result plot comparison on train by author
XGBClassifier prediction report on train by author

Xgb was same as prediction result as DT and Rnd with 98 %Accuracy and 99% F1 both on RSP and Pro

real offtake and XGBClassifier prediction result plot comparison on valid by author
XGBClassifier prediction report on valid by author

it was also same score and 2 data miss class prediction same as DT and Rnd.

Model summary

  • DT- Train : 98% Acc /99% F1 both | valid 86%Acc /91,75% F1 on RSP,Pro
  • Rnd- Train : 98% Acc /99% F1 both | valid 86%Acc /91,75% F1 on RSP,Pro
  • Xgg- Train : 98% Acc /99% F1 both | valid 86%Acc /91,75% F1 on RSP,Pro

And the confusion matrix of Train of all 3 was [[56 0] [1 3]] and [[10 1] [1 3]] same. All Treebase classifiers were learned and predicted in the same way with super high result of train and high in valid. If it stop at this step we will summarize that this all 3 model was powerful and prefect to use to classify Promotion class, but is that correct conclusion without final test with unseen data?

Feature_importances

before go to unseen data test we have to know which feature was the most important for model prediction and as all 3 model above gave the same preformance so to see Feature_importances I pick up the most lightest model as simple DecisionTree.

as you can see offtake was outstanding score from other as we used to see and realised from Visualization part

4.4 guest model : Logistic Regression

Due to the important score of offtake was strongest so I have an idea to simplify this project by cut other feature off and keep only ‘offtake’

train and val data for Logistic Regression by author
real offtake and Logistic Regression prediction result plot comparison on train by author
Logistic Regression prediction report on train by author
real offtake and Logistic Regression prediction result plot comparison on valid by author
Logistic Regression prediction report on valid by author

Logistic Regression on train only ‘offtake’ feature was more general as you can see yellow hilight. there were many data miss their class so the accuracy was 77% and f1 84% on RSP and 64% on Pro

Logistic Regression on valid the result was same as Treebase with valid 86% Accuracy and 91,75% F1 on RSP,Pro.

The result was interesting and difference with Treebase especially on train data that was more general so I will pick Logistic Regression to test with unseen data next 😁

5. unseen data (df_test)

Now we complete training and eveluation step on our model with train and validate set of data. next we are going to test our model with unseen data.

df_test by author

after read data we did some cleaning and added date feature and final preparation it from df to array([]) of X_test same as X_train to ready to test.

df_test by author
X_test array by author

5.1 Test with DecisionTreeClassifier

same as mention above, all treebase gave the same so DecisionTree was good enough to pick up for test.

real offtake and DecisionTreeClassifier prediction result plot comparison on test by author
DecisionTreeClassifier prediction report on test by author

badly, DecisionTreeClassifier with superpowerful train 98% accuracy can not predicted any Pro class(1) on test data.

5.2 test with guest model : Logistic Regression

I hope more general model as Logistic Regression can predict Pro class at least 1 data is ok, lets see how does it predict on our test.

real offtake and Logistic Regression prediction result plot comparison on test by author
Logistic Regression prediction report on test by author

Wow! Logistic Regression with only ‘offtake’ can predict 100% Accuracy and double confirm with prefect ROC chart below ^^

Logistic Regression ROC on test by author

Conclusion

  • Classification is one of important algorithm to learn, practice and get familiar with. The main issue is usally discuss on confusion matrix (get confuse with confusion matrix 😄)
  • Treebase algorithm are wonderful in term of widely use both of Regression and Classification task but sometime we need something more simply. Regression, K-mean, SVM, KNN Navie Bayes blahh blah
  • No need to fix and stick on other people method if is not fix your problem, you case. be confident on your idea and do it and hope it error not alert 😄
  • Data scientist is scientist, any your action is your experiment in the science lab, bomb can happen but you will learn how to aviod next.

Thank for you reading, hope you will enjoy my experiment and see you 😊

fullcode here on my github same https://github.com/MossMojito/classification

--

--