Classification Promotion on sales data
continues part of Regression on Price Promotion data. It switch RSP and all of promotion to class 0 and 1. Learn how to classify them.
Classification
another part of Supervise learnign that the target is categorical/class not continuous data. It mean we can not calculate sum or average on the target. the main explore on it is frequency base calculation such as countplot.
1. Data set
I manipulate data abit to switch it from Regression data to Classification data show as below.
and I did some data cleaning and generated feature from ‘date’ same.
and I changed name of target feature from pro_npro to ‘pro_class’ now we expand from original 6 features to 9 features columns.
2. Data Visualization
as mentioned above, it limit to explore class target but it still. I first took a look as inside of class 1 (pro) because I combined all of price promotion together and it better to see each promotion inside.
by eye we can see and say it’s balance (if Minority class is high gap with Majority class it call imbalance case. It have to do some action(simple approach are up or down sampling) before create any model. To aviod domination of Majority on Minority — if imbalance stay in the end result, model will relply all Majority, so imbalance is very important issue in classification) lucklly, our data no need to do any approach ^^
Anyway, We can explore target class through other continuous feature by plot them then separate each class by color or any sub-graph blah blah blah
you can see offtake seem to separate class 0 in low rage and 1 in higher but not much in other (stores and sales_in)
[ *Remake: for Sales_in, it seem to have some outstanding data and it should remove as outlier, In actually outliers is a mistake or other wrong recording, but this case all of outstanding was correct (verified my author) and the previous task- Regression I decided to remove and did it, so this classification I would like to do it in case of not-remove to keep and see it effect ]
final check with heatmap to show correlation off ‘pro_class’ with other all feature (we can calculate corr because we still keep target class as number 0, 1 to see and check this kind of thing, but in the formal and correct type it should be category type and it can’t be for the calculation)
for correlation world, it have its own world in many name of type to use depend on many factor, main factor is type of data such as below table
the default of .corr() is Pearson’s that were our result in heatmap.
3. Preprocessing
now we are ready to next step to the model, but we need preprocessing step to get our data ready too.
In this time I decided to not do Standardizing and Normalization because as my research ,learn annd experience Treebase algorithm (Decisiontree, Randomforest or any boosting) not really need scale the data. But I’m not expert enough to say “it no need”. I just would like to not do and lets see how does it going to be 😆
Therefore, we splited df_train as train and valid then did it as array([]) on feature that we selected [‘month’, ‘week_of_year’, ‘sales_in’, ‘stores’,’offtake’]
4. Model & Measure regression performance
I did it on Treebase same for the model, but I think main talking is measurement. Classification have unique measurement to measure prediction class correct or not correct it call confusion metrix
confusion metrix is the result table/ TP = True Positive / FP = Flase Positive / TN = True Negative / FN = False Negative. They are variable to calculate
TP/FP/TN/FN are ready to calculate Precision / Recall / Accuracy and F1.
4.1 DecisionTreeClassifier
I did DecisionTree as a baseline same and measured it with F1 score (already combine Precision / Recall) together and Accuracy
baseline predicted on train was super powerful 98 %Accuracy and 99% F1 both on RSP and Pro with only 1 miss class (yellow hilight)
and for Valid was still good at 86%Accuracy and 91% F1 on RSP with 2 data miss class prediction (one each).
4.2 RandomForestClassifier
next to 100 trees as a Forest with Validation (TimeSeriesSplit) same.
RandomForestClassifier predicted 1 data miss same as DT and the score was same 98 %Accuracy and 99% F1 both on RSP and Pro
it was same score and 2 data miss class prediction same as DT.
4.3 XGBClassifier
I love Xgb, so I will always see Xgb in my project 😄
Xgb was same as prediction result as DT and Rnd with 98 %Accuracy and 99% F1 both on RSP and Pro
it was also same score and 2 data miss class prediction same as DT and Rnd.
Model summary
- DT- Train : 98% Acc /99% F1 both | valid 86%Acc /91,75% F1 on RSP,Pro
- Rnd- Train : 98% Acc /99% F1 both | valid 86%Acc /91,75% F1 on RSP,Pro
- Xgg- Train : 98% Acc /99% F1 both | valid 86%Acc /91,75% F1 on RSP,Pro
And the confusion matrix of Train of all 3 was [[56 0] [1 3]] and [[10 1] [1 3]] same. All Treebase classifiers were learned and predicted in the same way with super high result of train and high in valid. If it stop at this step we will summarize that this all 3 model was powerful and prefect to use to classify Promotion class, but is that correct conclusion without final test with unseen data?
Feature_importances
before go to unseen data test we have to know which feature was the most important for model prediction and as all 3 model above gave the same preformance so to see Feature_importances I pick up the most lightest model as simple DecisionTree.
as you can see offtake was outstanding score from other as we used to see and realised from Visualization part
4.4 guest model : Logistic Regression
Due to the important score of offtake was strongest so I have an idea to simplify this project by cut other feature off and keep only ‘offtake’
Logistic Regression on train only ‘offtake’ feature was more general as you can see yellow hilight. there were many data miss their class so the accuracy was 77% and f1 84% on RSP and 64% on Pro
Logistic Regression on valid the result was same as Treebase with valid 86% Accuracy and 91,75% F1 on RSP,Pro.
The result was interesting and difference with Treebase especially on train data that was more general so I will pick Logistic Regression to test with unseen data next 😁
5. unseen data (df_test)
Now we complete training and eveluation step on our model with train and validate set of data. next we are going to test our model with unseen data.
after read data we did some cleaning and added date feature and final preparation it from df to array([]) of X_test same as X_train to ready to test.
5.1 Test with DecisionTreeClassifier
same as mention above, all treebase gave the same so DecisionTree was good enough to pick up for test.
badly, DecisionTreeClassifier with superpowerful train 98% accuracy can not predicted any Pro class(1) on test data.
5.2 test with guest model : Logistic Regression
I hope more general model as Logistic Regression can predict Pro class at least 1 data is ok, lets see how does it predict on our test.
Wow! Logistic Regression with only ‘offtake’ can predict 100% Accuracy and double confirm with prefect ROC chart below ^^
Conclusion
- Classification is one of important algorithm to learn, practice and get familiar with. The main issue is usally discuss on confusion matrix (get confuse with confusion matrix 😄)
- Treebase algorithm are wonderful in term of widely use both of Regression and Classification task but sometime we need something more simply. Regression, K-mean, SVM, KNN Navie Bayes blahh blah
- No need to fix and stick on other people method if is not fix your problem, you case. be confident on your idea and do it and hope it error not alert 😄
- Data scientist is scientist, any your action is your experiment in the science lab, bomb can happen but you will learn how to aviod next.
Thank for you reading, hope you will enjoy my experiment and see you 😊
fullcode here on my github same https://github.com/MossMojito/classification