Backorder prediction Problem

11 min readApr 2, 2021

Material back-order is the slandered problem and related to the availability of the product when it is required for further production or the customer wants to purchase the same. Here machine learning will predict the probability of having backorders tendency and it will be very helpful since such tendency may result in the chance of losing customer since it will affect timely production or product shortage when a customer wants it.

Introduction

Inventory Management is an approach to sourcing, storing, and selling products. In other words, it is the system for maintaining the right product at right time. If the product is not available when it requires then it may affect overall production and because of such an issue client may select another company or product which results in losing a customer. At the same time, inefficient management may also increase the availability of the product even it is not required, which results in issues in storage. Here machine learning model classifies whether there was a situation when the product went on backorder based on historic data. In this type of problem, performance metric plays a very important role.

Business Objective

latency for the prediction is not a strict concern.
Miss-classification may result in wrong production or storage or may result in a shortage of products.
The binary classification problem i.e. 1 for product went on back-order and 0 not went on back-order.

Selecting performance parameter

In the case of backorder problems, it happens rarely so in this type of problem accuracy is not a useful metric since it works well only when the balance type of data-set is there. Here F1 score may be chosen to evaluate the performance of the model. Mathematically, it can be expressed as :

F1 = 2 * (1/((1/precision) + (1/recall)) )

F1 Score will find the balance between precision and recall.

Precision : It is the number of correct positive results divided by the number of positive results predicted by the classifier.
Recall : It is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

Here higher precision may be because machines predict a few backorder conditions which could be problematic for the company since a small amount of time machine predicts true backorder conditions. In case of recall if the machine predicts more backorder conditions with actual backorder condition then it will give maximum recall which may consider the right parameter since at least it covers all actual back order conditions but in this case, it may result in the wrong production and again it will be costly in terms of manufacturing and storing the items unnecessarily.

For highly imbalanced data Macro-F1 score would be the best option since Macro F1-score will give the same importance to each class. It will be low for models that perform well on the common classes while performing poorly on the rare classes.

Area Under Curve

Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary classification problems. AUC of a classifier is equal to the probability that the classifier will find a randomly chosen positive example higher than a randomly chosen negative example.

False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR both are computed at varying threshold values. AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1].

Data set

For understanding the problem we are using backorder prediction data set available on Kaggle. The data set is having the following features:

Data features are as follows:

SKU — Random ID for the product

national_inv — Current inventory level for the part

lead_time — Transit time for product (if available)

in_transit_qty — Amount of product in transit from source

forecast_3_month — Forecast sales for the next 3 months

forecast_6_month — Forecast sales for the next 6 months

forecast_9_month — Forecast sales for the next 9 months

sales_1_month — Sales quantity for the prior 1 month time period

sales_3_month — Sales quantity for the prior 3 month time period

sales_6_month — Sales quantity for the prior 6 month time period

sales_9_month — Sales quantity for the prior 9 month time period

min_bank — Minimum recommend the amount to stock

potential_issue — Source issue for part identified

pieces_past_due — Parts overdue from source

perf_6_month_avg — Source performance for prior 6 month period

perf_12_month_avg — Source performance for prior 12 month period

local_bo_qty — Amount of stock orders overdue

deck_risk — Part risk flag

oe_constraint — Part risk flag

ppap_risk — Part risk flag

stop_auto_buy — Part risk flag

rev_stop — Part risk flag

went_on_backorder — Product actually went on backorder. This is the target value.

Now let's start coding !!!

Data Analysis

From the above data information we can see that out of 21 features and 1 label, 15 labels are having numeric values and rest are the non-numeric features, so first we have to find a way to convert these non-numeric features into numeric.

Also, data distribution is imbalanced since only 0.7% of labels are having output 1 (i.e.) Backorder initiated.

Let us take a more detailed analysis of what data is missing.

In the below table total number of missing data per feature is given.

Here lead_time feature has the highest number of NaN values so we cannot drop all these features, here replacing NaN for lead_time is discussed after EDA.

Univariate Analysis

Features in train and test data are:

Index([‘sku’, ‘national_inv’, ‘lead_time’, ‘in_transit_qty’, ‘forecast_3_month’, ‘forecast_6_month’, ‘forecast_9_month’, ‘sales_1_month’, ‘sales_3_month’, ‘sales_6_month’, ‘sales_9_month’, ‘min_bank’, ‘potential_issue’, ‘pieces_past_due’, ‘perf_6_month_avg’, ‘perf_12_month_avg’, ‘local_bo_qty’, ‘deck_risk’, ‘oe_constraint’, ‘ppap_risk’, ‘stop_auto_buy’, ‘rev_stop’, ‘went_on_backorder’], dtype=’object’)

SKU

SKU id the random ID generated for the product so we can ignore the same for prediction

For categorical data only Yes and No are there in labels.

stop_auto_buy

oe_constraint

In the same way, we can have a plot for the rest of the categorical data: ppap_risk, deck_risk, rev_stop. From this plot, we can get the idea that what percentage of a total instance when backorder initiated for feature having the value “Yes” and “No”.

Now for numeric features : ‘in_transit_qty’, ‘forecast_3_month’, ‘forecast_6_month’, ‘forecast_9_month’, ‘min_bank’, ‘local_bo_qty’, ‘pieces_past_due’, ‘sales_1_month’, ‘sales_3_month’, ‘sales_6_month’, ‘sales_9_month’ we will try to learn about data distribution. First we will check all numeric data range using box plot.

Now next we will see each feature one by one:

national_inv

For National_Inv feature shows the current inventory level for the parts. Negative values are for the item ordered even they were available. From the box plot, it can be seen that national_inv is ground zero for the events when backorder initiated.

in_transit_qty

in_transit_qty is the amount of product in transit from the source. Most of the data of the feature are zero so we will use quantile value to analyze this feature.

Here 70% of items in this feature is having 0 value. 99 % of features are having a value less than 565.

forecast_3_month, forecast_6_month , forecast_9_month

Forecast sales for the next 3 months, 6 months, and 9 months. Now we will use a box plot for this feature.

Forecast sales for the next 3,6 and 9 months are higher in case of the product went on backorder compare to products not went on backorder. we can measure the correlation between these features and if they are highly correlated then we can keep one of them and drop others.

Here from the correlation map, it can be seen that forecast_6_month and forecast_9_month are highly correlated and we can skip one of them.

min_bank

For the min_bank feature, data highly overlap for both the labels.

local_bo_qty

For local_bo_qty is the amount of stock order overdue, 98 % of samples are having zero value. This feature may be ignored since most of the values of the feature are zero.

pieces_past_due

For pieces_past_due is part overdue from source, 98 % of samples are having zero value. This feature may be ignored since most of the values of the feature are zero.

sales_1_month,sales_3_month,sales_6_month,sales_9_month

From the above plots, it can be observed that for sales 1 month, 3 months and 6 months, sell is higher in case of the product went on backorder.

perf_6_month_avg ,perf_12_month_avg

below value counts is given for perf_6_month_avg feature.

Only 10 % of samples are having a value below 0. Both perf_6_month_avg and perf_12_month_avg are highly correlated.

Bivariate :

Scatter_plot

From all scatter plots it can be seen that data are overlap and very hard to classify using the above plots.

Preprocessing

Numeric features are at different scales and for some of the data range is very large so it is required to normalize such features.
For categorical features, all are having two possible values which are Yes and No, which can be converted into binary numbers 0 and 1.
SKU is a unique number that is not important for classification and hence it can be dropped.
Some features are having negative values and those are meaningful in terms of information so those should keep as it is.
In lead_time no of Nan is 115617. using model-based imputation NaN values are replaced.
Very few samples are having NaN and hence we can drop them.
Data is highly imbalanced and hence it should be balanced by sampling or any other method after splitting them into train and test.

Data Preparation

Missing data: lead_time

We have used model-based imputation to replace missing values. missingpy library is used for missing data imputation. It has two types of model support for mission data imputation. Random forest-based data imputation and KNN based data imputation. For the rest of the missing values, we drop them since they are very few in number compared to overall data.

For all categorical data having two values: Yes and No. so we will convert them into binary values 1 and 0.

For all other numerical features, we will normalize them.

Now get input features as X and labels as y:

X = df.drop([‘went_on_backorder’,’sku’,’pieces_past_due’],axis=1)
y = df[‘went_on_backorder’]

Testing on various Models

Here data is highly imbalanced, so most of the machine learning model work around slightly balance data set will give a poor performance for such an imbalanced data set. For the same issue, sampling techniques should be used. some of the sampling techniques to handle imbalanced data are Random Sampling, Random Over Sampler, The Synthetic Minority Over-sampling Technique (SMOTE), Cost-sensitive learning framework, etc.
We have to make sure that sampling should be applied only on train data since we have to test our model on actual conditions where data will be highly imbalanced. Here we have applied certain techniques with different bagging and boosting algorithms.

Random over the sample with Random Forest Classifier

SMOTE for oversample with Random Forest Classifier

AdaBoost Classifier

Gradient Boosting Classifier

Gradient Boosting algorithm at lr 0.05 AUC Score is 0.829915077329052 Gradient Boosting algorithm at lr 0.075 AUC Score is 0.834305562163702 Gradient Boosting algorithm at lr 0.1 AUC Score is 0.8398305574688069 Gradient Boosting algorithm at lr 0.25 AUC Score is 0.8520098317669961 Gradient Boosting algorithm at lr 0.5 AUC Score is 0.8544669969654695 Gradient Boosting algorithm at lr 0.75 AUC Score is 0.8507290958344305 Gradient Boosting algorithm at lr 1 AUC Score is 0.8518704733014806

Stacking Classifier cv

Now after the ML models last but not least we will see the performance of multi-layer perceptron over the same problem.

MLP based classification

AUC Score is: 0.9405758870258877

Summary

Conclusions

We have discussed a case study related to the backorder prediction problem using machine learning. This blog discussed the ML problem of backorder prediction starting from EDA to data preprocessing. After preparing data we have tested them on various ML models and MLP based classifier, based on our experiment we find the best AUC score of 0.94 using MLP based classification.

Rikin Nayak - Medium

Read writing from Rikin Nayak on Medium. Assistant Scientist/Assistant Professor at Charusat Space Research and…

medium.com

Rikin Nayak - Assistant Professor/Scientist - Charusat Space Research Technology Center | LinkedIn

View Rikin Nayak's profile on LinkedIn, the world's largest professional community. Rikin has 1 job listed on their…

www.linkedin.com

References

https://github.com/krishnaik06/Handle-Imbalanced-Dataset/blob/master/handling-imbalanced.ipynb

de Santis, Rodrigo Barbosa, Eduardo Pestana de Aguiar, and Leonardo Goliatt. “Predicting material backorders in inventory management using machine learning.” 2017 IEEE Latin American Conference on Computational Intelligence (LA-CCI). IEEE, 2017.

What is Inventory Management? | 10 guides to mastering commerce

Calculating this manually for a whole inventory can be time-consuming and prone to human error. 2. Days inventory…

www.tradegecko.com

Twitter Sentiment Analysis

Understanding the sentiments of people using Natural Language Processing