Handling Missing Data

Handling missing data involves identifying and addressing gaps in a dataset where certain values are absent.

Types of Missing Data

Missing Completely at Random (MCAR)

Data is missing without any pattern or reason. For example, a person accidentally skips a question in an application.

Missing at Random (MAR)

Data is missing due to a known reason related to other variables. For instance, higher-income individuals may choose not to report their income.

Missing Not at Random (MNAR)

Data is missing due to some specific reasons, for example, borrowers with a poor credit score might not disclose their credit history.

Methods to Handle Missing Data

Deleting Missing Data

One of the methods of deleting missing data involves removing rows or columns with missing values from the dataset. This method simplifies the dataset and avoids the need to make assumptions about the missing data. The column with a missing data can be deleted when the amount of missing data is very small and removing it will not significantly impact the analysis. It is not recommended when a large portion of the dataset has the missing values as they can indicate some valuable insights or might be due to some specific reasons .

In the below data we have rows with missing borrower names and loan statuses

Original Loan Dataset:

Loan ID	Borrower Name	Loan Amount	Loan Status
LOAN001	Alice Smith	10,000	Approved
LOAN002	NULL	15,000	NULL
LOAN003	Carol Lee	20,000	Pending

In the dataset the Laon ID LOAN002 has null values.

After Deletion:

Loan ID	Borrower Name	Loan Amount	Loan Status
LOAN001	Alice Smith	10,000	Approved
LOAN003	Carol Lee	20,000	Pending

Mean Imputation

Mean imputation is the process of replacing missing numerical values with the mean valueof the existing values in the dataset. This ensures the dataset remains complete without removing any rows. Mean imputation is normally applied to datasets with distributed numerical data. It is not recommended to do mean imputation for skewed datasets, as it can distort the results .

The below dataset has missing loan amount

Original Dataset:

Loan ID	Loan Amount
LOAN001	10,000
LOAN002	NULL
LOAN003	20,000

Here, the missing loan amount is replaced with the mean of the existing values: (10,000 + 20,000) / 2 = 15,000.

After Mean Imputation:

Loan ID	Loan Amount
LOAN001	10,000
LOAN002	15,000
LOAN003	20,000

Median Imputation

Median imputation replaces missing values with the middle value of the existing data. The median is the middle value in a sorted list of numbers. For datasets with an even number of values, the median is the average of the two middle values. It works well for datasets with outliers or skewed data because it is not affected by extreme values, unlike the mean. However, avoid it for evenly distributed data, where the mean might better represent the central value.

Let us consider a loan dataset with missing loan amounts and interest rates.

Original dataset:

Loan ID	Borrower Name	Loan Amount	Interest Rate (%)
LOAN001	Alice Smith	10,000	5.5
LOAN002	Bob Johnson	NULL	6.0
LOAN003	Carol Lee	15,000	NULL
LOAN004	David Clark	20,000	5.0
LOAN005	Eva Brown	NULL	7.0

Loan Amount:

Existing values: 10,000, 15,000, 20,000
Median = Middle value = 15,000 (since the list has an odd number of values).

Interest Rate:

Existing values: 5.5, 6.0, 5.0, 7.0
Sorted list: 5.0, 5.5, 6.0, 7.0
Median = Average of the two middle values = (5.5 + 6.0) / 2 = 5.75.

After Median Imputation:

Loan ID	Borrower Name	Loan Amount	Interest Rate (%)
LOAN001	Alice Smith	10,000	5.5
LOAN002	Bob Johnson	15,000	6.0
LOAN003	Carol Lee	15,000	5.75
LOAN004	David Clark	20,000	5.0
LOAN005	Eva Brown	15,000	7.0

Mode Imputation

Mode imputation replaces missing values in categorical data with the most common category in the dataset. It works well when one category clearly dominates, such as loan statuses like Approved . This method is simple however, avoid using mode imputation if the most frequent category doesn’t represent the missing data accurate.

Let us consider a loan dataset with missing loan statuses. Mode imputation will replace the missing values with the most frequent category in the dataset.

Original dataset:

Loan ID	Borrower Name	Loan Status
LOAN001	Alice Smith	Approved
LOAN002	Bob Johnson	NULL
LOAN003	Carol Lee	Approved
LOAN004	David Clark	Pending
LOAN005	Eva Brown	NULL

Identify the Most Frequent Category (Mode)

Approved: 2 times, Pending: 1 time NULL: 2 missing values

The most frequent category is “Approved.”

After Mode Imputation:

Loan ID	Borrower Name	Loan Status
LOAN001	Alice Smith	Approved
LOAN002	Bob Johnson	Approved
LOAN003	Carol Lee	Approved
LOAN004	David Clark	Pending
LOAN005	Eva Brown	Approved

Arbitrary value imputation

This method can be used for numerical and categorical values. In these cases, the data is NMAR. The data in the column of the missing value is grouped in an order and a value is assigned which is far from either ends of the data. This is done so that the imputed value does not introduce any bias. For example, in the Loan dataset there may be missing entries for gender. In that case, the missing value is assigned a new value ‘Missing’.

Forward/Backward Fill Imputation

Forward fill is a method that replaces missing values with the last known value in the dataset. It is commonly used in time-series data, where previous values can logically carry forward/backward, such as loan repayments or stock prices. This method helps maintain trends and ensures there are no gaps in the data over time. It is suggested to avoid it, if the missing values occur at important points.

Let us consider a loan repayment dataset where some daily repayment amounts are missing.

Original dataset:

Date	Repayment Amount
01-01-2025	1,000
01-02-2025	NULL
01-03-2025	NULL
01-04-2025	1,200
01-05-2025	NULL

Apply Forward Fill

Replace the missing value on 01-02-2023 with the repayment amount from 01-01-2023 (1,000).
Replace the missing value on 01-03-2023 with the repayment amount from 01-02-2023 (1,000).
Replace the missing value on 01-05-2023 with the repayment amount from 01-04-2023 (1,200).

After Forward Fill:

Date	Repayment Amount
01-01-2025	1,000
01-02-2025	1,000
01-03-2022	1,000
01-04-2025	1,200
01-05-2025	1,200

Regression Imputation

Regression imputation is a method used when missing values can be predicted based on relationships with other variables in the dataset. It works where variables are strongly correlated, allowing predictions to fill in the gaps. This method is particularly useful when the missing data depends on other known information, such as predicting a loan amount based on income and credit score. If the relationships between variables are weak, avoid regression imputation, as it could lead to inaccurate or biased results.

Let us consider a loan dataset with missing loan amounts. Using regression imputation, we can predict the missing values based on the borrower’s income and credit score. Consider the formula used to derive loan amount is as below

Loan Amount = (Income × 0.2) + (Credit Score × 15) − 5,000

Original dataset

Loan ID	Income	Credit Score	Loan Amount
LOAN001	50,000	700	10,000
LOAN002	60,000	750	NULL
LOAN003	70,000	800	20,000
LOAN004	55,000	720	NULL

Apply Regression Imputation:

For LOAN002

Loan Amount = (60,000 × 0.2) + (750 × 15) − 5,000
Loan Amount = 12,000 + 11,250 − 5,000 = 18,250

For LOAN004:

Loan Amount = (55,000 × 0.2) + (720 × 15) − 5,000
Loan Amount = 11,000 + 10,800 − 5,000 = 16,800

After Regression Imputation:

Loan ID	Income	Credit Score	Loan Amount
LOAN001	50,000	700	10,000
LOAN002	60,000	750	18,250
LOAN003	70,000	800	20,000
LOAN004	55,000	720	16,800

K-Nearest Neighbors (KNN) Algorithm

K-Nearest Neighbors (KNN) imputation predicts missing values by finding the most similar data points (neighbors) in the dataset and using their values to fill in the gaps. It works well when there are clear relationships between variables.

Multiple Imputation by Chained Equations (MICE)

Multiple Imputation by Chained Equations (MICE) is a more advanced method that creates several complete datasets with different predictions for missing values, analyses each one, and combines the results. This approach captures the uncertainty around missing data and provides more reliable results but is computationally demanding and requires expertise to implement effectively.