Skip to content

Handling Missing Data

Handling missing data involves identifying and addressing gaps in a dataset where certain values are absent.

Types of Missing Data

Missing Completely at Random (MCAR)

Data is missing without any pattern or reason. For example, a person accidentally skips a question in an application.

Missing at Random (MAR)

Data is missing due to a known reason related to other variables. For instance, higher-income individuals may choose not to report their income.

Missing Not at Random (MNAR)

Data is missing due to some specific reasons, for example, borrowers with a poor credit score might not disclose their credit history.

Methods to Handle Missing Data

Deleting Missing Data

One of the methods of deleting missing data involves removing rows or columns with missing values from the dataset. This method simplifies the dataset and avoids the need to make assumptions about the missing data. The column with a missing data can be deleted when the amount of missing data is very small and removing it will not significantly impact the analysis. It is not recommended when a large portion of the dataset has the missing values as they can indicate some valuable insights or might be due to some specific reasons .

Mean Imputation

Mean imputation is the process of replacing missing numerical values with the mean valueof the existing values in the dataset. This ensures the dataset remains complete without removing any rows. Mean imputation is normally applied to datasets with distributed numerical data. It is not recommended to do mean imputation for skewed datasets, as it can distort the results .

Median Imputation

Median imputation replaces missing values with the middle value of the existing data. The median is the middle value in a sorted list of numbers. For datasets with an even number of values, the median is the average of the two middle values. It works well for datasets with outliers or skewed data because it is not affected by extreme values, unlike the mean. However, avoid it for evenly distributed data, where the mean might better represent the central value.

Mode Imputation

Mode imputation replaces missing values in categorical data with the most common category in the dataset. It works well when one category clearly dominates, such as loan statuses like Approved . This method is simple however, avoid using mode imputation if the most frequent category doesn’t represent the missing data accurate.

Arbitrary value imputation

This method can be used for numerical and categorical values. In these cases, the data is NMAR. The data in the column of the missing value is grouped in an order and a value is assigned which is far from either ends of the data. This is done so that the imputed value does not introduce any bias. For example, in the Loan dataset there may be missing entries for gender. In that case, the missing value is assigned a new value ‘Missing’.

Forward/Backward Fill Imputation

Forward fill is a method that replaces missing values with the last known value in the dataset. It is commonly used in time-series data, where previous values can logically carry forward/backward, such as loan repayments or stock prices. This method helps maintain trends and ensures there are no gaps in the data over time. It is suggested to avoid it, if the missing values occur at important points.

Regression Imputation

Regression imputation is a method used when missing values can be predicted based on relationships with other variables in the dataset. It works where variables are strongly correlated, allowing predictions to fill in the gaps. This method is particularly useful when the missing data depends on other known information, such as predicting a loan amount based on income and credit score. If the relationships between variables are weak, avoid regression imputation, as it could lead to inaccurate or biased results.

K-Nearest Neighbors (KNN) Algorithm

K-Nearest Neighbors (KNN) imputation predicts missing values by finding the most similar data points (neighbors) in the dataset and using their values to fill in the gaps. It works well when there are clear relationships between variables.

Multiple Imputation by Chained Equations (MICE)

Multiple Imputation by Chained Equations (MICE) is a more advanced method that creates several complete datasets with different predictions for missing values, analyses each one, and combines the results. This approach captures the uncertainty around missing data and provides more reliable results but is computationally demanding and requires expertise to implement effectively.