Unlocking the Power of Data: Mastering Machine Learning Imputation Techniques with Mathematical Precision
PART-1
INTRODUCTION
Machine learning is an essential part of modern data science, and it involves training models on large datasets to make predictions or classify data. However, real-world datasets are often incomplete or contain missing values, which can significantly impact the accuracy of machine learning models. This is where imputation techniques come into play.
Whether you're a data scientist or a machine learning enthusiast, this blog is for you. By the end of this post, you will have a solid understanding of imputation techniques in machine learning and the math behind them, empowering you to confidently tackle missing data in your projects.
What is the Imputation method?
Imputation is a technique used to fill in missing values or data points in a dataset. The process of imputation involves estimating the missing values based on the available data in the dataset.
Types of missing values
Missing Completely at Random (MCAR): In this type, the missingness is unrelated to any observed or unobserved variables in the dataset. The missing values occur randomly.
For example:
- In a survey, some respondents skipped certain questions due to distractions or random errors.
Missing at Random (MAR): In this type, the missingness is related to some observed variables in the dataset but not the missing variable itself. The missing values occur systematically based on other variables.
For example:
- In a health study, participants might be more likely to skip questions about their smoking habits if they consider it sensitive, but the missingness is not directly related to their actual smoking status.
Missing Not at Random (MNAR): In this type, the missingness is related to the missing variable itself, which means the missingness is driven by unobserved factors or factors not included in the dataset. The missing values occur systematically and are dependent on the value of the missing variable.
For example:
- In an income survey, high-income individuals might be less likely to disclose their income, leading to a higher proportion of missing values for high-income observations.
Mean imputation
Mean imputation is a method used to fill in missing values in a dataset by replacing them with the mean value of that column or feature.
It is a simple and widely used technique that works well for numerical data.
Formula
To calculate the mean of a dataset, we add up all the available values and divide by the total number of values:
$$Mean = (value1 + value2 + ... + valueN) / N$$
Example
Suppose we have a dataset with the following values
[10, 12, 15, 8, NaN, 17, 20, NaN, 25]
In this dataset, there are two missing values represented by NaN. To fill in these missing values using mean imputation, we first need to calculate the mean of the available values in the dataset:
mean = (10 + 12 + 15 + 8 + 17 + 20 + 25) / 7 = 107 / 7 = 15.285
We can then substitute the missing values with the mean value:
[10, 12, 15, 8, 15.285, 17, 20, 15.285, 25]
CODE
Code for handling missing values in a data frame using mean...
import pandas as pd
import numpy as np
# Create a sample dataframe with missing values
df = pd.DataFrame({
'A': [10, 12, 15, 8, np.nan, 17, 20, np.nan, 25],
'B': [1, np.nan, 3, np.nan, 5, 6, np.nan, 8, 9]
})
# Calculate mean of each column
means = df.mean()
# Impute missing values with means
df = df.fillna(means)
print(df)
OUTPUT
A B
0 10.000000 1.000000
1 12.000000 5.333333
2 15.000000 3.000000
3 8.000000 5.333333
4 15.285714 5.000000
5 17.000000 6.000000
6 20.000000 5.333333
7 15.285714 8.000000
8 25.000000 9.000000
Median imputation
Median imputation is another simple technique used to fill in missing values in a dataset by replacing them with the median of the non-missing values in that column.
The median is the middle value when the data is sorted in ascending or descending order.
Steps to compute
Sort the dataset in ascending order.
Determine the number of observations in the dataset, denoted by n.
CASE 1: If n is odd, the median is the middle observation of the sorted dataset. For example, if the dataset has 7 observations, the median is the 4th observation when the dataset is sorted in ascending order.
Example
Suppose we have the following dataset:
X = {5, 1, 3, 7, 2, 8, 4}
We first sort the dataset in ascending order:
X_sorted = {1, 2, 3, 4, 5, 7, 8}
Since the dataset has 7 observations, which is odd, the median is the middle observation, which is the 4th observation in this case:
median(X) = 4
CASE 2: If n is even, the median is the average of the two middle observations of the sorted dataset. For example, if the dataset has 8 observations, the median is the average of the 4th and 5th observations when the dataset is sorted in ascending order.
Example
Suppose we have the following dataset:
X = {5, 1, 3, 7, 2, 8, 4, 6}
We first sort the dataset in ascending order:
X_sorted = {1, 2, 3, 4, 5, 6, 7, 8}
Since the dataset has 8 observations, which is even, we take the average of the 4th and 5th observations to find the median:
median(X) = (4 + 5) / 2 = 4.5
CODE
Code for handling missing values in a data frame using median...
import pandas as pd
import numpy as np
# create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5], 'B': [6, np.nan, 8, 9, 10]})
# print the original DataFrame
print('Original DataFrame:\n', df)
# perform median imputation on the DataFrame
df = df.fillna(df.median())
# print the imputed DataFrame
print('Imputed DataFrame:\n', df)
OUTPUT
Original DataFrame:
A B
0 1.0 6.0
1 2.0 NaN
2 NaN 8.0
3 4.0 9.0
4 5.0 10.0
Imputed DataFrame:
A B
0 1.0 6.0
1 2.0 8.0
2 3.0 8.0
3 4.0 9.0
4 5.0 10.0
Mode Imputation
Mode imputation is a technique used to fill in missing values in a dataset by replacing them with the most frequently occurring value in that column or feature.
It is commonly used for categorical or discrete variables and helps preserve the original distribution of the data.
CODE
Code for handling missing values in a data frame using mode...
import pandas as pd
# Create a sample DataFrame with missing values
df = pd.DataFrame({'A': ['apple', 'banana', 'orange', None, 'banana'],
'B': ['red', None, 'yellow', 'yellow', 'green'],
'C': ['cat', 'dog', 'dog', 'cat', None]})
# Print the original DataFrame
print('Original DataFrame:\n', df)
# Perform mode imputation on the DataFrame
df_filled = df.apply(lambda x: x.fillna(x.mode()[0]))
# Print the imputed DataFrame
print('Imputed DataFrame:\n', df_filled)
NOTE: The mode()
function returns the most frequent value in each column, and [0]
is used to extract the first mode in case multiple modes exist.
OUTPUT
Original DataFrame:
A B C
0 apple red cat
1 banana None dog
2 orange yellow dog
3 None yellow cat
4 banana green None
Imputed DataFrame:
A B C
0 apple red cat
1 banana yellow dog
2 orange yellow dog
3 banana yellow cat
4 banana green cat
Random sampling imputation
Random sampling imputation is a method used to fill in missing values in a dataset by randomly selecting values from the observed values in the same variable.
It assumes that the observed values are representative of the underlying distribution and can be used to impute the missing values.
Steps to compute
Step 1: Let X be a random variable with n observations, including some missing values denoted as Xmiss.
i.e
X = [1, 2, NaN, NaN, 5]
Step 2: Identify the set of observed values, Xobs, by excluding the missing values from X.
Xobs = [1, 2, 5]
Step 3: Let Xrand be a set of randomly generated values.
NOTE: The number of random values should be equal to the count of missing values for the corresponding variable or feature.
Xrand = [3, 4]
Step 4: For each xmiss ∈ Xmiss
, replace xmiss
with a randomly selected value from Xrand.
For the first missing value, randomly select a value from Xrand. Let's say we randomly choose 3
Replace the first missing value with the selected value
X = [1, 2, 3, NaN, 5]
For the second missing value, randomly select another value from Xrand. Let's say we randomly choose 4
Replace the second missing value with the selected value
X = [1, 2, 3, 4, 5]
CODE
Code for handling missing values in a data frame using Random sampling...
import pandas as pd
import numpy as np
# Create a sample DataFrame with missing values
data = {'A': [1, 2, np.nan, np.nan, 5],
'B': [6, np.nan, 8, 9, 10],
'C': [11, 12, 13, np.nan, 15]}
df = pd.DataFrame(data)
# Define a function to perform random sampling imputation
def random_impute(column):
#creates a boolean mask identifying the missing values in the column
missing_indices = column.isnull()
#selects the observed values by indexing the column using the inverse of the missing indices mask
observed_values = column[~missing_indices]
#generates a set of random values by randomly choosing from the observed values. The size of the random values matches the count of missing values in the column
imputed_values = np.random.choice(observed_values, size=missing_indices.sum(), replace=True)
#assigns the imputed values to the missing values in the column.
column[missing_indices] = imputed_values
return column
# Apply the function to each column in the DataFrame
for column in df.columns:
df[column] = random_impute(df[column])
# Display the imputed DataFrame
print(df)
OUTPUT
Before random sampling
A B C
0 1.0 6.0 11.0
1 2.0 NaN 12.0
2 NaN 8.0 13.0
3 NaN 9.0 NaN
4 5.0 10.0 15.0
After random sampling
A B C
0 1.0 6.0 11.0
1 2.0 8.0 12.0
2 1.0 8.0 13.0
3 1.0 9.0 13.0
4 5.0 10.0 15.0
When to use these imputations?
Imputation Method | When to Use |
Mean Imputation | When the missing data is missing completely at random (MCAR) or missing at random (MAR) and the variable has a normal distribution. This method is quick and simple but may underestimate the variance and distort the distribution if the data is not normally distributed. |
Median Imputation | When the missing data is not normally distributed or when there are outliers present. Median imputation is more robust to outliers compared to mean imputation. However, it assumes the data is missing completely at random (MCAR) or missing at random (MAR). |
Mode Imputation | When dealing with categorical or nominal data. Mode imputation replaces missing values with the most frequent category in the variable. It assumes the data is missing completely at random (MCAR) or missing at random (MAR). |
Random Sampling Imputation | When the missing data is missing completely at random (MCAR) or missing at random (MAR), and you want to preserve the variability in the data. Random sampling imputation involves randomly selecting values from the observed data to impute the missing values. This method retains the original distribution and captures the uncertainty of the imputed values. |
Conclusion
In this blog post, we explored the concept of missing values and the challenges they pose in data analysis. We discussed different types of missing values, including MCAR, MAR, and MNAR, which help us understand the patterns and make informed decisions about handling missing data.
We covered commonly used imputation methods such as Mean, Median, Mode, Random Sampling. Each method has its own strengths and limitations, and the choice depends on the dataset characteristics and the missing data nature.
Mean imputation replaces missing values with the mean of available data, Median imputation uses the median, Mode imputation replaces missing values with the mode, Random Sampling imputation randomly selects values from observed data.
It's crucial to consider the data nature, missing data mechanism, and research question when selecting an imputation method. Some methods may introduce bias or distort patterns, while others preserve data integrity.
In the next installment, we will explore additional techniques.
Imputation methods play a crucial role in handling missing values and ensuring reliable data analysis. Stay tuned for our next blog post as we delve deeper into missing value imputation.
Thank you for reading, Hope You Enjoyed !!!
Looking forward to continue my exploration in the next installment