This page includes an interactive code editor. Try modifying and running the examples!

Handling Missing Data in Pandas

Important: Missing data is one of the most common issues in real-world datasets. Proper handling is crucial for accurate analysis.

Introduction to Missing Data

Missing data occurs when no data value is stored for a variable in an observation. It's a common problem in data analysis that can lead to biased estimates and reduced statistical power if not handled properly.

Types of Missing Data

MCAR: Missing Completely At Random
MAR: Missing At Random
MNAR: Missing Not At Random

Common Representations

NaN - Not a Number
None - Python null
NaT - Not a Time
Custom placeholders

Impact of Missing Data

Reduced sample size
Biased estimates
Reduced statistical power
Algorithm failures

Sample Dataset with Missing Values

Detecting Missing Values

Before handling missing data, you need to identify where and how much missing data exists in your dataset. Pandas provides several methods for detecting missing values.

Missing Data Detection Methods

import pandas as pd
import numpy as np

print("=== MISSING DATA DETECTION ===")

# Basic detection methods
print("\n1. isnull() - Detect missing values:")
print(df.isnull())

print("\n2. notnull() - Detect non-missing values:")
print(df.notnull())

print("\n3. Sum of missing values per column:")
missing_sum = df.isnull().sum()
print(missing_sum)

print("\n4. Percentage of missing values per column:")
missing_percent = (df.isnull().sum() / len(df)) * 100
print(missing_percent.round(2))

print("\n5. Check if any value is missing in each row:")
print(df.isnull().any(axis=1))

print("\n6. Check if entire row is missing:")
print(df.isnull().all(axis=1))

# Advanced detection
print("\n7. Columns with missing values:")
cols_with_missing = df.columns[df.isnull().any()].tolist()
print(cols_with_missing)

print("\n8. Rows with missing values:")
rows_with_missing = df[df.isnull().any(axis=1)]
print(f"Rows with missing values: {len(rows_with_missing)}")
print(rows_with_missing)

print("\n9. Pattern of missing values:")
missing_pattern = df.isnull().astype(int)
print("Missing pattern (1 = missing, 0 = present):")
print(missing_pattern)

# Specific column analysis
print("\n10. Analyze missing values in Age column:")
age_missing = df[df['Age'].isnull()]
print("Rows with missing Age:")
print(age_missing)

Detection Methods Summary:

Method	Description	Returns	Use Case
`.isnull()`	Detect missing values	Boolean DataFrame	General detection
`.notnull()`	Detect non-missing values	Boolean DataFrame	Finding complete data
`.isna()`	Alias for isnull()	Boolean DataFrame	Same as isnull()
`.notna()`	Alias for notnull()	Boolean DataFrame	Same as notnull()

Basic Handling Methods

There are two main approaches to handling missing data: deletion and imputation. The choice depends on the amount and pattern of missingness.

Basic Missing Data Handling

import pandas as pd
import numpy as np

print("=== HANDLING MISSING DATA ===")

# Method 1: Deletion
print("\n1. Deletion Methods:")

# Drop rows with any missing values
df_drop_any = df.dropna()
print(f"After dropping any NA: {df_drop_any.shape}")

# Drop rows where all values are missing
df_drop_all = df.dropna(how='all')
print(f"After dropping all NA: {df_drop_all.shape}")

# Drop columns with missing values
df_drop_cols = df.dropna(axis=1)
print(f"After dropping NA columns: {df_drop_cols.shape}")

# Drop rows with specific threshold
df_drop_thresh = df.dropna(thresh=4)  # Keep rows with at least 4 non-NA values
print(f"After threshold drop: {df_drop_thresh.shape}")

# Method 2: Imputation - Numerical data
print("\n2. Numerical Data Imputation:")

# Mean imputation
df_mean = df.copy()
df_mean['Age'] = df_mean['Age'].fillna(df_mean['Age'].mean())
df_mean['Salary'] = df_mean['Salary'].fillna(df_mean['Salary'].median())
df_mean['Experience'] = df_mean['Experience'].fillna(df_mean['Experience'].mode()[0])
print("After mean/median/mode imputation:")
print(df_mean[['Age', 'Salary', 'Experience']])

# Forward/backward fill
df_ffill = df.copy()
df_ffill['Age'] = df_ffill['Age'].fillna(method='ffill')
df_ffill['Salary'] = df_ffill['Salary'].fillna(method='bfill')
print("\nAfter forward/backward fill:")
print(df_ffill[['Age', 'Salary']])

# Method 3: Imputation - Categorical data
print("\n3. Categorical Data Imputation:")

df_cat = df.copy()
df_cat['Name'] = df_cat['Name'].fillna('Unknown')
df_cat['Department'] = df_cat['Department'].fillna('Not Specified')
df_cat['Department'] = df_cat['Department'].fillna(df_cat['Department'].mode()[0])
print("After categorical imputation:")
print(df_cat[['Name', 'Department']])

# Method 4: Advanced imputation
print("\n4. Advanced Imputation Techniques:")

# Interpolation for time series-like data
df_interpolate = df.copy()
df_interpolate['Age'] = df_interpolate['Age'].interpolate()
print("After interpolation:")
print(df_interpolate['Age'])

# Group-based imputation
df_group = df.copy()
df_group['Age'] = df_group.groupby('Department')['Age'].transform(
    lambda x: x.fillna(x.mean())
)
print("\nAfter group-based imputation:")
print(df_group[['Department', 'Age']])

When to Use Deletion

MCAR data
Small percentage missing (<5%)
Large dataset
Missingness in unimportant variables
Exploratory analysis

When to Use Imputation

MAR data
Larger percentage missing
Small dataset
Important variables
Final analysis

Advanced Imputation Techniques

For more sophisticated analysis, advanced imputation methods can provide better results by preserving relationships in the data.

Advanced Imputation Methods

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer

print("=== ADVANCED IMPUTATION TECHNIQUES ===")

# Create a more complex dataset for demonstration
complex_data = {
    'Feature1': [1, 2, np.nan, 4, 5, np.nan, 7, 8, 9, 10],
    'Feature2': [np.nan, 2, 3, 4, np.nan, 6, 7, 8, 9, 10],
    'Feature3': [1, np.nan, 3, 4, 5, 6, np.nan, 8, 9, 10],
    'Target': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}

df_complex = pd.DataFrame(complex_data)
print("Complex dataset with missing values:")
print(df_complex)

# Method 1: scikit-learn SimpleImputer
print("\n1. scikit-learn SimpleImputer:")

# Mean strategy
imputer_mean = SimpleImputer(strategy='mean')
df_sklearn_mean = pd.DataFrame(
    imputer_mean.fit_transform(df_complex),
    columns=df_complex.columns
)
print("Mean imputation:")
print(df_sklearn_mean)

# Method 2: KNN Imputation
print("\n2. K-Nearest Neighbors Imputation:")

imputer_knn = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(
    imputer_knn.fit_transform(df_complex),
    columns=df_complex.columns
)
print("KNN imputation:")
print(df_knn)

# Method 3: Multiple Imputation with fancyimpute (conceptual)
print("\n3. Multiple Imputation (Conceptual):")
# Note: fancyimpute would need to be installed separately

# Method 4: Custom imputation functions
print("\n4. Custom Imputation Functions:")

def custom_imputer(series):
    """Custom imputation based on data characteristics"""
    if series.dtype == 'object':
        return series.fillna('Missing')
    else:
        # Use median for skewed data, mean for normal
        if series.skew() > 1:  # Skewed distribution
            return series.fillna(series.median())
        else:  # Normal distribution
            return series.fillna(series.mean())

df_custom = df.apply(custom_imputer)
print("Custom imputation:")
print(df_custom)

# Method 5: Predictive imputation (simplified)
print("\n5. Predictive Imputation Concept:")

# For demonstration, using simple regression concept
def predictive_impute(df, target_col, feature_cols):
    """Simplified predictive imputation demonstration"""
    df_temp = df.dropna()
    if len(df_temp) > 1:
        # Simple average based on available data
        return df[target_col].fillna(df_temp[target_col].mean())
    return df[target_col]

df_predictive = df.copy()
df_predictive['Salary'] = predictive_impute(df_predictive, 'Salary', ['Age', 'Experience'])
print("Predictive imputation (simplified):")
print(df_predictive[['Age', 'Experience', 'Salary']])

Advanced Methods Comparison:

Method	Description	Pros	Cons
KNN Imputation	Uses similar records	Preserves relationships	Computationally expensive
Multiple Imputation	Creates multiple datasets	Accounts for uncertainty	Complex implementation
Predictive Models	Uses machine learning	Very accurate	Overfitting risk
Time Series Methods	Uses temporal patterns	Good for time data	Specific to time series

Missing Data Analysis

Understanding the pattern and mechanism of missing data is crucial for choosing the right handling method and interpreting results correctly.

Missing Data Pattern Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("=== MISSING DATA ANALYSIS ===")

# Analyze missing data patterns
print("\n1. Missing Data Pattern Analysis:")

# Missing data matrix
missing_matrix = df.isnull()
print("Missing data matrix:")
print(missing_matrix)

# Correlation of missingness
print("\n2. Correlation of Missingness:")
missing_corr = missing_matrix.corr()
print("Correlation between missingness patterns:")
print(missing_corr)

# Missing data by groups
print("\n3. Missing Data by Department:")
if 'Department' in df.columns:
    missing_by_dept = df.groupby('Department').apply(
        lambda x: x.isnull().sum() / len(x) * 100
    )
    print("Missing percentage by department:")
    print(missing_by_dept.round(2))

# Statistical analysis of missing vs non-missing
print("\n4. Statistical Comparison:")

if 'Salary' in df.columns and 'Age' in df.columns:
    # Compare salaries for rows with/without missing age
    salary_with_missing_age = df[df['Age'].isnull()]['Salary'].mean()
    salary_without_missing_age = df[df['Age'].notnull()]['Salary'].mean()
    
    print(f"Average salary with missing age: {salary_with_missing_age:.2f}")
    print(f"Average salary without missing age: {salary_without_missing_age:.2f}")
    print(f"Difference: {abs(salary_with_missing_age - salary_without_missing_age):.2f}")

# Impact analysis
print("\n5. Impact of Missing Data Handling:")

original_mean = df['Salary'].mean()
after_imputation = df['Salary'].fillna(df['Salary'].mean()).mean()

print(f"Original mean salary: {original_mean:.2f}")
print(f"After mean imputation: {after_imputation:.2f}")
print(f"Bias introduced: {abs(original_mean - after_imputation):.2f}")

# Missing data mechanism analysis
print("\n6. Missing Data Mechanism:")

# Test for MCAR (Missing Completely At Random)
# If missingness is unrelated to any variable, it might be MCAR
print("Testing for MCAR pattern...")

# MAR (Missing At Random) test
# If missingness is related to other observed variables
if 'Age' in df.columns and 'Salary' in df.columns:
    age_missing_related = df[df['Salary'].isnull()]['Age'].mean()
    age_not_missing_related = df[df['Salary'].notnull()]['Age'].mean()
    print(f"Age when Salary missing: {age_missing_related:.2f}")
    print(f"Age when Salary not missing: {age_not_missing_related:.2f}")

MCAR

Missing Completely At Random

Missingness is unrelated to any variable

Deletion is safe
Simple imputation works
Least problematic

MAR

Missing At Random

Missingness related to observed data

Imputation recommended
Model-based methods work
Common in practice

MNAR

Missing Not At Random

Missingness related to unobserved data

Most problematic
Advanced methods needed
Sensitivity analysis crucial

Best Practices and Common Pitfalls

Following best practices ensures that missing data handling doesn't introduce bias or distort your analysis results.

Best Practices Framework

import pandas as pd
import numpy as np

print("=== BEST PRACTICES AND PITFALLS ===")

# Common pitfalls and how to avoid them
print("\n1. Common Pitfalls in Missing Data Handling:")

# Pitfall 1: Ignoring missing data
print("\nPitfall 1: Ignoring missing data")
print("Many algorithms will fail or produce biased results with missing values")

# Pitfall 2: Always using deletion
print("\nPitfall 2: Always using deletion")
print("Deletion can lead to biased results and loss of information")

# Pitfall 3: Simple mean imputation for all cases
print("\nPitfall 3: Simple mean imputation for all cases")
print("Can underestimate variance and distort relationships")

# Best practices
print("\n2. Best Practices:")

# Practice 1: Understand the mechanism
print("\nPractice 1: Understand missing data mechanism")
print("- MCAR: Missing Completely At Random")
print("- MAR: Missing At Random") 
print("- MNAR: Missing Not At Random")

# Practice 2: Explore patterns
print("\nPractice 2: Explore missing data patterns")
print("Use visualization and statistical tests")

# Practice 3: Choose appropriate method
print("\nPractice 3: Choose appropriate imputation method")
print("Based on data type, amount, and mechanism")

# Practice 4: Sensitivity analysis
print("\nPractice 4: Perform sensitivity analysis")
print("Compare different imputation methods")

# Practical implementation
print("\n3. Practical Implementation Framework:")

def handle_missing_data(df, strategy='auto', max_missing=0.5):
    """Comprehensive missing data handling framework"""
    
    df_clean = df.copy()
    
    # Step 1: Analyze missingness
    missing_percent = df_clean.isnull().sum() / len(df_clean)
    
    # Step 2: Remove columns with too much missing data
    cols_to_drop = missing_percent[missing_percent > max_missing].index
    df_clean = df_clean.drop(columns=cols_to_drop)
    print(f"Dropped columns: {list(cols_to_drop)}")
    
    # Step 3: Handle remaining missing values
    for col in df_clean.columns:
        if df_clean[col].isnull().any():
            if df_clean[col].dtype == 'object':
                # Categorical data
                if strategy == 'mode' or df_clean[col].isnull().mean() < 0.05:
                    df_clean[col] = df_clean[col].fillna(df_clean[col].mode()[0])
                else:
                    df_clean[col] = df_clean[col].fillna('Missing')
            else:
                # Numerical data
                if strategy == 'median' or df_clean[col].skew() > 1:
                    df_clean[col] = df_clean[col].fillna(df_clean[col].median())
                else:
                    df_clean[col] = df_clean[col].fillna(df_clean[col].mean())
    
    return df_clean

# Test the framework
print("\nTesting comprehensive missing data handler:")
df_handled = handle_missing_data(df)
print("After handling:")
print(df_handled)
print(f"Missing values remaining: {df_handled.isnull().sum().sum()}")

Common Pitfalls to Avoid

Ignoring missing data entirely
Always deleting missing cases
Using mean imputation blindly
Not documenting handling methods
Assuming MCAR without testing
Ignoring the impact on variance

Recommended Workflow

Explore missing data patterns
Determine missingness mechanism
Choose appropriate method
Implement handling strategy
Validate results
Document process

Quick Reference Guide

Basic Operations:

# Detection
df.isnull().sum()        # Count missing per column
df[df.col.isnull()]      # Rows with missing values

# Deletion
df.dropna()              # Drop rows with any NA
df.dropna(axis=1)        # Drop columns with any NA
df.dropna(thresh=4)      # Keep rows with 4+ non-NA

# Imputation
df.fillna(value)         # Fill with specific value
df.fillna(method='ffill') # Forward fill
df.fillna(df.mean())     # Fill with mean

Advanced Operations:

# Group-based imputation
df.groupby('group')['col'].transform(
    lambda x: x.fillna(x.mean())
)

# Conditional imputation
df['col'] = np.where(
    df.col.isnull(), 
    df.other_col, 
    df.col
)

# Multiple columns
df.fillna({'col1': value1, 'col2': value2})

# Interpolation
df.interpolate()         # Linear interpolation

Next: After handling missing data, we'll learn about data transformation and manipulation techniques to prepare data for analysis.

← Sorting Data Statistical Functions →

Pandas Tutorial

Handling Missing Data in Pandas

Introduction to Missing Data

Types of Missing Data

Common Representations

Impact of Missing Data

Sample Dataset with Missing Values

Detecting Missing Values

Missing Data Detection Methods

Detection Methods Summary:

Basic Handling Methods

Basic Missing Data Handling

When to Use Deletion

When to Use Imputation

Advanced Imputation Techniques

Advanced Imputation Methods

Advanced Methods Comparison:

Missing Data Analysis

Missing Data Pattern Analysis

MCAR

MAR

MNAR

Best Practices and Common Pitfalls

Best Practices Framework

Common Pitfalls to Avoid

Recommended Workflow

Quick Reference Guide

Basic Operations:

Advanced Operations:

Explore Related Tools

Bulma Box Component – Simple Container with Shadow

Bulma Modal Component – Dialogs, Popups & Overlays

Permutation and Combination Calculator (nPr & nCr)

SQL SELECT Statement – Retrieve Data from Database

Bulma Image – Responsive Images & Media

Free Car Loan EMI Calculator India 2024

Follow Us

Our Tools

Our Company

Special Tools