This page includes an interactive code editor. Try modifying and running the examples!

Pandas Performance Optimization

Performance optimization is crucial when working with large datasets in pandas. This guide covers techniques to make your pandas code faster and more memory-efficient.

1. Memory Usage Optimization

Reduce memory footprint by optimizing data types.

Downcast numerical types: Use smallest possible integer/float types
Use categories: Convert low-cardinality strings to category dtype
Sparse data: Use sparse arrays for data with many zeros
Memory profiling: Monitor usage with memory_usage()

Data Type	Memory Usage	Optimization
`int64`	8 bytes	Downcast to `int8`/`int16`/`int32`
`float64`	8 bytes	Downcast to `float32`
`object` (strings)	Variable	Convert to `category` if low cardinality
`bool`	1 byte	Use `boolean` dtype

2. Vectorized Operations

Avoid loops and use vectorized operations for better performance.

# SLOW: Using iterrows
for index, row in df.iterrows():
    df.loc[index, 'new_col'] = row['col1'] * 2

# FAST: Vectorized operation
df['new_col'] = df['col1'] * 2

# SLOW: Using apply with lambda
df['new_col'] = df['col1'].apply(lambda x: x * 2)

# FAST: Built-in methods
df['new_col'] = df['col1'] * 2

3. Efficient Data Filtering

Choose the right method for filtering data.

Boolean indexing: Fastest for simple conditions
query() method: Good for complex expressions
loc[] accessor: Versatile but can be slower
Avoid: iterrows(), itertuples() for filtering

Performance Optimization Techniques

🚀 Speed Optimization

Vectorization: Use built-in operations
NumPy integration: Leverage NumPy functions
Method chaining: Reduce intermediate variables
Avoid copies: Use inplace=True when possible

💾 Memory Optimization

Data type optimization: Use appropriate dtypes
Chunk processing: Process large files in chunks
Garbage collection: Use del and gc.collect()
Sparse data: Use sparse arrays

🔧 Advanced Techniques

Dask integration: For out-of-core computation
Numba JIT: Just-in-time compilation
Caching: Memoize expensive operations
Parallel processing: Use multiprocessing

📊 Monitoring Tools

Memory profiler: Track memory usage
Line profiler: Line-by-line timing
cProfile: Function-level profiling
Pandas built-in: %timeit, %%time

Common Performance Pitfalls

❌ Anti-Patterns

Using iterrows() for vectorizable operations
Chained indexing: df[df['a'] > 1]['b']
Repeated append() in loops
Not specifying dtypes when reading CSV
Using apply() when vectorized methods exist

✅ Best Practices

Use vectorized operations whenever possible
Specify dtypes when reading data
Use concat() instead of append()
Process large files in chunks
Use appropriate data structures

Example: Comprehensive Performance Optimization

Performance Optimization Examples

import pandas as pd
import numpy as np
import time
import psutil
import os
from memory_profiler import memory_usage

# Set display options for better output
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("="*70)
print("PANDAS PERFORMANCE OPTIMIZATION GUIDE")
print("="*70)

# 1. MEMORY USAGE OPTIMIZATION
print("\n1. MEMORY USAGE OPTIMIZATION")
print("-" * 50)

# Create a large dataset to demonstrate memory optimization
def create_large_dataset(n_rows=1000000):
    """Create a large dataset for performance testing"""
    np.random.seed(42)
    
    data = {
        'user_id': np.random.randint(1, 10000, n_rows),
        'category': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_rows),
        'value1': np.random.randn(n_rows),
        'value2': np.random.uniform(0, 100, n_rows),
        'value3': np.random.exponential(2, n_rows),
        'timestamp': pd.date_range('2023-01-01', periods=n_rows, freq='T'),
        'description': np.random.choice(['low', 'medium', 'high'], n_rows)
    }
    return pd.DataFrame(data)

# Create dataset
df_large = create_large_dataset(1000000)
print(f"Original dataset shape: {df_large.shape}")
print(f"Original memory usage: {df_large.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check data types
print("\nOriginal data types:")
print(df_large.dtypes)

# Optimize data types
def optimize_dtypes(df):
    """Optimize data types to reduce memory usage"""
    df_opt = df.copy()
    
    # Convert object columns to category if cardinality is low
    for col in df_opt.select_dtypes(include=['object']):
        if df_opt[col].nunique() / len(df_opt) < 0.5:  # If unique values < 50%
            df_opt[col] = df_opt[col].astype('category')
    
    # Downcast numerical columns
    for col in df_opt.select_dtypes(include=['int']):
        df_opt[col] = pd.to_numeric(df_opt[col], downcast='integer')
    
    for col in df_opt.select_dtypes(include=['float']):
        df_opt[col] = pd.to_numeric(df_opt[col], downcast='float')
    
    return df_opt

df_optimized = optimize_dtypes(df_large)
print(f"\nOptimized memory usage: {df_optimized.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\nOptimized data types:")
print(df_optimized.dtypes)

# Memory savings
original_memory = df_large.memory_usage(deep=True).sum()
optimized_memory = df_optimized.memory_usage(deep=True).sum()
savings = (original_memory - optimized_memory) / original_memory * 100
print(f"\nMemory savings: {savings:.1f}%")

# 2. VECTORIZED OPERATIONS VS ITERATION
print("\n\n2. VECTORIZED OPERATIONS VS ITERATION")
print("-" * 50)

# Create sample data
df_sample = df_optimized.head(10000).copy()

# Method 1: Iteration with iterrows (SLOW)
def slow_calculation(df):
    """Slow method using iterrows"""
    start_time = time.time()
    results = []
    for index, row in df.iterrows():
        if row['value1'] > 0:
            results.append(row['value2'] * 2)
        else:
            results.append(row['value2'] * 0.5)
    df['slow_result'] = results
    return time.time() - start_time

# Method 2: Vectorized operations (FAST)
def fast_calculation(df):
    """Fast method using vectorization"""
    start_time = time.time()
    df['fast_result'] = np.where(df['value1'] > 0, df['value2'] * 2, df['value2'] * 0.5)
    return time.time() - start_time

# Compare performance
time_slow = slow_calculation(df_sample)
time_fast = fast_calculation(df_sample)

print(f"Iterrows method: {time_slow:.4f} seconds")
print(f"Vectorized method: {time_fast:.4f} seconds")
print(f"Speed improvement: {time_slow / time_fast:.1f}x faster")

# 3. EFFICIENT DATA FILTERING
print("\n\n3. EFFICIENT DATA FILTERING")
print("-" * 50)

# Create a large dataset for filtering tests
df_filter = df_optimized.copy()

# Method comparison for filtering
def time_filtering_operations():
    times = {}
    
    # Boolean indexing
    start = time.time()
    filtered1 = df_filter[(df_filter['value1'] > 0) & (df_filter['value2'] < 50)]
    times['Boolean Indexing'] = time.time() - start
    
    # Query method
    start = time.time()
    filtered2 = df_filter.query('value1 > 0 and value2 < 50')
    times['Query Method'] = time.time() - start
    
    # loc with conditions
    start = time.time()
    filtered3 = df_filter.loc[(df_filter['value1'] > 0) & (df_filter['value2'] < 50)]
    times['Loc Method'] = time.time() - start
    
    return times

filter_times = time_filtering_operations()
print("Filtering method performance:")
for method, t in filter_times.items():
    print(f"{method}: {t:.4f} seconds")

# 4. GROUPBY OPTIMIZATION
print("\n\n4. GROUPBY OPTIMIZATION")
print("-" * 50)

def time_groupby_operations():
    times = {}
    
    # Basic groupby
    start = time.time()
    result1 = df_optimized.groupby('category')['value1'].mean()
    times['Basic GroupBy'] = time.time() - start
    
    # Optimized groupby with multiple aggregations
    start = time.time()
    result2 = df_optimized.groupby('category').agg({
        'value1': ['mean', 'std', 'count'],
        'value2': ['min', 'max', 'mean']
    })
    times['Multi-Agg GroupBy'] = time.time() - start
    
    # Groupby with transform
    start = time.time()
    result3 = df_optimized.groupby('category')['value1'].transform('mean')
    times['GroupBy Transform'] = time.time() - start
    
    return times

groupby_times = time_groupby_operations()
print("GroupBy operation performance:")
for operation, t in groupby_times.items():
    print(f"{operation}: {t:.4f} seconds")

# 5. DATAFRAME CONCATENATION VS APPEND
print("\n\n5. DATAFRAME CONCATENATION OPTIMIZATION")
print("-" * 50)

def time_concatenation():
    times = {}
    
    # Split data into chunks
    chunks = [df_optimized.iloc[i:i+10000] for i in range(0, len(df_optimized), 10000)]
    
    # Method 1: Using append (inefficient for large operations)
    start = time.time()
    result_append = pd.DataFrame()
    for chunk in chunks[:10]:  # Use first 10 chunks to save time
        result_append = result_append.append(chunk, ignore_index=True)
    times['Append Method'] = time.time() - start
    
    # Method 2: Using concat (efficient)
    start = time.time()
    result_concat = pd.concat(chunks[:10], ignore_index=True)
    times['Concat Method'] = time.time() - start
    
    # Method 3: Using list comprehension with concat (most efficient)
    start = time.time()
    result_list = pd.concat([chunk for chunk in chunks[:10]], ignore_index=True)
    times['List Concat Method'] = time.time() - start
    
    return times

concat_times = time_concatenation()
print("Concatenation method performance:")
for method, t in concat_times.items():
    print(f"{method}: {t:.4f} seconds")

# 6. DATATYPE-SPECIFIC OPTIMIZATIONS
print("\n\n6. DATATYPE-SPECIFIC OPTIMIZATIONS")
print("-" * 50)

# String operations optimization
def time_string_operations():
    times = {}
    
    # Create string data
    df_strings = pd.DataFrame({
        'text': ['hello world'] * 10000 + ['foo bar'] * 10000
    })
    
    # Method 1: Using apply with lambda
    start = time.time()
    df_strings['upper_apply'] = df_strings['text'].apply(lambda x: x.upper())
    times['Apply Lambda'] = time.time() - start
    
    # Method 2: Using vectorized string operations
    start = time.time()
    df_strings['upper_vectorized'] = df_strings['text'].str.upper()
    times['Vectorized String'] = time.time() - start
    
    return times

string_times = time_string_operations()
print("String operation performance:")
for method, t in string_times.items():
    print(f"{method}: {t:.4f} seconds")

# 7. MEMORY-EFFICIENT CHUNK PROCESSING
print("\n\n7. CHUNK PROCESSING FOR LARGE FILES")
print("-" * 50)

def process_large_file_chunked(file_path, chunk_size=10000):
    """Process large files in chunks to avoid memory issues"""
    results = []
    
    # Read file in chunks
    chunk_reader = pd.read_csv(file_path, chunksize=chunk_size)
    
    for i, chunk in enumerate(chunk_reader):
        # Process each chunk
        chunk_processed = chunk.copy()
        chunk_processed['processed_value'] = chunk_processed.iloc[:, 0] * 2  # Example processing
        
        results.append(chunk_processed)
        
        # Print progress
        if (i + 1) % 10 == 0:
            print(f"Processed {((i + 1) * chunk_size):,} rows...")
    
    # Combine results
    final_result = pd.concat(results, ignore_index=True)
    return final_result

print("Chunk processing function defined. Use for large CSV files.")

# 8. PERFORMANCE MONITORING TOOLS
print("\n\n8. PERFORMANCE MONITORING")
print("-" * 50)

def monitor_performance(func, *args, **kwargs):
    """Monitor function performance"""
    
    # Memory usage
    mem_usage = memory_usage((func, args, kwargs))
    max_memory = max(mem_usage)
    
    # Execution time
    start_time = time.time()
    result = func(*args, **kwargs)
    execution_time = time.time() - start_time
    
    return result, execution_time, max_memory

# Example usage
def example_heavy_operation(df):
    """Example function to monitor"""
    return df.groupby('category').agg({'value1': 'mean', 'value2': 'sum'})

result, exec_time, max_mem = monitor_performance(example_heavy_operation, df_optimized)
print(f"Operation completed in {exec_time:.2f} seconds")
print(f"Maximum memory usage: {max_mem:.2f} MB")

# 9. NUMPY INTEGRATION FOR PERFORMANCE
print("\n\n9. NUMPY INTEGRATION")
print("-" * 50)

def numpy_vs_pandas_operations():
    """Compare NumPy vs Pandas operations"""
    times = {}
    
    # Large array
    large_array = np.random.randn(1000000)
    large_series = pd.Series(large_array)
    
    # Pandas operation
    start = time.time()
    pandas_result = large_series * 2 + 1
    times['Pandas Operation'] = time.time() - start
    
    # NumPy operation
    start = time.time()
    numpy_result = large_array * 2 + 1
    times['NumPy Operation'] = time.time() - start
    
    # Convert to NumPy, operate, convert back
    start = time.time()
    numpy_optimized = pd.Series((large_series.to_numpy() * 2 + 1))
    times['Pandas-NumPy Hybrid'] = time.time() - start
    
    return times

numpy_times = numpy_vs_pandas_operations()
print("NumPy vs Pandas performance:")
for operation, t in numpy_times.items():
    print(f"{operation}: {t:.4f} seconds")

# 10. CACHING FOR REPEATED OPERATIONS
print("\n\n10. CACHING STRATEGIES")
print("-" * 50)

from functools import lru_cache

@lru_cache(maxsize=128)
def expensive_calculation(x, y):
    """Expensive calculation that benefits from caching"""
    time.sleep(0.001)  # Simulate expensive operation
    return x * y + x**2 - y**2

def demonstrate_caching():
    """Demonstrate caching benefits"""
    
    # Without caching (repeated calculations)
    start = time.time()
    for i in range(1000):
        result = expensive_calculation(i % 10, i % 5)  # Repeated patterns
    time_no_cache = time.time() - start
    
    # With caching
    start = time.time()
    for i in range(1000):
        result = expensive_calculation(i % 10, i % 5)
    time_with_cache = time.time() - start
    
    return time_no_cache, time_with_cache

time_no_cache, time_with_cache = demonstrate_caching()
print(f"Without caching: {time_no_cache:.2f} seconds")
print(f"With caching: {time_with_cache:.2f} seconds")
print(f"Caching improvement: {time_no_cache / time_with_cache:.1f}x faster")

print("\n" + "="*70)
print("PERFORMANCE OPTIMIZATION SUMMARY")
print("="*70)
print("Key takeaways:")
print("1. Use vectorized operations instead of iteration")
print("2. Optimize data types to reduce memory usage")
print("3. Prefer concat() over append() for combining DataFrames")
print("4. Use chunk processing for large files")
print("5. Leverage NumPy for numerical operations")
print("6. Monitor performance with profiling tools")
print("7. Implement caching for repeated calculations")

Optimization Techniques

astype() - Data type conversion
to_numeric() - Numerical optimization
category dtype - String optimization
Vectorized operations - Speed improvement
Chunk processing - Memory management

Performance Tools

memory_usage() - Memory profiling
%timeit - Timing magic command
line_profiler - Line-by-line profiling
memory_profiler - Memory usage tracking
cProfile - Comprehensive profiling

Important: Always profile before optimizing. Use the right tool for the job - sometimes pandas might not be the best choice for extremely large datasets (consider Dask or PySpark).

Pro Tip: Use pd.api.types.is_string_dtype() and pd.api.types.is_numeric_dtype() to check data types before optimization.

Quick Performance Checklist

# ✅ Performance Checklist:
# 1. Use vectorized operations instead of loops
# 2. Optimize data types (downcast numbers, use categories)
# 3. Use concat() instead of append() for multiple DataFrames
# 4. Process large files in chunks with chunksize parameter
# 5. Use query() for complex filtering conditions
# 6. Leverage NumPy for mathematical operations
# 7. Use method chaining to avoid intermediate variables
# 8. Monitor memory usage with memory_usage(deep=True)
# 9. Use inplace=True to avoid copies when possible
# 10. Consider alternative libraries (Dask, Polars) for huge datasets

← Exporting Data Real-World Use Cases →

Pandas Tutorial

Pandas Performance Optimization

1. Memory Usage Optimization

2. Vectorized Operations

3. Efficient Data Filtering

Performance Optimization Techniques

🚀 Speed Optimization

💾 Memory Optimization

🔧 Advanced Techniques

📊 Monitoring Tools

Common Performance Pitfalls

❌ Anti-Patterns

✅ Best Practices

Example: Comprehensive Performance Optimization

Performance Optimization Examples

Optimization Techniques

Performance Tools

Quick Performance Checklist

Explore Related Tools

Bootstrap 5 Block Buttons: Full

Centimeters to Pixels Converter

Online Code Editors & Compilers

Foundation CSS Project Examples

Advanced Git Tools & Techniques

GitHub Account Setup & SSH Keys

Follow Us

Our Tools

Our Company

Special Tools