This page includes an interactive code editor. Try modifying and running the examples!
Pandas Performance Optimization
Performance optimization is crucial when working with large datasets in pandas. This guide covers techniques to make your pandas code faster and more memory-efficient.
1. Memory Usage Optimization
Reduce memory footprint by optimizing data types.
- Downcast numerical types: Use smallest possible integer/float types
- Use categories: Convert low-cardinality strings to category dtype
- Sparse data: Use sparse arrays for data with many zeros
- Memory profiling: Monitor usage with
memory_usage()
| Data Type | Memory Usage | Optimization |
|---|---|---|
int64 | 8 bytes | Downcast to int8/int16/int32 |
float64 | 8 bytes | Downcast to float32 |
object (strings) | Variable | Convert to category if low cardinality |
bool | 1 byte | Use boolean dtype |
2. Vectorized Operations
Avoid loops and use vectorized operations for better performance.
# SLOW: Using iterrows
for index, row in df.iterrows():
df.loc[index, 'new_col'] = row['col1'] * 2
# FAST: Vectorized operation
df['new_col'] = df['col1'] * 2
# SLOW: Using apply with lambda
df['new_col'] = df['col1'].apply(lambda x: x * 2)
# FAST: Built-in methods
df['new_col'] = df['col1'] * 23. Efficient Data Filtering
Choose the right method for filtering data.
- Boolean indexing: Fastest for simple conditions
- query() method: Good for complex expressions
- loc[] accessor: Versatile but can be slower
- Avoid:
iterrows(),itertuples()for filtering
Performance Optimization Techniques
🚀 Speed Optimization
- Vectorization: Use built-in operations
- NumPy integration: Leverage NumPy functions
- Method chaining: Reduce intermediate variables
- Avoid copies: Use
inplace=Truewhen possible
💾 Memory Optimization
- Data type optimization: Use appropriate dtypes
- Chunk processing: Process large files in chunks
- Garbage collection: Use
delandgc.collect() - Sparse data: Use sparse arrays
🔧 Advanced Techniques
- Dask integration: For out-of-core computation
- Numba JIT: Just-in-time compilation
- Caching: Memoize expensive operations
- Parallel processing: Use multiprocessing
📊 Monitoring Tools
- Memory profiler: Track memory usage
- Line profiler: Line-by-line timing
- cProfile: Function-level profiling
- Pandas built-in:
%timeit,%%time
Common Performance Pitfalls
❌ Anti-Patterns
- Using
iterrows()for vectorizable operations - Chained indexing:
df[df['a'] > 1]['b'] - Repeated
append()in loops - Not specifying dtypes when reading CSV
- Using
apply()when vectorized methods exist
✅ Best Practices
- Use vectorized operations whenever possible
- Specify dtypes when reading data
- Use
concat()instead ofappend() - Process large files in chunks
- Use appropriate data structures
Example: Comprehensive Performance Optimization
Performance Optimization Examples
Optimization Techniques
astype()- Data type conversionto_numeric()- Numerical optimizationcategorydtype - String optimization- Vectorized operations - Speed improvement
- Chunk processing - Memory management
Performance Tools
memory_usage()- Memory profiling%timeit- Timing magic commandline_profiler- Line-by-line profilingmemory_profiler- Memory usage trackingcProfile- Comprehensive profiling
Important: Always profile before optimizing. Use the right tool for the job - sometimes pandas might not be the best choice for extremely large datasets (consider Dask or PySpark).
Pro Tip: Use
pd.api.types.is_string_dtype() and pd.api.types.is_numeric_dtype() to check data types before optimization.Quick Performance Checklist
# ✅ Performance Checklist:
# 1. Use vectorized operations instead of loops
# 2. Optimize data types (downcast numbers, use categories)
# 3. Use concat() instead of append() for multiple DataFrames
# 4. Process large files in chunks with chunksize parameter
# 5. Use query() for complex filtering conditions
# 6. Leverage NumPy for mathematical operations
# 7. Use method chaining to avoid intermediate variables
# 8. Monitor memory usage with memory_usage(deep=True)
# 9. Use inplace=True to avoid copies when possible
# 10. Consider alternative libraries (Dask, Polars) for huge datasets