This page includes an interactive code editor. Try modifying and running the examples!
Pandas GroupBy Operations
The GroupBy operation is one of the most powerful features in Pandas, enabling you to split data into groups, apply functions to each group, and combine the results. It's essential for aggregation, transformation, and filtering operations.
1. The GroupBy Process: Split-Apply-Combine
GroupBy follows a three-step process:
- Split: Divide the data into groups based on specified criteria
- Apply: Apply a function to each group independently
- Combine: Combine the results into a new data structure
Visualization: DataFrame → Split by Group → Apply Function → Combine Results
2. Basic GroupBy Syntax and Methods
| Method | Description | Example |
|---|---|---|
| groupby() | Create GroupBy object | df.groupby('column') |
| sum() | Sum of each group | grouped.sum() |
| mean() | Average of each group | grouped.mean() |
| count() | Count of elements | grouped.count() |
| agg() | Multiple aggregations | grouped.agg(['sum', 'mean']) |
3. Common Aggregation Functions
- sum() - Sum of values
- mean() - Arithmetic mean
- median() - Median value
- std() - Standard deviation
- var() - Variance
- min() - Minimum value
- max() - Maximum value
- count() - Count of non-NA values
- size() - Size of group
- first()/last() - First/last value
Basic GroupBy Examples
Basic GroupBy Operations
4. Advanced GroupBy Techniques
| Technique | Description | Use Case |
|---|---|---|
| transform() | Return object with group values broadcasted | Adding group-wise statistics to original data |
| filter() | Filter groups based on conditions | Selecting groups that meet specific criteria |
| apply() | Apply custom function to each group | Complex group-wise operations |
| Multiple Columns | Group by multiple columns | Hierarchical grouping analysis |
Advanced GroupBy Operations
Advanced GroupBy Techniques
Real-World E-commerce Example
Real-World GroupBy Application
GroupBy Best Practices
- Use specific columns instead of entire DataFrame
- Chain operations for better performance
- Use
agg()for multiple aggregations - Consider using
pd.pivot_table()for simple cases - Reset index after grouping for cleaner DataFrames
Performance Tips
- Avoid using
apply()when built-in methods exist - Use categorical data for grouping columns
- Sort data before grouping if needed
- Use
as_index=Falseto keep grouping columns as regular columns - Consider Dask for very large datasets
Important: Remember that GroupBy operations are lazy - they don't compute until you apply an aggregation function. This allows for efficient chaining of operations.
Pro Tip: Use
.reset_index() after GroupBy operations to convert the result back to a regular DataFrame with proper column names.