modin — Drop-in pandas at Scale#
What it is#
Modin is a DataFrame library that parallelises pandas operations across all available CPU cores with a single import change. It replaces import pandas as pd with import modin.pandas as pd and intercepts the same API, distributing work across partitions using Ray (default), Dask, or Unidist as the execution engine. Code that already works with pandas runs on modin without modification; modin falls back to pandas transparently for unsupported operations. It is the lowest-friction path for speeding up existing pandas pipelines on multi-core machines.
Install#
pip install modin[ray] # Ray backend (recommended, best out-of-the-box speed)
pip install modin[dask] # Dask backend (useful if Dask is already in your stack)
pip install modin[all] # all backends
Output: (none — exits 0 on success)
Quick example#
import modin.pandas as pd # only this line changes
df = pd.read_csv("large.csv") # reads in parallel across cores
print(df.groupby("category")["value"].mean())
Output:
category
electronics 299.42
furniture 189.10
software 149.99
Name: value, dtype: float64
When / why to use it#
- Existing pandas codebase you want to speed up without rewriting to polars or Dask.
- Multi-core machines (8+ cores) where single-threaded pandas leaves most hardware idle.
- Large CSVs or Parquet files where
read_csv/read_parquetis the bottleneck. - Teams that cannot retrain on a new DataFrame API.
- Ad-hoc exploration where polars’ expression-based API feels unfamiliar.
Common pitfalls#
[!WARNING] Mixing modin and pandas DataFrames — passing a modin DataFrame to a function that calls
pd.DataFrame()with the stockpandasimport creates a plain pandas DataFrame, silently losing the distributed partitioning. Keep all objects in the same library within a pipeline.
[!WARNING] Small DataFrames are slower — modin has coordination overhead from the execution engine. For DataFrames under ~100 MB, plain pandas is often faster. Modin’s parallelism pays off only at scale.
[!WARNING]
inplace=Trueis unreliable — modin does not guarantee thatinplace=Truemodifies the original object. Reassign instead:df = df.dropna().
[!WARNING] Unsupported operations fall back silently — modin falls back to pandas for ~15% of the API surface. The fallback is correct but single-threaded. Check
modin.utils.has_metadataor watch forUserWarning: Distributing...messages to understand what’s running in parallel.
[!TIP] Set
MODIN_ENGINE=ray(ordask) as an environment variable before importing modin to control the backend without touching code. This makes it easy to switch backends per-environment.
[!TIP]
df._to_pandas()extracts a plain pandas DataFrame from a modin object. Use it when you need to pass data to a library that does not accept modin DataFrames.
Richer example — multi-file CSV aggregation#
import modin.pandas as pd
import glob
# Read and concat multiple large CSVs in parallel
files = glob.glob("data/logs_*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
# Group by date and action, count events
summary = (
df.groupby(["date", "action"])["user_id"]
.count()
.reset_index(name="events")
.sort_values("events", ascending=False)
)
print(summary.head(5))
Output:
date action events
3 2026-01-03 page_view 48291
1 2026-01-01 page_view 45982
7 2026-01-07 add_cart 38741
2 2026-01-02 search 33019
5 2026-01-05 page_view 31284
Backend selection#
Modin supports three execution engines. The backend is selected once at import time (or via environment variable) and cannot be changed mid-session.
# Option 1 — environment variable (set before Python starts)
# MODIN_ENGINE=ray python script.py
# Option 2 — modin.config (set before first import of modin.pandas)
import modin.config as cfg
cfg.Engine.put("ray") # or "dask", "unidist"
import modin.pandas as pd
# Option 3 — Ray directly (start cluster before modin import)
import ray
ray.init(num_cpus=8) # or ray.init(address="auto") for remote cluster
import modin.pandas as pd
| Backend | Best for |
|---|---|
| Ray | Single machine, best default performance, easy install |
| Dask | Already using Dask, distributed cluster, lower memory overhead |
| Unidist | MPI environments, HPC clusters |
modin.config — tuning parameters#
modin.config exposes runtime knobs for partition size, engine, and logging. Changes must be made before importing modin.pandas.
import modin.config as cfg
cfg.Engine.put("ray")
cfg.NPartitions.put(16) # override number of row partitions (default: nCPU)
cfg.RayRedisAddress.put(None) # use local Ray cluster
cfg.BenchmarkMode.put(False) # set True to disable fallback (raises instead of falling back)
import modin.pandas as pd
Check current values:
import modin.config as cfg
print(cfg.Engine.get()) # ray
print(cfg.NPartitions.get()) # 8 (on 8-core machine)
Interop with pandas#
Converting between modin and pandas is explicit and inexpensive — both share Arrow-compatible memory when possible.
import modin.pandas as mpd
import pandas as pd
# modin → pandas
df_modin = mpd.read_csv("data.csv")
df_pandas = df_modin._to_pandas()
print(type(df_pandas)) # <class 'pandas.core.frame.DataFrame'>
# pandas → modin
df_back = mpd.DataFrame(df_pandas)
print(type(df_back)) # <class 'modin.pandas.dataframe.DataFrame'>
# Use modin in a function that expects pandas
import numpy as np
result = pd.DataFrame(np.corrcoef(df_modin["a"], df_modin["b"])) # works via __array__
Operations that fall back to pandas#
Modin handles the most common operations natively. The following operations silently fall back to single-threaded pandas:
| Operation | Status |
|---|---|
read_csv, read_parquet, read_excel | Parallel ✅ |
groupby().agg() | Parallel ✅ |
merge / join | Parallel ✅ |
sort_values | Parallel ✅ |
apply(axis=0) (column-wise) | Parallel ✅ |
apply(axis=1) (row-wise) | Fallback ⚠️ |
iterrows, itertuples | Fallback ⚠️ |
to_sql | Fallback ⚠️ |
Custom groupby().apply() | Fallback ⚠️ |
MultiIndex operations | Partial ⚠️ |
[!TIP] Row-wise
apply(axis=1)is the most common performance trap. If you need per-row custom logic at scale, move to polars’map_elementsor vectorise with numpy instead.
Benchmarks — when modin wins#
Modin’s speedup is proportional to core count and dataset size. Rough guidelines:
| Scenario | modin vs pandas |
|---|---|
read_csv (2 GB, 8 cores) | ~4–6× faster |
groupby().agg() (50 M rows) | ~3–5× faster |
merge (two 10 M row tables) | ~2–4× faster |
sort_values (10 M rows) | ~2–3× faster |
apply(axis=1) | ~1× (fallback) |
| DataFrame under 10 MB | ~0.5–0.8× (overhead dominates) |
For datasets over ~10 GB or when you need strict performance guarantees, polars is typically faster than modin because its Rust/Arrow core has lower coordination overhead than modin’s Ray-based partitioning.
Dropping back to pandas for a single operation#
When you know an operation falls back anyway, it is cleaner to extract pandas explicitly for that step:
import modin.pandas as mpd
df = mpd.read_parquet("large.parquet") # parallel read
# Complex custom groupby — do it in pandas
pandas_df = df._to_pandas()
result = pandas_df.groupby("user_id").apply(
lambda g: g.nlargest(3, "value")
)
# Continue in modin
df_result = mpd.DataFrame(result.reset_index(drop=True))
Partition inspection#
import modin.pandas as pd
df = pd.read_csv("data.csv")
# Number of row partitions
from modin.config import NPartitions
print(NPartitions.get())
# Underlying pandas partitions (for debugging)
parts = df._query_compiler._modin_frame._partitions
print(f"Partition grid: {len(parts)} rows × {len(parts[0])} cols")
Quick reference#
| Task | Code |
|---|---|
| Import swap | import modin.pandas as pd |
| Set backend | modin.config.Engine.put("ray") |
| Set partitions | modin.config.NPartitions.put(16) |
| To pandas | df._to_pandas() |
| From pandas | modin.pandas.DataFrame(pandas_df) |
| Read CSV parallel | pd.read_csv("file.csv") |
| Read Parquet parallel | pd.read_parquet("file.parquet") |
| groupby (parallel) | df.groupby("k").agg({"v": "sum"}) |
| merge (parallel) | pd.merge(left, right, on="id") |
| Check engine | modin.config.Engine.get() |
| Start Ray manually | ray.init(num_cpus=N) before import |