skip to content

pandas β€” DataFrames

Load, filter, transform, and aggregate tabular data with pandas. Covers DataFrame creation, read_csv, groupby, merge, and the SettingWithCopy pitfall.

3 min read 6 snippets yesterday intermediate

pandas β€” DataFrames#

What it is#

pandas provides two main structures: DataFrame (2-D labeled table) and Series (1-D labeled array). It is the standard library for reading, cleaning, transforming, and analyzing tabular data in Python.

Install#

pip install pandas

Quick example#

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "dept": ["Eng", "Eng", "HR"],
    "score": [92, 78, 85],
})
print(df[df["score"] > 80])

Output:

      name dept  score
0    Alice  Eng     92
2  Charlie   HR     85

When / why to use it#

  • Reading CSV, Excel, JSON, or SQL tables: pd.read_csv(), pd.read_excel(), pd.read_sql().
  • Cleaning and reshaping data before analysis or ML.
  • Exploratory data analysis with df.describe(), groupby, pivot_table.
  • Joining multiple datasets: pd.merge().

Common pitfalls#

[!WARNING] SettingWithCopyWarning β€” modifying a slice of a DataFrame may not modify the original. Use .loc for assignment:

df.loc[df["score"] > 80, "grade"] = "A"  # βœ… correct
df[df["score"] > 80]["grade"] = "A"       # ❌ may silently do nothing

[!WARNING] inplace=True is a trap β€” many operations accept inplace=True but it doesn’t save memory and makes code harder to chain. Prefer reassignment: df = df.dropna().

[!TIP] Use df.dtypes and df.info() early to understand what you loaded. Numeric columns imported as object (string) will break aggregations silently.

Reading and writing files#

import pandas as pd

# CSV
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)

# Excel
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")

# JSON
df = pd.read_json("records.json", orient="records")

# from a SQL query
import sqlite3
conn = sqlite3.connect("app.db")
df = pd.read_sql("SELECT * FROM users WHERE active = 1", conn)

Richer example β€” groupby and merge#

import pandas as pd

sales = pd.DataFrame({
    "product": ["A", "B", "A", "B", "A"],
    "region": ["East", "East", "West", "West", "East"],
    "amount": [100, 200, 150, 250, 120],
})

# Group by region and product, aggregate
summary = (
    sales.groupby(["region", "product"])["amount"]
    .agg(total="sum", orders="count")
    .reset_index()
)
print(summary)
print()

# Merge with a product lookup table
products = pd.DataFrame({"product": ["A", "B"], "name": ["Widget", "Gadget"]})
merged = summary.merge(products, on="product")
print(merged[["region", "name", "total", "orders"]])

Output:

  region product  total  orders
0   East       A    220       2
1   East       B    200       1
2   West       A    150       1
3   West       B    250       1

  region    name  total  orders
0   East  Widget    220       2
1   East  Gadget    200       1
2   West  Widget    150       1
3   West  Gadget    250       1

Useful operations quick reference#

TaskCode
Filter rowsdf[df["col"] > 5]
Select columnsdf[["a", "b"]]
Add columndf["new"] = df["a"] * 2
Drop rows with NaNdf.dropna(subset=["col"])
Fill NaNdf["col"].fillna(0)
Sortdf.sort_values("col", ascending=False)
Rename columnsdf.rename(columns={"old": "new"})
Apply functiondf["col"].apply(lambda x: x.strip())
Pivotdf.pivot_table(values="sales", index="region", columns="product", aggfunc="sum")
Value countsdf["status"].value_counts()
Describedf.describe()
Shapedf.shape