pandas β DataFrames#
What it is#
pandas provides two main structures: DataFrame (2-D labeled table) and Series (1-D labeled array). It is the standard library for reading, cleaning, transforming, and analyzing tabular data in Python.
Install#
pip install pandas
Quick example#
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"dept": ["Eng", "Eng", "HR"],
"score": [92, 78, 85],
})
print(df[df["score"] > 80])
Output:
name dept score
0 Alice Eng 92
2 Charlie HR 85
When / why to use it#
- Reading CSV, Excel, JSON, or SQL tables:
pd.read_csv(),pd.read_excel(),pd.read_sql(). - Cleaning and reshaping data before analysis or ML.
- Exploratory data analysis with
df.describe(),groupby,pivot_table. - Joining multiple datasets:
pd.merge().
Common pitfalls#
[!WARNING] SettingWithCopyWarning β modifying a slice of a DataFrame may not modify the original. Use
.locfor assignment:df.loc[df["score"] > 80, "grade"] = "A" # β correct df[df["score"] > 80]["grade"] = "A" # β may silently do nothing
[!WARNING]
inplace=Trueis a trap β many operations acceptinplace=Truebut it doesnβt save memory and makes code harder to chain. Prefer reassignment:df = df.dropna().
[!TIP] Use
df.dtypesanddf.info()early to understand what you loaded. Numeric columns imported asobject(string) will break aggregations silently.
Reading and writing files#
import pandas as pd
# CSV
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)
# Excel
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")
# JSON
df = pd.read_json("records.json", orient="records")
# from a SQL query
import sqlite3
conn = sqlite3.connect("app.db")
df = pd.read_sql("SELECT * FROM users WHERE active = 1", conn)
Richer example β groupby and merge#
import pandas as pd
sales = pd.DataFrame({
"product": ["A", "B", "A", "B", "A"],
"region": ["East", "East", "West", "West", "East"],
"amount": [100, 200, 150, 250, 120],
})
# Group by region and product, aggregate
summary = (
sales.groupby(["region", "product"])["amount"]
.agg(total="sum", orders="count")
.reset_index()
)
print(summary)
print()
# Merge with a product lookup table
products = pd.DataFrame({"product": ["A", "B"], "name": ["Widget", "Gadget"]})
merged = summary.merge(products, on="product")
print(merged[["region", "name", "total", "orders"]])
Output:
region product total orders
0 East A 220 2
1 East B 200 1
2 West A 150 1
3 West B 250 1
region name total orders
0 East Widget 220 2
1 East Gadget 200 1
2 West Widget 150 1
3 West Gadget 250 1
Useful operations quick reference#
| Task | Code |
|---|---|
| Filter rows | df[df["col"] > 5] |
| Select columns | df[["a", "b"]] |
| Add column | df["new"] = df["a"] * 2 |
| Drop rows with NaN | df.dropna(subset=["col"]) |
| Fill NaN | df["col"].fillna(0) |
| Sort | df.sort_values("col", ascending=False) |
| Rename columns | df.rename(columns={"old": "new"}) |
| Apply function | df["col"].apply(lambda x: x.strip()) |
| Pivot | df.pivot_table(values="sales", index="region", columns="product", aggfunc="sum") |
| Value counts | df["status"].value_counts() |
| Describe | df.describe() |
| Shape | df.shape |