skip to content

ChromaDB — Embedded Vector Database

Store and query vector embeddings locally or over a network with ChromaDB. Covers client types, collections, add, query, metadata filters, embedding functions, and LangChain/LlamaIndex integration.

7 min read 21 snippets deep dive

ChromaDB — Embedded Vector Database#

What it is#

ChromaDB is an open-source vector database designed for AI applications. It stores embeddings (dense float vectors) alongside documents and metadata, and retrieves the nearest neighbours to a query vector using approximate nearest-neighbour search. Chroma runs embedded (in-process, no server), as a persistent local store, or as a client/server pair. It is the default vector store for many LangChain and LlamaIndex tutorials because it requires zero infrastructure to get started.

Install#

pip install chromadb

Output: (none — exits 0 on success)

Quick example#

import chromadb

client = chromadb.Client()   # in-memory

collection = client.create_collection("my_docs")

collection.add(
    documents=["Python is a high-level language.", "Rust is a systems language."],
    ids=["doc1", "doc2"],
)

results = collection.query(query_texts=["scripting language"], n_results=1)
print(results["documents"])
print(results["distances"])

Output:

[['Python is a high-level language.']]
[[0.6321]]

When / why to use it#

  • Adding semantic search to an application without deploying a separate database.
  • Storing and querying embeddings in a RAG pipeline alongside LangChain or LlamaIndex.
  • Prototyping: Chroma runs in-process with zero setup; switch to a persistent or server client for production.
  • Metadata-filtered retrieval: combine vector similarity with structured filters (where={"category": "news"}).
  • Multi-tenant systems: one collection per tenant, all in the same Chroma instance.

Common pitfalls#

[!WARNING] Duplicate IDs raise — adding a document with an ID that already exists raises chromadb.errors.IDAlreadyExistsError. Use upsert() when you may be re-adding existing documents.

[!WARNING] Dimension mismatch — all vectors in a collection must have the same dimension. Mixing embedding models (e.g. OpenAI 1536-dim and HuggingFace 768-dim) in one collection raises a dimension error on the second add.

[!WARNING] chromadb.Client() is in-memory only — data is lost when the process exits. Use chromadb.PersistentClient(path="./chroma_db") for data that must survive restarts.

[!TIP] The default embedding function (DefaultEmbeddingFunction) uses sentence-transformers/all-MiniLM-L6-v2 running locally. It is accurate enough for prototyping and requires no API key, but needs pip install chromadb[default].

[!TIP] Pass include=["documents", "metadatas", "distances"] to query() to control what the response contains. Omitting documents saves bandwidth when you only need IDs.

Client types#

Chroma offers three client modes. Switch modes by changing only the client construction line.

import chromadb

# In-memory — data lost on exit
client = chromadb.Client()

# Persistent — saved to disk, survives restarts
client = chromadb.PersistentClient(path="./chroma_db")

# HTTP client — connects to a running Chroma server
client = chromadb.HttpClient(host="localhost", port=8000)

Start the Chroma server for the HTTP client:

chroma run --path ./chroma_db --port 8000

Output: (none — exits 0 on success)

Creating and managing collections#

A collection groups documents with the same embedding dimension. Collections are created once and retrieved by name on subsequent runs.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")

# Create (fails if exists)
col = client.create_collection("research_papers")

# Get or create (idempotent)
col = client.get_or_create_collection("research_papers")

# Get existing (raises if missing)
col = client.get_collection("research_papers")

# List all collections
print(client.list_collections())   # ['research_papers']

# Delete
client.delete_collection("research_papers")

# Collection metadata and distance function
col = client.create_collection(
    "products",
    metadata={"hnsw:space": "cosine"},  # cosine | l2 (default) | ip
)

Adding documents#

The add() method stores documents with their embeddings (or lets Chroma embed them) and optional metadata.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

# Add with auto-embedding (uses default embedding function)
col.add(
    documents=[
        "ChromaDB is an open-source vector database for AI applications.",
        "LangChain is a framework for building LLM-powered pipelines.",
        "PyTorch is a deep learning framework developed by Meta.",
    ],
    ids=["art_001", "art_002", "art_003"],
    metadatas=[
        {"category": "database", "year": 2023},
        {"category": "framework",  "year": 2022},
        {"category": "ml",         "year": 2016},
    ],
)

print(f"Collection count: {col.count()}")

Output:

Collection count: 3
# Add with pre-computed embeddings (skips the embedding step)
import numpy as np

col.add(
    embeddings=np.random.rand(2, 384).tolist(),   # must match collection dimension
    documents=["Document A", "Document B"],
    ids=["doc_a", "doc_b"],
)

Querying#

query() takes one or more query texts (or pre-computed query embeddings) and returns the n_results nearest neighbours.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

results = col.query(
    query_texts=["vector similarity search database"],
    n_results=2,
    include=["documents", "metadatas", "distances"],
)

for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0],
):
    print(f"[{dist:.4f}] {doc[:60]} | {meta}")

Output:

[0.2841] ChromaDB is an open-source vector database for AI applic | {'category': 'database', 'year': 2023}
[0.5912] LangChain is a framework for building LLM-powered pipeli | {'category': 'framework', 'year': 2022}
# Batch queries (multiple query texts at once)
results = col.query(
    query_texts=["machine learning", "database storage"],
    n_results=1,
)
print(results["documents"])   # list of lists, one per query

Output:

[['PyTorch is a deep learning framework developed by Meta.'],
 ['ChromaDB is an open-source vector database for AI applications.']]

Metadata filters — where and where_document#

where= filters by document metadata before scoring; where_document= filters by document text content. Both use a MongoDB-style operator dict.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

# Exact match
results = col.query(
    query_texts=["vector database"],
    n_results=5,
    where={"category": "database"},
)

# Numeric comparison
results = col.query(
    query_texts=["deep learning"],
    n_results=5,
    where={"year": {"$gte": 2020}},
)

# Multiple conditions (AND)
results = col.query(
    query_texts=["framework"],
    n_results=5,
    where={"$and": [{"category": "framework"}, {"year": {"$gte": 2022}}]},
)

# OR
results = col.query(
    query_texts=["model"],
    n_results=5,
    where={"$or": [{"category": "database"}, {"category": "ml"}]},
)

# Text content filter
results = col.query(
    query_texts=["pipeline"],
    n_results=5,
    where_document={"$contains": "LangChain"},
)

print(results["ids"])

Supported operators: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or.

Upsert — add or update#

upsert() inserts new documents and updates existing ones by ID. Use it when your ingestion pipeline may re-process the same source documents.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("wiki")

# First run — inserts
col.upsert(documents=["Python was created by Guido van Rossum."], ids=["py_001"])

# Second run — updates in place (same ID, new content)
col.upsert(documents=["Python was created by Guido van Rossum in 1991."], ids=["py_001"])

print(col.get(ids=["py_001"])["documents"])

Output:

['Python was created by Guido van Rossum in 1991.']

Update and delete#

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("articles")

# Update metadata only
col.update(ids=["art_001"], metadatas=[{"category": "database", "year": 2024}])

# Update document and metadata
col.update(
    ids=["art_002"],
    documents=["LangChain builds LLM-powered applications and agents."],
    metadatas=[{"category": "framework", "year": 2024}],
)

# Delete by ID
col.delete(ids=["art_003"])

# Delete by metadata filter
col.delete(where={"year": {"$lt": 2020}})

print(f"Remaining: {col.count()}")

Embedding functions#

Chroma’s embedding functions convert raw text to vectors. Swap them at collection creation time.

import chromadb
from chromadb.utils import embedding_functions

# Default (all-MiniLM-L6-v2, local, no API key)
ef = embedding_functions.DefaultEmbeddingFunction()

# OpenAI
ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="sk-...",
    model_name="text-embedding-3-small",
)

# HuggingFace (local, any sentence-transformers model)
ef = embedding_functions.HuggingFaceEmbeddingFunction(
    model_name="BAAI/bge-base-en-v1.5",
)

# Google Generative AI
ef = embedding_functions.GoogleGenerativeAiEmbeddingFunction(
    api_key="...",
    model_name="models/text-embedding-004",
)

col = client.get_or_create_collection("docs", embedding_function=ef)

[!WARNING] The embedding function must be passed at every get_collection() call — Chroma does not persist it. If you omit it on retrieval, Chroma uses the default embedding function, which will mismatch dimensions if you used a different one during add().

LangChain integration#

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document
import os

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    api_key=os.environ["OPENAI_API_KEY"],
)

# Create vectorstore from documents
docs = [
    Document(page_content="ChromaDB stores embeddings.", metadata={"source": "chroma_docs"}),
    Document(page_content="LangChain builds LLM chains.", metadata={"source": "lc_docs"}),
]
vectorstore = Chroma.from_documents(
    docs,
    embedding=embeddings,
    persist_directory="./chroma_lc",
)

# As a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 2})
results = retriever.invoke("vector similarity")
for doc in results:
    print(doc.page_content)

Output:

ChromaDB stores embeddings.
LangChain builds LLM chains.

LlamaIndex integration#

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb

chroma_client = chromadb.PersistentClient(path="./chroma_li")
collection = chroma_client.get_or_create_collection("research")

vector_store = ChromaVectorStore(chroma_collection=collection)
storage_ctx  = StorageContext.from_defaults(vector_store=vector_store)

docs  = SimpleDirectoryReader("./docs").load_data()
index = VectorStoreIndex.from_documents(docs, storage_context=storage_ctx)

engine = index.as_query_engine()
print(engine.query("What is multi-head attention?"))

Distance functions#

ValueMeaningBest for
l2 (default)Euclidean distance — smaller = more similarUnnormalised embeddings
cosineCosine distance — 0 = identical, 2 = oppositeNormalised/sentence embeddings
ipInner product — larger = more similarWhen vectors are already normalised

Set at collection creation: metadata={"hnsw:space": "cosine"}.

Quick reference#

TaskCode
In-memory clientchromadb.Client()
Persistent clientchromadb.PersistentClient(path="./dir")
HTTP clientchromadb.HttpClient(host="host", port=8000)
Get or createclient.get_or_create_collection("name")
Add documentscol.add(documents=[...], ids=[...], metadatas=[...])
Add embeddingscol.add(embeddings=[[...]], documents=[...], ids=[...])
Querycol.query(query_texts=["..."], n_results=5)
Metadata filtercol.query(..., where={"key": "value"})
Text filtercol.query(..., where_document={"$contains": "word"})
Upsertcol.upsert(documents=[...], ids=[...])
Updatecol.update(ids=[...], documents=[...], metadatas=[...])
Delete by IDcol.delete(ids=["id1"])
Delete by filtercol.delete(where={"year": {"$lt": 2020}})
Countcol.count()
Get by IDcol.get(ids=["id1"])
Cosine distancecreate_collection("name", metadata={"hnsw:space": "cosine"})
OpenAI embeddingsOpenAIEmbeddingFunction(api_key=..., model_name="text-embedding-3-small")
HF embeddingsHuggingFaceEmbeddingFunction(model_name="BAAI/bge-base-en-v1.5")
LangChain storeChroma.from_documents(docs, embedding=embeddings, persist_directory="./dir")