Building a RAG Pipeline with LangChain and PostgreSQL
Retrieval-Augmented Generation lets an LLM answer questions about your data without fine-tuning. Instead of retraining the model, you retrieve relevant documents and include them in the prompt context. The LLM generates an answer grounded in your actual data.
This guide builds a production RAG pipeline using LangChain, OpenAI embeddings, and PostgreSQL with pgvector — no Pinecone or external vector database required.
Architecture
Documents → Chunk → Embed → Store in pgvector
↓
User Query → Embed → Similarity Search → Top-K Results → LLM Prompt → Answer
Two phases: indexing (process documents once) and querying (search and generate on each request).
Setup
pip install langchain langchain-openai langchain-community pgvector psycopg2-binary
Enable pgvector in PostgreSQL:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
content TEXT NOT NULL,
metadata JSONB DEFAULT '{}'::jsonb,
embedding vector(1536)
);
CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
The vector(1536) column stores OpenAI text-embedding-3-small embeddings. The IVFFlat index enables fast approximate nearest neighbor search.
Document Chunking
Large documents must be split into chunks that fit within the LLM context window:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = splitter.split_text(document_text)
Key parameters:
- chunk_size=1000: Each chunk is approximately 1000 characters (~250 tokens)
- chunk_overlap=200: Adjacent chunks share 200 characters to preserve context across boundaries
- separators: Split on paragraph breaks first, then sentences, then words
Chunk size is a tradeoff. Smaller chunks give more precise retrieval but lose context. Larger chunks preserve context but reduce precision. 1000 characters works well for most use cases.
Embedding and Storage
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import PGVector
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
CONNECTION_STRING = "postgresql://user:pass@localhost:5432/mydb"
vector_store = PGVector(
connection_string=CONNECTION_STRING,
embedding_function=embeddings,
collection_name="documents",
)
# Index documents
vector_store.add_texts(
texts=chunks,
metadatas=[{"source": "api-docs", "page": i} for i in range(len(chunks))],
)
Each chunk is embedded into a 1536-dimensional vector and stored in PostgreSQL. Metadata (source, page number) is stored alongside for filtering.
Retrieval
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
retriever = vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4},
)
template = """Use the following context to answer the question.
If the context does not contain the answer, say "I don't have enough information to answer that."
Context:
{context}
Question: {question}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["context", "question"])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": prompt},
return_source_documents=True,
)
result = qa_chain.invoke({"query": "How do I authenticate API requests?"})
print(result["result"])
print(result["source_documents"])
The retriever finds the 4 most similar chunks to the user question, includes them as context in the prompt, and the LLM generates an answer grounded in that context.
Improving Retrieval Quality
Hybrid Search
Combine vector similarity with keyword search:
retriever = vector_store.as_retriever(
search_type="mmr", # Maximal Marginal Relevance
search_kwargs={
"k": 4,
"fetch_k": 20, # Fetch 20, re-rank to 4
"lambda_mult": 0.7, # Balance relevance and diversity
},
)
MMR prevents returning 4 near-identical chunks. It fetches 20 candidates, then selects 4 that are both relevant and diverse.
Metadata Filtering
retriever = vector_store.as_retriever(
search_kwargs={
"k": 4,
"filter": {"source": "api-docs"},
},
)
If you have documents from multiple sources, filter to the relevant source before searching. This dramatically improves precision.
Production Considerations
Embedding costs. text-embedding-3-small costs $0.02 per million tokens. A 100-page document has approximately 25,000 tokens — costing $0.0005 to embed. Querying costs the same per query.
Caching. Cache frequent queries and their results. A Redis layer between the user and the RAG pipeline can reduce embedding API calls by 60–80%.
Chunking strategy matters more than the LLM model. Bad chunks with GPT-4 produce worse results than good chunks with GPT-4o-mini. Invest time in chunking.
Evaluation. Measure retrieval quality with precision@k: of the top-k retrieved chunks, how many are actually relevant to the question? A precision@4 below 0.5 means your chunking or embedding strategy needs work.
Takeaways
RAG is not magic — it is a search problem with a language model on top. The quality of your RAG pipeline depends on three things: chunk quality, embedding relevance, and prompt design. PostgreSQL with pgvector eliminates the need for a separate vector database, keeping your infrastructure simple.
Start with a basic pipeline, measure retrieval precision, and iterate on the chunking strategy before upgrading the LLM model.