Building a RAG Pipeline with LangChain and PostgreSQL

· 14 min read · AI & Machine Learning

Build a production RAG pipeline using LangChain, OpenAI embeddings, and PostgreSQL with pgvector for document question-answering.

Building a RAG Pipeline with LangChain and PostgreSQL

Retrieval-Augmented Generation lets an LLM answer questions about your data without fine-tuning. Instead of retraining the model, you retrieve relevant documents and include them in the prompt context. The LLM generates an answer grounded in your actual data.

This guide builds a production RAG pipeline using LangChain, OpenAI embeddings, and PostgreSQL with pgvector — no Pinecone or external vector database required.

Architecture

Documents → Chunk → Embed → Store in pgvector
                                    ↓
User Query → Embed → Similarity Search → Top-K Results → LLM Prompt → Answer

Two phases: indexing (process documents once) and querying (search and generate on each request).

Setup

pip install langchain langchain-openai langchain-community pgvector psycopg2-binary

Enable pgvector in PostgreSQL:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content TEXT NOT NULL,
  metadata JSONB DEFAULT '{}'::jsonb,
  embedding vector(1536)
);

CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The vector(1536) column stores OpenAI text-embedding-3-small embeddings. The IVFFlat index enables fast approximate nearest neighbor search.

Document Chunking

Large documents must be split into chunks that fit within the LLM context window:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = splitter.split_text(document_text)

Key parameters:

  • chunk_size=1000: Each chunk is approximately 1000 characters (~250 tokens)
  • chunk_overlap=200: Adjacent chunks share 200 characters to preserve context across boundaries
  • separators: Split on paragraph breaks first, then sentences, then words

Chunk size is a tradeoff. Smaller chunks give more precise retrieval but lose context. Larger chunks preserve context but reduce precision. 1000 characters works well for most use cases.

Embedding and Storage

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import PGVector

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

CONNECTION_STRING = "postgresql://user:pass@localhost:5432/mydb"

vector_store = PGVector(
    connection_string=CONNECTION_STRING,
    embedding_function=embeddings,
    collection_name="documents",
)

# Index documents
vector_store.add_texts(
    texts=chunks,
    metadatas=[{"source": "api-docs", "page": i} for i in range(len(chunks))],
)

Each chunk is embedded into a 1536-dimensional vector and stored in PostgreSQL. Metadata (source, page number) is stored alongside for filtering.

Retrieval

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},
)

template = """Use the following context to answer the question. 
If the context does not contain the answer, say "I don't have enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)

result = qa_chain.invoke({"query": "How do I authenticate API requests?"})
print(result["result"])
print(result["source_documents"])

The retriever finds the 4 most similar chunks to the user question, includes them as context in the prompt, and the LLM generates an answer grounded in that context.

Improving Retrieval Quality

Hybrid Search

Combine vector similarity with keyword search:

retriever = vector_store.as_retriever(
    search_type="mmr",  # Maximal Marginal Relevance
    search_kwargs={
        "k": 4,
        "fetch_k": 20,  # Fetch 20, re-rank to 4
        "lambda_mult": 0.7,  # Balance relevance and diversity
    },
)

MMR prevents returning 4 near-identical chunks. It fetches 20 candidates, then selects 4 that are both relevant and diverse.

Metadata Filtering

retriever = vector_store.as_retriever(
    search_kwargs={
        "k": 4,
        "filter": {"source": "api-docs"},
    },
)

If you have documents from multiple sources, filter to the relevant source before searching. This dramatically improves precision.

Production Considerations

Embedding costs. text-embedding-3-small costs $0.02 per million tokens. A 100-page document has approximately 25,000 tokens — costing $0.0005 to embed. Querying costs the same per query.

Caching. Cache frequent queries and their results. A Redis layer between the user and the RAG pipeline can reduce embedding API calls by 60–80%.

Chunking strategy matters more than the LLM model. Bad chunks with GPT-4 produce worse results than good chunks with GPT-4o-mini. Invest time in chunking.

Evaluation. Measure retrieval quality with precision@k: of the top-k retrieved chunks, how many are actually relevant to the question? A precision@4 below 0.5 means your chunking or embedding strategy needs work.

Takeaways

RAG is not magic — it is a search problem with a language model on top. The quality of your RAG pipeline depends on three things: chunk quality, embedding relevance, and prompt design. PostgreSQL with pgvector eliminates the need for a separate vector database, keeping your infrastructure simple.

Start with a basic pipeline, measure retrieval precision, and iterate on the chunking strategy before upgrading the LLM model.