RAG Project - Complete Documentation

📚 Project Overview

This is a Production-Ready Retrieval-Augmented Generation (RAG) System built with enterprise-grade components:

Framework: LangChain (RAG orchestration)
Embeddings: Google Gemini API (semantic understanding)
Vector DB: Pinecone (scalable vector storage)
LLM: Google Gemini Pro (generation)
UI: Streamlit (web interface)

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                    User Interface                        │
│                   (Streamlit Web App)                    │
│  - File Upload - Chat Interface - Display Results       │
└────────────────────┬────────────────────────────────────┘
                     │
┌────────────────────▼────────────────────────────────────┐
│              RAG Processing Pipeline                     │
├─────────────────────────────────────────────────────────┤
│                                                          │
│  1. Document Processor                                  │
│     - Text Extraction (.txt, .pdf, .docx)              │
│     - Text Chunking (with overlap)                     │
│                                                          │
│  2. Embedding Service                                   │
│     - Generate vectors via Google Gemini               │
│     - Dimension: 768                                    │
│                                                          │
│  3. Vector Storage                                      │
│     - Pinecone index management                        │
│     - Metadata storage                                 │
│                                                          │
│  4. RAG Chain                                           │
│     - Retrieval (semantic search)                      │
│     - Generation (LLM response)                        │
│     - Prompt engineering                               │
│                                                          │
└─────────────────────────────────────────────────────────┘

📂 File Structure Explained

rag-project/
│
├── src/                          # Main source code
│   ├── __init__.py              # Package initialization
│   │
│   ├── config/                  # Configuration module
│   │   ├── __init__.py
│   │   └── config.py            # Centralized config (reads .env)
│   │
│   ├── rag/                     # Core RAG implementation
│   │   ├── __init__.py
│   │   ├── pinecone_manager.py  # Pinecone CRUD operations
│   │   ├── embedding_service.py # Google Gemini embeddings
│   │   ├── document_processor.py# Document pipeline
│   │   └── rag_chain.py         # LangChain RAG chain
│   │
│   └── utils/                   # Utility modules
│       ├── __init__.py
│       ├── helpers.py           # Logging, formatting
│       ├── chunking.py          # Text splitting logic
│       └── text_processor.py    # File parsing
│
├── app.py                       # Streamlit web interface
├── main.py                      # CLI entry point
├── setup_project.py             # Setup automation
├── requirements.txt             # Python dependencies
│
├── .env.template                # Config template
├── .env                         # Config (create from template)
│
└── README.md                    # User documentation

🔄 Workflow Diagram

Processing Documents

User Upload File
       ↓
    Extract Text (text_processor.py)
       ↓
    Split into Chunks (chunking.py)
       ↓
    Generate Embeddings (embedding_service.py)
       ↓
    Upsert to Pinecone (pinecone_manager.py)
       ↓
    Document Ready for Queries

Answering Questions

User Question (Chat Interface)
       ↓
    Generate Question Embedding (embedding_service.py)
       ↓
    Search Pinecone for Similar Chunks (pinecone_manager.py)
       ↓
    Retrieve Top-K Results (default: 5)
       ↓
    Format Context from Retrieved Chunks
       ↓
    Send to Gemini with Custom Prompt (rag_chain.py)
       ↓
    Stream Response to User

🔌 API Integrations

Google Gemini API

Models Used:
- models/embedding-001 - Text embeddings
- gemini-2.5-flash - Text generation
Key Operations:
- embed_content() - Generate embeddings
- ChatGoogleGenerativeAI() - LLM interface

Pinecone API

Index: rag-documents-index (configurable)
Key Operations:
- create_index() - Initialize vector database
- upsert() - Store vectors with metadata
- query() - Semantic search

💾 Data Schema

Pinecone Vector Format

{
  "id": "filename_0_a1b2c3d4",
  "values": [0.23, 0.45, ...],  // 768-dimensional embedding
  "metadata": {
    "chunk_index": 0,
    "source": "document.txt",
    "text": "First 500 characters of chunk..."
  }
}

Document Metadata

chunk_index: Position in source document
source: Original filename
text: Content preview (first 500 chars)

⚙️ Configuration Parameters

Parameter	Default	Purpose
`CHUNK_SIZE`	1000	Characters per chunk
`CHUNK_OVERLAP`	200	Character overlap between chunks
`RETRIEVAL_TOP_K`	5	Number of results to retrieve
`EMBEDDING_DIMENSION`	768	Embedding vector dimension
`LANGCHAIN_VERBOSE`	False	Enable verbose logging

🔐 Security & Safety

Built-in Safeguards

Hallucination Prevention
- Custom prompt instructs model to refuse out-of-scope questions
- "I don't have information in the uploaded documents to answer that."
Context Verification
- Only uses retrieved documents as context
- No external data sources
Source Attribution
- Links answers back to source documents
- Shows document excerpts
Logging
- All operations logged for audit trail
- Configurable log levels

🚀 Performance Optimization

Chunking Strategy

Recursive character splitting
Respects semantic boundaries (paragraphs, sentences)
Configurable size and overlap

Embedding Strategy

Batch processing for multiple texts
Caching ready (can be added)
Async support ready

Retrieval Strategy

Vector similarity search (cosine distance)
Top-K filtering
Metadata filtering support

🛠️ Development Guide

Adding New Features

New File Formats

# Add to text_processor.py
elif file_ext == ".new_format":
    return TextProcessor._extract_from_new_format(file_path)

Custom Prompt Templates

# Modify in rag_chain.py _create_qa_chain()
CUSTOM_PROMPT = PromptTemplate(
    template="Your custom template...",
    input_variables=["context", "question"]
)

New Retrieval Strategies

# Create in rag_chain.py
def retrieve_with_reranking(self, question: str):
    # Custom retrieval logic

📊 Monitoring & Debugging

Enable Verbose Logging

# In .env file
LANGCHAIN_VERBOSE=True
LOG_LEVEL=DEBUG

Check Index Stats

from src.rag import PineconeManager

pm = PineconeManager()
stats = pm.get_index_stats()
print(stats)  # Shows vector counts, dimensions

🧪 Testing

Test Document Processing

# Create test file
echo "Test content" > test.txt

# Process it
python main.py process test.txt

Test RAG Chain

from src.rag import RAGChain

chain = RAGChain()
result = chain.query("Test question?")
print(result["answer"])

📝 Code Examples

Process Documents Programmatically

from src.rag import DocumentProcessor

processor = DocumentProcessor()
chunks = processor.process_file("document.txt", "document.txt")
print(f"Created {chunks} chunks")

Query Documents

from src.rag import RAGChain

chain = RAGChain()
result = chain.query("What is the main topic?")
print(result["answer"])
for doc in result["source_documents"]:
    print(f"Source: {doc.metadata['source']}")

🔄 Update & Maintenance

Updating Documents

Process new documents (appends to index)
Use --namespace flag for isolation
Clear index if needed: Update PINECONE_INDEX_NAME in .env

Clearing Data

Streamlit UI: "Clear All Data" button
CLI: Create new index with different name

📈 Scalability

Suitable For

Small to medium document repositories (millions of vectors)
Real-time query performance needs
Multi-tenant support (via namespaces)
Cost-effective vector storage

When to Scale

Consider vector database partitioning
Implement caching layer
Add async batch processing
Monitor Pinecone index size

🐛 Common Issues & Solutions

Issue	Cause	Solution
No embeddings generated	Invalid API key	Check GOOGLE_API_KEY
Connection refused	API timeout	Check internet, retry
Hallucinated answers	Prompt design	Adjust prompt template
Slow queries	Large TOP_K	Reduce RETRIEVAL_TOP_K
Memory issues	Large documents	Reduce CHUNK_SIZE

📚 Module Reference

config.py

Config class with all settings
validate() method for config checks

embedding_service.py

EmbeddingService for generating vectors
embed_text() - single text
embed_texts() - batch processing

pinecone_manager.py

PineconeManager for index operations
create_index() - setup
upsert_vectors() - store
query_vectors() - retrieve

document_processor.py

DocumentProcessor for full pipeline
process_file() - single file
process_multiple_files() - batch

rag_chain.py

RAGChain for Q&A
query() - get answers
is_relevant_to_documents() - check relevance

🎓 Learning Path

Beginner: Use Streamlit UI only
Intermediate: Explore CLI commands
Advanced: Modify code and add features
Expert: Integrate into production systems

📞 Support Resources

Google Generative AI: https://ai.google.dev/
Pinecone Docs: https://docs.pinecone.io/
LangChain: https://python.langchain.com/
Streamlit: https://docs.streamlit.io/

Version: 1.0.0
Last Updated: December 2024
Status: Production Ready

FilesExpand file tree

DOCUMENTATION.md

Latest commit

History