RAG Architecture Design
When to Use This Skill
Use this skill when:
- Rag Architecture tasks - Working on design retrieval-augmented generation pipelines including chunking, embedding, retrieval, and context assembly strategies
- Planning or design - Need guidance on Rag Architecture approaches
- Best practices - Want to follow established patterns and standards
Overview
Retrieval-Augmented Generation (RAG) combines retrieval from a knowledge base with LLM generation to provide accurate, grounded responses. Proper architecture is critical for performance and quality.
RAG Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ RAG Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ INDEXING PIPELINE │ │
│ │ │ │
│ │ Documents → Chunking → Embedding → Vector Store │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ QUERY PIPELINE │ │
│ │ │ │
│ │ Query → Embedding → Retrieval → Reranking → Context → │ │
│ │ LLM Generation → Response │ │
│ │ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Document Chunking
Chunking Strategies
| Strategy |
Description |
Best For |
| Fixed Size |
Split by character/token count |
Simple, general |
| Sentence |
Split at sentence boundaries |
Prose, articles |
| Paragraph |
Split at paragraph boundaries |
Structured docs |
| Semantic |
Split by topic/meaning |
Technical docs |
| Recursive |
Hierarchical splitting |
Mixed content |
| Document Structure |
Use headers, sections |
Manuals, specs |
Chunk Size Guidelines
| Document Type |
Chunk Size |
Overlap |
| FAQ |
100-300 tokens |
10-20% |
| Articles |
300-500 tokens |
15-25% |
| Technical Docs |
500-1000 tokens |
20-30% |
| Legal/Contracts |
200-400 tokens |
25-35% |
| Code |
50-150 lines |
By function |
Chunking Implementation
public class DocumentChunker
{
public IEnumerable<Chunk> ChunkDocument(
Document document,
ChunkingOptions options)
{
return options.Strategy switch
{
ChunkingStrategy.FixedSize =>
FixedSizeChunk(document.Content, options.ChunkSize, options.Overlap),
ChunkingStrategy.Sentence =>
SentenceChunk(document.Content, options.MaxSentences),
ChunkingStrategy.Semantic =>
SemanticChunk(document.Content, options.SemanticThreshold),
ChunkingStrategy.Recursive =>
RecursiveChunk(document.Content, options),
_ => throw new NotSupportedException()
};
}
private IEnumerable<Chunk> RecursiveChunk(
string content,
ChunkingOptions options)
{
var separators = new[] { "\n\n", "\n", ". ", " " };
foreach (var separator in separators)
{
var splits = content.Split(separator);
if (splits.All(s => CountTokens(s) <= options.ChunkSize))
{
return MergeSmallChunks(splits, options.ChunkSize, options.Overlap)
.Select((text, i) => new Chunk
{
Id = Guid.NewGuid(),
Content = text,
Index = i,
Metadata = new ChunkMetadata
{
TokenCount = CountTokens(text),
Separator = separator
}
});
}
}
return FixedSizeChunk(content, options.ChunkSize, options.Overlap);
}
}
Embedding Strategies
Embedding Model Selection
| Model |
Dimensions |
Speed |
Quality |
Cost |
| text-embedding-3-small |
1536 |
Fast |
Good |
Low |
| text-embedding-3-large |
3072 |
Medium |
Excellent |
Medium |
| text-embedding-ada-002 |
1536 |
Fast |
Good |
Low |
| Cohere embed-v3 |
1024 |
Fast |
Excellent |
Medium |
| BGE-large |
1024 |
Medium |
Excellent |
Free (local) |
Embedding Best Practices
public class EmbeddingService
{
private readonly IEmbeddingClient _client;
private readonly SemaphoreSlim _rateLimiter;
public async Task<float[][]> EmbedBatch(
IEnumerable<string> texts,
CancellationToken ct)
{
var textList = texts.ToList();
var embeddings = new List<float[]>();
// Process in batches to avoid rate limits
foreach (var batch in textList.Chunk(100))
{
await _rateLimiter.WaitAsync(ct);
try
{
var batchEmbeddings = await _client.EmbedAsync(
batch.ToArray(),
ct);
embeddings.AddRange(batchEmbeddings);
}
finally
{
_rateLimiter.Release();
}
}
return embeddings.ToArray();
}
public async Task<float[]> EmbedQuery(string query, CancellationToken ct)
{
// Some models need different prompts for queries vs documents
var formattedQuery = $"query: {query}";
return await _client.EmbedAsync(formattedQuery, ct);
}
}
Vector Store Design
Store Selection
| Store |
Type |
Scalability |
Features |
| Azure AI Search |
Managed |
High |
Hybrid search, filters |
| Pinecone |
Managed |
High |
Simple API |
| Qdrant |
Self-hosted/Managed |
High |
Payload filters |
| Weaviate |
Self-hosted/Managed |
High |
GraphQL, modules |
| Chroma |
Self-hosted |
Medium |
Simple, local dev |
| pgvector |
PostgreSQL extension |
Medium |
SQL integration |
Index Design
public class VectorIndexSchema
{
public string IndexName { get; set; } = "documents";
public List<VectorField> VectorFields { get; set; } =
[
new VectorField
{
Name = "content_vector",
Dimensions = 1536,
Similarity = SimilarityMetric.Cosine,
IndexType = IndexType.HNSW,
HnswConfig = new HnswConfig
{
M = 16,
EfConstruction = 100,
EfSearch = 40
}
}
];
public List<MetadataField> MetadataFields { get; set; } =
[
new MetadataField("document_id", FieldType.String, Filterable: true),
new MetadataField("source", FieldType.String, Filterable: true),
new MetadataField("created_at", FieldType.DateTime, Filterable: true),
new MetadataField("category", FieldType.StringArray, Filterable: true),
new MetadataField("content", FieldType.Text, Searchable: true)
];
}
Retrieval Strategies
Retrieval Methods
| Method |
Description |
Pros |
Cons |
| Vector Search |
Semantic similarity |
Handles synonyms |
May miss exact |
| Keyword Search |
BM25/TF-IDF |
Exact matches |
Misses synonyms |
| Hybrid |
Vector + Keyword |
Best of both |
More complex |
| Multi-Query |
Generate variations |
Better recall |
Higher cost |
| HyDE |
Hypothetical answer |
Better precision |
Latency |
Hybrid Search Implementation
public class HybridRetriever
{
private readonly IVectorStore _vectorStore;
private readonly ISearchClient _keywordSearch;
public async Task<List<SearchResult>> Retrieve(
string query,
RetrievalOptions options,
CancellationToken ct)
{
// Run vector and keyword search in parallel
var vectorTask = _vectorStore.SearchAsync(
query,
options.TopK * 2, // Retrieve more for fusion
ct);
var keywordTask = _keywordSearch.SearchAsync(
query,
options.TopK * 2,
ct);
await Task.WhenAll(vectorTask, keywordTask);
var vectorResults = await vectorTask;
var keywordResults = await keywordTask;
// Reciprocal Rank Fusion
var fused = ReciprocalRankFusion(
vectorResults,
keywordResults,
options.VectorWeight,
options.KeywordWeight);
return fused.Take(options.TopK).ToList();
}
private List<SearchResult> ReciprocalRankFusion(
List<SearchResult> vectorResults,
List<SearchResult> keywordResults,
float vectorWeight,
float keywordWeight,
int k = 60)
{
var scores = new Dictionary<string, float>();
for (int i = 0; i < vectorResults.Count; i++)
{
var id = vectorResults[i].Id;
scores.TryAdd(id, 0);
scores[id] += vectorWeight / (k + i + 1);
}
for (int i = 0; i < keywordResults.Count; i++)
{
var id = keywordResults[i].Id;
scores.TryAdd(id, 0);
scores[id] += keywordWeight / (k + i + 1);
}
return scores
.OrderByDescending(kv => kv.Value)
.Select(kv => new SearchResult
{
Id = kv.Key,
Score = kv.Value
})
.ToList();
}
}
Context Assembly
Context Window Management
public class ContextAssembler
{
private readonly int _maxTokens;
public string AssembleContext(
List<SearchResult> results,
string query,
int reservedTokens = 500)
{
var availableTokens = _maxTokens - reservedTokens;
var context = new StringBuilder();
var usedTokens = 0;
// Sort by relevance (already sorted from retrieval)
foreach (var result in results)
{
var chunkTokens = CountTokens(result.Content);
if (usedTokens + chunkTokens > availableTokens)
break;
context.AppendLine($"[Source: {result.Source}]");
context.AppendLine(result.Content);
context.AppendLine();
usedTokens += chunkTokens;
}
return context.ToString();
}
}
RAG Evaluation
Evaluation Metrics
| Metric |
Description |
Target |
| Retrieval Precision |
Relevant docs / Retrieved docs |
> 80% |
| Retrieval Recall |
Retrieved relevant / All relevant |
> 70% |
| Answer Accuracy |
Correct answers |
> 90% |
| Faithfulness |
Answer supported by context |
> 95% |
| Answer Relevancy |
Answer matches query |
> 85% |
Evaluation Framework
public class RagEvaluator
{
public async Task<EvaluationReport> Evaluate(
List<TestCase> testCases,
IRagPipeline pipeline,
CancellationToken ct)
{
var results = new List<TestResult>();
foreach (var testCase in testCases)
{
var response = await pipeline.Query(testCase.Query, ct);
results.Add(new TestResult
{
Query = testCase.Query,
ExpectedAnswer = testCase.ExpectedAnswer,
ActualAnswer = response.Answer,
RetrievedDocs = response.Sources,
RelevantDocs = testCase.RelevantDocs,
Metrics = new TestMetrics
{
RetrievalPrecision = CalculatePrecision(
response.Sources, testCase.RelevantDocs),
RetrievalRecall = CalculateRecall(
response.Sources, testCase.RelevantDocs),
AnswerCorrect = await EvaluateAnswer(
response.Answer, testCase.ExpectedAnswer),
Faithful = await CheckFaithfulness(
response.Answer, response.Context)
}
});
}
return new EvaluationReport(results);
}
}
Architecture Template
# RAG Architecture: [System Name]
## Overview
[Brief description of the RAG system purpose]
## Components
### Document Processing
- **Source**: [Document sources]
- **Chunking**: [Strategy and parameters]
- **Embedding**: [Model and dimensions]
### Vector Store
- **Provider**: [Azure AI Search / Pinecone / etc.]
- **Index**: [Index configuration]
- **Metadata**: [Stored fields]
### Retrieval
- **Method**: [Vector / Hybrid / Multi-query]
- **Top-K**: [Number of results]
- **Filters**: [Applied filters]
### Generation
- **Model**: [LLM model]
- **Context Window**: [Token allocation]
- **Prompt**: [Template reference]
## Data Flow
[Mermaid diagram of the pipeline]
## Performance Targets
| Metric | Target |
|--------|--------|
| Retrieval Latency | < 200ms |
| E2E Latency | < 3s |
| Answer Accuracy | > 90% |
Validation Checklist
Integration Points
Inputs from:
- Data sources → Documents to index
model-selection skill → Embedding/LLM choice
Outputs to:
prompt-engineering skill → Context integration
token-budgeting skill → Cost estimation
- Application code → RAG implementation