Building RAG Applications with Semantic Kernel and Azure OpenAI
Retrieval-Augmented Generation (RAG) lets you build AI applications that work with private data by retrieving relevant context at query time instead of fine-tuning. Let's build a production RAG system using Semantic Kernel and Azure OpenAI — the stack where .NET teams find the most success.
The RAG Pattern
RAG has three phases: Ingestion (chunk documents → embed → store in vector DB), Retrieval (embed query → search for similar chunks), and Generation (inject chunks into prompt → LLM generates grounded response).
User Query → [Embed] → [Vector Search] → [Top-K Chunks]
↓
[Prompt Template + LLM]
↓
Grounded Response
Understanding these phases is critical because each one introduces points where quality can degrade. Poor chunking during ingestion leads to irrelevant retrieval, which causes hallucinated generation. Let's build each phase with production-grade C# patterns using Semantic Kernel's abstractions.
Setting Up Semantic Kernel
First, configure the Semantic Kernel with Azure OpenAI services and a vector store. Semantic Kernel provides a unified abstraction layer that makes it easy to swap between different LLM providers and vector databases:
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddKernel()
.AddAzureOpenAIChatCompletion(
deploymentName: "gpt-4o",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!)
.AddAzureOpenAITextEmbeddingGeneration(
deploymentName: "text-embedding-3-large",
endpoint: builder.Configuration["AzureOpenAI:Endpoint"]!,
apiKey: builder.Configuration["AzureOpenAI:ApiKey"]!);
builder.Services.AddAzureAISearchVectorStore(
new Uri(builder.Configuration["AzureSearch:Endpoint"]!),
new Azure.AzureKeyCredential(
builder.Configuration["AzureSearch:ApiKey"]!));
builder.Services.AddScoped<RagService>();
var app = builder.Build();
app.MapPost("/api/ask", async (AskRequest request, RagService ragService) =>
{
var response = await ragService.AskAsync(request.Question);
return Results.Ok(new { answer = response.Answer, sources = response.Sources });
});
app.Run();
Use text-embedding-3-large (3072 dims) for best retrieval quality, or text-embedding-3-small (1536 dims) to save costs. Avoid ada-002 for new projects — the newer models offer better multilingual support and dimensionality reduction options.
Document Ingestion Pipeline
Before you can query, you need to ingest documents. Here's a practical chunking and embedding pipeline:
public class DocumentIngestionService
{
private readonly ITextEmbeddingGenerationService _embeddingService;
private readonly IVectorStore _vectorStore;
public async Task IngestDocumentAsync(string title, string content)
{
var chunks = ChunkDocument(content, maxTokens: 512, overlap: 50);
var collection = _vectorStore
.GetCollection<string, DocumentChunk>("knowledge-base");
await collection.CreateCollectionIfNotExistsAsync();
foreach (var chunk in chunks)
{
var embedding = await _embeddingService
.GenerateEmbeddingAsync(chunk.Text);
chunk.Embedding = embedding;
chunk.DocumentTitle = title;
chunk.Id = Guid.NewGuid().ToString();
await collection.UpsertAsync(chunk);
}
}
private static List<DocumentChunk> ChunkDocument(
string content, int maxTokens, int overlap)
{
var paragraphs = content.Split("\n\n", StringSplitOptions.RemoveEmptyEntries);
var chunks = new List<DocumentChunk>();
var currentChunk = new StringBuilder();
foreach (var paragraph in paragraphs)
{
if (currentChunk.Length + paragraph.Length > maxTokens * 4)
{
chunks.Add(new DocumentChunk { Text = currentChunk.ToString().Trim() });
// Keep overlap from previous chunk
var words = currentChunk.ToString().Split(' ');
currentChunk.Clear();
currentChunk.Append(string.Join(' ', words.TakeLast(overlap)));
currentChunk.Append(' ');
}
currentChunk.AppendLine(paragraph);
}
if (currentChunk.Length > 0)
chunks.Add(new DocumentChunk { Text = currentChunk.ToString().Trim() });
return chunks;
}
}
Retrieval and Response Generation
The core RAG service ties embedding, search, and generation together:
public class RagService
{
private readonly Kernel _kernel;
private readonly IVectorStore _vectorStore;
private readonly ITextEmbeddingGenerationService _embeddingService;
public async Task<RagResponse> AskAsync(
string question, int topK = 5, float minRelevance = 0.75f)
{
var queryEmbedding = await _embeddingService
.GenerateEmbeddingAsync(question);
var collection = _vectorStore
.GetCollection<string, DocumentChunk>("knowledge-base");
var searchResults = await collection.VectorizedSearchAsync(
queryEmbedding, new VectorSearchOptions { Top = topK });
var relevantChunks = new List<DocumentChunk>();
await foreach (var result in searchResults.Results)
if (result.Score >= minRelevance)
relevantChunks.Add(result.Record);
if (relevantChunks.Count == 0)
return new RagResponse("Not enough information in the documentation.", []);
var context = string.Join("\n\n---\n\n",
relevantChunks.Select(c => $"[Source: {c.DocumentTitle}]\n{c.Text}"));
var prompt = $"""
Answer based ONLY on the provided context. Cite sources.
## Context
{context}
## Question
{question}
""";
var chatService = _kernel.GetRequiredService<IChatCompletionService>();
var chatHistory = new ChatHistory();
chatHistory.AddUserMessage(prompt);
var response = await chatService.GetChatMessageContentAsync(
chatHistory,
new AzureOpenAIPromptExecutionSettings
{
Temperature = 0.1f,
MaxTokens = 1024,
});
return new RagResponse(
response.Content ?? "Unable to generate a response.",
relevantChunks.Select(c => c.DocumentTitle).Distinct().ToList());
}
}
Improving Retrieval Quality
Pure vector search gets you 70% of the way. To reach 90%+, use hybrid search (vector + keyword) and query expansion — ask the LLM to rephrase the query into 2-3 alternative phrasings before searching. This single technique can improve answer accuracy by ~25%.
Chunking matters most: Start with 512-token paragraph-aware chunks with 50-token overlap. For code-heavy content, split on function/class boundaries instead.
Production tips:
- Keep context to 3-5 chunks (~2000-3000 tokens) — more context means more noise
- Cache embeddings with a 24-hour TTL to reduce costs
- Build an evaluation dataset early — you can't improve what you can't measure
- Log every query, retrieved chunks, and generated response for debugging and quality tracking
- Set up relevance scoring thresholds — returning "I don't know" is better than a hallucinated answer
- Consider implementing a feedback loop where users can rate answer quality to continuously improve your retrieval pipeline
Key Takeaways
- Chunking is the most important decision. Start with 512-token chunks and iterate based on retrieval quality.
- Use hybrid search from day one — pure vector search misses exact-match scenarios.
- Keep temperature low (0.1) for grounded, factual responses.
- Query expansion delivers outsized impact with minimal effort.
- Cache embeddings aggressively — the same queries hit your system repeatedly.
- Build evaluation infrastructure early — RAG quality is hard to judge manually.
References
- Semantic Kernel documentation — Official Microsoft documentation for the Semantic Kernel SDK
- Semantic Kernel GitHub repository — Source code, samples, and community contributions
- Azure AI Search vector search — How to use vector search with Azure AI Search
- Azure OpenAI Service documentation — Getting started with Azure OpenAI embeddings and chat completions
- RAG pattern with Azure AI — Azure Architecture Center reference for RAG implementations
Share this post
Comments
Ajit Gangurde
Software Engineer II at Microsoft | 15+ years in .NET & Azure