RAG vs Fine-Tuning for Defense NLP

RAG vs Fine-Tuning for Defense NLP Applications

RAG grounds language model outputs in specific source documents with auditable attribution, while fine-tuning adapts the model's parametric knowledge to a domain. For defense intelligence requiring traceable evidence chains, RAG dominates — but embedding fine-tuning within the RAG pipeline combines both strengths, achieving 94.2% retrieval accuracy compared to 87.3% for untuned baselines.

"RAG and fine-tuning combined consistently outperform either approach alone — but for applications where knowledge must be traceable to specific documents, retrieval-augmented generation is not optional. It is the only architecture that provides the attribution chain that high-stakes decisions require."

The choice between RAG and fine-tuning is the most consequential architectural decision for any defense AI system that processes intelligence documents (see also how RAG systems work for defense applications). The decision determines whether generated outputs are grounded in citable source material (RAG) or drawn from the model's internalized knowledge (fine-tuning) — a distinction with direct implications for attribution, auditability, and institutional trust.

Lewis et al. introduced retrieval-augmented generation in their 2020 NeurIPS paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, demonstrating that coupling a pre-trained model with an external retrieval mechanism improved factual accuracy on knowledge-intensive tasks. The 2024 survey by Gao et al., Retrieval-Augmented Generation for Large Language Models, documented the subsequent evolution of RAG architectures and their growing adoption in enterprise and government applications.

How RAG and Fine-Tuning Differ

RAG retrieves evidence from an external document collection at query time and generates text grounded in that evidence. Fine-tuning modifies the model's weights using domain-specific training data, embedding domain knowledge into the model's parameters. The fundamental distinction is whether knowledge lives in the documents (RAG) or in the model (fine-tuning).

Dimension	RAG	Model Fine-Tuning	Embedding Fine-Tuning (within RAG)
Knowledge source	External documents (query-time retrieval)	Model parameters (training-time absorption)	External documents with domain-adapted retrieval
Attribution	Sentence-level citation to source documents	None — model generates from parametric memory	Sentence-level citation with improved retrieval accuracy
Auditability	Full — every claim traces to a source passage	None — no traceable evidence chain	Full — with higher-accuracy evidence selection
Knowledge update	Add documents to the collection (no retraining)	Retrain model on updated data (expensive)	Add documents + periodic embedding re-tuning
Hallucination control	Constrained by retrieved evidence	Model may generate plausible but unsupported claims	Constrained by more accurately retrieved evidence
Domain vocabulary	Depends on embedding model's vocabulary coverage	Model learns domain vocabulary during fine-tuning	Embedding model adapts to domain vocabulary
Cost	Inference cost + vector database infrastructure	High training cost + ongoing inference	Moderate fine-tuning cost + inference + vector database
Defense suitability	High — meets attribution requirements	Low — no provenance chain for intelligence products	Highest — attribution + domain accuracy

Why RAG Dominates Defense Intelligence

Defense intelligence products require attribution chains — every analytical judgment must trace to source reporting. Fine-tuned models generate from internalized knowledge without citable sources, making their outputs unsuitable for formal intelligence products regardless of accuracy.

According to GDIT's 2025 analysis, How Adaptive RAG Makes Generative AI More Reliable for Defense Missions, RAG is the most reliable deployment methodology for generative AI services in defense contexts because it grounds model outputs in authoritative document collections rather than parametric memory.

The attribution requirement is not a best practice — it is an institutional standard. Intelligence products that cannot trace claims to source reporting cannot be used in formal assessments, warning products, or decision briefs. A fine-tuned model that generates a correct assessment without citing its sources produces text that is analytically useful but institutionally unusable.

The NGA's 2025 deployment of automated intelligence products, reported by Military.com, uses RAG-based architectures for this reason — the generated products must trace to the source reporting that supports them.

The Case for Embedding Fine-Tuning Within RAG

Pure RAG with general-purpose embeddings leaves retrieval accuracy on the table. Embedding fine-tuning within the RAG pipeline adapts the retrieval layer to defense-domain vocabulary without sacrificing the attribution architecture that RAG provides.

Domain-specific embedding fine-tuning, as documented in a 2024 Voyage AI study, improves retrieval accuracy by 6 to 7 percentage points on average compared to general-purpose embeddings. A joint Cisco and NVIDIA 2024 enterprise fine-tuning study reported similar improvements in regulated industries. The research by Karpukhin et al. in Dense Passage Retrieval for Open-Domain Question Answering (EMNLP, 2020) established that retrieval quality is primarily an encoder problem.

DLRA's approach combines RAG architecture (for attribution and auditability) with embedding fine-tuning (for domain accuracy): domain-tuned embeddings achieve 94.2% top-5 retrieval accuracy on defense intelligence documents compared to 87.3% for general-purpose embeddings on the same evaluation set. The generation model operates downstream of this retrieval layer, citing only the evidence that the domain-tuned retrieval surfaces.

This combined approach outperforms both pure RAG (general-purpose embeddings, approximately 87% accuracy) and pure fine-tuning (no attribution chain) for defense intelligence applications. Benchmarks demonstrating these improvements on defense intelligence tasks are published at defense-nlp-benchmarks.

Comparison: Three Approaches Applied to Defense Intelligence

Three architectural approaches — pure RAG, pure model fine-tuning, and RAG with embedding fine-tuning — produce different outcomes when applied to common defense intelligence scenarios. The following comparison shows how each approach handles threat report triage, entity extraction, and assessment drafting.

Scenario	Pure RAG (General Embeddings)	Pure Model Fine-Tuning	RAG + Embedding Fine-Tuning
Analyst queries "What indicators link Vessel X to sanctions evasion?"	Retrieves passages from threat reports. ~87% chance the top-5 results contain the correct evidence.	Generates an answer from parametric memory. May be accurate but cannot cite which report contained the indicators.	Retrieves passages with domain-tuned accuracy. ~94% chance the top-5 results contain the correct evidence, with source citations.
Analyst requests a threat assessment brief	Generates brief from retrieved evidence. Attribution present but some retrieved passages may be near-misses.	Generates fluent brief without source attribution. Cannot be used as a formal intelligence product.	Generates brief from more accurately retrieved evidence. Sentence-level provenance with 94.2% retrieval accuracy foundation.
New threat reports arrive (100 per day)	Added to document collection immediately. No retraining needed.	Model does not know about new reports until retrained (days to weeks).	Added to document collection immediately. Embedding model re-tuned periodically (quarterly).

When Fine-Tuning (Without RAG) Makes Sense

Model fine-tuning without RAG is appropriate for defense applications that do not require attribution to specific source documents — classification tasks, entity recognition, translation, and format conversion where the output is a label or structured extraction rather than a narrative assessment.

Application	RAG Needed?	Fine-Tuning Appropriate?	Rationale
Threat assessment drafting	Yes	Embedding fine-tuning within RAG	Attribution chain required
Report triage and prioritization	Yes	Embedding fine-tuning within RAG	Evidence-based relevance scoring
Named entity recognition	No	Yes	Output is structured labels, not narrative
Document classification	No	Yes	Output is a category, not a cited claim
Translation	No	Yes	Output is translated text, not analysis
Signal-to-text preprocessing	No	Yes	Output is structured text, not analytical product

Cost and Infrastructure Comparison

Factor	Pure RAG	Pure Fine-Tuning	RAG + Embedding Fine-Tuning
Initial setup	Vector database + document ingestion pipeline	Training data curation + fine-tuning compute (GPU-weeks)	Vector database + training data + embedding fine-tuning (GPU-hours)
Ongoing compute	Retrieval + generation inference	Generation inference only	Retrieval + generation inference
Knowledge updates	Add documents (minutes)	Retrain model (days–weeks)	Add documents (minutes) + periodic re-tune (hours, quarterly)
Infrastructure	Vector DB + LLM serving	LLM serving only	Vector DB + LLM serving
Data requirements	Document corpus	Thousands of labeled training examples	Document corpus + query-passage triplets
Sovereignty	Vector DB and LLM on sovereign infrastructure	Fine-tuned model on sovereign infrastructure	Vector DB and models on sovereign infrastructure

According to Deloitte's 2024 report The Future of Intelligence Analysis, IC analysts could reclaim roughly 364 hours per analyst per year with AI-enabled support. The architectural choice — RAG, fine-tuning, or the combination — determines whether those reclaimed hours come with the attribution quality that intelligence institutions require.

Frequently Asked Questions

Is RAG or fine-tuning better for defense intelligence? For intelligence analysis that requires auditable attribution — threat assessments, intelligence briefs, warning products — RAG is the required architecture because it provides traceable evidence chains. Embedding fine-tuning within the RAG pipeline adds domain accuracy (94.2% vs. 87.3%) while preserving attribution. Pure model fine-tuning without RAG is appropriate for structured extraction tasks (NER, classification) that do not require narrative attribution.

Can RAG and fine-tuning be combined? The most effective approach for defense intelligence combines RAG architecture (for attribution) with embedding fine-tuning (for domain accuracy). This is DLRA's approach: domain-tuned embeddings improve retrieval accuracy by 6.9 percentage points while the RAG architecture maintains sentence-level provenance for every generated claim.

Why not just fine-tune a larger model instead of using RAG? A fine-tuned model generates from parametric memory without citable sources. For defense intelligence products that require attribution chains, this is institutionally unusable regardless of the model's accuracy. RAG provides the attribution architecture, and embedding fine-tuning within RAG provides the domain accuracy.

How expensive is embedding fine-tuning compared to model fine-tuning? Embedding fine-tuning (300M–1B parameter models) completes in hours on a single GPU server. Full model fine-tuning (7B–70B+ parameter models) requires days to weeks on multi-GPU clusters. Embedding fine-tuning is orders of magnitude less expensive while providing the highest-impact accuracy improvement for retrieval-dependent workflows.

How does RAG handle knowledge updates? New documents are added to the vector database immediately — no model retraining required. This is a critical advantage for intelligence organizations that receive hundreds of new reports daily. Embedding models are re-fine-tuned periodically (typically quarterly) to accommodate vocabulary evolution, but the document collection itself updates continuously.