💡 Lessons Learned: Implementing a Text-to-SQL Platform in Production

This page documents key lessons and practical insights from building a Text-to-SQL platform in a large tech company with a massive datalake. The goal: share experiences with tech-facing readers who build scalable LLM-powered SQL systems.

🛠️ 1. Starting Small: The Early Roadmap

We began last year when context-window limits were a major bottleneck (e.g., LLM prompt context only ~2,048–4,096 tokens in many models at the time). Our initial experiment used a model we refer to as LLama NSQL and an open-source platform called Vanna, which required three core context-inputs:

Database schema / DDL — all tables, relationships, constraints.
Documentation / metadata — table/column business meanings, comments, descriptions.
Question–SQL example pairs — natural language questions mapped to correct SQL for your schema.

With just these, Vanna could generate near-perfect SQL on a small, simple database. It was essentially a RAG-based approach (Retrieval-Augmented Generation) — ideal for simple cases.

📈 2. Scaling & Roadmap Visual

We mapped out the following sequence:

Phase	Focus	Challenge
Phase 1	Small DB, simple tables	Lightweight context, manageable token count
Phase 2	Multiple tables, complex joins	Schema grows, context window becomes a limit
Phase 3	Agent-based system, multi-tool orchestration	Latency + many LLM calls + retrieval overhead
Phase 4	Production optimization	Reduce end-to-end latency, manage hallucination, scale

→ Visual roadmap (you can insert a timeline graphic) Start → Scale → Agentization → Optimization

🔎 3. Scaling Challenges

When applying to many tables and complex joins:

Latency grew significantly.
SQL accuracy dropped for multi-table, multi-join queries.

We explored community tools and discovered Bird, which hosts Text-to-SQL challenges. Their "Chess system" inspired us to adopt agent-based architecture for reasoning + SQL generation.

⚡ 4. Latency & Context Window Constraints

In our early agent implementation:

Many LLM calls per query (tool chaining + retrieval).
Model context window couldn't accommodate full schema + metadata + examples.
A single query execution sometimes took ~30 minutes.

We realized: we must optimize what we send into the model and how often.

🧹 5. Workflow Optimization & Code Snippet

We reduced end-to-end execution to ~30 seconds by:

Pruning irrelevant schema parts.
Chunking metadata.
Minimizing LLM calls (fewer hops).
Tightening prompt structure.

Example: Prompt Template

You are a SQL-generation agent.
Schema:
{{ddl_snippet}}
Metadata:
{{metadata_snippet}}
Examples:
{{qa_pairs}}
Question:
{{user_question}}
Generate optimized SQL.

Example: Preprocessing Step

# Prune schema
relevant_tables = extract_relevant_tables(question, full_schema)
ddl_snippet = format_ddl(relevant_tables)

Important highlight: token budget matters. Much research (e.g., schema-pruning in text-to-sql) shows models degrade when input tokens exceed ~4,000-8,000. (arXiv (opens in a new tab)) In our case, we trimmed context until tokens ~3,000–4,000 per query consistently.

🧠 6. Key Technical Lessons

Data quality is king. Good metadata + correct examples yield better SQL.
LLM evolution brings new trade-offs. Bigger context windows help (e.g., GPT-4/5, Chaos models) but hallucinations increase.
Multi-agent architecture is complex. Agents need communication protocols, orchestration, failure modes.
Prompt + data = true asset. Models are tools; your input quality defines output.
You cannot be purely agile. Rapid AI evolution means you must plan architecture, tool-chain, scalers carefully.

🌟 7. Practical Advice

Begin with a narrow scope (few tables) to validate your method.
Use RAG with schema + metadata + QA examples.
Prune and chunk your context to stay within model limits (~3,000–4,000 tokens).
Minimize LLM calls: combine retrieval + generation steps.
Monitor SQL accuracy, hallucination, execution failures.
Build modular architecture: retrieval module, generation module, revision module.
Document everything: schema changes, prompt iterations, agent flows.

❓ 8. FAQ & Follow-up Questions

This section addresses common questions and clarifications that often arise when implementing Text-to-SQL systems in production.

Q1: How do you handle schema changes without breaking the system?

Answer: Schema changes are one of the biggest operational challenges. We recommend:

Version-controlled metadata: Store your schema documentation, DDL, and QA examples in git with clear versioning
Automated schema validation: Run automated tests that check if the current model can still generate valid SQL for known questions after schema changes
Graceful degradation: If the model references deprecated tables/columns, have a validation layer that catches this before execution
Notification pipeline: Alert the data team when schema changes impact high-usage query patterns

Q2: What's the right balance between context window size and retrieval quality?

Answer: This is a trade-off we constantly tune. Our findings:

Sweet spot: ~3,000–4,000 tokens per query (as mentioned in section 5)
Over-stuffing context: Beyond ~8k tokens, model attention degrades and hallucinations increase
Retrieval precision over quantity: Better to include 3 highly relevant tables + metadata than 20 tables with sparse details
Chunking strategy: Break large schemas into semantic chunks (related tables, common join paths) and retrieve only what's needed

Q3: How do you measure SQL accuracy beyond "it runs"?

Answer: Execution success is necessary but not sufficient. We track:

Exact match: Does SQL match a human-written reference query?
Execution accuracy: Does it run without errors?
Result correctness: Does it return the right answer (using test datasets with known expected results)?
Semantic equivalence: Is the SQL logically correct even if syntactically different from reference?
User feedback: Do users accept the result or rephrase the query?

Q4: Should we fine-tune a model or stick with prompt engineering?

Answer: Start with prompt engineering + RAG. Consider fine-tuning only when:

You have high-quality, domain-specific training data (thousands of NL→SQL pairs)
Your schema is highly specialized (industry-specific terminology, non-standard patterns)
You've exhausted prompt optimizations and still see consistent failure patterns
You have ML infrastructure for model training, deployment, and monitoring

For most teams, prompt engineering + schema pruning + good metadata yields better ROI.

Q5: How do you handle multi-turn conversations and follow-up questions?

Answer: Conversational context adds complexity. Our approach:

Context windowing: Maintain previous 2–3 turns of question→SQL→result in the prompt
Reference resolution: Explicitly track pronouns and references ("it", "those", "same period")
Intent classification: Detect if a follow-up is a refinement, new query, or result exploration
Session state: Store intermediate results in memory so follow-ups can reference them
Fallback to explicit: When ambiguous, ask the user to clarify rather than guess

✨ 9. Best Practices & Improvements Roadmap

Based on our production experience, here's a prioritized checklist of improvements and best practices, organized by implementation effort and impact.

🚀 Quick Wins (Implement in 1–2 weeks)

Priority	Practice	Impact
⭐⭐⭐	Add SQL validation layer	Catches syntax errors before execution
⭐⭐⭐	Implement query logging	Enables debugging and usage analysis
⭐⭐	Create 20–50 seed QA pairs	Immediate accuracy boost for common patterns
⭐⭐	Add schema metadata comments	Better schema understanding
⭐	Basic rate limiting	Prevents cost overruns

Implementation details:

from typing import Tuple
 
# SQL validation example
def validate_sql(sql: str) -> Tuple[bool, str]:
    """Check SQL for obvious errors before execution"""
    try:
        parsed = sqlglot.parse(sql, dialect='your_db')
        # Check for SELECT * (often problematic)
        # Check for missing WHERE clauses (accidental full scans)
        # Check for unsupported functions
        return True, "Valid"
    except Exception as e:
        return False, str(e)

🏗️ Medium-term Improvements (1–2 months)

Feature	Description	Benefit
Schema pruning agent	Pre-process query to identify relevant tables	Reduces context tokens by 50–70%
Self-correction loop	Model reviews and fixes its own SQL	Catches 15–30% of errors
Few-shot example retrieval	Dynamically fetch relevant QA examples	Improves accuracy for complex patterns
Query result cache	Cache identical questions + SQL	Reduces latency for repeated queries
A/B testing framework	Compare prompts, models, strategies	Data-driven optimization

Architecture sketch:

🎯 Long-term Strategic Investments (3–6 months)

Initiative	Effort	Strategic Value
Fine-tuned domain model	High	10–25% accuracy gain for specialized schemas
Multi-agent architecture	High	Handles complex, multi-step queries
Human-in-the-loop feedback	Medium	Continual improvement, data collection
Usage analytics dashboard	Medium	Monitor adoption, failure patterns
Automated testing pipeline	High	Regression testing as schema evolves

Key architectural patterns to explore:

ReAct Agent Pattern
- Reason: Analyze question and plan approach
- Act: Generate SQL and execute
- Observe: Check results
- Iterate: Refine if needed
Revision Agent
- Separate "generator" and "revision" roles
- Generator creates initial SQL
- Reviewer checks against schema, constraints, business rules
- Mimics human code review process
Ensemble Methods
- Run multiple prompts/models in parallel
- Rank results by confidence, execution success, user feedback
- Choose best or present options to user

📊 Monitoring & Observability Essentials

Don't build what you can't measure. Implement these metrics:

Metric Category	Key Metrics	Target
Accuracy	Exact match rate, Execution success, User satisfaction	over 85% execution success
Latency	End-to-end time, LLM calls per query	under 30 seconds p95
Cost	Tokens per query, LLM cost per user	Track weekly
Usage	Queries per user, Top failure patterns	Identify gaps
Business	Time saved vs. manual SQL writing, Adoption rate	ROI analysis

🌟 10. Ongoing Reflection & Future Work

Building production-level Text-to-SQL systems isn't just coding. It's data engineering + prompt design + system architecture + human insight. Models may get better, but if your metadata is messy or schema is outdated, you'll struggle. Human expertise remains the core.

📌 Recommended Roadmap Diagram (to include)

Start: Small DB → RAG → Simple SQL
Scale: Many tables → Complex joins → Agentization
Optimize: Latency reduction → Context window workarounds → Hallucination checks
Mature: Multi-tenant, live datalake, revision agents, usage monitoring

🌐 11. Why Text-to-SQL Is Difficult — A Clear Summary

Text-to-SQL may look simple from the outside ("just convert a sentence into SQL"), but in practice it is one of the hardest LLM application problems, involving multiple complex layers from NLP to database theory.

Here's a polished explanation of the core challenges:

1️⃣ Natural Language Is Ambiguous & Context-Dependent

Human language is messy. Queries may be incomplete, vague, or contain hidden assumptions.

Examples:

"show me sales last month" → sales of what? which region? which table is "sales"?
"top customers" → by revenue? order count? last 30 days?

LLMs must infer intent and connect it to actual database meaning. This ambiguity is one of the largest barriers.

2️⃣ Mapping Intent → SQL Requires Precise Logical Translation

Converting natural language into SQL is not a direct translation — it's reasoning + planning.

A Text-to-SQL system must:

identify entities (tables, columns, metrics)
map user intent to schema
generate syntactically valid SQL
ensure semantics are correct (produces meaningful results)

Even if grammar is correct, a small semantic mistake can produce:

wrong joins
missing filters
incorrect GROUP BY logic

SQL is extremely strict, while language is extremely flexible — a huge mismatch.

3️⃣ Deep Schema Understanding Is Required

A real-world database is often huge and messy:

hundreds of tables
thousands of columns
complicated relationships
inconsistent naming conventions
implicit business rules not documented in schema

A Text-to-SQL model must behave like a database expert, not just a language model. It needs to know:

how tables connect
foreign key paths
cardinality
which columns are meaningful
which tables should NEVER be joined

This is far beyond simple NLP.

4️⃣ Error Handling & Reliability Are Hard

A production system must gracefully handle: ❌ Invalid queries (e.g., wrong columns, missing tables) ❌ Bad user inputs (vague, incorrect, contradictory) ❌ Unexpected execution results (e.g., query returns millions of rows)

The system must:

detect errors
correct SQL
explain issues in human language
provide meaningful feedback

This adds a full engineering layer on top of LLM generation.

5️⃣ High-Quality Training Data Is Scarce & Expensive

Building accurate Text-to-SQL requires large, diverse training sets that reflect real business logic.

You need:

thousands of NL → SQL pairs
complete schema + metadata
examples across many query types
domain-specific instructions

Manually creating this data is extremely costly. Open-source datasets (Spider, Bird, WikiSQL…) are helpful but do not match real Enterprise schemas, which are wider, messier, and contain business-specific rules.

6️⃣ Integration, Testing & Production Deployment Are Complex

Integrating Text-to-SQL into a live enterprise environment is non-trivial. You must handle:

permission models
database connections
latency optimization
schema versioning
caching
security / compliance
performance tuning
user feedback loops

And then comes testing:

unit test queries
edge cases
cross-schema scenarios
load testing
regression testing as schema changes

Testing Text-to-SQL is one of the hardest parts because the output space is huge and many SQLs are correct but not equivalent.

🎯 In Summary: Why Is Text-to-SQL Hard?

Because it requires the combination of:

NLP understanding → interpret ambiguous human language
Logical reasoning → plan steps and structure SQL
Database expertise → understand schema, joins, constraints
Software engineering → validate, execute, optimize, debug
Data engineering → maintain metadata + training sets
Product design → provide clear feedback to users

It's not just "generate SQL." It's the intersection of several entire disciplines.

This is why even big tech companies struggle with it — including the project you're working on.

🔗 12. References

🛠️ 13. Tools & Technologies Mentioned

Vanna: Open-source Text-to-SQL platform
Bird: Text-to-SQL challenge platform
LLama NSQL: Custom model implementation
RAG (Retrieval-Augmented Generation): Core architectural pattern
Agent-based systems: Multi-tool orchestration approach