AI Technology
πŸ’¬ Text-to-SQL

πŸ’‘ Lessons Learned: Implementing a Text-to-SQL Platform in Production

This page documents key lessons and practical insights from building a Text-to-SQL platform in a large tech company with a massive datalake. The goal: share experiences with tech-facing readers who build scalable LLM-powered SQL systems.


πŸ› οΈ 1. Starting Small: The Early Roadmap

We began last year when context-window limits were a major bottleneck (e.g., LLM prompt context only ~2,048–4,096 tokens in many models at the time). Our initial experiment used a model we refer to as LLama NSQL and an open-source platform called Vanna, which required three core context-inputs:

  1. Database schema / DDL β€” all tables, relationships, constraints.
  2. Documentation / metadata β€” table/column business meanings, comments, descriptions.
  3. Question–SQL example pairs β€” natural language questions mapped to correct SQL for your schema.

With just these, Vanna could generate near-perfect SQL on a small, simple database. It was essentially a RAG-based approach (Retrieval-Augmented Generation) β€” ideal for simple cases.


πŸ“ˆ 2. Scaling & Roadmap Visual

We mapped out the following sequence:

PhaseFocusChallenge
Phase 1Small DB, simple tablesLightweight context, manageable token count
Phase 2Multiple tables, complex joinsSchema grows, context window becomes a limit
Phase 3Agent-based system, multi-tool orchestrationLatency + many LLM calls + retrieval overhead
Phase 4Production optimizationReduce end-to-end latency, manage hallucination, scale

β†’ Visual roadmap (you can insert a timeline graphic) Start β†’ Scale β†’ Agentization β†’ Optimization


πŸ”Ž 3. Scaling Challenges

When applying to many tables and complex joins:

  • Latency grew significantly.
  • SQL accuracy dropped for multi-table, multi-join queries.

We explored community tools and discovered Bird, which hosts Text-to-SQL challenges. Their "Chess system" inspired us to adopt agent-based architecture for reasoning + SQL generation.


⚑ 4. Latency & Context Window Constraints

In our early agent implementation:

  • Many LLM calls per query (tool chaining + retrieval).
  • Model context window couldn't accommodate full schema + metadata + examples.
  • A single query execution sometimes took ~30 minutes.

We realized: we must optimize what we send into the model and how often.


🧹 5. Workflow Optimization & Code Snippet

We reduced end-to-end execution to ~30 seconds by:

  • Pruning irrelevant schema parts.
  • Chunking metadata.
  • Minimizing LLM calls (fewer hops).
  • Tightening prompt structure.

Example: Prompt Template

You are a SQL-generation agent.
Schema:
{{ddl_snippet}}
Metadata:
{{metadata_snippet}}
Examples:
{{qa_pairs}}
Question:
{{user_question}}
Generate optimized SQL.

Example: Preprocessing Step

# Prune schema
relevant_tables = extract_relevant_tables(question, full_schema)
ddl_snippet = format_ddl(relevant_tables)

Important highlight: token budget matters. Much research (e.g., schema-pruning in text-to-sql) shows models degrade when input tokens exceed ~4,000-8,000. (arXiv (opens in a new tab)) In our case, we trimmed context until tokens ~3,000–4,000 per query consistently.


🧠 6. Key Technical Lessons

  1. Data quality is king. Good metadata + correct examples yield better SQL.
  2. LLM evolution brings new trade-offs. Bigger context windows help (e.g., GPT-4/5, Chaos models) but hallucinations increase.
  3. Multi-agent architecture is complex. Agents need communication protocols, orchestration, failure modes.
  4. Prompt + data = true asset. Models are tools; your input quality defines output.
  5. You cannot be purely agile. Rapid AI evolution means you must plan architecture, tool-chain, scalers carefully.

🌟 7. Practical Advice

  • Begin with a narrow scope (few tables) to validate your method.
  • Use RAG with schema + metadata + QA examples.
  • Prune and chunk your context to stay within model limits (~3,000–4,000 tokens).
  • Minimize LLM calls: combine retrieval + generation steps.
  • Monitor SQL accuracy, hallucination, execution failures.
  • Build modular architecture: retrieval module, generation module, revision module.
  • Document everything: schema changes, prompt iterations, agent flows.

❓ 8. FAQ & Follow-up Questions

This section addresses common questions and clarifications that often arise when implementing Text-to-SQL systems in production.

Q1: How do you handle schema changes without breaking the system?

Answer: Schema changes are one of the biggest operational challenges. We recommend:

  • Version-controlled metadata: Store your schema documentation, DDL, and QA examples in git with clear versioning
  • Automated schema validation: Run automated tests that check if the current model can still generate valid SQL for known questions after schema changes
  • Graceful degradation: If the model references deprecated tables/columns, have a validation layer that catches this before execution
  • Notification pipeline: Alert the data team when schema changes impact high-usage query patterns

Q2: What's the right balance between context window size and retrieval quality?

Answer: This is a trade-off we constantly tune. Our findings:

  • Sweet spot: ~3,000–4,000 tokens per query (as mentioned in section 5)
  • Over-stuffing context: Beyond ~8k tokens, model attention degrades and hallucinations increase
  • Retrieval precision over quantity: Better to include 3 highly relevant tables + metadata than 20 tables with sparse details
  • Chunking strategy: Break large schemas into semantic chunks (related tables, common join paths) and retrieve only what's needed

Q3: How do you measure SQL accuracy beyond "it runs"?

Answer: Execution success is necessary but not sufficient. We track:

  • Exact match: Does SQL match a human-written reference query?
  • Execution accuracy: Does it run without errors?
  • Result correctness: Does it return the right answer (using test datasets with known expected results)?
  • Semantic equivalence: Is the SQL logically correct even if syntactically different from reference?
  • User feedback: Do users accept the result or rephrase the query?

Q4: Should we fine-tune a model or stick with prompt engineering?

Answer: Start with prompt engineering + RAG. Consider fine-tuning only when:

  • You have high-quality, domain-specific training data (thousands of NLβ†’SQL pairs)
  • Your schema is highly specialized (industry-specific terminology, non-standard patterns)
  • You've exhausted prompt optimizations and still see consistent failure patterns
  • You have ML infrastructure for model training, deployment, and monitoring

For most teams, prompt engineering + schema pruning + good metadata yields better ROI.

Q5: How do you handle multi-turn conversations and follow-up questions?

Answer: Conversational context adds complexity. Our approach:

  • Context windowing: Maintain previous 2–3 turns of questionβ†’SQLβ†’result in the prompt
  • Reference resolution: Explicitly track pronouns and references ("it", "those", "same period")
  • Intent classification: Detect if a follow-up is a refinement, new query, or result exploration
  • Session state: Store intermediate results in memory so follow-ups can reference them
  • Fallback to explicit: When ambiguous, ask the user to clarify rather than guess

✨ 9. Best Practices & Improvements Roadmap

Based on our production experience, here's a prioritized checklist of improvements and best practices, organized by implementation effort and impact.

πŸš€ Quick Wins (Implement in 1–2 weeks)

PriorityPracticeImpact
⭐⭐⭐Add SQL validation layerCatches syntax errors before execution
⭐⭐⭐Implement query loggingEnables debugging and usage analysis
⭐⭐Create 20–50 seed QA pairsImmediate accuracy boost for common patterns
⭐⭐Add schema metadata commentsBetter schema understanding
⭐Basic rate limitingPrevents cost overruns

Implementation details:

from typing import Tuple
 
# SQL validation example
def validate_sql(sql: str) -> Tuple[bool, str]:
    """Check SQL for obvious errors before execution"""
    try:
        parsed = sqlglot.parse(sql, dialect='your_db')
        # Check for SELECT * (often problematic)
        # Check for missing WHERE clauses (accidental full scans)
        # Check for unsupported functions
        return True, "Valid"
    except Exception as e:
        return False, str(e)

πŸ—οΈ Medium-term Improvements (1–2 months)

FeatureDescriptionBenefit
Schema pruning agentPre-process query to identify relevant tablesReduces context tokens by 50–70%
Self-correction loopModel reviews and fixes its own SQLCatches 15–30% of errors
Few-shot example retrievalDynamically fetch relevant QA examplesImproves accuracy for complex patterns
Query result cacheCache identical questions + SQLReduces latency for repeated queries
A/B testing frameworkCompare prompts, models, strategiesData-driven optimization

Architecture sketch:

🎯 Long-term Strategic Investments (3–6 months)

InitiativeEffortStrategic Value
Fine-tuned domain modelHigh10–25% accuracy gain for specialized schemas
Multi-agent architectureHighHandles complex, multi-step queries
Human-in-the-loop feedbackMediumContinual improvement, data collection
Usage analytics dashboardMediumMonitor adoption, failure patterns
Automated testing pipelineHighRegression testing as schema evolves

Key architectural patterns to explore:

  1. ReAct Agent Pattern

    • Reason: Analyze question and plan approach
    • Act: Generate SQL and execute
    • Observe: Check results
    • Iterate: Refine if needed
  2. Revision Agent

    • Separate "generator" and "revision" roles
    • Generator creates initial SQL
    • Reviewer checks against schema, constraints, business rules
    • Mimics human code review process
  3. Ensemble Methods

    • Run multiple prompts/models in parallel
    • Rank results by confidence, execution success, user feedback
    • Choose best or present options to user

πŸ“Š Monitoring & Observability Essentials

Don't build what you can't measure. Implement these metrics:

Metric CategoryKey MetricsTarget
AccuracyExact match rate, Execution success, User satisfactionover 85% execution success
LatencyEnd-to-end time, LLM calls per queryunder 30 seconds p95
CostTokens per query, LLM cost per userTrack weekly
UsageQueries per user, Top failure patternsIdentify gaps
BusinessTime saved vs. manual SQL writing, Adoption rateROI analysis

🌟 10. Ongoing Reflection & Future Work

Building production-level Text-to-SQL systems isn't just coding. It's data engineering + prompt design + system architecture + human insight. Models may get better, but if your metadata is messy or schema is outdated, you'll struggle. Human expertise remains the core.


πŸ“Œ Recommended Roadmap Diagram (to include)

  • Start: Small DB β†’ RAG β†’ Simple SQL
  • Scale: Many tables β†’ Complex joins β†’ Agentization
  • Optimize: Latency reduction β†’ Context window workarounds β†’ Hallucination checks
  • Mature: Multi-tenant, live datalake, revision agents, usage monitoring

🌐 11. Why Text-to-SQL Is Difficult β€” A Clear Summary

Text-to-SQL may look simple from the outside ("just convert a sentence into SQL"), but in practice it is one of the hardest LLM application problems, involving multiple complex layers from NLP to database theory.

Here's a polished explanation of the core challenges:

1️⃣ Natural Language Is Ambiguous & Context-Dependent

Human language is messy. Queries may be incomplete, vague, or contain hidden assumptions.

Examples:

  • "show me sales last month" β†’ sales of what? which region? which table is "sales"?
  • "top customers" β†’ by revenue? order count? last 30 days?

LLMs must infer intent and connect it to actual database meaning. This ambiguity is one of the largest barriers.

2️⃣ Mapping Intent β†’ SQL Requires Precise Logical Translation

Converting natural language into SQL is not a direct translation β€” it's reasoning + planning.

A Text-to-SQL system must:

  • identify entities (tables, columns, metrics)
  • map user intent to schema
  • generate syntactically valid SQL
  • ensure semantics are correct (produces meaningful results)

Even if grammar is correct, a small semantic mistake can produce:

  • wrong joins
  • missing filters
  • incorrect GROUP BY logic

SQL is extremely strict, while language is extremely flexible β€” a huge mismatch.

3️⃣ Deep Schema Understanding Is Required

A real-world database is often huge and messy:

  • hundreds of tables
  • thousands of columns
  • complicated relationships
  • inconsistent naming conventions
  • implicit business rules not documented in schema

A Text-to-SQL model must behave like a database expert, not just a language model. It needs to know:

  • how tables connect
  • foreign key paths
  • cardinality
  • which columns are meaningful
  • which tables should NEVER be joined

This is far beyond simple NLP.

4️⃣ Error Handling & Reliability Are Hard

A production system must gracefully handle: ❌ Invalid queries (e.g., wrong columns, missing tables) ❌ Bad user inputs (vague, incorrect, contradictory) ❌ Unexpected execution results (e.g., query returns millions of rows)

The system must:

  • detect errors
  • correct SQL
  • explain issues in human language
  • provide meaningful feedback

This adds a full engineering layer on top of LLM generation.

5️⃣ High-Quality Training Data Is Scarce & Expensive

Building accurate Text-to-SQL requires large, diverse training sets that reflect real business logic.

You need:

  • thousands of NL β†’ SQL pairs
  • complete schema + metadata
  • examples across many query types
  • domain-specific instructions

Manually creating this data is extremely costly. Open-source datasets (Spider, Bird, WikiSQL…) are helpful but do not match real Enterprise schemas, which are wider, messier, and contain business-specific rules.

6️⃣ Integration, Testing & Production Deployment Are Complex

Integrating Text-to-SQL into a live enterprise environment is non-trivial. You must handle:

  • permission models
  • database connections
  • latency optimization
  • schema versioning
  • caching
  • security / compliance
  • performance tuning
  • user feedback loops

And then comes testing:

  • unit test queries
  • edge cases
  • cross-schema scenarios
  • load testing
  • regression testing as schema changes

Testing Text-to-SQL is one of the hardest parts because the output space is huge and many SQLs are correct but not equivalent.

🎯 In Summary: Why Is Text-to-SQL Hard?

Because it requires the combination of:

  • NLP understanding β†’ interpret ambiguous human language
  • Logical reasoning β†’ plan steps and structure SQL
  • Database expertise β†’ understand schema, joins, constraints
  • Software engineering β†’ validate, execute, optimize, debug
  • Data engineering β†’ maintain metadata + training sets
  • Product design β†’ provide clear feedback to users

It's not just "generate SQL." It's the intersection of several entire disciplines.

This is why even big tech companies struggle with it β€” including the project you're working on.


πŸ”— 12. References


πŸ› οΈ 13. Tools & Technologies Mentioned

  • Vanna: Open-source Text-to-SQL platform
  • Bird: Text-to-SQL challenge platform
  • LLama NSQL: Custom model implementation
  • RAG (Retrieval-Augmented Generation): Core architectural pattern
  • Agent-based systems: Multi-tool orchestration approach