π‘ Lessons Learned: Implementing a Text-to-SQL Platform in Production
This page documents key lessons and practical insights from building a Text-to-SQL platform in a large tech company with a massive datalake. The goal: share experiences with tech-facing readers who build scalable LLM-powered SQL systems.
π οΈ 1. Starting Small: The Early Roadmap
We began last year when context-window limits were a major bottleneck (e.g., LLM prompt context only ~2,048β4,096 tokens in many models at the time). Our initial experiment used a model we refer to as LLama NSQL and an open-source platform called Vanna, which required three core context-inputs:
- Database schema / DDL β all tables, relationships, constraints.
- Documentation / metadata β table/column business meanings, comments, descriptions.
- QuestionβSQL example pairs β natural language questions mapped to correct SQL for your schema.
With just these, Vanna could generate near-perfect SQL on a small, simple database. It was essentially a RAG-based approach (Retrieval-Augmented Generation) β ideal for simple cases.
π 2. Scaling & Roadmap Visual
We mapped out the following sequence:
| Phase | Focus | Challenge |
|---|---|---|
| Phase 1 | Small DB, simple tables | Lightweight context, manageable token count |
| Phase 2 | Multiple tables, complex joins | Schema grows, context window becomes a limit |
| Phase 3 | Agent-based system, multi-tool orchestration | Latency + many LLM calls + retrieval overhead |
| Phase 4 | Production optimization | Reduce end-to-end latency, manage hallucination, scale |
β Visual roadmap (you can insert a timeline graphic) Start β Scale β Agentization β Optimization
π 3. Scaling Challenges
When applying to many tables and complex joins:
- Latency grew significantly.
- SQL accuracy dropped for multi-table, multi-join queries.
We explored community tools and discovered Bird, which hosts Text-to-SQL challenges. Their "Chess system" inspired us to adopt agent-based architecture for reasoning + SQL generation.
β‘ 4. Latency & Context Window Constraints
In our early agent implementation:
- Many LLM calls per query (tool chaining + retrieval).
- Model context window couldn't accommodate full schema + metadata + examples.
- A single query execution sometimes took ~30 minutes.
We realized: we must optimize what we send into the model and how often.
π§Ή 5. Workflow Optimization & Code Snippet
We reduced end-to-end execution to ~30 seconds by:
- Pruning irrelevant schema parts.
- Chunking metadata.
- Minimizing LLM calls (fewer hops).
- Tightening prompt structure.
Example: Prompt Template
You are a SQL-generation agent.
Schema:
{{ddl_snippet}}
Metadata:
{{metadata_snippet}}
Examples:
{{qa_pairs}}
Question:
{{user_question}}
Generate optimized SQL.Example: Preprocessing Step
# Prune schema
relevant_tables = extract_relevant_tables(question, full_schema)
ddl_snippet = format_ddl(relevant_tables)Important highlight: token budget matters. Much research (e.g., schema-pruning in text-to-sql) shows models degrade when input tokens exceed ~4,000-8,000. (arXiv (opens in a new tab)) In our case, we trimmed context until tokens ~3,000β4,000 per query consistently.
π§ 6. Key Technical Lessons
- Data quality is king. Good metadata + correct examples yield better SQL.
- LLM evolution brings new trade-offs. Bigger context windows help (e.g., GPT-4/5, Chaos models) but hallucinations increase.
- Multi-agent architecture is complex. Agents need communication protocols, orchestration, failure modes.
- Prompt + data = true asset. Models are tools; your input quality defines output.
- You cannot be purely agile. Rapid AI evolution means you must plan architecture, tool-chain, scalers carefully.
π 7. Practical Advice
- Begin with a narrow scope (few tables) to validate your method.
- Use RAG with schema + metadata + QA examples.
- Prune and chunk your context to stay within model limits (~3,000β4,000 tokens).
- Minimize LLM calls: combine retrieval + generation steps.
- Monitor SQL accuracy, hallucination, execution failures.
- Build modular architecture: retrieval module, generation module, revision module.
- Document everything: schema changes, prompt iterations, agent flows.
β 8. FAQ & Follow-up Questions
This section addresses common questions and clarifications that often arise when implementing Text-to-SQL systems in production.
Q1: How do you handle schema changes without breaking the system?
Answer: Schema changes are one of the biggest operational challenges. We recommend:
- Version-controlled metadata: Store your schema documentation, DDL, and QA examples in git with clear versioning
- Automated schema validation: Run automated tests that check if the current model can still generate valid SQL for known questions after schema changes
- Graceful degradation: If the model references deprecated tables/columns, have a validation layer that catches this before execution
- Notification pipeline: Alert the data team when schema changes impact high-usage query patterns
Q2: What's the right balance between context window size and retrieval quality?
Answer: This is a trade-off we constantly tune. Our findings:
- Sweet spot: ~3,000β4,000 tokens per query (as mentioned in section 5)
- Over-stuffing context: Beyond ~8k tokens, model attention degrades and hallucinations increase
- Retrieval precision over quantity: Better to include 3 highly relevant tables + metadata than 20 tables with sparse details
- Chunking strategy: Break large schemas into semantic chunks (related tables, common join paths) and retrieve only what's needed
Q3: How do you measure SQL accuracy beyond "it runs"?
Answer: Execution success is necessary but not sufficient. We track:
- Exact match: Does SQL match a human-written reference query?
- Execution accuracy: Does it run without errors?
- Result correctness: Does it return the right answer (using test datasets with known expected results)?
- Semantic equivalence: Is the SQL logically correct even if syntactically different from reference?
- User feedback: Do users accept the result or rephrase the query?
Q4: Should we fine-tune a model or stick with prompt engineering?
Answer: Start with prompt engineering + RAG. Consider fine-tuning only when:
- You have high-quality, domain-specific training data (thousands of NLβSQL pairs)
- Your schema is highly specialized (industry-specific terminology, non-standard patterns)
- You've exhausted prompt optimizations and still see consistent failure patterns
- You have ML infrastructure for model training, deployment, and monitoring
For most teams, prompt engineering + schema pruning + good metadata yields better ROI.
Q5: How do you handle multi-turn conversations and follow-up questions?
Answer: Conversational context adds complexity. Our approach:
- Context windowing: Maintain previous 2β3 turns of questionβSQLβresult in the prompt
- Reference resolution: Explicitly track pronouns and references ("it", "those", "same period")
- Intent classification: Detect if a follow-up is a refinement, new query, or result exploration
- Session state: Store intermediate results in memory so follow-ups can reference them
- Fallback to explicit: When ambiguous, ask the user to clarify rather than guess
β¨ 9. Best Practices & Improvements Roadmap
Based on our production experience, here's a prioritized checklist of improvements and best practices, organized by implementation effort and impact.
π Quick Wins (Implement in 1β2 weeks)
| Priority | Practice | Impact |
|---|---|---|
| βββ | Add SQL validation layer | Catches syntax errors before execution |
| βββ | Implement query logging | Enables debugging and usage analysis |
| ββ | Create 20β50 seed QA pairs | Immediate accuracy boost for common patterns |
| ββ | Add schema metadata comments | Better schema understanding |
| β | Basic rate limiting | Prevents cost overruns |
Implementation details:
from typing import Tuple
# SQL validation example
def validate_sql(sql: str) -> Tuple[bool, str]:
"""Check SQL for obvious errors before execution"""
try:
parsed = sqlglot.parse(sql, dialect='your_db')
# Check for SELECT * (often problematic)
# Check for missing WHERE clauses (accidental full scans)
# Check for unsupported functions
return True, "Valid"
except Exception as e:
return False, str(e)ποΈ Medium-term Improvements (1β2 months)
| Feature | Description | Benefit |
|---|---|---|
| Schema pruning agent | Pre-process query to identify relevant tables | Reduces context tokens by 50β70% |
| Self-correction loop | Model reviews and fixes its own SQL | Catches 15β30% of errors |
| Few-shot example retrieval | Dynamically fetch relevant QA examples | Improves accuracy for complex patterns |
| Query result cache | Cache identical questions + SQL | Reduces latency for repeated queries |
| A/B testing framework | Compare prompts, models, strategies | Data-driven optimization |
Architecture sketch:
π― Long-term Strategic Investments (3β6 months)
| Initiative | Effort | Strategic Value |
|---|---|---|
| Fine-tuned domain model | High | 10β25% accuracy gain for specialized schemas |
| Multi-agent architecture | High | Handles complex, multi-step queries |
| Human-in-the-loop feedback | Medium | Continual improvement, data collection |
| Usage analytics dashboard | Medium | Monitor adoption, failure patterns |
| Automated testing pipeline | High | Regression testing as schema evolves |
Key architectural patterns to explore:
-
ReAct Agent Pattern
- Reason: Analyze question and plan approach
- Act: Generate SQL and execute
- Observe: Check results
- Iterate: Refine if needed
-
Revision Agent
- Separate "generator" and "revision" roles
- Generator creates initial SQL
- Reviewer checks against schema, constraints, business rules
- Mimics human code review process
-
Ensemble Methods
- Run multiple prompts/models in parallel
- Rank results by confidence, execution success, user feedback
- Choose best or present options to user
π Monitoring & Observability Essentials
Don't build what you can't measure. Implement these metrics:
| Metric Category | Key Metrics | Target |
|---|---|---|
| Accuracy | Exact match rate, Execution success, User satisfaction | over 85% execution success |
| Latency | End-to-end time, LLM calls per query | under 30 seconds p95 |
| Cost | Tokens per query, LLM cost per user | Track weekly |
| Usage | Queries per user, Top failure patterns | Identify gaps |
| Business | Time saved vs. manual SQL writing, Adoption rate | ROI analysis |
π 10. Ongoing Reflection & Future Work
Building production-level Text-to-SQL systems isn't just coding. It's data engineering + prompt design + system architecture + human insight. Models may get better, but if your metadata is messy or schema is outdated, you'll struggle. Human expertise remains the core.
π Recommended Roadmap Diagram (to include)
- Start: Small DB β RAG β Simple SQL
- Scale: Many tables β Complex joins β Agentization
- Optimize: Latency reduction β Context window workarounds β Hallucination checks
- Mature: Multi-tenant, live datalake, revision agents, usage monitoring
π 11. Why Text-to-SQL Is Difficult β A Clear Summary
Text-to-SQL may look simple from the outside ("just convert a sentence into SQL"), but in practice it is one of the hardest LLM application problems, involving multiple complex layers from NLP to database theory.
Here's a polished explanation of the core challenges:
1οΈβ£ Natural Language Is Ambiguous & Context-Dependent
Human language is messy. Queries may be incomplete, vague, or contain hidden assumptions.
Examples:
- "show me sales last month" β sales of what? which region? which table is "sales"?
- "top customers" β by revenue? order count? last 30 days?
LLMs must infer intent and connect it to actual database meaning. This ambiguity is one of the largest barriers.
2οΈβ£ Mapping Intent β SQL Requires Precise Logical Translation
Converting natural language into SQL is not a direct translation β it's reasoning + planning.
A Text-to-SQL system must:
- identify entities (tables, columns, metrics)
- map user intent to schema
- generate syntactically valid SQL
- ensure semantics are correct (produces meaningful results)
Even if grammar is correct, a small semantic mistake can produce:
- wrong joins
- missing filters
- incorrect GROUP BY logic
SQL is extremely strict, while language is extremely flexible β a huge mismatch.
3οΈβ£ Deep Schema Understanding Is Required
A real-world database is often huge and messy:
- hundreds of tables
- thousands of columns
- complicated relationships
- inconsistent naming conventions
- implicit business rules not documented in schema
A Text-to-SQL model must behave like a database expert, not just a language model. It needs to know:
- how tables connect
- foreign key paths
- cardinality
- which columns are meaningful
- which tables should NEVER be joined
This is far beyond simple NLP.
4οΈβ£ Error Handling & Reliability Are Hard
A production system must gracefully handle: β Invalid queries (e.g., wrong columns, missing tables) β Bad user inputs (vague, incorrect, contradictory) β Unexpected execution results (e.g., query returns millions of rows)
The system must:
- detect errors
- correct SQL
- explain issues in human language
- provide meaningful feedback
This adds a full engineering layer on top of LLM generation.
5οΈβ£ High-Quality Training Data Is Scarce & Expensive
Building accurate Text-to-SQL requires large, diverse training sets that reflect real business logic.
You need:
- thousands of NL β SQL pairs
- complete schema + metadata
- examples across many query types
- domain-specific instructions
Manually creating this data is extremely costly. Open-source datasets (Spider, Bird, WikiSQLβ¦) are helpful but do not match real Enterprise schemas, which are wider, messier, and contain business-specific rules.
6οΈβ£ Integration, Testing & Production Deployment Are Complex
Integrating Text-to-SQL into a live enterprise environment is non-trivial. You must handle:
- permission models
- database connections
- latency optimization
- schema versioning
- caching
- security / compliance
- performance tuning
- user feedback loops
And then comes testing:
- unit test queries
- edge cases
- cross-schema scenarios
- load testing
- regression testing as schema changes
Testing Text-to-SQL is one of the hardest parts because the output space is huge and many SQLs are correct but not equivalent.
π― In Summary: Why Is Text-to-SQL Hard?
Because it requires the combination of:
- NLP understanding β interpret ambiguous human language
- Logical reasoning β plan steps and structure SQL
- Database expertise β understand schema, joins, constraints
- Software engineering β validate, execute, optimize, debug
- Data engineering β maintain metadata + training sets
- Product design β provide clear feedback to users
It's not just "generate SQL." It's the intersection of several entire disciplines.
This is why even big tech companies struggle with it β including the project you're working on.
π 12. References
π οΈ 13. Tools & Technologies Mentioned
- Vanna: Open-source Text-to-SQL platform
- Bird: Text-to-SQL challenge platform
- LLama NSQL: Custom model implementation
- RAG (Retrieval-Augmented Generation): Core architectural pattern
- Agent-based systems: Multi-tool orchestration approach