In today’s data-driven world, the ability to extract insights from databases through natural language queries has emerged as a transformative capability. Text-to-SQL systems—models that translate human language questions into structured query language (SQL)—represent one of the most practical applications of natural language processing (NLP) in enterprise environments. These systems promise to democratize data access, allowing non-technical users to interact with databases through intuitive questions rather than complex code.
But how do we know if these systems are actually delivering on their promise? How can we systematically evaluate whether a text-to-SQL model generates not just valid SQL, but optimal, efficient queries that produce accurate results? This was a problem that our teams faced while trying to evaluate the efficacy of text-to-SQL systems. As organizations increasingly depend on AI agents and Agentic systems for critical business decisions, the need for robust evaluation frameworks has never been more pressing.
The Multifaceted Nature of SQL Evaluation
Text-to-SQL evaluation is deceptively complex. Unlike standard NLP tasks where string-based metrics might suffice, evaluating SQL generation requires consideration of multiple dimensions:
- Syntactic Correctness is merely the starting point. A query might be syntactically valid but semantically flawed. Consider “SELECT * FROM employees WHERE salary > 100000” versus “SELECT * FROM employees WHERE salary > 10000″—both valid SQL, but with dramatically different business implications.
- Execution Accuracy goes deeper, asking whether the generated SQL produces the same results as the intended query. Two semantically different queries might coincidentally return identical results on a particular dataset, creating a false impression of equivalence.
- Computational Efficiency adds another dimension. In enterprise settings with massive databases, an inefficient query that retrieves millions of records before filtering them can bring systems to a halt, regardless of its correctness.
- Logical Completeness examines whether the SQL query captures all the logical conditions required in the natural language question. Missing a single filter condition could mean the difference between actionable intelligence and misleading information.
Traditional evaluation methods often fall short in addressing these multifaceted concerns. Many existing benchmarks rely primarily on string matching or execution accuracy, neglecting the critical aspects of efficiency and logical completeness. Others use static datasets that fail to represent the diversity of real-world databases and queries.
In production environments, text-to-SQL systems face challenges that laboratory evaluations often miss:
- Databases with hundreds of tables and complex relationships
- Domain-specific terminology and jargon
- Ambiguous natural language questions
- Performance requirements for large-scale data
- Multiple SQL dialects across different database systems
Implementation Approaches for Evaluation Systems
Current text-to-SQL evaluation implementations typically fall into several categories, each with distinct advantages and limitations:
String Matching-Based Evaluation
The most straightforward approach compares generated SQL strings against reference queries. Implementations range from exact matching to various normalization techniques that account for SQL’s flexibility (e.g., column order invariance in SELECT clauses). Libraries like `sqlparse` and `sqlglot` are commonly used to standardize queries before comparison.
def normalize_query(query): # Parse the query parsed = sqlparse.parse(query)[0] # Standardize case, whitespace, etc. normalized = standardize_identifiers(parsed) return str(normalized) def compare_queries(generated, reference): return normalize_query(generated) == normalize_query(reference)
This approach is simple to implement but struggles with SQL’s expressivity—two semantically equivalent queries can be syntactically different.
Execution-Based Evaluation
A more robust approach executes both the generated and reference queries against a database and compares the results:
def execute_and_compare(conn, generated_sql, reference_sql): gen_cursor = conn.cursor() ref_cursor = conn.cursor() gen_cursor.execute(generated_sql) ref_cursor.execute(reference_sql) gen_results = gen_cursor.fetchall() ref_results = ref_cursor.fetchall() return compare_result_sets(gen_results, ref_results)
This method addresses many limitations of string matching but introduces other challenges. It requires a functioning database environment and cannot identify inefficient queries that happen to produce correct results. Implementations must also handle result set comparison nuances, such as row ordering and data type variations.
Multi-Metric Evaluation with LLM Judges
A more sophisticated evaluation implementation combines multiple metrics into composite scores while leveraging LLM-as-a-judge. While this is a common approach for unstructured data based evaluations, we used this technique to assess semantic equivalence and other qualitative aspects of SQL queries:
def evaluate_query_with_llm(generated_sql, reference_sql, db_connection, schema, llm_service): # First check basic validity and execution try: gen_results, gen_execution_time = execute_query(db_connection, generated_sql) ref_results, ref_execution_time = execute_query(db_connection, reference_sql) # Query is valid if it executes without error validity_score = 1.0 # Calculate execution accuracy through result comparison execution_score = calculate_result_similarity(gen_results, ref_results) except Exception as e: # Invalid query validity_score = 0.0 execution_score = 0.0 gen_execution_time = 0.0 # Return early for invalid queries return { "validity": validity_score, "execution": execution_score, "efficiency": 0.0, "logical": 0.0, "composite": 0.0, "error": str(e) } # Use LLM to evaluate semantic equivalence and efficiency llm_evaluations = {} # Evaluate efficiency efficiency_prompt = f""" Evaluate the computational efficiency of these two SQL queries: Generated query: ```sql {generated_sql} ``` Reference query: ```sql {reference_sql} ``` Database schema: ```sql {schema} ``` Consider factors like: - Number and complexity of joins - Use of appropriate indexes - Filtering strategy (early vs. late filtering) - Subquery usage - Aggregate function efficiency Assign a score between 0 and 1, where: - 1: The generated query is as efficient or more efficient than the reference - 0: The generated query is significantly less efficient - Values between represent relative efficiency Return a JSON with format: {{"score": float, "explanation": string}} """ efficiency_response = llm_service.call(efficiency_prompt) efficiency_result = parse_llm_response(efficiency_response) efficiency_score = efficiency_result.get("score", 0.0) # Evaluate logical correctness logical_prompt = f""" Evaluate if the generated SQL query captures all the logical conditions required by the reference query: Generated query: ```sql {generated_sql} ``` Reference query: ```sql {reference_sql} ``` Consider: - Are all required JOIN conditions present? - Are all filtering conditions (WHERE) properly included? - Are all grouping and aggregation operations logically equivalent? - Would the queries produce identical results on ANY valid database with this schema, not just the current data? Assign a score between 0 and 1, where: - 1: The generated query is logically equivalent to the reference - 0: The generated query misses critical logical elements - Values between represent partial logical correctness Return a JSON with format: {{"score": float, "explanation": string}} """ logical_response = llm_service.call(logical_prompt) logical_result = parse_llm_response(logical_response) logical_score = logical_result.get("score", 0.0) # Calculate execution time efficiency ratio if ref_execution_time > 0 and gen_execution_time > 0: time_efficiency = min(ref_execution_time / gen_execution_time, 1.0) else: time_efficiency = 0.0 # Combine algorithmic and LLM-based efficiency assessments efficiency_score = 0.7 * efficiency_score + 0.3 * time_efficiency # Calculate composite score with weighted components composite_score = ( 0.15 * validity_score + 0.45 * execution_score + 0.25 * efficiency_score + 0.15 * logical_score ) return { "validity": validity_score, "execution": execution_score, "efficiency": efficiency_score, "logical": logical_score, "composite": composite_score, "efficiency_explanation": efficiency_result.get("explanation", ""), "logical_explanation": logical_result.get("explanation", "") } def parse_llm_response(response): """Parse JSON response from LLM""" try: import json return json.loads(response) except: # Fallback if response isn't proper JSON import re score_match = re.search(r'"score":\s*([\d.]+)', response) explanation_match = re.search(r'"explanation":\s*"([^"]+)"', response) return { "score": float(score_match.group(1)) if score_match else 0.5, "explanation": explanation_match.group(1) if explanation_match else "No explanation provided" }
This implementation pattern leverages LLMs to evaluate aspects of SQL queries that are difficult to assess through traditional metrics alone:
- Semantic Equivalence: LLMs can recognize when two syntactically different queries are logically equivalent, going beyond simple string comparison or result matching.
- Efficiency Assessment: While execution time provides one measure of efficiency, LLMs can evaluate query structure and design patterns that might impact performance on larger datasets or different database states.
- Logical Completeness: LLMs can assess whether a query correctly implements all the logical conditions implied in the original question, even when the current test database doesn’t expose these distinctions.
- Implementation Quality: LLMs can evaluate subjective aspects of queries such as readability, maintainability, and adherence to best practices.
The key advantages of this approach include:
- Nuanced Scoring: Rather than binary correct/incorrect judgments, LLMs provide graduated scores that reflect degrees of correctness.
- Explanatory Feedback: LLMs provide natural language explanations of their assessments, which can be valuable for debugging and improving models.
- Adaptability: The same framework can evaluate queries across different SQL dialects by incorporating dialect-specific considerations into the prompts.
However, this approach also presents implementation challenges:
- Consistency: LLM evaluations may vary based on prompt wording, model version, and other factors.
- Calibration: Ensuring that LLM-assigned scores align with human expert judgments requires careful prompt engineering and validation.
- Cost and Latency: Using LLM APIs adds time and financial costs to the evaluation process.
- Reproducibility: The non-deterministic nature of LLM outputs can affect reproducibility of evaluations.
Most effective implementations mitigate these challenges by combining algorithmic metrics with LLM-based assessments, as shown in the example where execution time efficiency supplements the LLM’s structural analysis of query efficiency.
Database Interface Implementations
A critical component of any text-to-SQL evaluation system is the database interface, which must support query execution across different database engines:
class DatabaseInterface: def __init__(self, connection_params): self.connection = create_connection(connection_params) def execute_query(self, query): try: cursor = self.connection.cursor() start_time = time.time() cursor.execute(query) execution_time = time.time() - start_time results = cursor.fetchall() return results, execution_time except Exception as e: return None, 0
For enterprise environments, implementations must support specific database systems such as Snowflake:
class SnowflakeDatabaseInterface(DatabaseInterface): def __init__(self, account, user, password, warehouse, database, schema): import snowflake.connector self.connection = snowflake.connector.connect( account=account, user=user, password=password, warehouse=warehouse, database=database, schema=schema )
Handling dialect differences and ensuring thread safety for parallel evaluations are key implementation challenges.
Visualization and Reporting
Comprehensive evaluation systems need effective visualization and reporting components:
def generate_evaluation_report(results, output_dir): # Create DataFrame for easier analysis df = pd.DataFrame(results) # Overall statistics validity_rate = df['validity'].mean() execution_accuracy = df['execution'].mean() avg_efficiency = df['efficiency'].mean() # Generate visualizations plt.figure(figsize=(10, 6)) metrics = ['validity', 'execution', 'efficiency', 'composite'] avg_scores = [df[metric].mean() for metric in metrics] plt.bar(metrics, avg_scores) plt.title('Average Performance Metrics') plt.ylabel('Score') plt.ylim(0, 1) plt.savefig(os.path.join(output_dir, 'performance_metrics.png'))
Effective implementations provide both high-level summaries and detailed per-query reports, enabling both quick overview and in-depth analysis.
Evaluating Cortex Analyst in Snowflake
To demonstrate the practical application of the evaluation framework, we conducted a comprehensive assessment of Snowflake’s Cortex Analyst, a text-to-SQL system integrated directly into the Snowflake data platform. Cortex Analyst represents an interesting case study as it combines large language model capabilities with native understanding of Snowflake’s SQL dialect and optimization features.
Experimental Setup
The evaluation was conducted using the framework described in previous sections, with the following configuration:
- Database Environment: Snowflake warehouse
- Test Dataset: 10 natural language questions spanning various complexity levels
- Evaluation Metrics: Validity, execution accuracy, efficiency, logical correctness, and composite score
- LLM Judge: GPT-4o for assessing efficiency and logical correctness
The test queries covered a range of SQL operations including simple selections, complex joins, aggregations, window functions, and nested subqueries. Questions were designed to test both basic functionality and advanced analytical capabilities that leverage Snowflake-specific features.
Performance Results
The evaluation results, shown in the performance metrics chart, reveal several interesting patterns in Cortex Analyst’s capabilities:
Validity (0.60): Cortex Analyst achieved a 60% validity rate, indicating that approximately 6 out of 10 generated queries executed without syntax or runtime errors. This suggests room for improvement in basic SQL generation, particularly for complex query structures or edge cases in schema interpretation.
Execution Accuracy (0.56): The execution accuracy of 56% indicates that slightly more than half of the valid queries produced results matching the expected ground truth. This metric reveals that even when queries execute successfully, semantic correctness remains a challenge.
Efficiency (0.28): The efficiency score of 28% represents the most significant performance gap. This low score suggests that Cortex Analyst often generates queries that, while functionally correct, are not optimally structured for Snowflake’s architecture. Common inefficiencies observed included:
- Suboptimal join ordering that doesn’t leverage Snowflake’s automatic clustering
- Missing opportunities to use Snowflake-specific optimization hints
- Unnecessary data movement between compute nodes
- Inefficient use of window functions where simpler aggregations would suffice
Logical Correctness (0.38): With a logical correctness score of 38%, the evaluation revealed that Cortex Analyst sometimes misses subtle logical requirements implied in natural language questions. This includes handling of NULL values, proper implementation of business logic constraints, and complete capture of filtering conditions.
Composite Score (0.47): The overall composite score of 47% reflects the weighted combination of all metrics, suggesting that while Cortex Analyst shows promise, significant improvements are needed across all evaluation dimensions.
Analysis of Performance Patterns
The evaluation revealed several interesting patterns in Cortex Analyst’s behavior:
- Simple Query Excellence: For straightforward SELECT statements with basic filtering, Cortex Analyst performed well, achieving near-perfect validity and accuracy scores.
- Complex Join Challenges: Performance degraded significantly with multi-table joins, particularly when involving more than three tables or requiring complex join conditions.
- Aggregation Inconsistencies: While basic aggregations (SUM, COUNT, AVG) were handled well, more complex analytical functions showed inconsistent results.
- Snowflake Feature Utilization: Despite being a native Snowflake service, Cortex Analyst showed limited use of Snowflake-specific optimizations like clustering keys or materialized views.
Methodological Considerations
Beyond implementation details, several methodological considerations affect the design of evaluation systems:
1. Test data selection: How representative and diverse is the test dataset? Does it cover various query complexities and SQL features?
2. Reference query quality: Are the reference queries genuinely optimal solutions to the natural language questions?
3. Metric interpretation: What thresholds of performance indicate acceptable or excellent performance?
4. Cross-database evaluation: How does performance vary across different database systems and dialects?
5. Evaluation reproducibility: Can evaluation results be consistently reproduced given the same system and test data?
These considerations should inform the design and implementation of evaluation systems, regardless of the specific technical approach chosen.
Conclusion
Implementing robust text-to-SQL evaluation systems requires addressing multiple dimensions of query quality, from syntax validity to execution accuracy to computational efficiency. Current approaches range from simple string matching to sophisticated multi-metric systems that leverage LLMs as judges. The implementation patterns discussed here provide a foundation for developing evaluation systems tailored to specific needs and environments. As text-to-SQL technology continues to evolve, so too will the methodologies and tools for assessing its performance—ultimately helping bridge the gap between natural language understanding and database query execution.
Future work in this area might explore more robust efficiency metrics, better handling of query equivalence, and standardized benchmarks across different database systems. The goal remains consistent: providing meaningful, multi-faceted assessment of how well text-to-SQL systems translate natural language intent into effective database queries.