General SQL Parser .NET Architecture¶
Overview¶
General SQL Parser .NET (GSP) is a self-contained C# library that parses SQL from 15 database vendors (DB2, Greenplum, Hive, Impala, Informix, MDX, MSSQL, MySQL, Netezza, Oracle, PostgreSQL, Redshift, Snowflake, Sybase, Teradata). It targets net10.0 for modern .NET and netstandard2.0 for compatibility with .NET Framework 4.6.1+. It has a multi-layered architecture designed for accuracy, extensibility, and per-vendor isolation.
High-level architecture¶
graph TB
subgraph "Input Layer"
A[SQL text / file / stream]
end
subgraph "Lexical Analysis Layer"
B[Per-vendor lexers: TLexerOracle, TLexerMssql, ...]
end
subgraph "Syntax Analysis Layer"
C[Per-vendor parsers: TParserOracleSql, TParserMssqlSql, ...]
C2[SqlParserFactory + ISqlParser delegates]
end
subgraph "AST Layer"
D[TParseTreeNode tree]
D1[Statement nodes (TSelectSqlStatement, ...)]
D2[Expression nodes (TExpression, TFunctionCall, ...)]
D3[Table nodes (TTable, TJoinExpr)]
D4[Clause nodes (TWhereClause, ...)]
end
subgraph "Output Layer"
F[Applications]
F1[SQL formatting (gudusoft.gsqlparser.pp)]
F2[Script writer (gudusoft.gsqlparser.scriptWriter)]
F3[Visitors / lineage]
end
A --> B
B --> C
C --> C2
C2 --> D
D --> F
style A fill:#e1f5fe
style D fill:#f3e5f5
style F fill:#e8f5e8
Core components¶
1. The entry point — TGSqlParser¶
The TGSqlParser class (in TGSqlParser.cs) is the only type a typical consumer interacts with. It owns the per-vendor lexer/parser pair, the source-token list, the syntax-error bag, and the resulting statement list.
1 2 3 4 5 6 7 8 9 10 11 | |
Three entry-point methods:
parse()/checkSyntax()— full lex + parse, builds the AST.tokenizeSqltext()— lex only, buildssourcetokenlistwithout an AST.getrawsqlstatements()— splits a script into raw statements without syntax checking.
2. Per-vendor parser delegates (parser/)¶
In recent versions, the per-vendor logic that used to live as a giant switch inside TGSqlParser has been refactored into a delegate-per-vendor architecture under gsp_dotnet_core/src/gsqlparser/parser/:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | |
TGSqlParser constructs the matching delegate via SqlParserFactory.Create(vendor, this) and delegates per-vendor responsibilities (lexer construction, tokenisation, raw-statement extraction, parse-failure recovery) to it. Adding a new vendor means writing one new T<V>SqlParser class and registering it in the factory — the central dispatcher stays clean.
3. Per-vendor command tables (sqlcmds/)¶
A parallel split lives under gsp_dotnet_core/src/gsqlparser/sqlcmds/, with one TSqlCmds<V> class per vendor. These hold the dispatch table mapping parsed statement IDs to the right TCustomSqlStatement subclass for that vendor — for example, TSqlCmdsMssql knows that statement ID 217 maps to TMssqlCreateProcedure.
This is the layer that lets the same generic parse loop produce strongly-typed, vendor-specific statement classes.
4. Lexer and parser drivers¶
For each vendor, two driver classes live at the root of gsp_dotnet_core/src/gsqlparser/:
TLexer<V>.cs— the lexer driver, loadsparser/<vendor>/<vendor>_lex_table.zipTParser<V>Sql.cs— the parser driver, loadsparser/<vendor>/<vendor>_yacc_table.zip
These drivers are generated from .l (lex) and .y (yacc) grammar sources in the sibling gsp_dotnet_parser/ repository via the build_tables.sh toolchain. They are committed to the source tree so that a fresh clone builds without running the generator.
5. AST nodes (nodes/)¶
All AST nodes inherit from TParseTreeNode. The most common nodes:
| Node | Purpose | Class |
|---|---|---|
| Identifier | Object name (table, column, schema) | TObjectName |
| Literal | String / numeric / date constant | TConstant |
| Data type | VARCHAR(50), NUMBER(10,2), ... |
TTypeName |
| Function | SUM(x), LENGTH(s), ... |
TFunctionCall |
| Constraint | NOT NULL, CHECK (...), ... |
TConstraint |
| Expression / condition | Any boolean or arithmetic expression | TExpression |
| Result column | A single item in the SELECT list | TResultColumn |
| Table reference | A FROM-clause entry | TTable |
| Join expression | Nested join | TJoinExpr |
Visitor support is provided via IExpressionVisitor and the abstract TParseTreeVisitor base class.
6. Statement wrappers (stmt/)¶
Each kind of SQL statement has a high-level wrapper class that exposes statement-specific properties (WhereClause, ResultColumnList, targetTable, ...) on top of the generic AST. The plain ANSI cases live directly under stmt/ (TSelectSqlStatement, TInsertSqlStatement, TUpdateSqlStatement, ...). Vendor-specific statements live in subnamespaces:
stmt/mssql/—TMssqlBlock,TMssqlCreateProcedure,TMssqlExecute, ...stmt/oracle/—TPlsqlCreatePackage,TSqlplusCmdStatement, ...stmt/teradata/—TTeradataBTEQCmd, ...stmt/snowflake/— Snowflake-specific DDL
TGSqlParser constructs the right subclass based on the matched grammar rule, so code that does parser.sqlstatements.get(0) is TPlsqlCreatePackage Just Works.
7. Pretty printer (pp/)¶
The SQL formatter under gsp_dotnet_core/src/gsqlparser/pp/ takes a parsed AST and re-emits it as nicely-formatted SQL. Subpackages split responsibilities:
pp.processor— top-level driverpp.stmtformatter— per-statement formatters (FormatterFactory)pp.para— formatting options (GFmtOpt,GFmtOptFactory)pp.output/pp.print— output device abstraction
This is the engine behind the formatsql demo.
8. Script writer (scriptWriter/)¶
TScriptGenerator + TScriptWriter regenerate SQL from a (possibly modified) AST. Unlike the pretty printer, the script writer is meant for round-tripping — modify the AST, then emit the modified SQL for execution.
9. Convert helpers (convert/)¶
joinConverter.cs translates legacy join syntax between dialects (Oracle's (+) notation <-> ANSI joins).
10. Resources (Resources/)¶
Per-vendor keyword lists, function lists, and parser symbol tables ship as embedded resources:
Resources/<vendor>_keyword.properties— reserved-word listResources/<vendor>_function.properties— known built-in functionsResources/dataflow/<vendor>/function.properties— DataFlowAnalyzer per-function argument lineage hintsResources/parser_symbol_tables.zip— shared parser symbol-table data
Per-database conditional compilation¶
The library csproj wraps each vendor's parser-table data (parser/<vendor>/*.zip) in an <ItemGroup Condition="'$(includeXxx)' == 'true'"> block. This lets you ship a custom build that contains only the dialects you license. The published NuGet has all 15 dialects baked in.
The per-vendor T<V>SqlParser and T<V>Lexer .cs files are always compiled — only the embedded .zip data tables are conditional. This keeps the build matrix manageable.
Processing flow¶
Phase 1: Tokenisation¶
flowchart LR
A[SQL text] --> B[Character stream]
B --> C[TLexer<V>]
C --> D[TSourceTokenList]
subgraph "Token types"
F[Keywords]
G[Identifiers]
H[Literals]
I[Operators]
J[Whitespace]
K[Comments]
end
D --> F
D --> G
D --> H
D --> I
D --> J
D --> K
The lexer is a state machine generated from the vendor's .l (lex) file. It handles vendor-specific keyword recognition, literal formats, custom operators, and comment syntax variations.
Phase 2: Parsing¶
flowchart TD
A[Token stream] --> B[Grammar tables]
B --> C[TParser<V>Sql]
C --> D{Syntax valid?}
D -->|yes| E[Build AST nodes]
D -->|no| F[Push TSyntaxError]
E --> G[Per-vendor T<V>SqlParser delegate]
G --> H[TGSqlParser.sqlstatements]
F --> I[Continue or abort]
The parser is a yacc-generated LALR(1) parser. Errors are recoverable — a syntax error in one statement of a multi-statement script does not abort the others.
Vendor-specific dispatch¶
1 2 3 4 5 6 7 | |
Most consumer code does not need this — the per-vendor parser already produces the correct AST.
Design strengths¶
- Per-vendor isolation — adding a new dialect doesn't touch existing vendor code, thanks to the parser-delegate split.
- Self-contained — no third-party dependencies. The grammar tables ship as embedded resources.
- Cross-platform — multi-targets
net10.0andnetstandard2.0, so it runs on modern .NET, .NET Framework 4.6.1+, Mono, Unity, and Xamarin. - Round-tripping — the script writer lets consumers modify the AST and regenerate executable SQL.
- Backwards-compatible API — the public surface is stable across releases (per Gudu's compat policy).
Trade-offs¶
- Large library — the prebuilt NuGet is on the order of tens of MB because each of the 15 grammar tables ships as embedded data. Custom builds with
/p:includeXxx=falseflags can shrink this significantly. - Memory footprint — large SQL files produce large ASTs. For multi-megabyte scripts, parse statement-by-statement (see Performance Optimization).
- Not thread-safe per instance —
TGSqlParserinstances cannot be shared across threads. Use one instance per thread, or pool them.
Use-case patterns¶
SQL validation¶
1 2 3 4 5 | |
Data lineage analysis¶
The DataFlowAnalyzer demo at gsp_demo_dotnet/src/demos/dataFlowAnalyzer/ wraps TGSqlParser with column-level lineage tracking. It is the same engine used in production by Gudu's SQLFlow service.
SQL transformation / rewriting¶
1 2 3 4 | |
Conclusion¶
GSP .NET is a sophisticated, accuracy-focused SQL parser with broad dialect coverage and a clean per-vendor architecture. The recent refactor into per-vendor parser delegates and per-vendor command tables makes it straightforward to extend with new dialects or to specialise behaviour for a particular vendor without affecting the rest of the library.
For getting-started material, see the Quick Start and tutorials. For the full API surface, see the API reference.