Skip to content

General SQL Parser .NET Architecture

Overview

General SQL Parser .NET (GSP) is a self-contained C# library that parses SQL from 15 database vendors (DB2, Greenplum, Hive, Impala, Informix, MDX, MSSQL, MySQL, Netezza, Oracle, PostgreSQL, Redshift, Snowflake, Sybase, Teradata). It targets net10.0 for modern .NET and netstandard2.0 for compatibility with .NET Framework 4.6.1+. It has a multi-layered architecture designed for accuracy, extensibility, and per-vendor isolation.

High-level architecture

graph TB
    subgraph "Input Layer"
        A[SQL text / file / stream]
    end

    subgraph "Lexical Analysis Layer"
        B[Per-vendor lexers: TLexerOracle, TLexerMssql, ...]
    end

    subgraph "Syntax Analysis Layer"
        C[Per-vendor parsers: TParserOracleSql, TParserMssqlSql, ...]
        C2[SqlParserFactory + ISqlParser delegates]
    end

    subgraph "AST Layer"
        D[TParseTreeNode tree]
        D1[Statement nodes (TSelectSqlStatement, ...)]
        D2[Expression nodes (TExpression, TFunctionCall, ...)]
        D3[Table nodes (TTable, TJoinExpr)]
        D4[Clause nodes (TWhereClause, ...)]
    end

    subgraph "Output Layer"
        F[Applications]
        F1[SQL formatting (gudusoft.gsqlparser.pp)]
        F2[Script writer (gudusoft.gsqlparser.scriptWriter)]
        F3[Visitors / lineage]
    end

    A --> B
    B --> C
    C --> C2
    C2 --> D
    D --> F

    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style F fill:#e8f5e8

Core components

1. The entry point — TGSqlParser

The TGSqlParser class (in TGSqlParser.cs) is the only type a typical consumer interacts with. It owns the per-vendor lexer/parser pair, the source-token list, the syntax-error bag, and the resulting statement list.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
var parser = new TGSqlParser(EDbVendor.dbvoracle);
parser.sqltext = "SELECT * FROM employees WHERE dept = 'IT'";
int result = parser.parse();
if (result == 0)
{
    var statements = parser.sqlstatements;
}
else
{
    Console.WriteLine(parser.Errormessage);
}

Three entry-point methods:

  • parse() / checkSyntax() — full lex + parse, builds the AST.
  • tokenizeSqltext() — lex only, builds sourcetokenlist without an AST.
  • getrawsqlstatements() — splits a script into raw statements without syntax checking.

2. Per-vendor parser delegates (parser/)

In recent versions, the per-vendor logic that used to live as a giant switch inside TGSqlParser has been refactored into a delegate-per-vendor architecture under gsp_dotnet_core/src/gsqlparser/parser/:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
parser/
  ISqlParser.cs              # the interface every delegate implements
  AbstractSqlParser.cs       # shared base with parse-loop helpers
  SqlParserFactory.cs        # vendor enum -> delegate instance
  TDb2SqlParser.cs
  TGreenplumSqlParser.cs
  THiveSqlParser.cs
  TImpalaSqlParser.cs
  TInformixSqlParser.cs
  TMdxSqlParser.cs
  TMssqlSqlParser.cs
  TMysqlSqlParser.cs
  TNetezzaSqlParser.cs
  TOracleSqlParser.cs
  TPostgresqlSqlParser.cs
  TRedshiftSqlParser.cs
  TSnowflakeSqlParser.cs
  TSybaseSqlParser.cs
  TTeradataSqlParser.cs

TGSqlParser constructs the matching delegate via SqlParserFactory.Create(vendor, this) and delegates per-vendor responsibilities (lexer construction, tokenisation, raw-statement extraction, parse-failure recovery) to it. Adding a new vendor means writing one new T<V>SqlParser class and registering it in the factory — the central dispatcher stays clean.

3. Per-vendor command tables (sqlcmds/)

A parallel split lives under gsp_dotnet_core/src/gsqlparser/sqlcmds/, with one TSqlCmds<V> class per vendor. These hold the dispatch table mapping parsed statement IDs to the right TCustomSqlStatement subclass for that vendor — for example, TSqlCmdsMssql knows that statement ID 217 maps to TMssqlCreateProcedure.

This is the layer that lets the same generic parse loop produce strongly-typed, vendor-specific statement classes.

4. Lexer and parser drivers

For each vendor, two driver classes live at the root of gsp_dotnet_core/src/gsqlparser/:

  • TLexer<V>.cs — the lexer driver, loads parser/<vendor>/<vendor>_lex_table.zip
  • TParser<V>Sql.cs — the parser driver, loads parser/<vendor>/<vendor>_yacc_table.zip

These drivers are generated from .l (lex) and .y (yacc) grammar sources in the sibling gsp_dotnet_parser/ repository via the build_tables.sh toolchain. They are committed to the source tree so that a fresh clone builds without running the generator.

5. AST nodes (nodes/)

All AST nodes inherit from TParseTreeNode. The most common nodes:

Node Purpose Class
Identifier Object name (table, column, schema) TObjectName
Literal String / numeric / date constant TConstant
Data type VARCHAR(50), NUMBER(10,2), ... TTypeName
Function SUM(x), LENGTH(s), ... TFunctionCall
Constraint NOT NULL, CHECK (...), ... TConstraint
Expression / condition Any boolean or arithmetic expression TExpression
Result column A single item in the SELECT list TResultColumn
Table reference A FROM-clause entry TTable
Join expression Nested join TJoinExpr

Visitor support is provided via IExpressionVisitor and the abstract TParseTreeVisitor base class.

6. Statement wrappers (stmt/)

Each kind of SQL statement has a high-level wrapper class that exposes statement-specific properties (WhereClause, ResultColumnList, targetTable, ...) on top of the generic AST. The plain ANSI cases live directly under stmt/ (TSelectSqlStatement, TInsertSqlStatement, TUpdateSqlStatement, ...). Vendor-specific statements live in subnamespaces:

  • stmt/mssql/TMssqlBlock, TMssqlCreateProcedure, TMssqlExecute, ...
  • stmt/oracle/TPlsqlCreatePackage, TSqlplusCmdStatement, ...
  • stmt/teradata/TTeradataBTEQCmd, ...
  • stmt/snowflake/ — Snowflake-specific DDL

TGSqlParser constructs the right subclass based on the matched grammar rule, so code that does parser.sqlstatements.get(0) is TPlsqlCreatePackage Just Works.

7. Pretty printer (pp/)

The SQL formatter under gsp_dotnet_core/src/gsqlparser/pp/ takes a parsed AST and re-emits it as nicely-formatted SQL. Subpackages split responsibilities:

  • pp.processor — top-level driver
  • pp.stmtformatter — per-statement formatters (FormatterFactory)
  • pp.para — formatting options (GFmtOpt, GFmtOptFactory)
  • pp.output / pp.print — output device abstraction

This is the engine behind the formatsql demo.

8. Script writer (scriptWriter/)

TScriptGenerator + TScriptWriter regenerate SQL from a (possibly modified) AST. Unlike the pretty printer, the script writer is meant for round-tripping — modify the AST, then emit the modified SQL for execution.

9. Convert helpers (convert/)

joinConverter.cs translates legacy join syntax between dialects (Oracle's (+) notation <-> ANSI joins).

10. Resources (Resources/)

Per-vendor keyword lists, function lists, and parser symbol tables ship as embedded resources:

  • Resources/<vendor>_keyword.properties — reserved-word list
  • Resources/<vendor>_function.properties — known built-in functions
  • Resources/dataflow/<vendor>/function.properties — DataFlowAnalyzer per-function argument lineage hints
  • Resources/parser_symbol_tables.zip — shared parser symbol-table data

Per-database conditional compilation

The library csproj wraps each vendor's parser-table data (parser/<vendor>/*.zip) in an <ItemGroup Condition="'$(includeXxx)' == 'true'"> block. This lets you ship a custom build that contains only the dialects you license. The published NuGet has all 15 dialects baked in.

The per-vendor T<V>SqlParser and T<V>Lexer .cs files are always compiled — only the embedded .zip data tables are conditional. This keeps the build matrix manageable.

Processing flow

Phase 1: Tokenisation

flowchart LR
    A[SQL text] --> B[Character stream]
    B --> C[TLexer<V>]
    C --> D[TSourceTokenList]

    subgraph "Token types"
        F[Keywords]
        G[Identifiers]
        H[Literals]
        I[Operators]
        J[Whitespace]
        K[Comments]
    end

    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    D --> K

The lexer is a state machine generated from the vendor's .l (lex) file. It handles vendor-specific keyword recognition, literal formats, custom operators, and comment syntax variations.

Phase 2: Parsing

flowchart TD
    A[Token stream] --> B[Grammar tables]
    B --> C[TParser<V>Sql]
    C --> D{Syntax valid?}
    D -->|yes| E[Build AST nodes]
    D -->|no| F[Push TSyntaxError]
    E --> G[Per-vendor T<V>SqlParser delegate]
    G --> H[TGSqlParser.sqlstatements]
    F --> I[Continue or abort]

The parser is a yacc-generated LALR(1) parser. Errors are recoverable — a syntax error in one statement of a multi-statement script does not abort the others.

Vendor-specific dispatch

1
2
3
4
5
6
7
// Example: vendor-specific behaviour driven by EDbVendor
switch (parser.DbVendor.Code) {
    case EDbVendor.dbvoracle: /* CONNECT BY, ROWNUM */ break;
    case EDbVendor.dbvmssql:  /* TOP, OUTPUT */ break;
    case EDbVendor.dbvpostgresql: /* LATERAL, UNNEST */ break;
    case EDbVendor.dbvsnowflake: /* QUALIFY */ break;
}

Most consumer code does not need this — the per-vendor parser already produces the correct AST.

Design strengths

  • Per-vendor isolation — adding a new dialect doesn't touch existing vendor code, thanks to the parser-delegate split.
  • Self-contained — no third-party dependencies. The grammar tables ship as embedded resources.
  • Cross-platform — multi-targets net10.0 and netstandard2.0, so it runs on modern .NET, .NET Framework 4.6.1+, Mono, Unity, and Xamarin.
  • Round-tripping — the script writer lets consumers modify the AST and regenerate executable SQL.
  • Backwards-compatible API — the public surface is stable across releases (per Gudu's compat policy).

Trade-offs

  • Large library — the prebuilt NuGet is on the order of tens of MB because each of the 15 grammar tables ships as embedded data. Custom builds with /p:includeXxx=false flags can shrink this significantly.
  • Memory footprint — large SQL files produce large ASTs. For multi-megabyte scripts, parse statement-by-statement (see Performance Optimization).
  • Not thread-safe per instanceTGSqlParser instances cannot be shared across threads. Use one instance per thread, or pool them.

Use-case patterns

SQL validation

1
2
3
4
5
var parser = new TGSqlParser(EDbVendor.dbvoracle) { sqltext = sqlScript };
if (parser.parse() == 0)
    AnalyseComplexity(parser.sqlstatements);
else
    ReportSyntaxErrors(parser.SyntaxErrors);

Data lineage analysis

The DataFlowAnalyzer demo at gsp_demo_dotnet/src/demos/dataFlowAnalyzer/ wraps TGSqlParser with column-level lineage tracking. It is the same engine used in production by Gudu's SQLFlow service.

SQL transformation / rewriting

1
2
3
4
var parser = new TGSqlParser(EDbVendor.dbvoracle) { sqltext = sourceSql };
parser.parse();
// Mutate the AST
// Emit via TScriptGenerator (see Advanced Features tutorial)

Conclusion

GSP .NET is a sophisticated, accuracy-focused SQL parser with broad dialect coverage and a clean per-vendor architecture. The recent refactor into per-vendor parser delegates and per-vendor command tables makes it straightforward to extend with new dialects or to specialise behaviour for a particular vendor without affecting the rest of the library.

For getting-started material, see the Quick Start and tutorials. For the full API surface, see the API reference.