Skip to content

Parse-tree internals

This page is the conceptual map you need before you start walking parse trees with General SQL Parser .NET. It covers what the AST is, how it's shaped, what the visitor contract guarantees, and why the design is the way it is. If you just want copy-pasteable recipes, jump to the Walk the parse tree how-to instead.

What the parser actually produces

When you call parser.parse() on TGSqlParser, the output is a list of statements, each one a tree of nodes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
TGSqlParser
  └── sqlstatements : TStatementList         (one entry per ;-separated statement)
        ├── [0] TSelectSqlStatement          (one of many TCustomSqlStatement subclasses)
        │     ├── ResultColumnList           (the SELECT list)
        │     ├── joins / tables             (the FROM clause)
        │     ├── WhereClause                (the WHERE clause)
        │     ├── GroupByClause
        │     ├── OrderbyClause
        │     └── ...
        ├── [1] TInsertSqlStatement
        ├── [2] TUpdateSqlStatement
        └── ...

Two ideas are doing the work here:

  1. Statements are first-class. The top-level container is a list of statements, not a single root node. A 12-statement script gives you 12 entries in sqlstatements. Each one is its own self-contained tree.
  2. Every tree node descends from TParseTreeNode. Statements, clauses, expressions, table references, column references — they're all TParseTreeNode subclasses, so the same traversal machinery works on all of them.

The base class — TParseTreeNode

Defined in gsp_dotnet_core/src/gsqlparser/nodes/TParseTreeNode.cs. Every node carries:

Member What it is Why it matters
dbvendor EDbVendor of the script that produced this node A node knows its own dialect; visitors can branch on it.
startToken, endToken The first and last source tokens covered by this node Lets you map any AST node back to a (line, column) range in the original SQL — the foundation for highlighting, error reporting, and source-preserving rewrites.
Gsqlparser Back-reference to the owning TGSqlParser Useful for tools that walk the tree and need to look up other state (errors, source text).
accept(TParseTreeVisitor) Calls visitor.preVisit(this), then acceptChildren, then postVisit(this) The dispatcher entry point — see below.
acceptChildren(TParseTreeVisitor) Walks every direct child of this node and calls accept on each Concrete subclasses override this to expose their child fields in a defined order.

There is no parent pointer on TParseTreeNode. The tree is walked top-down. If you need parent information during a walk, carry it in your visitor's state (a stack), or compute it from startToken.posinlist.

Why no parent pointer?

The tree is built bottom-up by the parser, so giving every node a parent pointer would either require a second pass to set them or impose mutability constraints on every constructor. The visitor pattern obviates the need: preVisit is called before any child, postVisit is called after every descendant has been visited, so a visitor that maintains a Stack<TParseTreeNode> of "current ancestors" gets parent information for free.

The two trees that touch each AST node

When you parse SQL, two parallel data structures exist for every statement:

Structure Lives on What it is What you use it for
AST TCustomSqlStatement and its tree of TParseTreeNode children Logical structure: SELECT has a column list, a from list, a where clause, etc. Semantic analysis, lineage, transformation.
Token list TGSqlParser.sourcetokenlist (and per-statement subsets) Linear sequence of every lexical token: keywords, identifiers, whitespace, comments Source-preserving rewrites, exact-position error messages, the formatter.

Each AST node points back into the token list via startToken and endToken. Comments and whitespace are tokens but are not AST nodes. If you see a node in the AST, it has a token range; if you see a token in the source, it may or may not have a node attached.

The statement hierarchy (stmt/)

Top-level entries in sqlstatements are subclasses of TCustomSqlStatement (defined at gsp_dotnet_core/src/gsqlparser/TCustomSqlStatement.cs). The base class exposes a small set of "every statement might have these" properties:

Property Type Present on
sqlstatementtype ESqlStatementType enum always
tables TTableList SELECT, INSERT, UPDATE, DELETE, MERGE
joins TJoinList SELECT
ResultColumnList TResultColumnList SELECT, INSERT INTO ... SELECT, anywhere a projection appears
WhereClause TWhereClause SELECT, UPDATE, DELETE

The first thing most walkers do is switch on sqlstatementtype to decide which subclass to cast to:

1
2
3
4
5
6
7
8
switch (stmt.sqlstatementtype)
{
    case ESqlStatementType.sstselect:  HandleSelect((TSelectSqlStatement) stmt); break;
    case ESqlStatementType.sstinsert:  HandleInsert((TInsertSqlStatement) stmt); break;
    case ESqlStatementType.sstupdate:  HandleUpdate((TUpdateSqlStatement) stmt); break;
    case ESqlStatementType.sstdelete:  HandleDelete((TDeleteSqlStatement) stmt); break;
    // ...
}

The full enum (ESqlStatementType) has over 200 entries — every DDL statement, every dialect-specific extension, every PL/SQL block type. Most of them inherit TCustomSqlStatement directly; some (especially CREATE TABLE / CREATE PROCEDURE / etc.) live under stmt/<vendor>/ because the syntax is dialect-specific.

The two visitor interfaces

There are two visitor interfaces. Knowing which to use is half the battle.

TParseTreeVisitor — the general-purpose visitor

Defined at gsp_dotnet_core/src/gsqlparser/nodes/TParseTreeVisitor.cs. A class with hundreds of virtual preVisit(SomeNodeType) / postVisit(SomeNodeType) methods — one pair per concrete node class.

You inherit from it, override the preVisit overloads for the node types you care about, then call node.acceptChildren(yourVisitor). The dispatcher takes care of calling the right overload based on the node's runtime type.

Why pre + post: preVisit runs on the way down. postVisit runs on the way back up. If you're tracking nesting depth, building a parent stack, or measuring subtree sizes, you need both.

1
2
3
4
5
6
7
8
9
class FunctionFinder : TParseTreeVisitor
{
    public override void preVisit(TFunctionCall node)
    {
        Console.WriteLine("found: " + node.FunctionName);
    }
}

stmt.acceptChildren(new FunctionFinder());

That's the whole pattern. Override only what you care about; everything else is a no-op default.

IExpressionVisitor — the expression-only walker

Defined at gsp_dotnet_core/src/gsqlparser/nodes/IExpressionVisitor.cs:

1
2
3
4
public interface IExpressionVisitor
{
    bool exprVisit(TParseTreeNode pNode, bool isLeafNode);
}

A simpler shape: one exprVisit(node, isLeaf) method, called once per node in an expression tree. The boolean tells you whether the node is a leaf (column ref, literal) vs an internal node (operator, function call). Returning false aborts the walk early.

Used when you're inside a TExpression (a WHERE condition, a CASE WHEN, a function argument) and want to walk only that expression without descending into other clauses. Most consumers reach for TParseTreeVisitor first; IExpressionVisitor is the right choice when you're already holding a TExpression and want to enumerate its operands.

How accept and acceptChildren are wired

The base TParseTreeNode.accept(v) does this:

1
2
3
4
5
6
public virtual void accept(TParseTreeVisitor v)
{
    v.preVisit(this);          // dispatched to the right overload by overload resolution
    this.acceptChildren(v);
    v.postVisit(this);
}

Every concrete subclass overrides acceptChildren to traverse its own field set in a defined order. For example, TSelectSqlStatement.acceptChildren(v) walks the CTE list, the result-column list, the from-list, the where clause, the group-by, the order-by, the limit, and so on — in source order. Calling acceptChildren(v) on a node you hold gives you a complete depth-first walk of its subtree, with preVisit firing before each node and postVisit after.

Two corollaries fall out of this:

  • Visit order is deterministic and source-faithful. The visitor will see clauses in the order the SQL author wrote them. This matters for code that emits commentary or builds linear reports.
  • Nothing is hidden. If a clause is in the AST, the visitor sees it. If a feature is missing from the visitor surface, the parser didn't expose it as a node — which is a parser bug to file, not a "did I forget to traverse it?" question.

Source positions: the startToken / endToken pair

Every node carries a startToken and endToken pointing into TGSqlParser.sourcetokenlist. Each TSourceToken has lineNo, columnNo, astext (raw source text), tokencode, and tokentype. So:

  • Position highlighting (e.g. an IDE selecting "this expression"): (node.startToken.lineNo, node.startToken.columnNo) to (node.endToken.lineNo, node.endToken.columnNo + node.endToken.astext.Length).
  • Source-preserving rewrites (e.g. "rename column X to Y"): mutate the token's astext and re-emit the whole token list. The formatter and TScriptGenerator rely on this.
  • Comments: comment tokens are interleaved with solid tokens in the token list. Walking a node's [startToken..endToken] range hits both — no separate "comment list" needed.

If you build a custom rewriter, prefer mutating tokens over rewriting nodes. A token mutation always re-emits cleanly; a structural AST mutation may need careful re-linking and risks losing comments.

Why the design is shaped this way

A few load-bearing choices, with their reasons:

Decision Rationale
Pseudo-enums (InnerEnum) instead of true C# enums for ENodeType etc. The codebase was originally auto-ported from Java. The pattern lets ENodeType double as a runtime registry mapping IDs to FQ class names — the parser uses that registry to instantiate nodes when reading lex/yacc tables.
Visitor methods on a single huge TParseTreeVisitor class instead of per-node IVisitor<T> interfaces Keeps the visitor surface discoverable in one file, lets users override one method without implementing the whole hierarchy. The C# overload resolver does the type dispatch for free.
No parent pointer on nodes Avoids constructor-time mutability and parent-pointer corruption during AST mutation. Callers that need parents either track them in visitor state or compute them via startToken.posinlist.
Separate AST + token list The AST captures meaning; the token list captures exact source. Splitting them makes the formatter, the script writer, and the parser's own error-reporting all simpler.
Vendor-specific node subclasses under nodes/<vendor>/ Most SQL is shared, but a handful of constructs (Oracle (+) joins, MSSQL OPENROWSET, Hive LATERAL VIEW, Snowflake FLATTEN) only exist in one dialect. They get their own node types, and the visitor surface gets matching overloads.

See also