Parse-tree internals¶
This page is the conceptual map you need before you start walking parse trees with General SQL Parser .NET. It covers what the AST is, how it's shaped, what the visitor contract guarantees, and why the design is the way it is. If you just want copy-pasteable recipes, jump to the Walk the parse tree how-to instead.
What the parser actually produces¶
When you call parser.parse() on TGSqlParser, the output is a list of statements, each one a tree of nodes:
1 2 3 4 5 6 7 8 9 10 11 12 | |
Two ideas are doing the work here:
- Statements are first-class. The top-level container is a list of statements, not a single root node. A 12-statement script gives you 12 entries in
sqlstatements. Each one is its own self-contained tree. - Every tree node descends from
TParseTreeNode. Statements, clauses, expressions, table references, column references — they're allTParseTreeNodesubclasses, so the same traversal machinery works on all of them.
The base class — TParseTreeNode¶
Defined in gsp_dotnet_core/src/gsqlparser/nodes/TParseTreeNode.cs. Every node carries:
| Member | What it is | Why it matters |
|---|---|---|
dbvendor |
EDbVendor of the script that produced this node |
A node knows its own dialect; visitors can branch on it. |
startToken, endToken |
The first and last source tokens covered by this node | Lets you map any AST node back to a (line, column) range in the original SQL — the foundation for highlighting, error reporting, and source-preserving rewrites. |
Gsqlparser |
Back-reference to the owning TGSqlParser |
Useful for tools that walk the tree and need to look up other state (errors, source text). |
accept(TParseTreeVisitor) |
Calls visitor.preVisit(this), then acceptChildren, then postVisit(this) |
The dispatcher entry point — see below. |
acceptChildren(TParseTreeVisitor) |
Walks every direct child of this node and calls accept on each |
Concrete subclasses override this to expose their child fields in a defined order. |
There is no parent pointer on TParseTreeNode. The tree is walked top-down. If you need parent information during a walk, carry it in your visitor's state (a stack), or compute it from startToken.posinlist.
Why no parent pointer?
The tree is built bottom-up by the parser, so giving every node a parent pointer would either require a second pass to set them or impose mutability constraints on every constructor. The visitor pattern obviates the need: preVisit is called before any child, postVisit is called after every descendant has been visited, so a visitor that maintains a Stack<TParseTreeNode> of "current ancestors" gets parent information for free.
The two trees that touch each AST node¶
When you parse SQL, two parallel data structures exist for every statement:
| Structure | Lives on | What it is | What you use it for |
|---|---|---|---|
| AST | TCustomSqlStatement and its tree of TParseTreeNode children |
Logical structure: SELECT has a column list, a from list, a where clause, etc. | Semantic analysis, lineage, transformation. |
| Token list | TGSqlParser.sourcetokenlist (and per-statement subsets) |
Linear sequence of every lexical token: keywords, identifiers, whitespace, comments | Source-preserving rewrites, exact-position error messages, the formatter. |
Each AST node points back into the token list via startToken and endToken. Comments and whitespace are tokens but are not AST nodes. If you see a node in the AST, it has a token range; if you see a token in the source, it may or may not have a node attached.
The statement hierarchy (stmt/)¶
Top-level entries in sqlstatements are subclasses of TCustomSqlStatement (defined at gsp_dotnet_core/src/gsqlparser/TCustomSqlStatement.cs). The base class exposes a small set of "every statement might have these" properties:
| Property | Type | Present on |
|---|---|---|
sqlstatementtype |
ESqlStatementType enum |
always |
tables |
TTableList |
SELECT, INSERT, UPDATE, DELETE, MERGE |
joins |
TJoinList |
SELECT |
ResultColumnList |
TResultColumnList |
SELECT, INSERT INTO ... SELECT, anywhere a projection appears |
WhereClause |
TWhereClause |
SELECT, UPDATE, DELETE |
The first thing most walkers do is switch on sqlstatementtype to decide which subclass to cast to:
1 2 3 4 5 6 7 8 | |
The full enum (ESqlStatementType) has over 200 entries — every DDL statement, every dialect-specific extension, every PL/SQL block type. Most of them inherit TCustomSqlStatement directly; some (especially CREATE TABLE / CREATE PROCEDURE / etc.) live under stmt/<vendor>/ because the syntax is dialect-specific.
The two visitor interfaces¶
There are two visitor interfaces. Knowing which to use is half the battle.
TParseTreeVisitor — the general-purpose visitor¶
Defined at gsp_dotnet_core/src/gsqlparser/nodes/TParseTreeVisitor.cs. A class with hundreds of virtual preVisit(SomeNodeType) / postVisit(SomeNodeType) methods — one pair per concrete node class.
You inherit from it, override the preVisit overloads for the node types you care about, then call node.acceptChildren(yourVisitor). The dispatcher takes care of calling the right overload based on the node's runtime type.
Why pre + post: preVisit runs on the way down. postVisit runs on the way back up. If you're tracking nesting depth, building a parent stack, or measuring subtree sizes, you need both.
1 2 3 4 5 6 7 8 9 | |
That's the whole pattern. Override only what you care about; everything else is a no-op default.
IExpressionVisitor — the expression-only walker¶
Defined at gsp_dotnet_core/src/gsqlparser/nodes/IExpressionVisitor.cs:
1 2 3 4 | |
A simpler shape: one exprVisit(node, isLeaf) method, called once per node in an expression tree. The boolean tells you whether the node is a leaf (column ref, literal) vs an internal node (operator, function call). Returning false aborts the walk early.
Used when you're inside a TExpression (a WHERE condition, a CASE WHEN, a function argument) and want to walk only that expression without descending into other clauses. Most consumers reach for TParseTreeVisitor first; IExpressionVisitor is the right choice when you're already holding a TExpression and want to enumerate its operands.
How accept and acceptChildren are wired¶
The base TParseTreeNode.accept(v) does this:
1 2 3 4 5 6 | |
Every concrete subclass overrides acceptChildren to traverse its own field set in a defined order. For example, TSelectSqlStatement.acceptChildren(v) walks the CTE list, the result-column list, the from-list, the where clause, the group-by, the order-by, the limit, and so on — in source order. Calling acceptChildren(v) on a node you hold gives you a complete depth-first walk of its subtree, with preVisit firing before each node and postVisit after.
Two corollaries fall out of this:
- Visit order is deterministic and source-faithful. The visitor will see clauses in the order the SQL author wrote them. This matters for code that emits commentary or builds linear reports.
- Nothing is hidden. If a clause is in the AST, the visitor sees it. If a feature is missing from the visitor surface, the parser didn't expose it as a node — which is a parser bug to file, not a "did I forget to traverse it?" question.
Source positions: the startToken / endToken pair¶
Every node carries a startToken and endToken pointing into TGSqlParser.sourcetokenlist. Each TSourceToken has lineNo, columnNo, astext (raw source text), tokencode, and tokentype. So:
- Position highlighting (e.g. an IDE selecting "this expression"):
(node.startToken.lineNo, node.startToken.columnNo)to(node.endToken.lineNo, node.endToken.columnNo + node.endToken.astext.Length). - Source-preserving rewrites (e.g. "rename column X to Y"): mutate the token's
astextand re-emit the whole token list. The formatter andTScriptGeneratorrely on this. - Comments: comment tokens are interleaved with solid tokens in the token list. Walking a node's
[startToken..endToken]range hits both — no separate "comment list" needed.
If you build a custom rewriter, prefer mutating tokens over rewriting nodes. A token mutation always re-emits cleanly; a structural AST mutation may need careful re-linking and risks losing comments.
Why the design is shaped this way¶
A few load-bearing choices, with their reasons:
| Decision | Rationale |
|---|---|
Pseudo-enums (InnerEnum) instead of true C# enums for ENodeType etc. |
The codebase was originally auto-ported from Java. The pattern lets ENodeType double as a runtime registry mapping IDs to FQ class names — the parser uses that registry to instantiate nodes when reading lex/yacc tables. |
Visitor methods on a single huge TParseTreeVisitor class instead of per-node IVisitor<T> interfaces |
Keeps the visitor surface discoverable in one file, lets users override one method without implementing the whole hierarchy. The C# overload resolver does the type dispatch for free. |
| No parent pointer on nodes | Avoids constructor-time mutability and parent-pointer corruption during AST mutation. Callers that need parents either track them in visitor state or compute them via startToken.posinlist. |
| Separate AST + token list | The AST captures meaning; the token list captures exact source. Splitting them makes the formatter, the script writer, and the parser's own error-reporting all simpler. |
Vendor-specific node subclasses under nodes/<vendor>/ |
Most SQL is shared, but a handful of constructs (Oracle (+) joins, MSSQL OPENROWSET, Hive LATERAL VIEW, Snowflake FLATTEN) only exist in one dialect. They get their own node types, and the visitor surface gets matching overloads. |
See also¶
- Walk the parse tree — practical recipes built on the concepts above
- Software Architecture — the broader component-level view (parser ↔ AST ↔ statement classes ↔ formatter)
- AST Tree Nodes Reference — the per-node catalog (what fields each concrete node exposes)