Skip to content

Performance Considerations

A practical guide to GSP .NET's performance characteristics — what's fast, what's slow, and what to do about it.

Typical numbers

On a recent x86_64 Linux machine running .NET 10:

Workload Time
Simple SELECT (one table, one WHERE) < 1 ms
Complex SELECT (multi-join, CTE, window functions) 1–10 ms
Stored procedure (a few hundred lines) 10–100 ms
Tokenisation only ~5–10× faster than full parsing
Steady-state batch parsing (reusing one TGSqlParser) 1000+ statements/sec for typical SELECTs

These are order-of-magnitude figures — your mileage will vary based on SQL complexity, vendor (Oracle PL/SQL is the slowest), and CPU.

What's fast

  • Tokenisation — running tokenizeSqltext() skips parsing entirely. Use this when you only need keyword/identifier/literal info.
  • Re-parsing on the same TGSqlParser instance — once the grammar tables are loaded into memory, subsequent parse() calls are cheap.
  • ANSI DML on most vendors — SELECT/INSERT/UPDATE/DELETE without procedural blocks parse in well under a millisecond.

What's slow

  • Parser construction — loading the embedded grammar tables takes hundreds of milliseconds the first time. Reuse instances.
  • Oracle PL/SQL — the largest single grammar in the library. Procedural blocks are ~10× slower than equivalently-sized DML.
  • Pathologically nested expressions — a 1000-deep nesting of binary operators can blow the parser stack and is slow even when it succeeds. Real SQL doesn't look like this.
  • Multi-megabyte single statements — uncommon, but the AST scales linearly with statement size.

Tactics

Reuse parser instances

The single biggest performance win:

1
2
3
4
5
6
7
var parser = new TGSqlParser(EDbVendor.dbvoracle);
foreach (string sql in batch)
{
    parser.sqltext = sql;
    if (parser.parse() != 0) continue;
    Process(parser.sqlstatements);
}

Parse statement-by-statement for large files

For multi-MB scripts, split first and parse one statement at a time. This keeps peak memory bounded.

Tokenise when you don't need an AST

1
2
3
parser.sqltext = sql;
parser.tokenizeSqltext();
foreach (var token in parser.sourcetokenlist) { /* ... */ }

Multi-threading

TGSqlParser is not thread-safe. For parallel workloads, give each thread its own instance, or pool them with ConcurrentBag<TGSqlParser>. See Performance Optimization for code samples.

Memory management

The AST holds references to the entire sourcetokenlist for line/column tracking. After processing a statement, drop the parser reference so the AST can be collected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public static IEnumerable<MyResult> Process(string path)
{
    foreach (string stmt in StreamStatements(path))
    {
        var parser = new TGSqlParser(EDbVendor.dbvoracle) { sqltext = stmt };
        if (parser.parse() != 0) continue;
        yield return Analyse(parser.sqlstatements.get(0));
        // parser & AST eligible for GC at end of iteration
    }
}

For long-running services, an explicit GC.Collect() after major batches can help — the AST objects are large enough to land in Gen2 and stick.

When to suspect the parser

Most performance problems in code that uses GSP are not in the parser. Common culprits, in descending order:

  1. Allocations in your visitor code — building strings inside a preVisit hook that runs on every node.
  2. String concatenation in a tight loop — use StringBuilder.
  3. Parser construction in a tight loop — see "Reuse parser instances" above.
  4. Re-parsing the same SQL repeatedly — cache the AST.
  5. PL/SQL parsing on workloads that don't actually need it — none of the production workloads we've seen require this.

If you've ruled all of those out and still see slow parsing, please file an issue with the offending SQL — there's almost certainly a grammar pessimisation we can fix.

Profiling

dotnet-counters monitor --process-id <pid> and dotnet-trace collect work fine against GSP-using processes. For micro-benchmarking, BenchmarkDotNet is the standard.

See also