Skip to content

Performance Optimization

Tips and techniques for optimising parser throughput when you process thousands of statements or multi-megabyte SQL scripts.

Reuse parser instances

Parser construction is the slowest single operation because it loads embedded grammar tables from the assembly resources. For batch workloads, instantiate one parser per vendor and reuse it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
using gudusoft.gsqlparser;

// Once
var parser = new TGSqlParser(EDbVendor.dbvoracle);

// Per statement
foreach (string sql in batch)
{
    parser.sqltext = sql;
    if (parser.parse() != 0) continue;
    Process(parser.sqlstatements);
}

Re-parsing on the same instance is cheap — the grammar tables stay in memory.

Parse statement-by-statement for large files

Loading a 100 MB SQL script as one string and calling parse() will allocate a proportionally large AST. For huge files, split the input by the vendor's delimiter (typically ;, plus GO for MSSQL) and parse one statement at a time.

1
2
3
4
5
6
foreach (string statement in SplitStatements(File.ReadAllText(largeFile)))
{
    parser.sqltext = statement;
    if (parser.parse() == 0)
        Process(parser.sqlstatements.get(0));
}

For Oracle, use parser.getrawsqlstatements() first — it splits the file without doing full syntax checking, which is enough to feed individual statements back through parse().

Tokenise instead of parse

If you only need lexical info (keywords, identifiers, literals) — for example to count distinct table-name tokens — call tokenizeSqltext() instead. It runs the lexer only and skips parsing entirely.

1
2
3
4
parser.sqltext = sql;
parser.tokenizeSqltext();
foreach (TSourceToken tok in parser.sourcetokenlist)
    Console.WriteLine($"{tok.tokencode}\t{tok.astext}");

Tokenisation is roughly 5–10× faster than full parsing on the same input.

Multi-threading

A TGSqlParser instance is not thread-safe. For parallel workloads, give each thread its own parser:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Parallel.ForEach(
    sqls,
    () => new TGSqlParser(EDbVendor.dbvoracle), // local init
    (sql, _, parser) =>
    {
        parser.sqltext = sql;
        parser.parse();
        Process(parser.sqlstatements);
        return parser;
    },
    _ => { });

You can pool parsers across threads with ConcurrentBag<TGSqlParser> if construction overhead becomes the bottleneck.

Disable PL/SQL when you don't need it

Oracle's PL/SQL grammar is the largest single grammar in the library. If your workload is pure DML/DDL and never sees procedural blocks, you'll never trigger it and pay nothing extra. But anytime the parser sees BEGIN..END; or CREATE PROCEDURE, expect roughly an order-of-magnitude slowdown vs. a SELECT of the same size — plan for it.

Memory management

The AST keeps references to source tokens for line/column tracking. After processing each statement, drop your reference to the parser so the AST can be collected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
public static IEnumerable<MyResult> ProcessFile(string path)
{
    foreach (string stmt in StreamStatements(path))
    {
        var parser = new TGSqlParser(EDbVendor.dbvoracle) { sqltext = stmt };
        if (parser.parse() != 0) continue;
        yield return Process(parser.sqlstatements.get(0));
        // parser & AST eligible for GC at end of iteration
    }
}

For long-running services, consider an explicit GC.Collect() after major batches — the AST objects are large enough to land in Gen2 and stick.

Profile before you optimise

dotnet-counters and dotnet-trace work fine against GSP. Typical bottlenecks, in descending order of impact:

  1. String allocation in your own code (e.g. building output strings inside the visitor pre-visit hook)
  2. Parser construction (mitigated by reuse, see above)
  3. PL/SQL or stored-procedure parsing for procedural workloads
  4. Lexer table loading (one-time, only matters for short-lived processes)