IslandSQL Final Episode 10: Parsing PL/pgSQL

2024-07-29

Introduction

IslandSQL is a parser for SQL files targeting OracleDB or PostgreSQL. The parser is available on Maven Central and can process SQL*Plus, SQLcl or psql statements besides SQL statements. However, the focus is on statements with static DML statements and code in PL/SQL and PL/pgSQL. For static code analysis, for example.

In PostgreSQL create function, create procedure and do accept code as a string. This simplifies parsing and the implementation of additional languages. As a result, you can write functions and procedures in SQL, PL/pgSQL, PL/Tcl, PL/Perl and PL/Python in any standard PostgreSQL distribution.

Starting with IslandSQL version 0.10 it’s possible to parse SQL and PL/pgSQL in strings and extend the parse tree accordingly. In this blog post, I will explain how this works.

This VS Code extension uses IslandSQL in a language server to report syntax errors and to produce parse trees as shown in this blog post.

PL/pgSQL as String

Let’s look at a do statement executing a PL/pgSQL block provided as a string.

1) hello_world.sql

do '
begin
   raise notice $$Hello World!$$;
end
';

NOTICE:  Hello World!
DO

The parse tree in IslandSQL version 0.9 looks as follows:

Look at the parse tree. It is quite simple. Interesting is the PL/pgSQL block. The content in single quotes is represented as a single token. The big rectangle at the bottom. It’s a single token regardless of the code size.

PL/pgSQL as Subtree

Parsing PL/pgSQL as a string is easy. The lexer produces the token for the string containing the PL/pgSQL code. The parser does not need to understand the PL/pgSQL at all.

But this does not help analyse the PL/pgSQL code. We need a parser that

understands PL/pgSQL
parses PL/pgSQL code provided as a string
extends the main parse tree by sub-parse trees with PL/pgSQL code

The IslandSQL parser in version 0.10 does exactly that. It produces this parse tree by default:

Parse tree with PL/pgSQL as string and as subtree

The PL/pgSQL code is represented twice in this parse tree. Once as a single token. And once as a node named postgresqlPlpgsqlCode.

PL/pgSQL as Subtree Only

Maybe you do not like it when PL/pgSQL code is represented twice in the parse tree. In this case, you can override the default options when creating an IslandSQL document.

The IslandSQL library requires Java 8 or newer. In the next example, we use Java 22 to reduce boilerplate code. Implicitly declared classes are still in preview. Hence we have to pass the parameters --enable-preview --source 22 when running the HelloWorld.java program.

Explaining HelloWorld.java

In line 13 we create the IslandSQL document with a series of parameters. One of them is removeCode(true) in line 19 which oversteers the default behaviour and removes the code as a string from the parse tree.

In line 21 we print a profile. Thanks to Cary Millsap, I can’t deal with performance issues without thinking about a bicycle, a Ferrari and kissing… The profile helps to understand where the parser has spent its time.

Finally, in lines 22 and 23 we print the parse tree. Once as a simple textual hierarchy and once as a DOT graph. You can visualise the result in Edotor.net, for example.

2) HelloWorld.java

import ch.islandsql.grammar.IslandSqlDialect;
import ch.islandsql.grammar.IslandSqlDocument;
import ch.islandsql.grammar.util.ParseTreeUtil;

void main() {
    var source = """
            do '
            begin
               raise notice $$Hello World!$$;
            end
            ';
            """;
    var doc = new IslandSqlDocument.Builder()
            .sql(source)
            .hideOutOfScopeTokens(false)
            .dialect(IslandSqlDialect.POSTGRESQL)
            .profile(true)
            .subtrees(true)
            .removeCode(true)
            .build();
    System.out.println(doc.getParserMetrics().printProfile());
    System.out.println(ParseTreeUtil.printParseTree(doc.getFile()));
    System.out.println(ParseTreeUtil.dotParseTree(doc.getFile()));
}

java --enable-preview --source 22 -cp .:islandsql-0.10.0.jar:antlr4-runtime-4.13.1.jar HelloWorld.java

Profile
=======

Total memory used by parser    : 5’164 KB
Total time spent in parser     : 27.918 ms
Total time recorded by profiler: 14.167 ms (100%)

Rule Name (Decision)                          Time (ms) Percent Invocations Lookahead Max Lookahead Ambiguities Errors
---------------------------------------- -------------- ------- ----------- --------- ------------- ----------- ------
string (1887)                                    10.302   72.72           2         0             0           0      0
plsqlStatement (1129)                             1.337    9.44           1         0             0           0      0
postgresqlDo (576)                                0.750    5.30           1         0             0           0      0
postgresqlPlpgsqlCode (6)                         0.735    5.19           2         0             0           0      0
statement (11)                                    0.566    3.99           1         0             0           0      0
sqlEnd (1888)                                     0.314    2.21           1         0             0           0      0
postgresqlRaiseStatement (1260)                   0.134    0.95           1         0             0           0      0
postgresqlRaiseStatement (1269)                   0.030    0.21           1         0             0           0      0

file
  statement
    doStatement
      postgresqlDo
        K_DO:do
        postgresqlPlpgsqlCode
          K_BEGIN:begin
          plsqlStatement
            postgresqlRaiseStatement
              K_RAISE:raise
              raiseLevel
                K_NOTICE:notice
              string:dollarString
                DOLLAR_STRING:$$Hello World!$$
              SEMI:;
          K_END:end
      sqlEnd
        SEMI:;
  <EOF>

digraph islandSQL {
  bgcolor="transparent"
  "1155757579" [shape=ellipse label="file" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "1155757579" -> "1785111044"
  "1155757579" -> "1482748887"
  "1785111044" [shape=ellipse label="statement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "1785111044" -> "494894055"
  "494894055" [shape=ellipse label="doStatement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "494894055" -> "1123226989"
  "494894055" -> "500885941"
  "1123226989" [shape=ellipse label="postgresqlDo" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "1123226989" -> "1115381650"
  "1123226989" -> "616412281"
  "1115381650" [shape=box label="do" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "616412281" [shape=ellipse label="postgresqlPlpgsqlCode" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "616412281" -> "2118096382"
  "616412281" -> "878861517"
  "616412281" -> "746394140"
  "2118096382" [shape=box label="begin" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "878861517" [shape=ellipse label="plsqlStatement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "878861517" -> "1705665942"
  "1705665942" [shape=ellipse label="postgresqlRaiseStatement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "1705665942" -> "1731763384"
  "1705665942" -> "1100619942"
  "1705665942" -> "87242619"
  "1705665942" -> "864248990"
  "1731763384" [shape=box label="raise" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "1100619942" [shape=ellipse label="raiseLevel" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "1100619942" -> "285074186"
  "285074186" [shape=box label="notice" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "87242619" [shape=ellipse label="dollarString" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "87242619" -> "15892131"
  "15892131" [shape=box label="$$Hello World!$$" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "864248990" [shape=box label=";" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "746394140" [shape=box label="end" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "500885941" [shape=ellipse label="sqlEnd" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
  "500885941" -> "484841769"
  "484841769" [shape=box label=";" style=filled fillcolor="#fadabd" fontname="Helvetica"]
  "1482748887" [shape=box label="<EOF>" style=filled fillcolor="#fadabd" fontname="Helvetica"]
}

The parse tree looks now like this:

Parse tree with PL/pgSQL as subtree only

SQL Dialect

In line 16 of the previous HelloWorld.java program we set the dialect to POSTGRESQL. Is that required? – No, it’s not. But when do we need to set the SQL dialect in IslandSQL? – When the lexical incompatibility between OracleDB and PostgreSQL leads to syntax errors in the code to be parsed.

What? – Let me explain.

Identifiers

OracleDB and PostgreSQL use different characters to build an identifier. The following table shows the differences. The allowed characters are listed in square brackets. Read \p{Alpha} as any alphabetic letter in the character set of the database.

DBMS	Allowed as First Character	Allowed in Subsequent Characters
OracleDB	`[\p{Alpha}]`	[_$#0-9\p{Alpha}]
PostgreSQL	`[_\p{Alpha}]`	[_$0-9\p{Alpha}]

PostgreSQL allows identifiers that start with an underscore. That’s not a problem. However, OracleDB allows the hash sign (#) to be used in an identifier. That leads to unexpected results when the PostgreSQL code uses the bitwise XOR operator without spaces around it. Here’s an example

3) One or two identifiers in select_list?

select a#b from t;

PostgreSQL expects that the columns a and b exists in table t.

OracleDB expects that the column a#b exists in table t.

In this case, IslandSQL can parse the code without errors. However, the parse tree might not look as expected. That’s a documented limitation and cannot be influenced by setting the dialect. At least not in version 0.10 of IslandSQL. Nevertheless, it shows the impact of a lexical incompatibility.

Inquiry Directives

PL/SQL supports predefined and custom Inquiry Directives. These directives are lexically incompatible with PostgreSQL dollar-quoted string constants.

Here’s an example:

4) Using custom inquiry directives

alter session set plsql_ccflags = 'custom1:41, custom2:42';
begin
   dbms_output.put_line($$custom1);
   dbms_output.put_line($$custom2 || '(2)');
end;

By default, this causes a parse error, because $$custom1);\n dbms_output.put_line($$ is identified as a dollar-quoted string constant by the lexer. As a result, the code is interpreted like this:

5) Visualising how two custom inquiry directives are treated as a string

alter session set plsql_ccflags = 'custom1:41, custom2:42';
begin
   dbms_output.put_line('custom1);
   dbms_output.put_line('custom2 || '(2)');
end;

This makes it clearer, why we get a syntax error in line 4 at custom2.

In this case, we have to set the SQL dialect to ORACLEDB to parse custom_inquiry_directives.sql without errors.

Predefined Inquiry Directives

To simplify the use, we want to avoid specifying the SQL dialect. One way to achieve that is to handle predefined inquiry directives in the GENERIC SQL dialect.

Here’s an example, that does not report syntax errors:

6) Using predefined inquiry directives

begin
   dbms_output.put_line($$plsql_line);
   dbms_output.put_line($$plsql_line || '(2)');
end;

We know the predefined inquiry directives and can deal with them in the lexer.

However, this special treatment can cause problems in corner cases like this one:

7) Using the name of a predefined inquiry directive at the start of a dollar-quoted string constant

do '
begin
   raise notice $$plsql_line is a predefined inquiry directive$$;
end
';

In such cases, we should use the POSTGRESQL dialect to avoid parse errors.

Detect SQL Dialect

Is there a way to detect the SQL dialect of an SQL input automatically? Sure. Several. I’m sure there is a way to use an LLM to get a reasonable result. I’m more of a rule-based guy. So we could parse the code with one dialect and on parse errors try other dialects. If all dialects produce errors, we could choose the one with the fewest errors.

This sounds costly, right? Therefore I decided to start with a simple SQL dialect detection mechanism. The current implementation looks like this:

8) Excerpt of IslandSqlDocument.java in version 0.10.0

private static IslandSqlDialect guessDialect(String sql) {
    return sql.contains("\n/\n") ? IslandSqlDialect.ORACLEDB : IslandSqlDialect.GENERIC;
}

In other words, if the code contains a slash followed by a newline character in the first column of a line, then we go with the ORACLEDB dialect. In all other cases, we go with the GENERIC SQL dialect.

My first version was even simpler. However, I had to add a final \n to the search string to ensure that SQL code containing multiline comments is not recognized as ORACLEDB dialect. Files with Windows newline characters are always recognized as GENERIC. That’s the price when trying to keep things simple and fast.

Now, the ORACLEDB SQL dialect is correctly detected in the next example. As a result, no syntax errors are reported.

9) Using custom inquiry directives in a PL/SQL Block ending on slash

alter session set plsql_ccflags = 'custom1:41, custom2:42';
begin
   dbms_output.put_line($$custom1);
   dbms_output.put_line($$custom2 || '(2)');
end;
/

The SQL dialect detection mechanism kicks in when no SQL dialect is specified (null).

Conclusion

Developing a single grammar for OracleDB and PostgreSQL was an interesting work. I learned a lot about the underlying DBMS. I often looked at the grammar documentation and did not understand it fully. So I had to run the provided examples or create some myself. The typical test cases are based on working examples, extended by tests based on the cartesian product of a subset of clauses that verified if I had defined the order, the optionality, and the cardinality according to the documentation.

A challenge is the undocumented stuff. There are various reasons why something is not documented. In the end, it does not matter why something is missing. The parser fails when processing code that works but should not according to the docs. This cannot be covered by tests based on the official documentation. I found some bugs while processing real-life code. And I expect to find more.

This is the season finale of IslandSQL. There won’t be a second season. However, there might be some spin-offs since I plan to build products based on IslandSQL. Therefore I plan to keep the parser compatible with the latest versions of OracleDB and PostgreSQL.

Feedback is welcome. Please leave your comments on this blog post or open an issue in the IslandSQL GitHub repository for questions, bugs or feature requests.

Thank you.