Introduction
IslandSQL is a parser for SQL files targeting OracleDB or PostgreSQL. The parser is available on Maven Central and can process SQL*Plus, SQLcl or psql statements besides SQL statements. However, the focus is on statements with static DML statements and code in PL/SQL and PL/pgSQL. For static code analysis, for example.
In PostgreSQL create function
, create procedure
and do
accept code as a string. This simplifies parsing and the implementation of additional languages. As a result, you can write functions and procedures in SQL, PL/pgSQL, PL/Tcl, PL/Perl and PL/Python in any standard PostgreSQL distribution.
Starting with IslandSQL version 0.10 it’s possible to parse SQL and PL/pgSQL in strings and extend the parse tree accordingly. In this blog post, I will explain how this works.
This VS Code extension uses IslandSQL in a language server to report syntax errors and to produce parse trees as shown in this blog post.
PL/pgSQL as String
Let’s look at a do
statement executing a PL/pgSQL block provided as a string.
do '
begin
raise notice $$Hello World!$$;
end
';
NOTICE: Hello World!
DO
The parse tree in IslandSQL version 0.9 looks as follows:
Look at the parse tree. It is quite simple. Interesting is the PL/pgSQL block. The content in single quotes is represented as a single token. The big rectangle at the bottom. It’s a single token regardless of the code size.
PL/pgSQL as Subtree
Parsing PL/pgSQL as a string is easy. The lexer produces the token for the string containing the PL/pgSQL code. The parser does not need to understand the PL/pgSQL at all.
But this does not help analyse the PL/pgSQL code. We need a parser that
- understands PL/pgSQL
- parses PL/pgSQL code provided as a string
- extends the main parse tree by sub-parse trees with PL/pgSQL code
The IslandSQL parser in version 0.10 does exactly that. It produces this parse tree by default:
The PL/pgSQL code is represented twice in this parse tree. Once as a single token. And once as a node named postgresqlPlpgsqlCode
.
PL/pgSQL as Subtree Only
Maybe you do not like it when PL/pgSQL code is represented twice in the parse tree. In this case, you can override the default options when creating an IslandSQL document.
The IslandSQL library requires Java 8 or newer. In the next example, we use Java 22 to reduce boilerplate code. Implicitly declared classes are still in preview. Hence we have to pass the parameters --enable-preview --source 22
when running the HelloWorld.java
program.
Explaining HelloWorld.java
In line 13 we create the IslandSQL document with a series of parameters. One of them is removeCode(true)
in line 19 which oversteers the default behaviour and removes the code as a string from the parse tree.
In line 21 we print a profile. Thanks to Cary Millsap, I can’t deal with performance issues without thinking about a bicycle, a Ferrari and kissing… The profile helps to understand where the parser has spent its time.
Finally, in lines 22 and 23 we print the parse tree. Once as a simple textual hierarchy and once as a DOT graph. You can visualise the result in Edotor.net, for example.
import ch.islandsql.grammar.IslandSqlDialect;
import ch.islandsql.grammar.IslandSqlDocument;
import ch.islandsql.grammar.util.ParseTreeUtil;
void main() {
var source = """
do '
begin
raise notice $$Hello World!$$;
end
';
""";
var doc = new IslandSqlDocument.Builder()
.sql(source)
.hideOutOfScopeTokens(false)
.dialect(IslandSqlDialect.POSTGRESQL)
.profile(true)
.subtrees(true)
.removeCode(true)
.build();
System.out.println(doc.getParserMetrics().printProfile());
System.out.println(ParseTreeUtil.printParseTree(doc.getFile()));
System.out.println(ParseTreeUtil.dotParseTree(doc.getFile()));
}
java --enable-preview --source 22 -cp .:islandsql-0.10.0.jar:antlr4-runtime-4.13.1.jar HelloWorld.java
Profile
=======
Total memory used by parser : 5’164 KB
Total time spent in parser : 27.918 ms
Total time recorded by profiler: 14.167 ms (100%)
Rule Name (Decision) Time (ms) Percent Invocations Lookahead Max Lookahead Ambiguities Errors
---------------------------------------- -------------- ------- ----------- --------- ------------- ----------- ------
string (1887) 10.302 72.72 2 0 0 0 0
plsqlStatement (1129) 1.337 9.44 1 0 0 0 0
postgresqlDo (576) 0.750 5.30 1 0 0 0 0
postgresqlPlpgsqlCode (6) 0.735 5.19 2 0 0 0 0
statement (11) 0.566 3.99 1 0 0 0 0
sqlEnd (1888) 0.314 2.21 1 0 0 0 0
postgresqlRaiseStatement (1260) 0.134 0.95 1 0 0 0 0
postgresqlRaiseStatement (1269) 0.030 0.21 1 0 0 0 0
file
statement
doStatement
postgresqlDo
K_DO:do
postgresqlPlpgsqlCode
K_BEGIN:begin
plsqlStatement
postgresqlRaiseStatement
K_RAISE:raise
raiseLevel
K_NOTICE:notice
string:dollarString
DOLLAR_STRING:$$Hello World!$$
SEMI:;
K_END:end
sqlEnd
SEMI:;
<EOF>
digraph islandSQL {
bgcolor="transparent"
"1155757579" [shape=ellipse label="file" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"1155757579" -> "1785111044"
"1155757579" -> "1482748887"
"1785111044" [shape=ellipse label="statement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"1785111044" -> "494894055"
"494894055" [shape=ellipse label="doStatement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"494894055" -> "1123226989"
"494894055" -> "500885941"
"1123226989" [shape=ellipse label="postgresqlDo" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"1123226989" -> "1115381650"
"1123226989" -> "616412281"
"1115381650" [shape=box label="do" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"616412281" [shape=ellipse label="postgresqlPlpgsqlCode" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"616412281" -> "2118096382"
"616412281" -> "878861517"
"616412281" -> "746394140"
"2118096382" [shape=box label="begin" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"878861517" [shape=ellipse label="plsqlStatement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"878861517" -> "1705665942"
"1705665942" [shape=ellipse label="postgresqlRaiseStatement" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"1705665942" -> "1731763384"
"1705665942" -> "1100619942"
"1705665942" -> "87242619"
"1705665942" -> "864248990"
"1731763384" [shape=box label="raise" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"1100619942" [shape=ellipse label="raiseLevel" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"1100619942" -> "285074186"
"285074186" [shape=box label="notice" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"87242619" [shape=ellipse label="dollarString" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"87242619" -> "15892131"
"15892131" [shape=box label="$$Hello World!$$" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"864248990" [shape=box label=";" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"746394140" [shape=box label="end" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"500885941" [shape=ellipse label="sqlEnd" style=filled fillcolor="#bfe6ff" fontname="Helvetica"]
"500885941" -> "484841769"
"484841769" [shape=box label=";" style=filled fillcolor="#fadabd" fontname="Helvetica"]
"1482748887" [shape=box label="<EOF>" style=filled fillcolor="#fadabd" fontname="Helvetica"]
}
The parse tree looks now like this:
SQL Dialect
In line 16 of the previous HelloWorld.java
program we set the dialect to POSTGRESQL
. Is that required? – No, it’s not. But when do we need to set the SQL dialect in IslandSQL? – When the lexical incompatibility between OracleDB and PostgreSQL leads to syntax errors in the code to be parsed.
What? – Let me explain.
Identifiers
OracleDB and PostgreSQL use different characters to build an identifier. The following table shows the differences. The allowed characters are listed in square brackets. Read \p{Alpha}
as any alphabetic letter in the character set of the database.
DBMS | Allowed as First Character | Allowed in Subsequent Characters |
---|---|---|
OracleDB | [\p{Alpha}] | [_$#0-9\p{Alpha}] |
PostgreSQL | [_\p{Alpha}] | [_$0-9\p{Alpha}] |
PostgreSQL allows identifiers that start with an underscore. That’s not a problem. However, OracleDB allows the hash sign (#
) to be used in an identifier. That leads to unexpected results when the PostgreSQL code uses the bitwise XOR operator without spaces around it. Here’s an example
select a#b from t;
PostgreSQL expects that the columns a
and b
exists in table t
.
OracleDB expects that the column a#b
exists in table t
.
In this case, IslandSQL can parse the code without errors. However, the parse tree might not look as expected. That’s a documented limitation and cannot be influenced by setting the dialect. At least not in version 0.10 of IslandSQL. Nevertheless, it shows the impact of a lexical incompatibility.
Inquiry Directives
PL/SQL supports predefined and custom Inquiry Directives. These directives are lexically incompatible with PostgreSQL dollar-quoted string constants.
Here’s an example:
alter session set plsql_ccflags = 'custom1:41, custom2:42';
begin
dbms_output.put_line($$custom1);
dbms_output.put_line($$custom2 || '(2)');
end;
By default, this causes a parse error, because $$custom1);\n dbms_output.put_line($$
is identified as a dollar-quoted string constant by the lexer. As a result, the code is interpreted like this:
alter session set plsql_ccflags = 'custom1:41, custom2:42';
begin
dbms_output.put_line('custom1);
dbms_output.put_line('custom2 || '(2)');
end;
This makes it clearer, why we get a syntax error in line 4 at custom2
.
In this case, we have to set the SQL dialect to ORACLEDB
to parse custom_inquiry_directives.sql
without errors.
Predefined Inquiry Directives
To simplify the use, we want to avoid specifying the SQL dialect. One way to achieve that is to handle predefined inquiry directives in the GENERIC
SQL dialect.
Here’s an example, that does not report syntax errors:
begin
dbms_output.put_line($$plsql_line);
dbms_output.put_line($$plsql_line || '(2)');
end;
We know the predefined inquiry directives and can deal with them in the lexer.
However, this special treatment can cause problems in corner cases like this one:
do '
begin
raise notice $$plsql_line is a predefined inquiry directive$$;
end
';
In such cases, we should use the POSTGRESQL
dialect to avoid parse errors.
Detect SQL Dialect
Is there a way to detect the SQL dialect of an SQL input automatically? Sure. Several. I’m sure there is a way to use an LLM to get a reasonable result. I’m more of a rule-based guy. So we could parse the code with one dialect and on parse errors try other dialects. If all dialects produce errors, we could choose the one with the fewest errors.
This sounds costly, right? Therefore I decided to start with a simple SQL dialect detection mechanism. The current implementation looks like this:
private static IslandSqlDialect guessDialect(String sql) {
return sql.contains("\n/\n") ? IslandSqlDialect.ORACLEDB : IslandSqlDialect.GENERIC;
}
In other words, if the code contains a slash followed by a newline character in the first column of a line, then we go with the ORACLEDB
dialect. In all other cases, we go with the GENERIC
SQL dialect.
My first version was even simpler. However, I had to add a final \n
to the search string to ensure that SQL code containing multiline comments is not recognized as ORACLEDB
dialect. Files with Windows newline characters are always recognized as GENERIC. That’s the price when trying to keep things simple and fast.
Now, the ORACLEDB SQL dialect is correctly detected in the next example. As a result, no syntax errors are reported.
alter session set plsql_ccflags = 'custom1:41, custom2:42';
begin
dbms_output.put_line($$custom1);
dbms_output.put_line($$custom2 || '(2)');
end;
/
The SQL dialect detection mechanism kicks in when no SQL dialect is specified (null
).
Conclusion
Developing a single grammar for OracleDB and PostgreSQL was an interesting work. I learned a lot about the underlying DBMS. I often looked at the grammar documentation and did not understand it fully. So I had to run the provided examples or create some myself. The typical test cases are based on working examples, extended by tests based on the cartesian product of a subset of clauses that verified if I had defined the order, the optionality, and the cardinality according to the documentation.
A challenge is the undocumented stuff. There are various reasons why something is not documented. In the end, it does not matter why something is missing. The parser fails when processing code that works but should not according to the docs. This cannot be covered by tests based on the official documentation. I found some bugs while processing real-life code. And I expect to find more.
This is the season finale of IslandSQL. There won’t be a second season. However, there might be some spin-offs since I plan to build products based on IslandSQL. Therefore I plan to keep the parser compatible with the latest versions of OracleDB and PostgreSQL.
Feedback is welcome. Please leave your comments on this blog post or open an issue in the IslandSQL GitHub repository for questions, bugs or feature requests.
Thank you.