IslandSQL Episode 8: What’s New in Oracle Database 23.4?

2024-06-15

Introduction

In the last episode, we looked at some features in PostgreSQL which I miss in the Oracle Database. The IslandSQL grammar now covers PL/SQL and the related DDL statements. The implementation was more complex than expected, mainly because of the incompatibilities between PostgreSQL and the Oracle Database. I will probably deal with this topic in a future blog post.

Now I’d like to talk about the new features in Oracle Database 23ai. Not about all features in the New Features Guide, but only about some changes since the release of 23.3, which was known as 23c. It’s hard to find changes from 23.3 to 23.4 in the documentation. So, I guess it’s worth a blog post. I focus on the features that are relevant to the IslandSQL grammar. In other words, the interesting ones from a developer’s perspective.

1. Vector Data Type

A vector is a number array for which you can optionally define the number of dimensions (size) and the data type (int8, float32 or float64) of the dimension values. This data type is the basis for the vector search functionality.

Here’s a slightly amended example from the documentation creating and populating a table with a vector data type.

1) Table with vector column

drop table if exists galaxies purge;
create table galaxies (
   id        number             not null primary key,
   name      varchar2(10 char)  not null unique,
   embedding vector(5, int8)    not null, -- 5 dimensions, stored as int8
   doc       varchar2(120 char) not null
);

insert into galaxies 
   (id, name, embedding, doc)
values 
   (1, 'M31',     '[0,2,2,0,0]', 'Messier 31 is a barred spiral galaxy in the Andromeda constellation which has a lot of barred spiral galaxies.'),
   (2, 'M33',     '[0,0,1,0,0]', 'Messier 33 is a spiral galaxy in the Triangulum constellation.'),
   (3, 'M58',     '[1,1,1,0,0]', 'Messier 58 is an intermediate barred spiral galaxy in the Virgo constellation.'),
   (4, 'M63',     '[0,0,1,0,0]', 'Messier 63 is a spiral galaxy in the Canes Venatici constellation.'),
   (5, 'M77',     '[0,1,1,0,0]', 'Messier 77 is a barred spiral galaxy in the Cetus constellation.'),
   (6, 'M91',     '[0,1,1,0,0]', 'Messier 91 is a barred spiral galaxy in the Coma Berenices constellation.'),
   (7, 'M49',     '[0,0,0,1,1]', 'Messier 49 is a giant elliptical galaxy in the Virgo constellation.'),
   (8, 'M60',     '[0,0,0,0,1]', 'Messier 60 is an elliptical galaxy in the Virgo constellation.'),
   (9, 'NGC1073', '[0,1,1,0,0]', 'NGC 1073 is a barred spiral galaxy in Cetus constellation.');
commit;

The vector in this example has 5 dimensions with the following meaning:

Number of occurrences of intermediate in the doc column
Number of occurrences of barred in the doc column
Number of occurrences of spiral in the doc column
Number of occurrences of giant in the doc column
Number of occurrences of elliptical in the doc column

2. Vector Functions and PL/SQL Packages

The vector_distance function is a key functionality for similarity searches. Let’s say we want to see galaxies that are similar to NGC1073 (barred and spiral). The following query shows how vector functions can help to get this result:

2) Different vector_distance metrics in action

with ngc1073 as (select vector('[0,1,1,0,0]', 5, int8) as query_vector)
select name, 
       round(vector_distance(embedding, query_vector, cosine), 3) as cosine_distance,
       round(vector_distance(embedding, query_vector, dot), 3) as inner_product,
       round(vector_distance(embedding, query_vector, euclidean), 3) as l2_distance,
       round(vector_distance(embedding, query_vector, euclidean_squared), 3) as l2_squared,
       round(vector_distance(embedding, query_vector, hamming), 3) as hamming_distance,
       round(vector_distance(embedding, query_vector, manhattan), 3) as l1_distance
  from galaxies, ngc1073
 order by cosine_distance;

NAME       COSINE_DISTANCE INNER_PRODUCT L2_DISTANCE L2_SQUARED HAMMING_DISTANCE L1_DISTANCE
---------- --------------- ------------- ----------- ---------- ---------------- -----------
M31                      0            -4       1.414          2                2           2
M77                      0            -2           0          0                0           0
M91                      0            -2           0          0                0           0
NGC1073                  0            -2           0          0                0           0
M58                   .184            -2           1          1                1           1
M63                   .293            -1           1          1                1           1
M33                   .293            -1           1          1                1           1
M60                      1             0       1.732          3                3           3
M49                      1             0           2          4                4           4

9 rows selected.

M77 and M31 have identical vectors and they are expected to be very similar. However, M31 is interesting. It has a similar shape to NGC1073 but the words barred and spiral appear twice. This is a good match only for some distance metrics.

A conventional query, e.g. based on Oracle Text might be good enough to find similar galaxies. However, if you have vectors with tons of dimensions then a vector similarity search might become appealing. Especially, since you can index vector columns to speed up your queries.

The following SQL functions are provided for the vector data type in the Oracle Database:

vector_chunks
vector or to_vector
vector_dimension_count or vector_dims
vector_dimension_format
vector_distance or cosine_distance or inner_product or l1_distance or l2_distance
vector_embedding
vector_norm
vector_serialize or from_vector

Furthermore, the following supplied PL/SQL packages provide additional functionality related to the vector data type:

3. Shorthand Operators for Distances

For some vector_distance metrics shorthand operators (<=>, <->, <#>) are available. Here’s an example:

3) Shorthand operators for vector_distance metrics

with ngc1073 as (select vector('[0,1,1,0,0]', 5, int8) as query_vector)
select name, 
       round(embedding <=> query_vector, 3) as cosine_distance,
       round(embedding <-> query_vector, 3) as l2_distance,
       round(embedding <#> query_vector, 3) as inner_product
  from galaxies, ngc1073
 order by cosine_distance;

NAME       COSINE_DISTANCE L2_DISTANCE INNER_PRODUCT
---------- --------------- ----------- -------------
M31                      0       1.414            -4
M77                      0           0            -2
M91                      0           0            -2
NGC1073                  0           0            -2
M58                   .184           1            -2
M63                   .293           1            -1
M33                   .293           1            -1
M60                      1       1.732             0
M49                      1           2             0

9 rows selected.

4. Approximate Similarity Searches

You can imagine producing a top-N result of the previous queries using the row_limiting_clause. This clause was introduced in 12.1 to limit search results. Exactly to be precise. In 23.4 the clause was extended to support approximate similarity searches. The idea is to get a good enough result with a better performance when using vector indexes. Here’s an example:

4) Approximate Similarity Search

select name, embedding
  from galaxies
 order by embedding <=> vector('[0,1,1,0,0]', 5, int8)
 fetch approx first 3 rows only
  with target accuracy 80 percent;

NAME       EMBEDDING           
---------- --------------------
M31        [0,2,2,0,0]         
M91        [0,1,1,0,0]         
M77        [0,1,1,0,0]

5. Source and Destination Predicates in Graph Operator

The graph operator is a new 23ai feature. In 23.4 it got two additional predicates: source_predicate and destination_predicate. It allows us to test if a vertex is a source or a destination of an edge. The direction of the arrow so to speak. Here’s a formatted example from the documentation, based on this model.

5) Source and destination predicates

select *
  from graph_table (students_graph
          match (p1 is person) -[e is friends]- (p2 is person)
          where p1.name = 'Mary'
          columns (
             e.friendship_id,
             e.meeting_date,
             case 
                when p1 is source of e then 
                   p1.name 
                else
                   p2.name 
             end as from_person,
             case 
                when p1 is destination of e then
                   p1.name
               else
                   p2.name 
             end as to_person
          )
       )
 order by friendship_id;

FRIENDSHIP_ID MEETING_DATE        FROM_PERSON TO_PERSON 
------------- ------------------- ----------- ----------
            1 19.09.2000 00:00:00 Mary        Alice     
            5 19.09.2000 00:00:00 Mary        John      
            7 10.07.2001 00:00:00 Bob         Mary

6. Breaking Change for Inlined MLE Call Specification

In 23.3 the following code works.

6) MLE Call Specification in 23.3

create or replace function get42 return number is 
   mle language javascript q'[return 42;]';
/
select get42();

Function GET42 compiled


   GET42()
----------
        42

In 24.4 the same code produces this error:

Function GET42 compiled

LINE/COL  ERROR
--------- -------------------------------------------------------------
2/28      PLS-00881: missing closing delimiter 'q'[return' for MLE language code
Errors: check compiler log

The reason is a change in the syntax of the inlined MLE call specification. The JavaScript code cannot be passed as a string anymore. Instead, a new type of delimiter must be used as documented here. You can use almost any sequence of characters as a delimiter for the JavaScript code. The same character sequence must be used for the start and the end delimiter. Exceptions are the pairs (), [], {} and <>.

PostgreSQL dollar-quoted string constants are also valid delimiters. This works in 23.4:

7) MLE Call Specification in 23.4

create or replace function get42 return number is 
   mle language javascript $code$ return 42;$code$;
/

select get42();

Function GET42 compiled


   GET42()
----------
        42

Please note that the space after the first $code$ is required for the Oracle Database to recognize the end of the delimiter.

This change might simplify the implementation of additional MLE languages. Nevertheless, this is a breaking change that requires an amendment of the existing code base.

Outlook

In the next episode, the IslandSQL grammar will be extended to cover the missing statements with a query block. This means the

create view,
create materialized view and
create table statement

for PostgreSQL and the Oracle Database. Only two episodes are left until the end of the first IslandSQL season.

1 Comment

IslandSQL Episode 9: GraphQL, JSON and Flexible Schemas With Duality Views - Philipp Salvisberg's Blog says:

2024-06-27 at 13:31

[…] the last episode, we looked at some new features in Oracle Database 23.4. The IslandSQL grammar now covers all […]

Reply