By Franck Pachot

.

I plan to write several posts about OracleText indexes, which is a feature that is not used enough in my opinion. It’s available in all editions and can index small text or large documents to search by words. When you create an OracleText index, a few tables are created to store the words and the association between those words and the table row that contains the document. I’ll start to show how document inserts are processed.

Create the table and index

I’m creating a simple table with a CLOB

SQL> create table DEMO_CTX_FRAG
     (num number constraint DEMO_CTX_FRAG_PK primary key,txt clob);

Table created.

and a simple OracleText on that column

SQL> create index DEMO_CTX_INDEX on DEMO_CTX_FRAG(txt)
     indextype is ctxsys.context;

Index created.

That creates the following tables:

  • DR$DEMO_CTX_INDEX$I which stores the tokens (e.g words)
  • DR$DEMO_CTX_INDEX$K which index the documents (docid) and links them to the table ROWID
  • DR$DEMO_CTX_INDEX$R which stores the opposite way navigation (get ROWID from a docid)
  • DR$DEMO_CTX_INDEX$N which stores docid for deferred maintenance cleanup.

Inserts

I’m inserting a row with some text in the clob column

SQL> insert into DEMO_CTX_FRAG values (0001,'Hello World');

1 row created.

I commit

SQL> commit;

Commit complete.

And here is what we have in the OracleText tables:

SQL> select * from DR$DEMO_CTX_INDEX$K;
no rows selected

SQL> select * from DR$DEMO_CTX_INDEX$R;
no rows selected

SQL> select * from DR$DEMO_CTX_INDEX$I;
no rows selected

SQL> select * from DR$DEMO_CTX_INDEX$N;
no rows selected

Nothing is stored here yet. Which means that we cannot find our newly inserted row from an OracleText search.

By default, all inserts maintain the OracleText tables asynchronously.
The inserted row is referenced in a CTXSYS queuing table that stores the pending inserts:

SQL> select * from CTXSYS.DR$PENDING;

   PND_CID    PND_PID PND_ROWID          PND_TIMES P
---------- ---------- ------------------ --------- -
      1400          0 AAAXUtAAKAAABWlAAA 13-FEB-15 N

and we have a view over it:

SQL> select pnd_index_name,pnd_rowid,pnd_timestamp from ctx_user_pending;

PND_INDEX_NAME                 PND_ROWID          PND_TIMES
------------------------------ ------------------ ---------
DEMO_CTX_INDEX                 AAAXUtAAKAAABWlAAA 13-FEB-15

Synchronization

let’s synchronize:

SQL> exec ctx_ddl.sync_index('DEMO_CTX_INDEX');

PL/SQL procedure successfully completed.

The queuing table has been processed:

SQL> select pnd_index_name,pnd_rowid,pnd_timestamp from ctx_user_pending;

no rows selected

and here is how that document is sotred in our OracleText tables.

$K records one document (docid=1) and the table rowid that contains it:

SQL> select * from DR$DEMO_CTX_INDEX$K;

     DOCID TEXTKEY
---------- ------------------
         1 AAAXUtAAKAAABWlAAA

$R table stores the docid -> rowid is a non-relational way:

SQL> select * from DR$DEMO_CTX_INDEX$R;

    ROW_NO DATA
---------- ------------------------------------------------------------
         0 00001752D0002800000569404141

How is it stored? It’s an array of ROWIDs which are fixed length. Then from the docid we can directly go to the offset and get the rowid. Because DATA is limited to 4000 bytes, there are several rows. But a docid determines the ROW_NO as well as the offset in DATA.

$I stores the tokens (which are the words here as we have TEXT token – which is the type 0) as well as their location information:

SQL> select * from DR$DEMO_CTX_INDEX$I;

TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
---------- ---------- ----------- ---------- ----------- ----------
HELLO               0           1          1           1 008801
WORLD               0           1          1           1 008802

For each word it stores the range of docid that contains the work (token_first and token_last are those docid) and token_info stores in an binary way the occurrences of the word within the documents (it stores pairs of docid and offest within the document). It’s a BLOB but is limited to 4000 bytes so that it is stored inline. Which means that if a token is present in a lot of document, several lines in $I will be needed, each covering a different range of docid. This has changed in 12c and we will see that in future blog posts.

Thus, we can have several rows for one token. This is the first cause of fragmentation. Searching for documents that contain such a word will have to read several lines of the $I table.

The $N has nothing here because we synchronized only inserts and there is nothing to cleanup.

SQL> select * from DR$DEMO_CTX_INDEX$N;

no rows selected

Several inserts

I will insert two lines, which also contain the ‘hello’ word.

SQL> insert into DEMO_CTX_FRAG values (0002,'Hello Moon, hello, hello');

1 row created.

SQL> insert into DEMO_CTX_FRAG values (0003,'Hello Mars');

1 row created.

SQL> commit;

Commit complete.

And I synchronize:

SQL> exec ctx_ddl.sync_index('DEMO_CTX_INDEX');

PL/SQL procedure successfully completed.

So, I’ve now 3 documents:

SQL> select * from DR$DEMO_CTX_INDEX$K;

     DOCID TEXTKEY
---------- ------------------
         1 AAAXUtAAKAAABWlAAA
         2 AAAXUtAAKAAABWlAAB
         3 AAAXUtAAKAAABWlAAC

The reverse mapping array has increased:

SQL> select * from DR$DEMO_CTX_INDEX$R;

    ROW_NO DATA
---------- ------------------------------------------------------------
         0 00001752D000280000056940414100001752D00028000005694041420000

And now the tokens:

SQL> select * from DR$DEMO_CTX_INDEX$I;

TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT TOKEN_INFO
---------- ---------- ----------- ---------- ----------- ----------
HELLO               0           1          1           1 008801
WORLD               0           1          1           1 008802
HELLO               0           2          3           2 0098010201
MARS                0           3          3           1 008802
MOON                0           2          2           1 008802

What is interesting here is that the previous lines (docid 1) have not been updated and new lines have been inserted for docid 2 and 3.

  • ‘moon’ is only in docid 2
  • ‘mars’ is only in docid 3
  • ‘hello’ is in 2 (token_count) documents, from docid 2 to docid 3 (token_first and token_last)

This is the other cause of fragmentation coming from frequent sync. Each sync will add new rows. However, when multiple documents are processed in the same sync, then only one $I entry per token is needed.

There is a third cause of fragmentation. We see here that the token_info is larger for that HELLO covering docid 2 to 3 because there are several occurrences of the token. All that must fit in memory when we synchronize. So it’s good to synchronize when we have several documents (so that the common tokens are not too fragmented) but we need also to have enough memory. The default is 12M and is usually too small. It can be increased with the ‘index memory’ parameter of the index. And there is also a maximum set by ctx_adm.set_parameter for which the default (50M) is also probably too low.

Nothing yet in the $N table that we will see in the next post:

SQL> select * from DR$DEMO_CTX_INDEX$N;

no rows selected

summary

The important points here is that inserted document are visible only after synchronization, and synchronizing too frequently will cause fragmentation. If you need to synchronize in real time (on commit) and you commit for each document inserted, then you will probably have to plan frequent index optimization. If on the other hand we are able to synchronize only when we have inserted a lot of documents, then fragmentation is reduced according that we had enough memory to process all documents in one pass.

The next post will be about deletes and updates.