ZT oracle全文索引

asword發表於2009-02-10
全文索引通常用在文字或網頁搜尋中,比如類似對 "大學生活真好!" 或者一個檔案進行索引,查詢方法跟使用平常的SELECT的區別是,需要在WHERE條件中使用一個特定的運算子,全文索引分幾個類別,分別使用於不同的搜尋環境,不同的類別使用不同的運算子進行搜尋.沒有全文索引時,這類查詢通常用下面的sql查詢:
SELECT id,doc FROM table WHERE doc like ‘%國家%’;
[@more@]
全文索引通常用在文字或網頁搜尋中,比如類似對 "大學生活真好!" 或者一個檔案進行索引,查詢方法跟使用平常的SELECT的區別是,需要在WHERE條件中使用一個特定的運算子,全文索引分幾個類別,分別使用於不同的搜尋環境,不同的類別使用不同的運算子進行搜尋.沒有全文索引時,這類查詢通常用下面的sql查詢:
SELECT id,doc FROM table WHERE doc like ‘%國家%’;
這些查詢能夠很好search我們想要的,但隨著資料庫的增長,查詢速度是不可忍受的,資料庫只能透過full table scan來完成這項工作.有了全文索引後,能夠很好的解決這類問題,oracle會採用一種類似普通索引的技術來索引文件.其實在內部oracle利用了特殊的詞法(lexer)分割器,詞法分割器跟語言有關係,比如:english、日語、漢字等都有不同的分割器.分割器將文字根據平常的使用習慣分割成最小的片語單元進行儲存,oracle 稱他為token.比如"大學生活真好!" oracke可能會分割成"大學生"、"生活"、“真好"、"生活真好"等小單元進行儲存,所以建立後的全文索引的容量是表的好多倍.索引的內容可以來自表中的某個列或者多個列或子表的列或者某個檔案或某個網頁等.
  • 索引類別:

到10GR2為止,共有四種全文搜尋型別:

Index Type Description Supported Preferences and Parameters Query Operator Notes
CONTEXT Use this index to build a text retrieval application when your text consists of large coherent documents. You can index documents of different formats such as MS Word, HTML or plain text.

With a context index, you can customize your index in a variety of ways.

This index type requires CTX_DDL.SYNC_INDEX after DML on base table.

All CREATE INDEX preferences and parameters supported except for INDEX SET.

These supported parameters include the index partition clause, and the format, charset, and language columns.

CONTAINS

Grammar is called the CONTEXT grammar, which supports a rich set of operations.

The CTXCAT grammar can be used with query templating.

Supports all documents services and query services.

Supports indexing of partitioned text tables.

CTXCAT Use this index type for better mixed query performance. Typically, with this index type, you index small documents or text fragments. Other columns in the base table, such as item names, prices and descriptions can be included in the index to improve mixed query performance.

This index type is transactional, automatically updating itself after DML to base table. No CTX_DDL.SYNC_INDEX is necessary.

INDEX SET

LEXER

STOPLIST

STORAGE

WORDLIST (only prefix_index attribute supported for Japanese data)

Format, charset, and language columns not supported.

Table and index partitioning not supported.

CATSEARCH

Grammar is called CTXCAT, which supports logical operations, phrase queries, and wildcarding.

The CONTEXT grammar can be used with query templating.

Theme querying is supported.

This index is larger and takes longer to build than a CONTEXT index.

The size of a CTXCAT index is related to the total amount of text to be indexed, number of indexes in the index set, and number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set.

The CTXCAT index does not support table and index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)

CTXRULE Use CTXRULE index to build a document classification or routing application. The CTXRULE index is an index created on a table of queries, where the queries define the classification or routing criteria. See . MATCHES Single documents (plain text, HTML, or XML) can be classified using the MATCHES operator, which turns a document into a set of queries and finds the matching rows in the CTXRULE index.
CTXPATH Create this index when you need to speed up existsNode() queries on an XMLType column. STORAGE Use with existsNode() Can only create this index on XMLType column.

Although this index type can be helpful for existsNode() queries, it is not required for XML searching. See

  • 建立語法

CREATE INDEX [schema.]index ON [schema.]table(column) INDEXTYPE IS
ctxsys.context [ONLINE]
[LOCAL [(PARTITION [partition] [PARAMETERS('paramstring')]
[, PARTITION [partition] [PARAMETERS('paramstring')]])]
[PARAMETERS(paramstring)] [PARALLEL n] [UNUSABLE]];
這裡的context還可以換成ctxcat、ctxrule、ctxpath幾個型別,支援分割槽索引.這裡很重要的部分
是PARAMETERS部分,定義了索引的各種建立條件.
PARAMETERS(paramstring)

The syntax for paramstring is as follows:

paramstring =
'[DATASTORE datastore_pref]  定義要所索引的資料來自哪裡,
 [FILTER filter_pref] 定義過濾器,
 [CHARSET COLUMN charset_column_name]  當選用CHARSET_FILTER過濾器時,顯示指定
                                   一個過濾器需要的列
 [FORMAT COLUMN format_column_name] 過濾器需要格式化的列

 [LEXER lexer_pref] 定義一個此法分析器
 [LANGUAGE COLUMN language_column_name]  多語言環境中,顯示指定lexer分析的多語言列

 [WORDLIST wordlist_pref]  定義一個常用搜尋詞表
 [STORAGE storage_pref]  定義索引的儲存引數
 [STOPLIST stoplist] 定義不需要index的片語列表
 [SECTION GROUP section_group] 定義一個索引區域(通常用在索引html、xml的環境)
 [MEMORY memsize] 分配索引建立需要的記憶體空間(in 10G)
 [POPULATE | NOPOPULATE] 是否建立一個空的索引
 [[METADATA] SYNC (MANUAL | EVERY "interval-string" | ON COMMIT)] 定義索引同步時間(in 10G)
 [TRANSACTIONAL]' 是否開啟事務更新(in 10G)
  • 索引建立      
  •  [oracle@ique ~]$ sqlplus mayp/mayp
SQL*Plus : Release 10.2.0.4.0 - Production on Thu Aug 21 19:45:43 2008
Copyright (c) 1982, 2007, Oracle.  All Rights Reserved.
Connected to:Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - ProductionWith the Partitioning, OLAP, Data Mining and Real Application Testing option
SQL> create table oratext(id number,name varchar2(30),address varchar2(200)) 
pctfree 10 tablespace assm/  2  
Table created.
SQL> alter table oratext add constraint pk_oratext primary key(id) using index/  2  
Table altered.
SQL> exec ctx_ddl.create_preference('mayp_storage','basic_storage');Elapsed: 00:00:00.22
SQL> exec ctx_ddl.set_attribute('mayp_storage','I_TABLE_CLAUSE','TABLESPACE TEST');Elapsed: 00:00:00.11SQL> exec ctx_ddl.set_attribute('mayp_storage','K_TABLE_CLAUSE','TABLESPACE TEST');Elapsed: 00:00:00.04SQL> exec ctx_ddl.set_attribute('mayp_storage','R_TABLE_CLAUSE','TABLESPACE TEST');Elapsed: 00:00:00.06
SQL> set feedback onSQL> exec ctx_ddl.set_attribute('mayp_storage','N_TABLE_CLAUSE','TABLESPACE TEST');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01SQL> exec ctx_ddl.set_attribute('mayp_storage','I_INDEX_CLAUSE','TABLESPACE TEST COMPRESS 2');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.01SQL> exec ctx_ddl.set_attribute('mayp_storage','P_TABLE_CLAUSE','TABLESPACE TEST');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.04SQL> exec ctx_ddl.create_preference('mayp_lexer','chinese_lexer')
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.02SQL> exec ctx_ddl.create_preference('mayp_datastore','direct_datastore');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.06
SQL> create index oratext_address on oratext(address) indextype is ctxsys.context 
parameters('lexer mayp_lexer datastore mayp_datastore storage mayp_storage')/  2  
Index created.
Elapsed: 00:00:04.48
這裡我重新定了了三個索引的引數:mayp_datastore(要索引的資料的儲存方式),mayp_storage(索引的儲存位置),
mayp_lexer(詞法分割器).在索引建立完之後,會發現多了以dr$index_name開頭的四張表和
一個dr$索引和oratext_address索引:
SQL> select table_name,tablespace_name from user_tables;
Tab.Name                       TABLESPACE_NAME------------------------------ ------------------------------DR$ORATEXT_ADDRESS$I           TESTDR$ORATEXT_ADDRESS$R           TESTSYS_IOT_OVER_45158             MSSMSYS_IOT_MAP_45158              MSSMORATEXT                        ASSMDR$ORATEXT_ADDRESS$NDR$ORATEXT_ADDRESS$K
SQL> select index_name,tablespace_name from user_indexes;
Ind.Name                       TABLESPACE_NAME------------------------------ ------------------------------SYS_IL0000011906C00002$$       TESTPK_P                           TESTSYS_IL0000011907C00002$$       TESTDR$ORATEXT_ADDRESS$X           TESTSYS_IL0000046825C00006$$       TESTSYS_IL0000046830C00002$$       TESTORATEXT_ADDRESSSYS_IL0000046726C00006$$       TESTSYS_IL0000046731C00002$$       TEST
其中DR$ORATEXT_ADDRESS$I這張表最關鍵,它儲存被lexer分割後的token,也就是我們可以利用搜尋的關鍵字 .
SQL> select token_text from dr$oratext_address$i;
TOKEN_TEXT----------------------------------------------------------------1128工業海街號街南路南蘇桐蘇州市桐路星海園區
13 rows selected.
Elapsed: 00:00:00.01
透過一個查詢來看看:
SQL> select name,address from  oratext where contains(address,'園區')>0;     
NAME                           ADDRESS------------------------------ --------------------------------------------------王皓                           蘇州市工業園區星海街南11號司馬格達                       蘇州市工業園區蘇桐路28號張磊                           蘇州市工業園區蘇桐路28號
3 rows selected.
Elapsed: 00:00:01.49
SQL> select name,address from  oratext where contains(address,'蘇桐路')>0;
NAME                           ADDRESS------------------------------ --------------------------------------------------司馬格達                       蘇州市工業園區蘇桐路28號張磊                           蘇州市工業園區蘇桐路28號
2 rows selected.
Elapsed: 00:00:00.07
SQL> select name,address from  oratext where contains(address,'星海街')>0;
NAME                           ADDRESS------------------------------ --------------------------------------------------王皓                           蘇州市工業園區星海街南11號
1 row selected.
  • 索引維護
ctxsys.context全文索引在發生DML語句後,不同自動同步索引,必須透過ctx_ddl.sync_index同步,
可以定義一個job進行同步,
SQL> insert into oratext values(4,'馬成','蘇州市觀前街7號');
1 row created.
Elapsed: 00:00:00.08SQL> commit;
Commit complete.
Elapsed: 00:00:00.00
SQL> select name,address from oratext where contains(address,'觀前街')>0;
no rows selected
Elapsed: 00:00:00.09
SQL> exec ctx_ddl.sync_index('oratext_address','2m');
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.18
SQL> select name,address from oratext where contains(address,'觀前街')>0;
NAME                           ADDRESS------------------------------ --------------------------------------------------馬成                           蘇州市觀前街7號
1 row selected.
Elapsed: 00:00:00.02
在索引的同步過程中,可能會產生碎片,我們可以定時透過ctx_ddl.optimize_index維護索引.
ctx_report package可以報告全文索引的一些統計資訊:
ctx_report.index_size:報告索引的大小
Elapsed: 00:00:00.00SQL> select ctx_report.index_size('oratext_address') from dual;
CTX_REPORT.INDEX_SIZE('ORATEXT_ADDRESS')--------------------------------------------------------------------------------===========================================================================                    INDEX SIZE FOR MAYP.ORATEXT_ADDRESS===========================================================================TABLE:                               MAYP.DR$ORATEXT_ADDRESS$ITABLESPACE NAME:                     TESTBLOCKS ALLOCATED:                                                        8BLOCKS USED:                                                             3BYTES ALLOCATED:                                         65,536 (64.00 KB)BYTES USED:                                              24,576 (24.00 KB)
LOB SEGMENT:                         MAYP.SYS_LOB0000046825C00006$$TABLE NAME:                          MAYP.DR$ORATEXT_ADDRESS$ILOB COLUMN:                          TOKEN_INFOTABLESPACE NAME:                     TESTBLOCKS ALLOCATED:                                                        8BLOCKS USED:                                                             2BYTES ALLOCATED:                                         65,536 (64.00 KB)BYTES USED:                                              16,384 (16.00 KB)
INDEX (NORMAL):                      MAYP.DR$ORATEXT_ADDRESS$XTABLE NAME:                          MAYP.DR$ORATEXT_ADDRESS$ITABLESPACE NAME:                     TEST.........................
TOTALS FOR INDEX MAYP.ORATEXT_ADDRESS---------------------------------------------------------------------------TOTAL BLOCKS ALLOCATED:                                                 72TOTAL BLOCKS USED:                                                      19TOTAL BYTES ALLOCATED:                                 589,824 (576.00 KB)TOTAL BYTES USED:                                      155,648 (152.00 KB)
ctx_report.create_index_script:建立索引定義指令碼
SQL> select ctx_report.create_index_script('oratext_address') from dual;
CTX_REPORT.CREATE_INDEX_SCRIPT('ORATEXT_ADDRESS')--------------------------------------------------------------------------------begin  ctx_ddl.create_preference('"ORATEXT_ADDRESS_DST"','DIRECT_DATASTORE');end;/
begin  ctx_ddl.create_preference('"ORATEXT_ADDRESS_FIL"','NULL_FILTER');end;..........................
create index "MAYP"."ORATEXT_ADDRESS"  on "MAYP"."ORATEXT"      ("ADDRESS")  indextype is ctxsys.context  parameters('    datastore       "ORATEXT_ADDRESS_DST"    filter          "ORATEXT_ADDRESS_FIL"    section group   "ORATEXT_ADDRESS_SGP"    lexer           "ORATEXT_ADDRESS_LEX"    wordlist        "ORATEXT_ADDRESS_WDL"    stoplist        "ORATEXT_ADDRESS_SPL"    storage         "ORATEXT_ADDRESS_STO"  ')/
begin  ctx_output.end_log;end;/1 row selected.
Elapsed: 00:00:00.23
ctx_report.index_stats:報告索引的內部結構,已經碎片的程度
SQL> declare    l_index_stats clob;    begin     ctx_report.index_stats(index_name=>'oratext_address',report=>l_index_stats);     dbms_output.put_line(l_index_stats);  end;   /  2    3    4    5    6    7  ===========================================================================                  STATISTICS FOR"MAYP"."ORATEXT_ADDRESS"===========================================================================
indexed documents:                                                      3allocated docids:4$I rows:                                                               17
---------------------------------------------------------------------------                             TOKENSTATISTICS---------------------------------------------------------------------------
unique tokens:                                                         17average $I rows per token:1.00tokens with most $I rows:  園區 (0:TEXT)                                                         1  星海 (0:TEXT)                                                         1  桐路 (0:TEXT)1  蘇州市 (0:TEXT)                                                       1  蘇桐 (0:TEXT)                                                         1  前街 (0:TEXT)1
...........
---------------------------------------------------------------------------                         FRAGMENTATIONSTATISTICS---------------------------------------------------------------------------
total size of $I data:                                                 93
$I rows:17estimated $I rows if optimal:                                          17estimated row fragmentation:                                          0 %
garbage docids:1estimated garbage size:                                                19
most fragmented tokens:  園區 (0:TEXT)                                                       0 %  星海 (0:TEXT)0 %  桐路 (0:TEXT)                                                       0 %  蘇州市 (0:TEXT)                                                     0 %  蘇桐 (0:TEXT)0 %  前街 (0:TEXT)                                                       0 %  南 (0:TEXT)                                                         0 %  路 (0:TEXT)0 %  街南 (0:TEXT)                                                       0 %  街 (0:TEXT)                                                         0 %  號 (0:TEXT)0 %  海街 (0:TEXT)                                                       0 %  觀前 (0:TEXT)                                                       0 %  工業 (0:TEXT)0 %  7 (0:TEXT)                                                          0 %  28 (0:TEXT)                                                         0 %  11 (0:TEXT)0 %
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.03這個索引沒有碎片.
--結束
參考文件:
Oracle Text Application Developer's Guide Release 9.2
Oracle Text Reference Release 9.2
Oracle® Text Application Developer's Guide 10g Release 2 (10.2)

來自 “ ITPUB部落格 ” ,連結:http://blog.itpub.net/82387/viewspace-1017150/,如需轉載,請註明出處,否則將追究法律責任。

相關文章