Download presentation
Presentation is loading. Please wait.
Published byGervase Booth Modified over 9 years ago
1
Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003
2
Oracle Text saves your time CERN Engineering Data Management System at CERN Oracle Text How we profit from this technology Conclusion Content
3
Oracle Text saves your time CERN Content
4
Oracle Text saves your time The world’s largest particle physics research laboratory Founded in 1954, CERN has today 20 member states 2400 staff Over 6500 scientists come here to use research facilities 500 universities, over 80 nationalities CERN explores what matter is made of, and what forces hold it together WWW was born here CERN - European Organization for Nuclear Research
5
Oracle Text saves your time LHC - The Large Hadron Collider Project
6
Oracle Text saves your time LHC - Cryodipole
7
Oracle Text saves your time EDMS Engineering Data Management System Content
8
Oracle Text saves your time EDMS Portal EDMS Common layer AxalantMP5Other DB’s Design Data Documents and Drawings Asset tracking Work management EDMS - Engineering Data Management System
9
Oracle Text saves your time Structures Managing EDMS - Engineering Data Management System
10
Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Managing EDMS - Engineering Data Management System
11
Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Managing EDMS - Engineering Data Management System
12
Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Managing EDMS - Engineering Data Management System
13
Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing EDMS - Engineering Data Management System
14
Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing Equipment workflow, data EDMS - Engineering Data Management System
15
Oracle Text saves your time Structures Complete life-cycle for a single/compound documents. Document versioning Document approval processes (comments collector) Assemblies Managing Equipment workflow, data Installation (jobs, locations, etc..) EDMS - Engineering Data Management System
16
Oracle Text saves your time Manage a full description of the LHC project’s engineering data over it’s lifetime (>25 years) Support and coordinate engineering work / information / data workflow Establish a knowledge transfer: evolving staff, many short time visitors A full description of the machine and its components through their lifecycle must be constantly available for all concerned parties Help tracing solutions to all problems occurring in the machine Provide an efficient search tool to support with requirements above - our choice Oracle Text EDMS mandate Operation InstallationDesignOperationDismantling
17
Oracle Text saves your time Our needs Oracle Text – our choice
18
Oracle Text saves your time Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice
19
Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice
20
Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Oracle Text – our choice
21
Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Simple for users Simple to develop Simple to maintain Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Our needs Simplicity: Oracle Text – our choice
22
Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Oracle text supports 39 languages Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Very efficient for searches within big collection of data Our needs Simple for users Simple to develop Simple to maintain Simplicity: Oracle Text – our choice
23
Oracle Text saves your time Bi-lingual : Official CERN languages are English and French. We have to support both Performance: Response time is very important Index metadata & files : First line search is done on meta data, however the possibility to index files is essential Oracle Text supports most of the document formats Oracle text supports 39 languages Very efficient for searches within big collection of data Oracle text comes as an option in RDBMS - no additional costs Our needs Results with scoring methodology to help navigate through a result Standard SQL statements Easy to maintain with ALTER INDEX or CTX_DDL packages Simple for users Simple to develop Simple to maintain Simplicity: Oracle Text – our choice
24
Oracle Text saves your time Oracle Text Content
25
Oracle Text saves your time Oracle Text Takes care of: Enables the building of a Text Query Application and a Document Classification Application Oracle text indexing searching: word and theme viewing text Uses standard SQL
26
Oracle Text saves your time CREATE INDEX index_name ON table_name(column_name) INDEXTYPE IS CTXSYS.CONTEXT PARAMETERS(‘parameters string’); [datastore datastore_pref] [filter filter_pref] [charset column charset_column_name] [format column format_column_name] [lexer lexer_pref] [language column language_column_name] [wordlist wordlist_pref] [storage storage_pref] [stoplist stoplist] [section group section_group] [memory memsize] [populate | nopopulate] CONTEXT Index Creation
27
Oracle Text saves your time IndexQuery OperatorCharacteristics CONTEXTCLOB, BLOB, BFILE, CHAR, VARCHAR2, XML On text column Most complete of all 3 types. CTXCATCHAR, VARCHAR2 Combined index on a text column and one or more other columns. Transactional – no need for synchronizing when DML. Creating can be longer because of the sub-indexes. Supports: INDEX SET, LEXER*, STOPLIST, STORAGE, WORDLIST* Has it’s own query language. CONTAINS CTXRULE CATSEARCH MATCHES Used for Building a document classification application For indexing small text fragments and related information. To improve mixed query performance VARCHAR2, CLOB On column containing a set of queries. Supports: LEXER (only BASIC) Does not support number of operators. Large coherent documents Types of indexes
28
Oracle Text saves your time ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize fast’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full maxtime10’); ALTER INDEX cdi_text_ctx REBUILD ONLINE PARAMETERS(‘optimize full’); Index Maintenance & Optimization
29
Oracle Text saves your time ALTER INDEX index_name REBUILD [ONLINE][PARAMETERS(parameters string)]; CTX_DDL package CTX_DDL.OPTIMIZE_INDEX CTX_DDL.SYNC_INDEX Index Maintenance & Optimization
30
Oracle Text saves your time INSERT A new row inserted in DR$PENDING queue, not available for query before synchronization UPDATE Existing ROWID is placed in DR$PENDING, neither new nor old content is available for query before synchronization DELETE The row is immediately unavailable for query(marked as invalid), but only removed when optimization complete CTX_USER_PENDING (CTX_PENDING) view To check records waiting for synchronization DML processing
31
Oracle Text saves your time “To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula. Inverse frequency scoring assumes that frequently occurring terms in a document set are noise terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.” Oracle Text Reference, Release 9.0.1 In data set: M number of occurrences of TERM1, N number of occurrences of TERM2 M >> N Document having equal (n-occurrences) of TERM1 and TERM2 Example Result SCORE for querying TERM1 < SCORE for querying TERM2 Scoring
32
Oracle Text saves your time SYNonym ABOUT STEM Translation Term Broader, Narrower, Preferred, Related Term Boolean Linguistics Others OR NOT MINUS AND lhc AND magnet AND NOT cryogenic FUZZY NEAR SOUNDEX WITHIN SQE SYN (science) ABOUT (particle) begin ctx_query.store_sqe ( ‘particle‘, ’atom, molecule proton’ ); end; ‘SQE (particle)’ Query Operators
33
Oracle Text saves your time Administer servers and the data dictionary (only ctxsys user) Create and manage the preferences, section groups, stoplists, manage indexes Document presentation features (only for CONTAINS indexes) Manage logs for the indexes Manage and browse thesaurus Generating query feedback, counting hits, and creating SQE (stored query expressions) CTX_ADMIN CTX_DDL CTX_DOC CTX_OUTPUT CTX_QUERY CTX_THES CTX packages
34
Oracle Text saves your time How we profit from this technology Content
35
Oracle Text saves your time EDMS metadata index preferences
36
Oracle Text saves your time Version 1.5 accelerateur lhc méthode EDMS search for both languages
37
Oracle Text saves your time EDMS metadata index preferences
38
Oracle Text saves your time To be able to query on reserved words or symbols such as “minus”, “-”, “near” they must be escaped. There are 2 methods to escape the character, using “{}” or “\”. When using: We had to hardcode it for each symbol and word. A standard “dictionary table” with the reserved characters would be useful. Escaping characters to query them
39
Oracle Text saves your time It is important to know how users will search the data and what kind of data you are going to index before you actually do it. EDMS metadata index preferences
40
Oracle Text saves your time Meta dataFilesEnvironment Hardware & system: Two node cluster based on two Sun SPARC 450, running Solaris 2.6 + Sun Cluster 2.1 RDBMS: 8.1.7.4 ~500 MB SGA size 60-80 concurrent users (during working hours) EDMS Index Maintenance & Optimization 4 000New documents (monthly) 88 800Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) 266 500Drawings 74 000Test documents / 148 000 / 78
41
Oracle Text saves your time Meta dataFiles Index synchronization: every 10 min, takes a few seconds Index optimization: every weekend, takes ~30 min PROCEDURE rebuild_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' sync ' ')'); END; PROCEDURE optimize_metedata_ctx IS BEGIN EXECUTE IMMEDIATE ('alter index CDI_TEXT_CTX rebuild online parameters(' ' optimize full' ')'); END; Environment EDMS Index Maintenance & Optimization 4 000New documents (monthly) 88 800Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) 266 500Drawings 74 000Test documents / 148 000 / 78
42
Oracle Text saves your time Synchronize every 24h ? Optimize (fast, full) every month? Meta dataFilesEnvironment EDMS Index Maintenance & Optimization 4 000New documents (monthly) 88 800Files (CSV, DOC, DOT, HTM, HTML, MPP, PDF, PPT, PS, RES, TXT) / total 45Files (GB) / total 4 000Document updates (monthly) 266 500Drawings 74 000Test documents / 148 000 / 78
43
Oracle Text saves your time SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’lhc’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 9 SQL> SELECT c_id,score(10) FROM compound_doc_info WHERE CONTAINS(c_text,’evolution’,10)>0 AND c_id = 1738594907; C_ID SCORE(10) ------------------ ---------------- 1738594907 15 Scoring
44
Oracle Text saves your time DECLARE xtab ctx_thes.exp_tab; …. BEGIN ctxsys.ctx_thes.rt(xtab,p_term,’edms_thes’); FOR i IN 1..xtab.COUNT LOOP IF xtab(i).xrel = C_RELETED_TERM THEN htp.anchor ( L_DOC_SEARCH ||'?cookie=' ||cookie ||'&p_search_type=' ||p_search_type ||'&p_free_text=' ||LOWER(xtab(i).xphrase),LOWER(xtab(i).xphrase) ); END IF; END LOOP; END; Propose the RT (Related Term) if nothing found with the original term(s). Would be nice to have a spell checker corrector, using existing tokens. Using the thesaurus
45
Oracle Text saves your time Using the thesaurus - example
46
Oracle Text saves your time Using the thesaurus - example
47
Oracle Text saves your time …WHERE CONTAINS (c_text, p_free_text) > 0; Total 83 ms Querying with Oracle Text versus standard SQL
48
Oracle Text saves your time … WHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Total 03.98s Querying with Oracle Text versus standard SQL
49
Oracle Text saves your time ToolOracle TextStandard SQL CharacteristicsUnderperforming. StatementWHERE UPPER(c_text) LIKE '%’||UPER(p_free_text)||’%’ Fast. Time WHERE CONTAINS (c_text,p_free_text) > 0 * Tests done with TOra 1.3.8 (in parentheses repeated 10x) 83 ms (821ms)03.98s (39.14s)* p_free_text is a single word or an exact sentence Querying with Oracle Text versus standard SQL
50
Oracle Text saves your time ToolOracle TextStandard SQL CharacteristicsUnderperforming. StatementWHERE ( UPPER(c_text) LIKE '%’||UPPER(p_text_1)||’%’ OR UPPER(c_text) LIKE '%’||UPER(p_text_2)||’%’ ) Fast. Time WHERE CONTAINS (c_text,p_free_text) > 0 * Tests done with TOra 1.3.8 (in parentheses repeated 10x) 103ms (01:03 )09:09 (1:22.09)* p_free_text is an expression with OR operator Querying with Oracle Text versus standard SQL
51
Oracle Text saves your time Querying with Oracle Text Total 02.36s
52
Oracle Text saves your time Querying with Oracle Text Total 02.36s Total 02.31s
53
Oracle Text saves your time Querying with Oracle Text Total 02.36s Total 02.31s Total 02.25s
54
Oracle Text saves your time Mixed queries “LHC-Q-EI-0002” is a document number Search is done on: 1) the document number column using a standard index 2) the context index
55
Oracle Text saves your time Formatted documents such as Microsoft Word, PDF has to be filtered File_format column stores “TEXT” or “BINARY” value. INSO_FILTER ignores all with “TEXT” in the format column. Indexing various file formats NULL_FILTER for plain text and HTML formats
56
Oracle Text saves your time Some indexing problems we have The creation of an Intermedia Text Index (with URL_DATASTORE) is failing with ORA-4030 out of process memory. After successful indexing of the PDF files (using INSO_FILTER), some are indexed only “partially” without any error being created in the error table. In June 2002 this was identified to be a memory leak fixed in 8.1.7.4.0 We observe now the same ORA-4030 error with 8.1.7.4.0 OPS Result : very difficult to verify if the document is correctly indexed.
57
Oracle Text saves your time Conclusion Content
58
Oracle Text saves your time Oracle text is worth using because … Performance Simplicity of the code (integrated with Oracle, no external search engine) Simplicity of the index maintenance Functional features: bi-lingual support, special query operators, thesaurus Document presentation features Conclusion
59
Oracle Text saves your time EDMS SERVICE https://edms.cern.ch This presentation: https://edms.cern.ch/file/402581/1/Oracle_Text_OracleWorld2003.ppt Contact: Anna.Suwalska@cern.ch Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.