Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.semantec.de Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec.

Similar presentations


Presentation on theme: "Www.semantec.de Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec."— Presentation transcript:

1 www.semantec.de Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec GmbH. D-71083 Herrenberg

2 www.semantec.de Agenda Motivation –Problems when searching in documents Oracle Text features –Oracle text searching capabilities –Document sources, formats and languages –How indexing work –Index types A business case

3 www.semantec.de The Need Find document X by keyword Y SELECT doc_name FROM documents WHERE UPPER(text) LIKE ‚%CAT%‘ We don‘t need APPLICATION, VACATION Too slow – often results in full table scans No information about relevance No search in files – Word, PDF, Excel No advanced searching

4 www.semantec.de Finding information Information systems major tasks: –Store and –Retrieve We are experts in storing both structured and non- structured data How to find... –Fast –Precise –Effective...what we need?

5 www.semantec.de Agenda Motivation Oracle Text features –Oracle text searching capabilities –Document sources, formats and languages –How indexing work –Index types A business case

6 www.semantec.de What is Oracle Text? Formerly known as ConText (8.0) and interMedia Text (8i) Uses standart SQL to index, search and analyze text and documents stored in the Oracle database, in files and on the Web Allows advanced searching including keyword search, pattern matching, boolean expressions, etc. Supports multiple languages

7 www.semantec.de Example of Oracle Text search SELECT doc_name FROM documents WHERE UPPER(text) LIKE ‚%SPACE%‘ SELECT doc_name FROM documents WHERE CONTAINS(text, ‚space‘, 1) > 0 ORDER BY score(1) DESC Normal search: Oracle Text index:

8 www.semantec.de Boolean expressions AND (&) – ‚mouse & wireless‘ OR (|) – ‚mouse | wireless‘ NOT (~) – ‚mouse ~ wireless‘ ACCUMulate (,) – ‚mouse, monitor, cd‘ SELECT doc_name FROM documents WHERE CONTAINS(text, ‚mouse | wireless‘, 1) > 0 ORDER BY score(1) DESC

9 www.semantec.de Proximity NEAR – ‚mouse‘ is within 5 words of ‚wireless‘ SELECT doc_name FROM documents WHERE CONTAINS(text, ‚NEAR((mouse,wireless),5)‘, 1) > 0 ORDER BY score(1) DESC

10 www.semantec.de Expansion operators Allow to expand the word list searched for Wildcard (%, _) – ‚_ing‘, ‚monito%‘ Soundex (!) – words that sound similarly –‚!sing‘ -> sing sink Fuzzy – words that are spelled similarly –‚fuzzy(sing,70,10,weight)‘ -> sing king sink Stem ($) – words having the same linguistic root –‚$sing‘ -> sing sang sung

11 www.semantec.de Thesauri The set of words in Oracle Text have relationships stored in a thesauri: –Synonym rings –Hierarchical - Broader, Narrower term –Associative relation term –Translation

12 www.semantec.de Thesauri examples Theme search – ‚ABOUT(economics)‘ Broader term – ‚BT(cat)‘ -> animal Narrower term – ‚NT(animal)‘ -> cat dog Associative relation – ‚RT(cat)‘ -> kitten Translated term – ‚TR(cat)‘ -> cat gato Synonym – ‚SYN(cat)‘ -> cat tiger

13 www.semantec.de Document sections Harry Potter ‚harry WITHIN book‘ ‚rowling WITHIN book@author‘ I like my cat. ‚cat INPATH(A/B)‘ ‚HASPATH(A/B)‘ For documents having internal structure, like XML and HTML, sections can be defined and indexed XPath functions }

14 www.semantec.de Location of documents Direct – Text is stored directly in a text column Multi-column – Text is in multiple columns Detail – Text is in multiple rows of a detail table Nested – Text is stored in a nested table File – Documents are stored externally as files URL – Documents are stored externally as files on the Internet User – Documents are synthesized at index time by a stored procedure

15 www.semantec.de Direct and Multi-column documents doc_nameauthortext documents doc_nameauthortext DirectMulti-column......... Allowed datatypes: CHAR VARCHAR VARCHAR2 BLOB CLOB BFILE XMLType

16 www.semantec.de Detail and Nested documents doc_nameauthor doc_details doc_nameseq_notext Detail { { documents doc_nameauthordoc_nst seq_notext Nested

17 www.semantec.de File and URL documents doc_nameauthortext File File1: /location1/file1.doc File2: /location1/file2.doc documents doc_nameauthortext URL URL1: http://www.mysite.com/file1.doc URL2: http://www.mysite.com/file2.doc The column stores the document‘s location in the file system The column stores the document‘s location on the Web

18 www.semantec.de Document formats Over 150 document formats are supported including: Microsoft Word, Excel, PowerPoint, Project HTML XML PDF

19 www.semantec.de Languages Oracle Text supports indexing of text in different languages including: English, German, other western European Japanese, Chinese, Korean,...

20 www.semantec.de German language features Composite word indexing –VERTRAGSANLAGE Alternate spelling –ÖFFNEN OEFFNEN

21 www.semantec.de How does indexing work?

22 www.semantec.de Index types Oracle Text supports 3 types of indexes: –CONTEXT –CTXCAT –CTXRULE

23 www.semantec.de CONTEXT Use this index when your text consist of large coherent documents It is not transactional and needs periodic synchronization Supports all Oracle Text features

24 www.semantec.de CTXCAT Use this index for better query performance for mixed queries. Best for indexing small text fragments This index is automatically maintained when data is changed. Does not support all features –No sections –Only single column document location –...

25 www.semantec.de CTXRULE Used to build document classification or routing application A table of queries and corresponding categories identifying the classification or routing criteria is defined Each incoming document can be classified to a category using the corresponding queries

26 www.semantec.de Index creation CREATE INDEX myindex ON docs(text) INDEXTYPE IS CTXSYS.CONTEXT; A number of preferences can be specified: Datastore – How are your documents stored? Filter – How can the documents be converted to plain text? Lexer – What language is being indexed? Wordlist – How stem and fuzzy queries are to be expanded? Storage – How should the index data be stored? Stop list – Which words or themes should not be indexed? Section group – How are documents sections defined?

27 www.semantec.de The cat is jumping on the floor. Present search results Filter – converts documents from their format to plaintext or HTML Highlight – generates offsets (location in document) of the text matching your query

28 www.semantec.de Agenda Motivation Oracle Text features A business case

29 www.semantec.de A business case At Semantec we have a mission critical collaboration platform - Service Manager Our customers communicate to us using Service Manager It allows to plan, track, control and report on all objectives, projects and activities

30 www.semantec.de The application The application has a number of Text fields

31 www.semantec.de The application It also has attachments

32 www.semantec.de The searching needs We have developed a complex search using LIKE, but... No search in attachments No score No boolean operators No chance to peek at fragments of the text found

33 www.semantec.de The solution We have created 2 Oracle Text indexes: –A multi-column table index on the columns Name, Description and Notes –A file index on the attachments

34 www.semantec.de The solution We searched in both indexes After finding the results we highlighted portions of the text containing them SELECT score(1), s.service_id FROM sm_services s WHERE CONTAINS(s.dummy_ctxindx,:srch,1) > 0 UNION ALL SELECT score(2), s.service_id FROM sm_services s, sm_upload a WHERE s.id = a.service_id AND CONTAINS( a.id_context, :srch,2) > 0 ORDER BY score desc, service_id

35 www.semantec.de The result Searched text – stem Link to open the file The score Link to open the application at the item containing the attachment Highlighted portions of the text

36 www.semantec.de Indexing performance 522 – number of documents 178 MB – total size of the documents 15 min – indexing time 86162 – number of different words 13 MB – size of the index

37 www.semantec.de Searching performance 50 times faster! SELECT id FROM sm_services WHERE UPPER(name) LIKE '%MANAGER%' OR UPPER(customer_descr) LIKE '%MANAGER%' OR UPPER(supplier_note) LIKE '%MANAGER%‚ Standart search -> Time: 10.56 sec SELECT id FROM sm_services WHERE CONTAINS(dummy_ctxindx,'manager',1) > 0 Oracle Text search -> Time: 0.20 sec

38 www.semantec.de Summary Fully integrated with the database Indexes everything...... Located everywhere Powerful text search capabilities Oracle Text talks German „Google-izes“ your application

39 www.semantec.de Want to know more? Telephone: Fax: E-Mail: Internet: Company: Name: Address: Semantec GmbH. Krasen Paskalev, Armin Singer, Peter Kopecki Benzstr. 32 D-71083 Herrenberg, Germany Meet us here -> booth 2C at the ground floor +49(7032)9130-0 +49(7032)9130-12 +49(7032)9130-22 krasen.paskalev@semantec.bg singer@semantec.de www.semantec.de


Download ppt "Www.semantec.de Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec."

Similar presentations


Ads by Google