Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions.

Slides:



Advertisements
Similar presentations
Denny Cherry Manager of Information Systems MVP, MCSA, MCDBA, MCTS, MCITP.
Advertisements

PL/SQL.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
LYU0101 Wireless Digital Library on PDA Lam Yee Gordon Yeung Kam Wah Supervisor Prof. Michael Lyu First semester FYP Presentation 2001~2002.
Information Retrieval in Practice
Fundamentals, Design, and Implementation, 9/e Chapter 11 Managing Databases with SQL Server 2000.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
Oracle Text Operations J. Molka-Danielsen Sept. 30, 2002.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Overview of Search Engines
Denny Cherry twitter.com/mrdenny.
CERN – European Organization for Nuclear Research Administrative Support - Advanced Information Systems Introduction to Oracle interMedia-Text By Derek.
Oracle Text NoCOUG Presentation August 15, Session Objectives Review Oracle Text basics Index Options Compare Oracle Text with interMedia and ConText.
Oracle Text saves your time Oracle Text Search saves your time Anna Suwalska European Organization for Nuclear Research - Geneva OracleWorld Paris 2003.
Databases & Data Warehouses Chapter 3 Database Processing.
Class 6 Data and Business MIS 2000 Updated: September 2012.
JSP Standard Tag Library
Database Design for DNN Developers Sebastian Leupold.
Introduction –All information systems create, read, update and delete data. This data is stored in files and databases. Files are collections of similar.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
I Copyright © 2004, Oracle. All rights reserved. Introduction.
Database Technical Session By: Prof. Adarsh Patel.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Project Overview Bibliographic merging, Endeca, and Web application.
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Oracle vs SQL Server Dr. Alex Wang. Oracle Text Oracle Text uses standard SQL to do almost everything. Full-text retrieval technology, deal with unstructured.
Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
File Processing Concepts – Field – combination of 1 or more characters that is the smallest unit of data to be accessed – Record – group of related fields.
Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.
CS240A Notes on DB Extenders a.k.a. Data Blades, Cartridge, Snapins Carlo Zaniolo Department of Computer Science University of California, Los Angeles.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.
´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr.
Digas Digital Archiving System. Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers)
Denny Cherry twitter.com/mrdenny.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Creating and Maintaining Geographic Databases. Outline Definitions Characteristics of DBMS Types of database Relational model SQL Spatial databases.
Oracle 8i interMedia Text Presented by Jorge Rimblas 4-Feb-2002 SSI Worldwide.
SQL Jan 20,2014. DBMS Stores data as records, tables etc. Accepts data and stores that data for later use Uses query languages for searching, sorting,
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
SQL/Lesson 7/Slide 1 of 32 Implementing Indexes Objectives In this lesson, you will learn to: * Create a clustered index * Create a nonclustered index.
SQL SERVER DAYS 2011 Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.
Lifecycle Server XM Edition. XM Edition Features Full Oracle and SQL Server Support –Oracle & –SQL Server 2005 Improved XML import/export.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Session 1 Module 1: Introduction to Data Integrity
8 th Semester, Batch 2009 Department Of Computer Science SSUET.
Unit-8 Introduction Of MySql. Types of table in PHP MySQL supports various of table types or storage engines to allow you to optimize your database. The.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Unlocking Hidden Gems in Oracle Text
Database Performance Tuning and Query Optimization
Chapter 11 Database Performance Tuning and Query Optimization
SQL Server Indexing for the Client Developer
Presentation transcript:

Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions ODF Benchmarking Oracle8i Intermedia Text

Background for this benchmark The task of the thesis project The EDMS Search Engine CERN’s EDMS CADIM/EDB, managing product data documents MP5, managing physical components Oracle 7, database platform Implement Oracle8i in the future? ODF Benchmarking Oracle8i Intermedia Text

Oracle 8i Intermedia A product embedded in Oracle8i Intermedia allows a unified technique for accessing various types of data such as:  Text  Documents  Images  Audio  Video ODF Benchmarking Oracle8i Intermedia Text

Features in Intermedia Text ODF Benchmarking Oracle8i Intermedia Text Database integration Mixed queries against multiple text columns Single SQL API Indexes on most database columns All index data in the database Automatic triggers on textcolumns to detect changes Full text search Exact Word and phrase search, operators, removing stopwords XML and HTML document section searching

Features in Intermedia Text ODF Benchmarking Oracle8i Intermedia Text About Search (All languages) Parses any text search to perform an optimal search Complements full-text search Theme Identification (English only) Identifies strong themes in documents using a ”Theme base” ”Thematic” information is put into the index Themesearch is available via the about operator May be customized for specific terminology

Features in Intermedia Text ODF Benchmarking Oracle8i Intermedia Text Document services View documents as plain text or HTML format Store documents in a database, file system or at an URL adress Multilingual text search Full-text search in most languages including Japanese, Chinese and Korean Support for all Oracle-NLS character sets Stemming and Fuzzy search for Dutch, English, French, German and Spanish Base-letter support and alternate spelling for Western European languages

Features in Intermedia Text To be investigated: The text indexing technique The new query operators How to use this in the EDMS Search Engine ODF Benchmarking Oracle8i Intermedia Text

OIMT Textindexes An OIMT textindex can be created on: varchar2 columns Large object (LOB) columns B-file columns, indexing entire files Allowing queries like: Select id from table where contains(column,’Atlas’)>0; 1 One of the detectors in the LHC ring is ATLAS 2 The Atlasmountains are situated in the north of Africa -> 1 … contains(column,’Atlas inner detector’)>0; 1 The problems with the ATLAS inner detector... 2 The inner detector of ATLAS... -> 1 Query optimization Selects the best executing plan based on analyze table…compute statistics; ODF Benchmarking Oracle8i Intermedia Text

OIMT Textindexes An OIMT textindex is an ”inverted” index 1 the LHC accelerator 2 the LEP accelerator -> accelerator: row #1 position 2, row #2 position 2 LHC:row #1 position 1 LEP:row #2 position 1 A textindex is built up by five objects Four tables: I,K,N,R and one b-tree index: X Create OIMT textindex statement create index myindex on table(column) indextype is ctxsys.context; Note that table must have a primary key Updating, two choices Automaticly by starting ctxsrv and commit Rebuild ”manually” ODF Benchmarking Oracle8i Intermedia Text

OIMT Textindexes Datastore Filter Sectioner Lexer The indexing ”pipeline” Loops over the rows and reads data out of the column Or from remote servers, accessed via http or ftp via pointers Transformns the data into text representation Output can be in HTML or XML Takes the output from the filter and converts it to plain text Different sectioner for different formats Detects important section tags Separates text into tokens and words Remove stopwords ODF Benchmarking Oracle8i Intermedia Text

OIMT Textindexes ODF Benchmarking Oracle8i Intermedia Text The preference system allows customization of textindexes Classes of ”customizable” objects: DATASTORE, FILTER, SECTION_GROUP, LEXER, WORDLIST, STOPLIST, STORAGE Create a preference to customize an OIMT textindex : execute ctx_ddl.create_preference(’my_pref’, ’BASIC_LEXER’); execute ctx_ddl.set_attribute(’my_pref’, ’INDEX_THEMES’, ’YES’); create index my_index on table(column) indextype is ctxsys.context parameters(’LEXER my_pref’); Default preferences Unset preferences get their value from the default system

OIMT Query operators Scoring operators: WEIGHT(*), THRESHOLD(>), ACCUM(,), MINUS(-) Examples: …contains(column, ’(edms*2) AND cms’)>10; …contains(column, ’edms, cms’)>0; edms+cms scores higher than each word alone Word expansion operators: WILDCARD(%), FUZZY(?), STEM($), SOUNDEX(!), EQUIV(=) Examples: …contains(column, ’?mignets’)>0; Fuzzy correct missspellings 1 The magnets of the LHC accelerator… -> 1 …contains(column, ’$go’)>0; Stem considers e.g. go, went, gone as the ”same” word 1 I will go to the cinema. as well as plurals, magnet=magnets 2 I went back home. 3 The train has gone. -> 1,2,3 …contains(column, ’!dog’)>0; Soundex retrieves all word which sounds alike 1 I have a dog. 2 Someone has dug a hole. -> 1,2 ODF Benchmarking Oracle8i Intermedia Text

Proximity operator: NEAR(;) Examples: …contains(column, NEAR((edms,cms),4, TRUE))>0; 1 EDMS is available for CMS, ATLAS, LHCb... 2 The EDMS at CERN manages all product data documents of CMS -> 1 Section limiting and theme operators: WITHIN, ABOUT about is used as a ”general” operator, optimizing the query by including stem($) and theme search if available Thesaurus operator: SYN Examples: SYN(dog) == {boxer} | {rotweiler} | {terrier} Boolean operators: AND, OR, NOT OIMT Query operators ODF Benchmarking Oracle8i Intermedia Text

OIMT Benchmarks Is SELECT…CONTAINS(column,’word’)>0; faster than SELECT…LIKE(’%word%’); ? How do indexes on entire files perform? What is the prize in terms of memory storage, maintenance? Still fast retrieval? Is updating flexible? Do the query operators perform as expected? ODF Benchmarking Oracle8i Intermedia Text

Comparing CONTAINS and LIKE, retrieval times Using OIMT’s CONTAINS Fulltablescan using LIKE, as in EDMS Search today Y-axis: Retrieval time [s] X-axis: Tablesize [# of rows] ODF Benchmarking Oracle8i Intermedia Text

Comparing CONTAINS and LIKE, retrieval times Using OIMT’s CONTAINS Fulltablescan using LIKE, as in EDMS Search today Y-axis: Retrieval time [s] X-axis: Tablesize [# of rows] ODF Benchmarking Oracle8i Intermedia Text

Indexing entire files Tests with smaller amounts of files (<1000) works fine Encountered heavy problems when indexing up to 9000 files: Security bug A user gets the database owner’s OS privileges when accessing files through FILE DATASTORE. Special roles will be introduced in the version to manage this problem. Not indexing certain Excel files, bug Said to be fixed in the version Storage problems, tablespace, temp, shared pool, lobsegments A OIMT textindex is complex, several storage parameters have to be extended, which was non-trivial. Unrelevant errormessages Somewtimes when problems occur, unrelevant errormessages or no errormessages at all are returned, making troubleshooting difficult. ODF Benchmarking Oracle8i Intermedia Text

Indexing entire files Statistics about the 9000 file index: Total indexsize:1.0 GB Number of files:8942 Accumulated filesize:4.4 GB Tot indexsize/acc filesize:23.6 % Average indexsize per file:115.8 KB/file Average filesize:490.1 KB/file Creation time:7:06 hours Creation time per file:2.9 seconds/file Updating: Works fine, 2 seconds/inserted row Quering: 1-3 seconds/query (depends on the query!) Machine:SUN 4CPU 300MHz 1.5 GB RAM ODF Benchmarking Oracle8i Intermedia Text

A view of the testtable ODF Benchmarking Oracle8i Intermedia Text Index created on this file reference column

Conclusions ODF Benchmarking Oracle8i Intermedia Text Indexing entire files works, but is not trivial to do in version 8.1.6, to be improved in 8.1.7? Quering is fast, both for indexed varchar2 columns and filecolumns Updating textindexes can be done automatic Index preferences may be customized Multilingual Textindexes Query Operators Works mainly as expected, providing powerful tools for an advanced Search Engine Wildcards a very slow when executed on fileindexed columns O8i IMT A very interesting ”platform” for the future EDMS Search Engine Will provide tools for fast full-text searches, with both simple and advanced queries

Application interfaces are non-trivial to create Interface important to make userfriendly Many hits may be retrieved from full-text searches, limiting these may be crucial Some ideas: Ask the users for hints Keep the interface as simple as possible, avoid too many graphical objects etc A simple and an advanced search option Menus for choosing query operators, scoring etc The EDMS Search Engine Interface ODF Benchmarking Oracle8i Intermedia Text

An OIMT Test Search Engine: 9000 indexed files from the EDMS production database