Text in Oracle The Search Platform and Ultra Search Omar Alonso, Senior Product Manager, Oracle Corp. Stefan Buchta, Principal Product Manager, Oracle.

Slides:



Advertisements
Similar presentations
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Advertisements

DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Benchmarking Oracle 8i Intermedia Text Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Information Retrieval in Practice
Web Server Hardware and Software
XHTML Presenters : Jarkko Lunnas Sakari Laaksonen.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Building Enterprise Information Portal using Oracle Portal 3
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
SESSION 9 THE INTERNET AND THE NEW INFORMATION NEW INFORMATIONTECHNOLOGYINFRASTRUCTURE.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
Oracle Text Operations J. Molka-Danielsen Sept. 30, 2002.
Libraries and Institutional Content Management Systems
Overview of Search Engines
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
Microsoft Office System UK Developers Conference Radisson Edwardian, Heathrow 29 th & 30 th June 2005.
Oracle Text NoCOUG Presentation August 15, Session Objectives Review Oracle Text basics Index Options Compare Oracle Text with interMedia and ConText.
JSP Standard Tag Library
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Leveraging Oracle Data for Web- Based Reporting Northern California Oracle Users Group May 2001.
1 Copyright © 2004, Oracle. All rights reserved. Introduction to Oracle Forms Developer and Oracle Forms Services.
11/16/2012ISC329 Isabelle Bichindaritz1 Web Database Application Development.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Enterprise Reporting Solution
Advanced searching with Oracle Text Indexing and searching in text and documents Author: Krasen Paskalev Certified Oracle DBA Semantec.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr.
Digas Digital Archiving System. Digas is the database program used for research and fact checking in the Research Department (“Dokumentation”, ~ 60 researchers)
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
Strictly Business Using “StrictlyFused” to Create an Extensible Knowledge Portal.
CS562 Advanced Java and Internet Application Introduction to the Computer Warehouse Web Application. Java Server Pages (JSP) Technology. By Team Alpha.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
Business Data Integration with MOSS 2007 Naveedullah Khan PMP, MCAD.NET Senior Consultant.
Introduction to SQL Server 2000 Reporting Services Jeff Dumas Technical Specialist Microsoft Corporation
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Information Retrieval in Practice
Introduction to Oracle Forms Developer and Oracle Forms Services
Building Enterprise Applications Using Visual Studio®
Search Engine Architecture
Introduction to Oracle Forms Developer and Oracle Forms Services
Overview of MDM Site Hub
XML QUESTIONS AND ANSWERS
Introduction to Oracle Forms Developer and Oracle Forms Services
Building Search Systems for Digital Library Collections
Search Techniques and Advanced tools for Researchers
MANAGING DATA RESOURCES
Technology Landscape and Enterprise Objectives
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
Oracle and XML Mingzhu Wei /7/2019.
Presentation transcript:

Text in Oracle The Search Platform and Ultra Search Omar Alonso, Senior Product Manager, Oracle Corp. Stefan Buchta, Principal Product Manager, Oracle Corp. NoCOUG May 16 th 2001

Agenda  What is Oracle Text?  Introducing Oracle Text  Text in the database – Why Integration is Key  Performance and scalability  Ease of Use  Global Solutions  Search Quality  Specialized Indexes  XML  Document Services  Ultra Search  Summary

What is Oracle Text?  Formerly know as interMedia Text  Oracle Text adds powerful text search and intelligent text management capabilities to the Oracle database.  Oracle Text: – Fully integrated with the database – Offers premier text search quality – Provides several advances features for text management, document services, XML, etc. – Has the best internationalization set of features for multilingual text search applications.

Introducing Oracle Text – An example create index description_idx on PRODUCT_INFORMATION(PRODUCT_DESCRIPTION) indextype is ctxsys.context ; select score(1), product_id, product_name from product_information where contains (product_description, 'monitor NEAR "high resolution"', 1)>0 order by score(1) desc ; SCORE(1) PRODUCT_ID PRODUCT_NAME Monitor 21/HR Monitor 17/HR LCD Monitor 11/PM Plasma Monitor 10/XGA Monitor 21/HR/M Monitor 17/HR/F

Integration with the database  The attempt to separate text and normal business (structured) data fails: – Cost – Complexity – High latency of development and deployment – Performance

No Integration - Separate Everything Application RepositoryIndex Search Engine (API) Oracle Database File System B-Tree Inverted SQL C API

Full Integration – text, index, API, optimizer Application RepositoryIndex Search Engine (API) Oracle DatabaseB-Tree SQL

Integration Benefits  Low cost  Low complexity  High performance  High integrity  Manageability  Leveraging existing skills

Oracle Uses Oracle Text  Oracle internet File System  Oracle Portal  Oracle CRM  Oracle E-Business Suite  Oracle eXchange  Ultra Search  Oracle.com  OTN

Oracle Internet File System

Oracle E-Business Suite

Performance – illustration Large doc set – 100Gig (20million web pages)  Hardware : Enterprise Sparc  Task : web query – Web-style query syntax – 2-3 words – Return first 100 hits  40 queries/second  90% of requests take < 0.5 second  7 hours to create index

Performance – Query throughput  Oracle Text vs one of the best-known specialist Text search engines

Ease of Use, Ease of Development  Simple SQL and PL/SQL interface – Can be used by any developer that knows SQL – Can be called by any tool that knows SQL – Using any language: Java, JSP, PL/SQL, C, etc.  Choice of datastores – Stored in the database – Stored in the file system – Stored on the web (URL) – User-defined datastore

Global Solutions  Basic indexing/search works in any NLS language  Special support for Japanese, Chinese, Korean  Theme search and services available in any single-byte, white space-delimited language  Can mix languages, character sets in a single column  Can query across languages

Chinese, Japanese, Korean Text Character sets: Japanese: JA16SJIS, JA16EUC Simplified Chinese: GBK, GB Traditional Chinese: BIG5, EUC, TRIS Korean: KO16KSC5601 Unicode: UTF8 Lexing: Lexical segmentation for Japanese, Chinese Morphological segmentation for Korean

Multilingual Search

Cross-language queries  Can mix languages, character sets within a document collection (e.g. Chinese and English documents)  Can use English to query e.g. Chinese terms or vice versa.  Query a term which is expressed differently in simplified and traditional Chinese. select score(1), product_id, product_name from product_information where contains (product_description, 'TRSYN(monitor, Chinese)', 1)>0 order by score(1) desc ; Find products whose description contains ‘monitor' or its Chinese equivalents.

Search Quality  Exact word  Boolean expression  Phrase  Proximity  Fuzzy  Stemming  Wildcards – Prefix, substring index  Thesaurus, multiple Thesauri  ABOUT search  Theme (concept-based) search  Accumulate scores  Term weighting  Advanced XML search  XPath support  Query Feedback

ABOUT – themes and theme queries "We ordered a bottle of chardonnay to go with the fish, and cabernet sauvignon for the steak …" select id from docs where contains(text, ‘ABOUT(wine)')>0  The knowledge base allows Oracle Text to associate words and concepts.  Knowledge base contains over 400,000 concepts.  You can extend the knowledgebase to include – Words and concepts from your specialist field e.g. medicine – Associations of words and spellings to guide novice/internet users

Catalog Index  Optimized for response time on small text fields  True transactional DML  Supports structured query, including range query  Subset of CONTEXT query language – No fuzzy, stemming, about – User-friendly web-like query syntax

Classification  CTXRULE is an index type designed classification/routing applications  Efficiently take a document and find matching queries Classification Application Perform Action Incoming documents Matched Documents 9i9i Compares against rules

Prefix, substring index  Prefix and Substring are flavors of the CONTEXT index  Prefix will add more tokens to the CONTEXT index to efficiently process prefix searches (e.g. 'ora%')  Substring will add an index on substrings of each token, to efficiently process substring searches (e.g. '%oxy%')

Storing XML in Oracle  Decomposition – decompose documents into atomic elements – store elements in columns/rows – compose XML documents using SQL  xmltype – store XML as xmltype, use xmltype methods  Store as LOB or varchar – Store XML as-is, in a LOB or VARCHAR – Search using Oracle Text section searching or XPath

Content search and XML  Create index create index BOOKINDEX on BOOKS(text) indextype is ctxsys.context  Query by content select PRICE from BOOKS where contains(text, ‘Harry Potter')>0 order by price desc;  Create index to include section info create index BOOKINDEX on BOOKS(text) indextype is ctxsys.context parameters ('section group my_auto_section_group' ) ;  Limit content search to a section of text select price from books where contains(text, ‘Harry Potter within title’)>0 order by price desc;

Advanced XML searches  Nested section search The Matrix Introduction to Matrix Algebra select price from media where contains(desc, ‘matrix within title within movie’)>0  Search inside attribute values Bridge of Birds select title from books where contains(text, ‘Hughart within

More advanced XML searches  map multiple tags to same name The Diamond Age or, A Young Lady’s Illustrated Primer (map H1 and H2 to section name of “headline”) select author from articles where contains(text, ‘Diamond within headline’)>0  doctype limiters to handle tag collisions … … … 123 Meheula Pkwy … map (foo)address to “ ”, (bar)address to “address”

Document Services  Extract Themes (major concepts) – Extract hierarchical structure  Extract Gist – Generic or Point-of-View – Sentence- or Paragraph- level  View a document as HTML – Highlight search terms, highlight navigation  Return results in a table or a PL/SQL table  Basis for Clustering, More Like This, …

Summary  Fully integrated with the database  Premier text search quality  Advanced features for text management, document services, and XML.  Best multilingual features in the market.

Introducing Oracle Ultra Search

Issues in Corporate Search  Information Management Crisis – Explosive Growth of Information flowing over corporate Intranets. – Knowledge scattered across: IT repositories, billions of documents, and data fragments. – Non-Uniform Information  Structured in databases.  Unstructured - Word processing doc., presentations.

Impacts of Bad Search  Customers - Turn to competitor’s Website.  Employees - Waste time and money on useless searches.  Oracle Ultra Search – Solves problem of finding relevant information. – Across your company’s many disparate information repositories.

Oracle Ultra Search  Out-of-the-Box solution that – Searches text across multiple repositories  Databases, HTML Pages, Files, Mail Servers. – Provides the best relevance ranking and globalization in the industry. – Provides value added Portal functionality. – Presents Web style interface.  Built onto Oracle’s proven, reliable Text Retrieval software and Oracle9i server.

Oracle Ultra Search Docum. Title Relevance

Ultra Search Applications  Portal Search – Most powerful search for Oracle9iAS Portal. – Build your own portal. – Special ‘Portlet’ crawls inside and outside of Portal Repository.  Canned Web Search for Oracle Text  Library or Archive Search  Content Management Platform Searc

Search Multiple Repositories

Value Added Portal Functionality  ‘Canned’ Web-Style Search  Aggregates Information For Indexing – Documents stay in their own repositories. – Search returns ‘normalized’ results, uniformly ranked by relevance.  Organize & Categorize Content From Multiple Repositories – Extract valuable metadata. – Improve search by narrowing through ‘fielded search’.

‘Out-of-the-Box’ Web-Style Search  Oracle Text Application – Uses public Text interfaces. – Enhanced with expertise about gathering and indexing information for best quality search. – No coding against low level API’s.  Oracle Text Retrieval Engine – Highly integrated with Oracle9i server. – Best interoperability with dynamic data. – Scalability and Reliability of Oracle platform.

Aggregates Information Gather Analyze Make Queryable Maintain  Gather – Crawls Web, corporate repositories  Analyze – Create index required for querying, filter  Make Queryable – Embedd through API  Maintain – Schedule crawling – Easy Administration

Powerful Fielded Search  Narrow search to parts of document - title, body, name of author.  Extract and use repository metadata – Word processing documents: Author, Title. – Databases: Identify Columns. – Header/Body/Attachment.  Unify repositories in common, logical terms. – Uniform set of results, ranked by overall relevance.

Flexible Metadata Mapping Search Term Repositories Metadata Fields

Ultra Search Architecture

Architecture  Simple, Robust Architecture Built on: – Oracle9i Server Platform – Oracle’s Text Retrieval Engine  Flexible Deployment – Server-Tier – Mid-Tier

Ultra Search Components  Crawler  Server Component  Query API & Application  Administration Tool  Mail API

Ultra Search Crawler  Multi-Threaded JAVA process. – Gathers documents from repositories you specify on a set schedule. – Maps and analyzes link relationships. – Filters (150+) Non-HTML Documents, extracts valuable metadata. – Indexes documents and data fragments.  Flexible Configuration – Run on one or more machines: ‘Remote crawling’

Ultra Search Crawler  Set Inclusion/Exclusion Domains – Limit crawling to corporate net or specific sections of it.  Maintain Fresh Search Results – Set crawling schedules for each Web site or repository.

Crawling Abilities  Web Sites (HTTP Protocol)  Database Tables – Oracle and any ODBC compliant database. – Local (Ultra Search instance) or remote database – Crawls both fulltext and ‘fielded’ columns.  Files (file:// Protocol) – Ultra Search filters, extracts text and metadata.  s (IMAP Protocol) – Crawl and index mailing lists through IMAP.

Ultra Search Query API  ‘Embed’ Ultra Search in your Portal or Application. – Customize look-and-feel to your requirements. – Easy integration with your application.  API for JAVA (JSP) and PL/SQL (PSP).  Returns data with or without HTML markup. – Build: Basic Search Form, Search Result Form...  Includes Highly Functional Query Application.

Java Query API Illustration 12 3

PL/SQL Query API Illustration

Administration Environment  Browser-based, Self-Service Web Application.  Define Ultra Search Instances.  Configure and Schedule Crawler.  Set Query Options To Narrow Searches. – Document Attributes (e.g. TITLE, AUTHOR). – Define ‘Data Source Groups’.  Manage Administrative Users.

Administration Environment

Summary  Eliminate the Chaos Inside Your Firewalls !  Oracle Ultra Search – Crawls, Indexes, and makes searchable your Intranet. – Provides Web-style search without the need for coding. – Organizes, categorizes, and unifies content from multiple repositories. – Leverages Oracle9i platform - reliable, scalable, always available.

A Q & Q U E S T I O N S A N S W E R S