Who Needs All Those Indexes ? One is Enough Bruce Lindsay IBM Almaden Research Center

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Almaden Research Center © 2006 IBM Corporation IOP 06 Open Source Intelligence Lesson Learned.
Oyster, Edinburgh, May 2006 AIFB OYSTER - Sharing and Re-using Ontologies in a Peer-to-Peer Community Raul Palma 2, Peter Haase 1 1) Institute AIFB, University.
BAH DAML Tools XML To DAML Query Relevance Assessor DAML XSLT Adapter.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
A N O VERVIEW OF B USINESS I NTELLIGENCE T ECHNOLOGY Source: Communications of the ACM, Vol. 54 No. 8 Surajit Chaudhuri, Umeshwar Dayal, Vivek Narasayya,
Information Retrieval in Practice
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Integration and Insight Aren’t Simple Enough Laura Haas IBM Distinguished Engineer Director, Computer Science Almaden Research Center.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CH 11 Multimedia IR: Models and Languages
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
ÆKOS: A new paradigm for discovery and access to complex ecological data David Turner, Paul Chinnick, Andrew Graham, Matt Schneider, Craig Walker Logos.
Overview of Search Engines
Distribution Statement A. Approved for public release; distribution is unlimited. Test and Evaluation/Science and Technology Program Rapid Data Analyzer.
Redefining Perspectives A thought leadership forum for technologists interested in defining a new future June COPYRIGHT ©2015 SAPIENT CORPORATION.
Databases & Data Warehouses Chapter 3 Database Processing.
Text Analytics And Text Mining Best of Text and Data
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Ch 4. The Evolution of Analytic Scalability
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Database Architecture Introduction to Databases. The Nature of Data Un-structured Semi-structured Structured.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Haggle Architecture and Reference Implementation Uppsala, September Erik Nordström, Christian Rohner.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Building Data and Document-Driven Decision Support Systems How do managers access and use large databases of historical and external facts?
By N.Gopinath AP/CSE Cognos Impromptu. What is Impromptu? Impromptu is an interactive database reporting tool. It allows Power Users to query data without.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
Integrating Structured & Unstructured Data. Goals  Identify some applications that have crucial requirement for integration of unstructured and structured.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
An Ontological Approach to Financial Analysis and Monitoring.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Developing GRID Applications GRACE Project
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
An Introduction To Big Data For The SQL Server DBA.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Innovative Novartis Knowledge Center
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Information Retrieval in Practice
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
knowledge organization for a food secure world
Data Warehouse.
Associative Query Answering via Query Feature Similarity
Information Retrieval
Charles Tappert Seidenberg School of CSIS, Pace University
AI Discovery Template IBM Cloud Architecture Center
Presentation transcript:

Who Needs All Those Indexes ? One is Enough Bruce Lindsay IBM Almaden Research Center

Stored Data is Heterogeneous Most stored data is NOT well structured –Text & Semi-structured –Sparse, multi-valued, & multi-occurrence attributes Much value latent in un-structured data Text analytic tools can extract value –Beyond the words: names, roles, concepts, … Text analytics: searching for meaning in the content –Semantic & knowledge driven analysis –Expensive: big dictionaries, byte-by-byte, big inputs and outputs –Stateless  easy scale-out

Text Analytics analytic1 analytic2 to Index Derive { } from inputs –Language, words (stems, part-of-speech, …) –Context (title, bold, anchor text, …) –Concepts (person, organization, role, product, …) –Classification (complaint, fraud, spam, xxx, …) –Meta-data (to/from, subject, date, title, abstract, reference, …) Domain and customer specific analysis offer most value Analytics produced attributes induce index schema Object Dictionary Attributes/ Values Attributes/ Values Data Source

Text Indexing Logical index over MANY entries per object –Large index – even with aggressive compression –Non-transactional Scale-out needed –Capacity - single index too big for one (commodity) node –Ingest thruput – concurrent insert to index fragments –Query response – fan-out / in for query parallelism Query –Predicates over matches –Match scoring – magic weighting of predicate importance & position –Query planning & optimization probably needed

What about Data Processing? select / project / join / aggregate Add “value” postings to index for keys and measures Select: { }   {obj1} Project: { }   {val2} Join: { }   {obj2} Project: { }   {measVal} Aggregation: sum({measVal})

Architecture Obj  storeMgr Indexer … scale-out … Analytics Query  queryPlanner  queryDriver  ranked results ObjStore Obj Indexer Obj Queue Obj Analytics Index Fragment file Analytics Index Fragment

Conclusions Derived value from un-structured objects –Much value latent in un-structured data –Value extracted via analytic tools –Value captured in scalable index –Value exploited via query and data processing Architecture –Index independent object store schema –Application choice of object analytics induces index schema –Scaled-out analytics and index