Who Needs All Those Indexes ? One is Enough Bruce Lindsay IBM Almaden Research Center
Stored Data is Heterogeneous Most stored data is NOT well structured –Text & Semi-structured –Sparse, multi-valued, & multi-occurrence attributes Much value latent in un-structured data Text analytic tools can extract value –Beyond the words: names, roles, concepts, … Text analytics: searching for meaning in the content –Semantic & knowledge driven analysis –Expensive: big dictionaries, byte-by-byte, big inputs and outputs –Stateless easy scale-out
Text Analytics analytic1 analytic2 to Index Derive { } from inputs –Language, words (stems, part-of-speech, …) –Context (title, bold, anchor text, …) –Concepts (person, organization, role, product, …) –Classification (complaint, fraud, spam, xxx, …) –Meta-data (to/from, subject, date, title, abstract, reference, …) Domain and customer specific analysis offer most value Analytics produced attributes induce index schema Object Dictionary Attributes/ Values Attributes/ Values Data Source
Text Indexing Logical index over MANY entries per object –Large index – even with aggressive compression –Non-transactional Scale-out needed –Capacity - single index too big for one (commodity) node –Ingest thruput – concurrent insert to index fragments –Query response – fan-out / in for query parallelism Query –Predicates over matches –Match scoring – magic weighting of predicate importance & position –Query planning & optimization probably needed
What about Data Processing? select / project / join / aggregate Add “value” postings to index for keys and measures Select: { } {obj1} Project: { } {val2} Join: { } {obj2} Project: { } {measVal} Aggregation: sum({measVal})
Architecture Obj storeMgr Indexer … scale-out … Analytics Query queryPlanner queryDriver ranked results ObjStore Obj Indexer Obj Queue Obj Analytics Index Fragment file Analytics Index Fragment
Conclusions Derived value from un-structured objects –Much value latent in un-structured data –Value extracted via analytic tools –Value captured in scalable index –Value exploited via query and data processing Architecture –Index independent object store schema –Application choice of object analytics induces index schema –Scaled-out analytics and index