Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,

Slides:

Advertisements

Similar presentations

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.

Ontology Classifications Acknowledgement Abstract Content from simulation systems is useful in defining domain ontologies. We describe a digital library.

Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.

Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.

Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.

Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…

 DATABASE DATABASE  DATABASE ENVIRONMENT DATABASE ENVIRONMENT  WHY STUDY DATABASE WHY STUDY DATABASE  DBMS & ITS FUNCTIONS DBMS & ITS FUNCTIONS 

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

ALMA Integrated Computing Team Coordination & Planning Meeting #1 Santiago, April 2013 Evaluation of mongoDB for Persistent Storage of Monitoring.

Goodbye rows and tables, hello documents and collections.

Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Summary of Alma-OSF’s Evaluation of MongoDB for Monitoring Data Heiko Sommer June 13, 2013 Heavily based on the presentation by Tzu-Chiang Shen, Leonel.

Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.

Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.

Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.

Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.

Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team

Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.

Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.

DATA MANAGEMENT 1) File StructureFile Structure 2) Physical OrganisationPhysical Organisation 3) Logical OrganisationLogical Organisation 4) File OrganisationFile.

Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.

Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,

Big Data Processing of School Shooting Archives

Efficient Multi-User Indexing for Secure Keyword Search

CS6604 Digital Libraries Global Events Team Final Presentation

CS522 Advanced database Systems

Collection Management (Tweets) Final Presentation

Collection Management Webpages

Presenters: Radha Krishnan, Sneha Mehta April 28, 2016

Storage and Indexes Chapter 8 & 9

Collection Management

ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study

Building Search Systems for Digital Library Collections

CLA Team Final Presentation CS 5604 Information Storage and Retrieval

Virginia Tech, Blacksburg, VA 24061

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

File Organizations Chapter 8 “How index-learning turns no student pale

Clustering and Topic Analysis

Virginia Tech Blacksburg CS 4624

Clustering tweets and webpages

CS 5604 Information Storage and Retrieval

CS6604 Digital Libraries IDEAL Webpages Presented by

Team FE Final Presentation

External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.

Collection Management Webpages Final Presentation

Selected Topics: External Sorting, Join Algorithms, …

CS6604 Digital Libraries IDEAL Webpages Presented by

Information Storage and Retrieval

Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li

Adam Lech Joseph Pontani Matthew Bollinger

Rafał Kuć – Sematext sematext.com

File Organizations and Indexing

Presentation transcript:

Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech, Blacksburg

Outline 1.Schema design 2.Indexing 3.Custom Search

System Overview 1.CPU a.Intel i5 Haswell Quad core 3.3 Ghz Xeon 2.RAM a.660 GB in total b.32 GB in each of the 19 Hadoop nodes c.4 GB in the manager node d.16 GB in the tweet DB nodes e.16 GB in the HDFS backup node 3. Storage a. 60 TB across Hadoop, manager, and tweet DB nodes b.11.3 TB for backup 4.Number of nodes a.19 Hadoop nodes b.1 Manager node c.2 Tweet DB nodes d.1 HDFS backup node

Solr Schema: Design

Solr Schema: Effect Solr idx size depends on: ●Number of fields ●Stored vs. not stored ●Type of field o example: string vs. text ●Index issues: o Recommended to add H/W o Alternatively, design schema and tune Java/Solr configurations

Solr Schema: Future ●Fewer stored fields ●NRT indexer o Consider using facet.method=fcs, helps during first request ●Index size lowering o D x S + U  D is the document count  S is the the size of the data type  ints - 4bytes, 8 bytes for doubles, U cumulative size of the unique field values

Document ID: Design Syntax: Collection_name--Counter_value Use: Noise, Reduction, HBase, Solr/Lucene

Document ID: Effect ● Affects Lucene indexing ● Does not affect Solr index size ● Fastest: Zero padded sequential ● Slowest: Random UUID generated using some languages (UUID v4) ● Our current design ○ Performance tradeoffs ○ Somewhere in the middle

ID generation in pipeline LuceneSolrHBase NR RAW ID addition

Document ID: Future Preprocessing ●Current: o Concurrency o `test_and_set` o Batch processing ●Recommendation: o Binary encoded UUID o Parallel o CPU hit Querying ●Current: o Faster fetching o Lower disk I/O ●Recommendation: o Sequentially assigned value or unhashed timestamp o Batch processing vs. asynchronous processing

Indexing

Hbase Indexer - Morphline morphlines: [ { commands: [ { extractHBaseCells { mappings: [{ inputColumn: "original:text_clean" outputField: "text" type: string source: value } { inputColumn: "analysis:ner_people" outputField: "ner_people_multiple" type: string source: value } ] } } { split { inputField: "ner_people_multiple" outputField: "ner_people" separator: "|" } maps to HBase column family and column qualifier defined in schema.xml

Indexing - Jobs

Results WebpagesTweets

Search 1: Adjust boost levels in Query Parser (solrconfig.xml) These numbers are conjectural. A more disciplined approach: Relevance = f(q,d) For each of 100 queries, induce a logistic regression model. Find average contribution of each field to relevance.

Search

2: Reorder documents on the basis of Social Importance Score.... solr.SearchHandler edu.vt.dlib.ideal.solr.SocialBoostComponent

Search 3: Supplement result list, if necessary, by retrieving documents from collections that include topics related to the search. Get topic list straight from HBase.

Search Results {'query': 'election', 'num_results': , 'precision': 0.998, 'time': } {'query': 'elect', 'num_results': 3247, 'precision': 0.978, 'time': } {'query': 'revolution', 'num_results': 13048, 'precision': 0.95, 'time': } {'query': 'uprising', 'num_results': 1769, 'precision': 0.851, 'time': } {'query': 'storm', 'num_results': , 'precision': 0.999, 'time': } {'query': 'winter', 'num_results': , 'precision': 0.999, 'time': } {'query': 'ebola', 'num_results': , 'precision': 1.0, 'time': } {'query': 'disease', 'num_results': 6802, 'precision': 0.993, 'time': } {'query': 'bomb', 'num_results': 33924, 'precision': 0.857, 'time': } {'query': 'explosion', 'num_results': 1224, 'precision': 0.284, 'time': } {'query': 'crash', 'num_results': , 'precision': 0.995, 'time': } {'query': 'plane crash', 'num_results': , 'precision': 1.0, 'time': } {'query': 'shooting', 'num_results': 5366, 'precision': 0.744, 'time': } {'query': 'paris shooting', 'num_results': 446, 'precision': 0.446, 'time': } {'query': 'terrorist attack', 'num_results': 1143, 'precision': 0.768, 'time': } (first 1000 results)

Future Work 1.Relevance feedback to derive logistic regression, evaluate with F1 score. 2.More Solr nodes -> more index shards. 3.Boost by length 4.Innovative inputs (social graph) 5.More sophisticated traversal of hierarchical classification

Acknowledgement We are especially thankful to The NSF grant IIS , III: Small: Integrated Digital Event Archiving and Library (IDEAL) for the funding that supported the infrastructure and the data used in the project. Dr. Edward A. Fox for being the guide on the side for us and for coordinating efforts of all the teams to help us make it a successful class project. The GTA (Sunshin), the GRA (Mohamed) and other students of the class who supported us with ideas as well as efforts during the semester. The authors and the contributors of open source projects, blogs and wiki pages from where we borrowed some ideas and solutions.

Thank you! Q & A