CS 5604 Information Storage and Retrieval

Slides:



Advertisements
Similar presentations
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Advertisements

Module 5: Routing BizTalk Messages. Overview Lesson 1: Introduction to Message Routing Lesson 2: Configuring Message Routing Lesson 3: Monitoring Orchestrations.
Chapter 5: Introduction to Information Retrieval
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Fast Track to ColdFusion 9. Getting Started with ColdFusion Understanding Dynamic Web Pages ColdFusion Benchmark Introducing the ColdFusion Language Introducing.
Databases & Data Warehouses Chapter 3 Database Processing.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
Big Data Processing of School Shooting Archives
Information Retrieval in Practice
CS 405G: Introduction to Database Systems
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management (Tweets) Final Presentation
Collection Management Webpages
Presenters: Radha Krishnan, Sneha Mehta April 28, 2016
Collection Management
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
UNC Digital Library Project
Building Search Systems for Digital Library Collections
Extraction, aggregation and classification at Web Scale
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Federated & Meta Search
Virginia Tech, Blacksburg, VA 24061
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Front End Final Presentation
Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS6604 Digital Libraries IDEAL Webpages Presented by
Applying Key Phrase Extraction to aid Invalidity Search
Multimedia Database Virginia Polytechnic Institute and State University Blacksburg, VA CS 4624 Multimedia, Hypertext and Information Access Client.
Team FE Final Presentation
Sam Fisher, Josh Horn, Johanna Pinsirikul, Taylor Sims
Collection Management Webpages Final Presentation
Wikipedia Hadoop Steven Stulga Spring 2016
CS6604 Digital Libraries IDEAL Webpages Presented by
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Katrina Database SearchKat
Lucene/Solr Architecture
Getting Started With Solr
Rafał Kuć – Sematext sematext.com
敦群數位科技有限公司(vanGene Digital Inc.) 游家德(Jade Yu.)
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Presentation transcript:

CS 5604 Information Storage and Retrieval Presenters: Andrej Galad, Long Xia, Shivam Maharshi, Tingting Jiang Spring 2016 CS 5604 Information Retrieval and Storage Virginia Polytechnic Institute and State University Blacksburg, VA Professor: Dr. E. Fox

Project Overview Integrated Digital Event Archive and Library (IDEAL) project Data source: social media (tweets, related web pages) Goal: build a state-of-the-art information retrieval system Management: separate teams, Solr team, Front-end team Solr team’s responsibility Data storage and HBase schema Indexing Custom search (query handler, ranking function, etc.) Support for other teams (Front-end, Collaborative filtering) Get people working

Data Storage and HBase Schema Why use HBase Non-relational, column-family-oriented, key-value-based database Great scalability and flexibility How data stored HBase schema Import data into HBase

Indexing Indexing pipeline we use Morphline LilyIndexer transform data from HBase to Solr. With Solr, we can perform search query, do custom ranking, etc, which Shivam and Andrej will talk more on this.

Indexing Two types indexers Morphlines Lily HBase Batch Indexer Lily HBase Near Real-time (NRT) Indexer Morphlines Data extracting, transforming, and loading to Solr Morphlines configuration file Solr Schema

Solr schema.xml & solrconfig.xml Static & Dynamic Fields Default & Copy Fields Stop & Profanity words

Morphline Configuration Mappings from Hbase cells to Solr fields (31 fields) Split fields into Multi-valued fields (4 fields)

Solr Search Admin UI

Custom Ranking Solr score (tf-idf) + custom scores (other teams) Custom Relevance Score = WTopic * (Document Score)Topic + WClustering * (Document Score)Clustering + WCollection * (Document Score)Collection Weight techniques Multiple linear regressions Empirical analysis for the Fractional Relevant Documents Ultimately… Query Boosting + Query expansion + Re-ranking + Pseudo-Relevance feedback

Solr Search Components Solr - pluggable web application Custom handlers, components, libraries Dynamic linking Custom classloaders Declarative discovery - solrconfig.xml Pain while debugging!!! Sample Component

Solr: Component Configuration Build and upload JAR(s) + dependencies to all Solr nodes

Solr: Component Configuration Build and upload JAR(s) + dependencies to all Solr nodes Register component in solrconfig.xml

Solr: Component Configuration Build and upload JAR(s) + dependencies to all Solr nodes Register component in solrconfig.xml Update configuration and reload collection $ solrctl instancedir --update <collection_name> <collection_configuration> $ solrctl collection --reload <collection_name>

Solr: Component Verification

Query Manipulation Query Expansion In-memory Lucene index based on ideal-cs5604s16-topic-words Schema: label, collection_id, words

Query Manipulation Re-ranking Tf-idf + weight 1 * custom score 1 + ...

Pseudo Relevance Feedback

Pseudo Relevance Feedback

Problems Faced Reflection, reflection, reflection Lack of solid documentation attempt => failure Cluster upgrade API versions mismatch Getting the data into HBase from all teams Pain to debug Solr on cluster Insufficient access privilege to the cluster

Lessons Learned Future Work Clear scope and requirement details Clear contract and team deliverables More effective communications Future Work 1. Precision and Recall evaluation 2. Performance improvements 3. Calculate the weights for custom ranking

Acknowledgement NSF grant IIS - 1319578, III: Small: Integrated Digital Event Archiving and Library (IDEAL) Dr. Edward A. Fox GRA: Sunshin Lee and Mohamed Magdy Farag All other teams