1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Recommender Systems & Collaborative Filtering
CS525: Special Topics in DBs Large-Scale Data Management
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Recommender System with Hadoop and Spark
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Information Retrieval in Practice
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination.
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Web Mining Research: A Survey
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Overview of Search Engines
Understanding and Managing WebSphere V5
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Implementing search with free software An introduction to Solr By Mick England.
HADOOP ADMIN: Session -2
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Collaborative Filtering - Rajashree. Apache Mahout In 2008 as a subproject of Apache’s Lucene project Mahout absorbed the Taste open source collaborative.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
1  Ex Libris Ltd., Internal and Confidential Ex Libris Primo Sofia July 2013 Roman Piontek Key-Account Manager.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
Revolutionizing enterprise web development Searching with Solr.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
Presented By :Ayesha Khan. Content Introduction Everyday Examples of Collaborative Filtering Traditional Collaborative Filtering Socially Collaborative.
Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
June 2013 BIG DATA SCIENCE: A PATH FORWARD. CONFIDENTIAL | 2  Data Science Lead.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
User Modeling and Recommender Systems: recommendation algorithms
Artificial Intelligence Techniques Internet Applications 4.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Apache Mahout Industrial Strength Machine Learning Jeff Eastman.
Recommendation Systems ARGEDOR. Introduction Sample Data Tools Cases.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Hadoop Javad Azimi May What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data. It includes:
Information Retrieval in Practice
Image taken from: slideshare
Big Data is a Big Deal!.
Recommender Systems & Collaborative Filtering
Introducing Apache Mahout
Spark Presentation.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Hadoop Clusters Tess Fulkerson.
Visualization of Web Search Results in 3D
CS6604 Digital Libraries IDEAL Webpages Presented by
CS110: Discussion about Spark
April 15, 2014 Faceted Browsing: Analysis and implementation of a Big Data Solution using Apache Solr. Advisor: Prof. Sonia Bergamaschi Co-Advisor: Prof.
Charles Tappert Seidenberg School of CSIS, Pace University
Indexing with ElasticSearch
Introducing Apache Mahout
Presentation transcript:

1CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scientist Lucid Imagination Enhancing Discovery with Solr and Mahout

2CONFIDENTIAL | Copyright Lucid Imagination Evolution Documents Models Feature Selection User Interaction Clicks Ratings/Reviews Learning to Rank Social Graph Queries Phrases NLP Content Relationships Page Rank, etc. Organization

3CONFIDENTIAL | Copyright Lucid Imagination Minding the Intersection Search Discovery Analytics

4CONFIDENTIAL | Copyright Lucid Imagination Background –Apache Mahout –Apache Solr and Lucene Recommendations with Mahout –Collaborative Filtering Discovery with Solr and Mahout Discussion Topics

5CONFIDENTIAL | Copyright Lucid Imagination Apache Lucene in a Nutshell Java based Application Programming Interface (API) for adding search and indexing functionality to applications Fast and efficient scoring and indexing algorithms Lots of contributions to make common tasks easier: –Highlighting, spatial, Query Parsers, Benchmarking tools, etc. Most widely deployed search library on the planet

6CONFIDENTIAL | Copyright Lucid Imagination Apache Solr in a Nutshell Lucene-based Search Server + other features and functionality Access Lucene over HTTP: –Java, XML, Ruby, Python,.NET, JSON, PHP, etc. Most programming tasks in Lucene are taken care of in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support Lucene Best Practices

7CONFIDENTIAL | Copyright Lucid Imagination Apache Mahout in a Nutshell An Apache Software Foundation project to create scalable machine learning libraries under the Apache Software License – The Three C’s: –Collaborative Filtering (recommenders) –Clustering –Classification Others: –Frequent Item Mining –Primitive collections –Math stuff

8CONFIDENTIAL | Thinking Lucene Think Lucid Recommendations with Mahout

9CONFIDENTIAL | Copyright Lucid Imagination Collaborative Filtering (CF) –Provide recommendations solely based on preferences expressed between users and items –“People who watched this also watched that” Content-based Recommendations (CBR) –Provide recommendations based on the attributes of the items and user profile –‘Modern Family’ is a sitcom, Bob likes sitcoms => Suggest Modern Family to Bob Mahout geared towards CF, can be extended to do CBR –Classification can also be used for CBR Aside: search engines can also solve these problems Recommenders

10CONFIDENTIAL | Copyright Lucid Imagination DraculaJane Eyre FrankensteinJava Programming Bob14???- Mary514- In many instances, user’s don’t provide actual ratings –Clicks, views, etc. Non-Boolean ratings can also often introduce unnecessary noise –Even a low rating often has a positive correlation with highly rated items in the real world Example: Should we recommend Frankenstein to Bob? To Rate or Not? DraculaJane EyreFrankenstein Bob14??? Mary514

11CONFIDENTIAL | Copyright Lucid Imagination Collaborative Filtering with Mahout Extensive framework for collaborative filtering Recommenders –User based –Item based –Slope One Online and Offline support –Offline can utilize Hadoop Item 1 Item 2 …Item m User User … User n Recommendations for User X

12CONFIDENTIAL | Copyright Lucid Imagination User Similarity Item 1 Item 2 Item 3 Item 4 User 1 User 2 User 3 User 4 What should we recommend for User 1?

13CONFIDENTIAL | Copyright Lucid Imagination Item Similarity Item 1 Item 2 Item 3 Item 4 User 1 User 2 User 3 User 4 What should we recommend for User 1?

14CONFIDENTIAL | Copyright Lucid Imagination Intuition: There is a linear relationship between rated items –Y = mX + b where m = 1 Solve for b upfront based on existing ratings: b = (Y-X) –Find the average difference in preference value for every pair of items Online can be very fast, but requires up front computation and memory Slope One UserItem 1Item 2 A3.52 B?3 User A: 3.5 – 2 = 1.5 Item 1 (User B) = = 4.5

15CONFIDENTIAL | Copyright Lucid Imagination Online –Predates Hadoop –Designed to run on a single node Matrix size of ~ 100M interactions –API for integrating with your application Offline –Hadoop based –Designed to run on large cluster –Several approaches: RecommenderJob, ItemSimilarityJob, ParallelALSFactorizationJob Online and Offline Recommendations

16CONFIDENTIAL | Copyright Lucid Imagination Essentially does matrix multiplication using distributed techniques $MAHOUT_HOME/bin/examples/asf- -examples.sh RecommenderJob User A X= Recs

17CONFIDENTIAL | Thinking Lucene Think Lucid Discovery with Solr

18CONFIDENTIAL | Copyright Lucid Imagination Goals: –Guide users to results without having to guess at keywords –Encourage serendipity –Never show empty results Out of the Box: –Faceting –Spell Checking –More Like This –Clustering (Carrot 2 ) Extend –Clustering (with Mahout) –Frequent Item Mining (with Mahout) Discovery with Solr

19CONFIDENTIAL | Copyright Lucid Imagination Automatically group similar content together to aid users in discovering related items and/or avoiding repetitive content Solr has search result clustering –Pluggable –Default implementation uses Carrot 2 Mahout has Hadoop based large scale clustering –K-Means, Minhash, Dirichlet, Canopy, Spectral, etc. Clustering

20CONFIDENTIAL | Copyright Lucid Imagination Discovery In Action Pre-reqs: –Apache Ant 1.7.x, Subversion (SVN) Command Line 1: –svn co solr-trunkhttps://svn.apache.org/repos/asf/lucene/dev/trunk –cd solr-trunk/solr/ –ant example –cd example –java –Dsolr.clustering.enabled=true –jar start.jar Command Line 2 –cd exampledocs; java –jar post.jar *.xml e=true e=true

21CONFIDENTIAL | Thinking Lucene Think Lucid Solr + Mahout

22CONFIDENTIAL | Copyright Lucid Imagination Most Mahout tasks are offline Solr provides many touch points for integration: –ClusteringEngine Clustering results –SearchComponent Suggestions – Related searches, clusters, MLT, spellchecking –UpdateProcessor Classification of documents –FunctionQuery Basics

23CONFIDENTIAL | Copyright Lucid Imagination Discover frequently co-occurring items Use Case: Related Searches from Solr Logs Hadoop and sequential versions –Parallel FP Growth Input: – TAB SPACE SPACE –Comma, pipe also allowed as delimiters Example: Frequent Itemset Mining

24CONFIDENTIAL | Copyright Lucid Imagination Goal: –Extract user queries from Solr logs –Feed into FIM to generate Related Keyword Searches Context: –Solr Query logs –bin/mahout regexconverter –input $PATH_TO_LOGS --output /tmp/solr/output --regex "(?<=(\?|&)q=).*?(?=&|$)" --overwrite --transformerClass url -- formatterClass fpg –bin/mahout fpg --input /tmp/solr/output/ -o /tmp/solr/fim/output -k 25 -s 2 -- method mapreduce –bin/mahout seqdumper --seqFile /tmp/solr2/results/frequentpatterns/part-r FIM on Solr Query Logs

25CONFIDENTIAL | Copyright Lucid Imagination Key: Chris: Value: ([Chris, Hostetter],870), ([Chris],870), ([Search, Faceted, Chris, Hostetter, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, Hostetter, Webcast, Power],18), ([Search, Faceted, Chris, Hostetter],18), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, Hostetter, webcast, along, sponsors],12), ([Solr, new, Chris, Hostetter, webcast, along],12), ([Solr, new, Chris, Hostetter, webcast],12), ([Solr, new, Chris, Hostetter],12) Output

26CONFIDENTIAL | Copyright Lucid Imagination Resources

27CONFIDENTIAL | Thinking Lucene Think Lucid Appendix

28CONFIDENTIAL | Copyright Lucid Imagination Mahout Overview Math Vectors/Matrices/ SVD Math Vectors/Matrices/ SVD Recommenders Clustering Classification Freq. Pattern Mining Freq. Pattern Mining Genetic Utilities/Integration Lucene/Vectorizer Utilities/Integration Lucene/Vectorizer Collections (primitives) Apache Hadoop Applications Examples See