Site Technology TOI Fest Q1 2015 Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Hive: A data warehouse on Hadoop
Recommender systems Ram Akella November 26 th 2008.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
Capacity Planning in SharePoint Capacity Planning Process of evaluating a technology … Deciding … Hardware … Variety of Ways Different Services.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
THE HOG LANGUAGE A scripting MapReduce language. Jason Halpern Testing/Validation Samuel Messing Project Manager Benjamin Rapaport System Architect Kurry.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Penwell Debug Intel Confidential BRIEF OVERVIEW OF HIVE Jonathan Brauer ESE 380L Feb
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Presented by John Dougherty, Viriton 4/28/2015 Infrastructure and Stack.
Introduction to Hadoop and HDFS
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
Master Thesis Defense Jan Fiedler 04/17/98
Chapter 6: Information Retrieval and Web Search
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Big Data: Industry Needs Data Scientists Data Analysts Data Infrastructure Engineers Developers (all kinds) 2-3:30, August 10, 2015 Room 261 RSC.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Nov 2006 Google released the paper on BigTable.
PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems IEEE Big Data 2014.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
PhD Dissertation Defense Scaling Up Machine Learning Algorithms to Handle Big Data BY KHALIFEH ALJADDA ADVISOR: PROFESSOR JOHN A. MILLER DEC-2014 Computer.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Next Generation of Apache Hadoop MapReduce Owen
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Microsoft Ignite /28/2017 6:07 PM
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Image taken from: slideshare
Enhancement of IITBombayX-Open edX
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
A Straightforward Author Profiling Approach in MapReduce
Big Data A Quick Review on Analytical Tools
Big Data Technology.
Spark Presentation.
Projects on Extended Apache Spark
Hadoop Clusters Tess Fulkerson.
Ministry of Higher Education
Introduction to Spark.
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?

Outline Introduction Required Data Sources Big Data Platform Semantic Search at CB Future Work and Conclusions

Introduction

Keyword-based Search Traditional search engines (i.e. Lucene, Solr, Elasticsearch) tokenize text and find documents containing those tokens and linguistic variations: – User’s Search: machine learning Tokenization: ["machine", "learning"] => Stemming: ["machin", "learn"] Final Query: machin AND learn This could match a document for a “machinist” who has “learned” something. – software architect => … => software AND architect Might identify a building architect requiring knowledge of specialized architecture software – account manager => … => account AND manage Will match text such as “need to manage the process and account for any variances”

Semantic Search We need a way to identify and search for the meaning of keyword phrases, not just the individual text tokens – i.e. machine learning = "machine learning" OR "data scientist" OR "mahout" OR "svm" OR "neural networks”

Possible Solutions Natural Language Processing (NLP) Not a good option for CB (different languages) Statistical ML Models Language-agnostic Human-readable High accuracy Fast and scalable Manual Taxonomies: Not Scalable Man power required in every supported language

Required Data Sources search logs (Billions) Job Seekers Recruiters Classified users (Millions) Black-listed keywords (e.g stopwords)

Big Data Platform

Hadoop Platform Distributed storage and processing platform Scalable to Petabytes or greater Our clusters: Production: 68 DataNodes. ~800TB configured, over 600TB used (replication factor 3), mostly compressed data. Combined ~1400 CPU threads, ~4TB RAM. DR: 42 DataNodes, 1.4PB. SQL Server tables refreshed daily Table data stored as SequenceFile format (binary, compressed) Looking into row-column store formats

MapReduce (Java) Distribution of work (map) Aggregation of work output (reduce) Hive: SQL-like language Sqoop: Transfer of data between HDFS and relational DBs Oozie: Workflow management, scheduling HDFS operations, MapReduce, Hive, Sqoop Processing on Hadoop

Cont.. Q2: Spark (Java, Scala, Python, etc.) Will still support MapReduce, but Spark is the future.

CB Semantic Search

Our Target User’s Query: machine learning research and development Portland, OR software engineer AND hadoop java Traditional Search Engine Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java ) Ideal Parsing: "machine learning" AND "research and development" AND "Portland, OR” AND "software engineer" AND hadoop AND java Semantically Enhanced Query: ("machine learning" OR "computer vision" OR "data mining" OR matlab) AND ("research and development" OR "r&d") AND ("Portland, OR" OR "Portland, Oregon") AND ("software engineer" OR "software developer") AND (hadoop OR "big data" OR hbase OR hive) AND (java OR j2ee)

Abstract Model Mine user search logs Collaborative Filtering Remove noise

Job Seeker Search Behavior Recruiter Search Behavior Content- based Filtering

PGMHD Java Developer.NET Developer Nurse Health Care Java J2EE C# Care giver RN Senior Home

Map/Reduce job which finds and scores similar searches run for the same users ○Jane searched for “registered nurse” and “r.n.” and “nurse”. ○Zeke searched for “java developer” and “scala” and “jvm” and “j2ee”

Similarity Scores Co-Occurrence Score Point-wise Mutual Information Score Probabilistic Based Similarity Score

Sample Results Cashier => retail, retail cashier, customer service, cashiers CDL => cdl driver, cdl a, driver Data Scientist => machine learning, big data

Special Cases Synonyms: cpa => Certified Public Accountant rn => Registered Nurse r.n. => Registered Nurse Ambiguous Terms*: driver => driver (trucking) ~80% driver => driver (software) ~20%

Conclusions and Future Work Semantic Search focuses on understanding the meaning behind the search keywords. Semantic Search at CB was enabled by implementing a workflow that analyzes billions of search logs using the Big Data platform. The workflow runs continuously to handle any manually curation proposed by data analysts in near-real-time manner.

Conclusions and Future Work We plan to start using Spark to analyze the queries we received in real time. We plan to use semantic search API intensively in our recommendation engine to improve the quality of the recommendations

Acknowledgment I would like to thank Trey Grainger for his continuous support to make semantic search possible and for providing the content of this presentation. I would like to thank the Search Relevancy and Recommendations team who take the responsibility to build the API for this semantic search to make it useful.

Publication Crowdsourced query augmentation through semantic discovery of domain-specific jargon, IEEE Big Data 2014