Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yelp Dataset Challenge

Similar presentations


Presentation on theme: "Yelp Dataset Challenge"— Presentation transcript:

1 Yelp Dataset Challenge
Campus Arc II, 16 April 2015 Mehdy Davary, Computer science department (IIUN)

2 About The Challenge Dataset
1.2M reviews 400K tips by 250K users for 42K businesses 400K business attributes, e.g., hours, parking availability, ambience Social network of 250K users for a total of 1.9M social edges. Aggregated check-ins over time for each of the 42K businesses cities U.K.: Edinburgh U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison

3 The Hortonworks Sandbox
Platform The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform (HDP). It is a personal, portable Hadoop environment. H2O on Hortonworks Data Platform is a fully Open Source Predictive Analytics Platform. Neo4j is a Graph Database which stores data in a Graph, with Nodes. Neo4j uses Cypher queries to work with graph data. The Hortonworks Sandbox H2O Sentiment Analysis Neoj4

4 The Hortonworks Sandbox
By now we have managed all YELP five JSON data files in Hadoop as tables which are sortable and searchable. Mainly we use HCatalog, Pig, Python and Hive to load and process data.

5 H2O H2O is a statistical analysis engine that uses Hadoop Distributed File System (HDFS) as its storage platform and provides a user-friendly interface for easy querying.

6 Neoj4 The real power of Neo4j is in connected data. To associate any two nodes, we add a Relationship which describes how the records are related.

7 To analyze Hortonworks sandbox data with Excel 2013
Hortonworks ODBC driver (64-bit) installed and configured. Microsoft Excel 2013 Professional Plus 64-bit. Use the Microsoft Query feature to access Hortonworks sandbox data. Use the Excel Power View feature to analyze the data.

8 About reviews on “Restaurants”
5 important dimensions Raw data Food Service Ambience Deals/Discounts Quality-Price Ratio yelp_academic_dataset_review.json yelp_academic_dataset_business.json A review can be associated with multiple dimensions (categories) at the same time.

9 data preparation for data mining
All reviews Total reviews on “Restaurants” Reduced numbers of reviews on “Restaurants” by using (review.useful > 3 AND review.cool > 2 AND review.stars > 3 AND business.review_count > 5) as filtering factors All businesses All restaurants Restaurants r.useful > 3 r.cool >2 r.stars > 3 b.review_count > 5 Review 1’127’525 706’290 22’584 Business 42’153 User 252’898 Tip 403’210 Checkin 31’617

10 review funny: int useful: int cool: int user_id: string review_id: string stars: int text: string date: string type: string business_id: string business attributes: string business_id: string full_address: string open: boolean hours: string categories: string city: string review_count: int name: string neighborhoods: string longitude: float state: string stars: float latitude: float type: string user yelping_since: string votes: {funny: 1, useful: 5, cool: 0}, string name: string review_count: int user_id: string friends: string fans: int average_stars: float type: string compliments: string elite: string

11 Java implementation of the NLTK in Hadoop
Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser. Java implementation of the NLTK in Hadoop The Stanford NLP Group

12 Unfortunately, the frustration of being Dr
Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I feel that I have to give Dr. Goldberg 2 stars. Unfortunately, frustration Dr. Goldberg's patient repeat experience I've doctors NYC -- good doctor, terrible staff. It staff simply answers phone. It takes 2 hours repeated calling answer. Who time deal it? run problem doctors it. You office workers, patients medical needs, answering phone? It's incomprehensible work aggravation. It's regret feel give Dr. Goldberg 2 stars. ((Unfortunately,RB),(frustration,NN),(being,VB),(Goldberg,NNP),(patient,NN),(repeat,NN),(experience,NN),('ve,VB),(had,VB),(so,RB),(many,JJ),(other,JJ),(doctors,NN),(NYC,NNP),(good,JJ),(doctor,NN),(terrible,JJ),(staff,NN),(seems,VB),(staff,NN),(simply,RB),(never,RB),(answers,VB),(phone,NN),(usually,RB),(takes,VB),(hours,NN),(repeated,VB),(calling,VB),(get,VB),(answer,NN),(time,NN),(wants,VB),(deal,VB),(have,VB),(run,VB),(problem,NN),(many,JJ),(other,JJ),(doctors,NN),(just,RB),(do,VB),(n't,RB),(get,VB),(have,VB),(office,NN),(workers,NN),(have,VB),(patients,NN),(medical,JJ),(needs,NN),(n't,RB),(anyone,NN),(answering,VB),(phone,NN),('s,VB),(incomprehensible,NN),(not,RB),(work,VB),(aggravation,NN),('s,VB),(regret,NN),(feel,VB),(have,VB),(give,VB),(Goldberg,NNP),(stars,NN)) {(Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.),(It seems that his staff simply never answers the phone.),(It usually takes 2 hours of repeated calling to get an answer.),(Who has time for that or wants to deal with it?),(I have run into this problem with many other doctors and I just don't get it.),(You have office workers, you have patients with medical needs, why isn't anyone answering the phone?),(It's incomprehensible and not work the aggravation.),(It's with regret that I feel that I have to give Dr. Goldberg 2 stars.)}

13 Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser. Using the SentiWordNet to find the Positive and Negative values related to each Part of Speech. Summing up the Positive and Negative values obtained to calculate a Net Positive and Net Negative value related to a sentence. SentiWordNet A lexical resource for opinion mining

14


Download ppt "Yelp Dataset Challenge"

Similar presentations


Ads by Google