Towards the Web of Concepts: Extracting Concepts from Large Datasets Aditya G. Parameswaran Stanford University Joint work with: Hector Garcia-Molina (Stanford)

Slides:



Advertisements
Similar presentations
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Advertisements

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Large-Scale Entity-Based Online Social Network Profile Linkage.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
Data Mining Association Analysis: Basic Concepts and Algorithms
Kalervo Järvelin – Issues in Context Modeling – ESF IRiX Workshop - Glasgow THEORETICAL ISSUES IN CONTEXT MODELLING Kalervo Järvelin
Information Retrieval in Practice
SMS-Based Web Search for Low-end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University Eric Brewer University of California.
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
INFO 624 Week 3 Retrieval System Evaluation
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Active Sampling for Entity Matching Aditya Parameswaran Stanford University Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi Yahoo! Research.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Search Engines and Information Retrieval Chapter 1.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Estimating Importance Features for Fact Mining (With a Case Study in Biography Mining) Sisay Fissaha Adafre School of Computing Dublin City University.
Time-focused density-based clustering of trajectories of moving objects Margherita D’Auria Mirco Nanni Dino Pedreschi.
Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Computer Science 101 Database Concepts. Database Collection of related data Models real world “universe” Reflects changes Specific purposes and audience.
Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
WEB SEARCH PERSONALIZATION WITH ONTOLOGICAL USER PROFILES Data Mining Lab XUAN MAN.
Aditya Akella The Performance Benefits of Multihoming Aditya Akella CMU With Bruce Maggs, Srini Seshan, Anees Shaikh and Ramesh Sitaraman.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Output URL Bidding Panagiotis Papadimitriou, Hector Garcia-Molina, (Stanford University) Ali Dasdan, Santanu Kolay (Ebay Inc)
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
TWinner : Understanding News Queries with Geo-content using Twitter Satyen Abrol,Latifur Khan University of Texas at Dallas,Department of Computer Science.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Information Retrieval in Practice
User Modeling for Personal Assistant
Neighborhood - based Tag Prediction
Wikitology Wikipedia as an Ontology
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
Panagiotis G. Ipeirotis Luis Gravano
Efficient Processing of Top-k Spatial Preference Queries
Fraction-Score: A New Support Measure for Co-location Pattern Mining
Presentation transcript:

Towards the Web of Concepts: Extracting Concepts from Large Datasets Aditya G. Parameswaran Stanford University Joint work with: Hector Garcia-Molina (Stanford) and Anand Rajaraman (Kosmix Corp.) 1

Motivating Examples tax assessors san antonio 2 tax san assessors 324 antonio 2855 tax assessors 273 assessors san < 5 san antonio 2385

Motivating Examples Lord of the rings ◦ Lord of the ◦ Of the rings Microsoft Research Redmond ◦ Microsoft Research ◦ Research Redmond Computer Networks ◦ Computer ◦ Networks 3

The Web of Concepts (WoC) Concepts are: Entities, events and topics People are searching for Web of concepts contains: Concepts Relationships between concepts Metadata on concepts 4 Japanese restaurants in Palo Alto Homma’s Sushi Fuki Sushi Teriyaki bowl Hours: M-F 9-5 Expensive Hours: M-S 5-9 Cheap

How does the WoC help us? Improve search ◦ Find concepts the query relates to ◦ Return metadata  E.g., Homma’s Sushi Timings, Phone No., … ◦ Return related concepts  E.g., Fuki Sushi, … Rank content better Discover intent 5 Japanese restaurants in Palo Alto Homma’s Sushi Fuki Sushi Teriyaki bowl M-F 9-5 Expensive M-S 5-9 Cheap

How to construct the WoC? Standard sources ◦ Wikipedia, Freebase, … ◦ Small fraction of actual concepts ◦ Missing:  restaurants, hotels, scientific concepts, places, … Updating the WoC is critical ◦ Timely results ◦ New events, establishments, …, ◦ Old concepts not already known 66 Japanese restaurants in Palo Alto Homma’s Sushi Fuki Sushi Teriyaki bowl M-F 9-5 Expensive M-S 5-9 Cheap

Desiderata Be agnostic towards ◦ Context ◦ Natural Language 7 Concepts Web-pages Query Logs Tags Tweets Blogs K-grams With frequency Concept Extraction

Our Definition of Concepts Concepts are: ◦ k-grams representing  Real / imaginary entities, events, … that  People are searching for / interested in ◦ Concise  E.g., Harry Potter over The Wizard Harry Potter  Keeps the WoC small and manageable ◦ Popular  Precision higher 8 Concepts K-grams With frequency Concept Extraction

Previous Work Frequent Item-set Mining ◦ Not quite frequent item-sets  k-gram can be a concept even if k-1-gram is not ◦ Different support thresholds required for each k ◦ But, can be used as a first step Term extraction ◦ IR method of extracting terms to populate indexes ◦ Typically uses NLP techniques, and not popularity ◦ One technique that takes popularity into account 9

Notation Sub-concepts of San Antonio : San & Antonio Sub-concepts of San Antonio Texas : ◦ San Antonio & Antonio Texas Super-concepts of San : San Antonio, … Support (San Antonio) = 2385 Pre-confidence of San Antonio: 2385 / Post-confidence of San Antonio: 2385 / K-gramFrequency San14585 Antonio2855 San Antonio2385

Empirical Property Observed on Wikipedia 11 If k-gram a 1 a 2 …a k for k > 2 is a concept then at least one of the sub-concepts a 1 a 2 …a k-1 and a 2 a 3 …a k is not a concept. kBoth ConceptsOne concept or More Lord of the Rings Manhattan Acting School Microsoft Research Redmond Computer Networks

“Indicators” that we look for 1. Popular 2. Scores highly compared to sub- and super-concepts ◦ “Lord of the rings” better than  “Lord of the” and “Of the rings”. ◦ “Lord of the rings” better than  “Lord of the rings soundtrack” 3. Does not represent part of a sentence ◦ “Barack Obama Said Yesterday” ◦ Not required for tags, query logs 12

Outline of Approach S = {} For k = 1 to n ◦ Evaluate all k-grams w.r.t. k-1-grams  Add some k-grams to S  Discard some k-1-grams from S Precisely k-grams until k = n-1 that satisfy indicators are extracted ◦ Under perfect evaluation of concepts w.r.t. sub-concepts ◦ Proof in Paper 13

Detailed Algorithm S = {} For k = 1 to n ◦ For all k-grams s (two sub-concepts r and t)  If support(s) < f(k)  Continue  If min (pre-conf(s), post-conf(s)) > threshold  S = S U {s} – {r, t}  Elseif pre-conf(s) > threshold & >> post-conf(s) & t Є S  S = S U {s} – {r}  Elseif post-conf(s) > threshold & >> pre-conf(s) & r Є S  S = S U {s} – {t} 14 Indicator 1 1 Indicator 2: r & t are not concepts Indicator 2: r is not a concept Indicator 2: t is not a concept

Experiments: Methodology AOL Query Log Dataset ◦ 36M queries and 1.5M unique terms. ◦ Evaluation using Humans (Via M.Turk) ◦ Plus Wikipedia  (For experiments on varying parameters) ◦ Experimentally set thresholds Compared against ◦ C-Value Algorithm:  a term-extraction algorithm with popularity built in ◦ Naïve Algorithm:  simply based on frequency 15

Raw Numbers concepts extracted an absolute precision of 0.95 rated against Wikipedia and Mechanical Turk. For same volume of 2, 3, and 4-gram concepts, our algorithm gave ◦ Fewer absolute errors (369) vs.  C-Value (557) and Naïve (997) ◦ Greater Non-Wiki Precision (0.84) vs.  C-Value (0.75) and Naïve (0.66) 16

Head-to-head Comparison 17

Experiments on varying thresholds 18

On Varying Size of Log 19

Ongoing Work (with A. Das Sarma, H. G.-Molina, N. Polyzotis and J. Widom) How do we attach a new concept c to the web of concepts? ◦ Via human input ◦ But: costly, so need to minimize # questions ◦ Questions of the form: Is c a kind of X? ◦ Equivalent to Human-Assisted Graph Search ◦ Algorithms/Complexity results in T.R. 20

Conclusions & Future Work Adapted frequent-itemset metrics to extract fresh concepts from large datasets of any kind Lots more to do with query logs ◦ Relationship extraction between concepts  Parent-Child  Sibling ◦ Metadata extraction Automatic learning of thresholds 21