Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)

Slides:

Advertisements

Similar presentations

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.

Advertisements

Google News Personalization: Scalable Online Collaborative Filtering

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.

Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.

Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov

Precision and Recall.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Evaluating Search Engine

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.

Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.

Presented by Zeehasham Rasheed

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.

Information Retrieval

Overview of Web Data Mining and Applications Part I

Chapter 5: Information Retrieval and Web Search

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.

Tag-based Social Interest Discovery

Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Adversarial Information Retrieval The Manipulation of Web Content.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

TERM IMPACT- BASED WEB PAGE RAKING School of Electrical Engineering and Computer Science Falah Al-akashi and Diana Inkpen

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

Implicit An Agent-Based Recommendation System for Web Search Presented by Shaun McQuaker Presentation based on paper Implicit:

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Querying Structured Text in an XML Database By Xuemei Luo.

Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.

Chapter 6: Information Retrieval and Web Search

Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Search Engine Architecture

Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Understanding User’s Query Intent with Wikipedia G 여 승 후.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Kijung Shin Jinhong Jung Lee Sael U Kang

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Seeking Stable Clusters in the Blogosphere Nilesh Bansal, Fei Chiang, Nick Koudas (univ. of Toronto) Frank Wm. Tompa (univ. of Waterloo) Presented by Jung-yeon,

1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.

Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Zhenjiang Lin, Michael R. Lyu and Irwin King

Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16

Panagiotis G. Ipeirotis Luis Gravano

Information Retrieval and Web Design

Presentation transcript:

Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University) Nick Koudas (University of Toronto) Dimitris Papadias (Hong Kong University of Science and Technology)

 Explosion of Web 2.0 content  blogs, micro-blogs, social networking  Need for “cross reference” on the web  after we read a news article, we wonder if there are any blogs discussing it  and vice versa

 A service of the BlogScope system  a real blog search engine serving 20K users /day  Input: a text document  Output: relevant blog posts  Methodology  extract key phrases from the input document  use these phrases to query BlogScope

 Novel Query-by-Document (QBD) model  Practical phrase extractor  Phrase set enhancement with Wikipedia knowledge (QBD-W)  Evaluation of all proposed methods using Amazon Mechanical Turk  Human annotators are serious because they get paid for the tasks

 Example of RF  Distinctions between RF and QBD  RF involves interaction, while QBD does not  RF is most effective for improving recall, whereas QBD aims at both high precision and recall  RF starts with a keyword query; QBD directly takes a document as input

 Two classes of methods  Very slow but accurate, from the machine learning community  Practical, not so accurate as the above (our method falls in this category)  Phrase extraction in QBD has distinct goals  Document retrieval accuracy is more important than that of the phrase set itself  A better phrase extractor is not necessarily more suitable for QBD, as shown in our experiments

 Query expansion  Used when user’s keyword set does not express herself properly  PageRank, TrustRank, …  QBD-W follows this framework  Wikipedia mining

 Recall that Query-by-Document  Extracts key phrases from the input document  And then query them against a search engine  Idea: given a query document D  Identify all phrases from D  Score each individual phrase  Obtain the set of phrases with highest scores, and refine it

 Process the document with a Part-of-Speech tagger  Nouns, adjectives, verbs, …  We compiled a list of POS patterns  Indexed by a POS trie forest  Each term sequence following such a POS pattern is considered a phrase

PatternInstance NNintendo JNglobal warming NNApple computer JJNdeclarative approximate selection NNNcomputer science department JCJNefficient and effective algorithm JNNNJunior United States Senator NNNNMicrosoft Host Integration Server …… NNNNNUnited States President Barrack Obama

 Two scoring functions  f t, based on TF/IDF  f l, based on the concept of mutual information

 Extract the most characteristic phrases from the input document D  But may obtain term sequences which are not really phrases  Example: “moment Down Jones” in “at this moment Dow Jones”

 MI: the conditional probability of a pair of events, with respect to their individual probabilities  Eliminates non-phrases

 Take the top-k phrases with highest scores  Eliminates duplicates  Two different phrases may carry similar meanings  Remove phrases who are ▪ Subsumed by another with higher score ▪ Differ from a better phrase only in the last term ▪ And other rules …

 Motivation:  The user may also be interested in web documents related to the given one, but does not contain the same key phrases  Example: after reading an article on Michelle Obama, the user may also want to learn her husband, and past American presidents  Main idea:  Obtain an initial phrase set with QBD  Use Wikipedia knowledge to identify phrases that are related to the initial phrases  Our method follows the spreading-activation framework

 Given an initial phrase set  Locate nodes corresponding to these phrases on the Wiki Graph  Assign weights to these nodes  Iteratively spreads node weights to neighbors ▪ Assume the random surfer model ▪ With a certain probability, return to one of the initial nodes

 S is the initial phrase set  Initial weights are normalized  s(c v ) is the score of c v, assigned by QBD

WiiSonyNintendoPlay Station Tomb Raider Wii02/107/101/100 Sony0004/40 Nintendo5/61/6000 Play Station 2/116/111/1102/11 Tomb Raider 0001/10

 With probability α v’, proceed to a neighbor;  Otherwise, return to one of the initial nodes  α v’ is a function of the node v’

 α v is not a constant, unlike other algorithms (e.g., TrustRank)  α v gets smaller, and eventually drops to zero, for nodes increasingly farther away from the initial ones  Reduce CPU overhead of RelevanceRank computation, since only a subset of nodes are considered  Important, as RelevanceRank is calculated online

IterationWiiSonyNintendoPlay Station …………… Infinite

 Methodology  Employ human annotators at Amazon Mturk  Dataset  A random sample of news articles from the New York Times, the Economist, Reuters, and Financial Times during Aug-Sep 2007  Competitors for phrase extraction  QBD-TFIDF (tf-idf scoring)  QBD-MI (mutual information scoring)  QBD-YAHOO (Yahoo! phrase extractor)

 Quality of Phrase Retrieval  Quality of Document Retrieval  Efficiency  The total running time of QBD is negligible

l max Time (seconds)

 We propose  the query-by-document model  two effective phrase extraction algorithms  enhancing the phrase set with the Wikipedia graph  Future work  more sophisticated phrase extraction (e.g., with additional background knowledge)  blog matching using key phrases