Lycos Retriever: An Information Fusion Engine Brian Ulicny.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Yansong Feng and Mirella Lapata
Chapter 5: Introduction to Information Retrieval
QA-LaSIE Components The question document and each candidate answer document pass through all nine components of the QA-LaSIE system in the order shown.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
Search Engines and Information Retrieval
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
INFO 624 Week 3 Retrieval System Evaluation
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Web Search Introduction.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Exercise 1: Bayes Theorem (a). Exercise 1: Bayes Theorem (b) P (b 1 | c plain ) = P (c plain ) P (c plain | b 1 ) * P (b 1 )
Overview of Search Engines
S.E.O. What we need to do for every site we build.
SEO Lunch How to Grow A Business in 3 Bites Akiva Ben-Ezra
SEO Webinar - With Neil Palmer of IM3.co.uk In Partnership with Huddlebuy How do I improve my website traffic with SEO? Covering: What is SEO? Why is SEO.
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari ( ) Harshit Mittal ( ) Rohit Kumar Saraf ( ) Vinay.
Lesson 12 — The Internet and Research
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Modern Information Retrieval Computer engineering department Fall 2005.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Accessing Reliable Internet Sites Please view the following slides in order. Carefully read the text and follow any instructions that are given.
Internet Search Strategies How and Where to Find What you Need on the Internet.
1 Search Engine Optimization An introduction to optimizing your web site for best possible search engine results.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Information Retrieval in Practice
Search Engine Optimization
Search Engine Architecture
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Multimedia Information Retrieval
Information Retrieval
Data Mining Chapter 6 Search Engines
Information Retrieval and Web Search
Presentation transcript:

Lycos Retriever: An Information Fusion Engine Brian Ulicny

Retriever: Directory Page

Retriever: Image Selection

Retriever: Subtopic Page

Why Retriever?  Topical Queries vastly outnumber Questions.  Standard Search Results too many and contain junk. Even in top 10 results, due to SEO efforts  Topical Summaries answer “What do I need to know about ?”  Topic summary resources like Wikipedia have become increasingly popular.  But Wikipedia depends on human effort, so coverage is uneven and idiosyncratic.  Wikipedia reflects point of view of most engaged or partisan contributor.  Retriever as automatically updated first-draft Wikipedia.

Retriever: Processes 1. Mine query logs for Topics 2. Categorize Topics Naïve Bayesian categorizer built on DMOZ pages; Name guesser 3. Disambiguate Topics Disambiguator trained on DMOZ 4. Formulate Document Retrieval Query 5. Parse Retrieved Documents 6. Identify allowed alternate/reduced forms of Topic based on Category 8. Select Paragraphs Must have Topic as Discourse Topic 9. Identify Best Images 10. Delete Duplicate Paragraphs Near duplicates, too. 11. Arrange Paragraphs by Verb What is it? What does it have? What has it done? What happened to it? 12. Select Subtopics 13. Do editorial fixes on Passages 14. Construct Page/Directory

Paragraph Filters Must Have: Some form of Topic as Discourse Topic At least 3 grammatical sentences Should Have: Highest number of unique NPs. Must NOT Have: Have Any Exophors Except in quotations Topic-Insertion Spam The American Civil Herbal Viagra War was fought Herbal Viagra… Not too many mentions of topic (Erotic) fan fiction or Contain Obscenities Search Engine snippets Duplicates Wikipedia mirrors are everywhere

Subtopics Use best chunks for Overview page(s) Identify topic superstrings Topic: Marie Curie Superstring: Marie Curie Fellowship; MC Institute Else cluster by frequent common NPs Take into account reduced mentions: Topic: Charlie Sheen; Most frequent NP: Richards But Subtopic should be: ‘Denise Richards’ However: “new” is not always “New York”

Coherence Pseudo-coherence achieved by stringing together paragraphs with same Discourse Topic. Discourse Topic is based on form and position of phrase. As (a) subject of first sentence Police said that Lindsay Lohan was charged… Or in fronted material, For Lindsay Lohan, 2005 was full of surprises… Not the statistical notion of aboutness usual in IR. Information packaged by paying attention to the information conveyed by verb/predicate Alternate (but not anaphoric) references provide variety.

Similar Work FactBites.com Sentence extraction; grouped by source Strzalkowski and Colleagues (GE) Summarization by paragraph extraction Google Current (Current TV) Features on top-gaining queries Artequakt (EU funded; U of Southampton UK) Create artist bios; convert found texts to logical format; NLG from logical representation. Document Understanding Conference (DUC) “Summarization as Information Synthesis for Task” Sentence-level fusion; no IR component Black Hat: Spam Blogs

Evaluation Categorization (982 Topics) 93.5% precision (revised) Disambiguation (100 topics) 83% unambiguous (live) If it isn’t ambiguous in DMOZ, we don’t disambiguate. Chunking (642 chunks) 88.8% relevant (83.4% relevant as categorized) Subtopics (1861 chunks) 88.5% chunks relevant to subtopic (live) Images (83 images) 85.5% relevant (revised)

Retriever Goals Generate topical summaries on popular topics By extracting and arranging paragraphs from source documents In a coherent, readable and attractive structure Consisting of overview and subtopics Monetize with focused advertisements Allow spiders to crawl to generate traffic Abide by Fair Use/Copyright Laws Much more to be done Temporal ordering, hyperlinking, anaphora, 2 nd pass for subtopics, …

Questions? Lycos Retriever: An Information Fusion Engine Brian Ulicny Versatile Information Systems Lycos Retriever Currently not being updated and images not live.