Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lycos Retriever: An Information Fusion Engine Brian Ulicny.

Similar presentations


Presentation on theme: "Lycos Retriever: An Information Fusion Engine Brian Ulicny."— Presentation transcript:

1 Lycos Retriever: An Information Fusion Engine Brian Ulicny

2 Retriever: Directory Page

3 Retriever: Image Selection

4 Retriever: Subtopic Page

5 Why Retriever?  Topical Queries vastly outnumber Questions.  Standard Search Results too many and contain junk. Even in top 10 results, due to SEO efforts  Topical Summaries answer “What do I need to know about ?”  Topic summary resources like Wikipedia have become increasingly popular.  But Wikipedia depends on human effort, so coverage is uneven and idiosyncratic.  Wikipedia reflects point of view of most engaged or partisan contributor.  Retriever as automatically updated first-draft Wikipedia.

6 Retriever: Processes 1. Mine query logs for Topics 2. Categorize Topics Naïve Bayesian categorizer built on DMOZ pages; Name guesser 3. Disambiguate Topics Disambiguator trained on DMOZ 4. Formulate Document Retrieval Query 5. Parse Retrieved Documents 6. Identify allowed alternate/reduced forms of Topic based on Category 8. Select Paragraphs Must have Topic as Discourse Topic 9. Identify Best Images 10. Delete Duplicate Paragraphs Near duplicates, too. 11. Arrange Paragraphs by Verb What is it? What does it have? What has it done? What happened to it? 12. Select Subtopics 13. Do editorial fixes on Passages 14. Construct Page/Directory

7 Paragraph Filters Must Have: Some form of Topic as Discourse Topic At least 3 grammatical sentences Should Have: Highest number of unique NPs. Must NOT Have: Have Any Exophors Except in quotations Topic-Insertion Spam The American Civil Herbal Viagra War was fought Herbal Viagra… Not too many mentions of topic (Erotic) fan fiction or Contain Obscenities Search Engine snippets Duplicates Wikipedia mirrors are everywhere

8 Subtopics Use best chunks for Overview page(s) Identify topic superstrings Topic: Marie Curie Superstring: Marie Curie Fellowship; MC Institute Else cluster by frequent common NPs Take into account reduced mentions: Topic: Charlie Sheen; Most frequent NP: Richards But Subtopic should be: ‘Denise Richards’ However: “new” is not always “New York”

9 Coherence Pseudo-coherence achieved by stringing together paragraphs with same Discourse Topic. Discourse Topic is based on form and position of phrase. As (a) subject of first sentence Police said that Lindsay Lohan was charged… Or in fronted material, For Lindsay Lohan, 2005 was full of surprises… Not the statistical notion of aboutness usual in IR. Information packaged by paying attention to the information conveyed by verb/predicate Alternate (but not anaphoric) references provide variety.

10 Similar Work FactBites.com Sentence extraction; grouped by source Strzalkowski and Colleagues (GE) Summarization by paragraph extraction Google Current (Current TV) Features on top-gaining queries Artequakt (EU funded; U of Southampton UK) Create artist bios; convert found texts to logical format; NLG from logical representation. Document Understanding Conference (DUC) “Summarization as Information Synthesis for Task” Sentence-level fusion; no IR component Black Hat: Spam Blogs

11 Evaluation Categorization (982 Topics) 93.5% precision (revised) Disambiguation (100 topics) 83% unambiguous (live) If it isn’t ambiguous in DMOZ, we don’t disambiguate. Chunking (642 chunks) 88.8% relevant (83.4% relevant as categorized) Subtopics (1861 chunks) 88.5% chunks relevant to subtopic (live) Images (83 images) 85.5% relevant (revised)

12 Retriever Goals Generate topical summaries on popular topics By extracting and arranging paragraphs from source documents In a coherent, readable and attractive structure Consisting of overview and subtopics Monetize with focused advertisements Allow spiders to crawl to generate traffic Abide by Fair Use/Copyright Laws Much more to be done Temporal ordering, hyperlinking, anaphora, 2 nd pass for subtopics, …

13 Questions? Lycos Retriever: An Information Fusion Engine Brian Ulicny Versatile Information Systems Bulicny@vistology.com Lycos Retriever http://www.lycos.com/retriever.html Currently not being updated and images not live.


Download ppt "Lycos Retriever: An Information Fusion Engine Brian Ulicny."

Similar presentations


Ads by Google