1 The BINGO! System for Information Portal Generation and Expert Web Search Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald,

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Chapter 5: Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
A Quality Focused Crawler for Health Information Tim Tang.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Search Engines and Information Retrieval
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Chapter 19: Information Retrieval
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Search Engines and Information Retrieval Chapter 1.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Question Answering.  Goal  Automatically answer questions submitted by humans in a natural language form  Approaches  Rely on techniques from diverse.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Search Engine Architecture
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Algorithmic Detection of Semantic Similarity WWW 2005.
Digital libraries and web- based information systems Mohsen Kamyar.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.
BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
School of Computer Science & Engineering
Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Panagiotis G. Ipeirotis Luis Gravano
Observations on DB & IR and Conclusions
Web Mining Research: A Survey
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
Presentation transcript:

1 The BINGO! System for Information Portal Generation and Expert Web Search Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer Outline: Problem & research challenge Our work in progress Future research avenues University of the Saarland, Saarbrücken, Germany

2 query = „ARIES recovery open source download“ (In-) Effectivity of Web Search Engines Google (and Yahoo) top 3: CpSc 862 Assignment 6 CpSc 862 Assignment 6... You can open it by double clicking the Mars... (logfunc... and any deviations from the standard Aries recovery... Tutorial: Application Servers and Associated Technologies Tutorial: Application Servers and Associated Technologies... an OIA for his inventions (ARIES, ARIES/IM, Commit_LSN... AppServersTutorial_SIGMOD2002_Slides.pdf Recovery from Spiritual Abuse Recovery from Spiritual Abuse... with you no matter how long your recovery takes... powerful form of spiritual abuse and a source of spiritual... their denial in hopes that... biblestudies/sp_abuse.pdf Sourceforge: BZFlagBZFlag - OpenSource OpenGL Multiplayer Multiplatform battlezone... JBoss.orgJBoss.org - The JBoss/Server is the leading Open Source... Dev-C++Dev-C++ - Dev-C++ is an full-featured Integrated... libraries... Topic –> Database –> Database Engines/Servers But there is hope: exploit structure explore neighborhood use topic directory

3 From Observations to Research Avenues Key observation: yes, there are ways to find what you are searching, but intellectual time is expensive !  requires „intelligent“ automation Research Avenues: Leverage machine learning: automatic classification Structure and annotate information: XML Organize documents „semantically“: ontologies More computer time for better result: focused crawling Goal: should be able to find results for advanced info request in one day with < 5 min intellectual effort that the best human experts can find with infinite time

4 Challenge: Expert Web Queries Which Shakespeare drama has a scene where a woman talks a Scottish nobleman into murder? Which professors from D are teaching DBS and have research projects on XML? D DBS XML Who was the Italian woman that I met at the PC meeting where Moshe Vardi was PC Chair? Find Fermat‘s last / Wiles‘ theorem in MathML format. Where can I download an open source implementation of the ARIES recovery algorithm? Are there any theorems isomorphic to my new conjecture? Find related theorems. What are Chernoff-Hoeffding bounds? Find the text and notes of the western song Raw Hide.

5 Challenge: (Meta-) Portal Building Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g ? Are there metabolic models for acid reflux that could be related to the gene expression data? Who are the top researchers in the database system community? Who is working on using machine learning techniques for searching XML data? Find information about public subsidies for plumbers. Find new EU regulations that affect an electrician‘s business. What are the most important results in large deviation theory?

6 Our Research Agenda Web Applications Intranet Applications Deep Web Semantic Web Web Service Generator (UDDI, WSDL) BINGO! DB / Cache / Queue Mgr Craw- ler Focus Mgr Classi- fier Filter (XML...) Onto- logies Search Engine XXL Index Mgr Onto- logy Mgr Query Processor Ontology Service Portal Explorer HTML2XML Auto Annotation Materialography Portal Chamber of Crafts Portal...

7 Crawler Classifier Focused Crawling Link Analysis automatially build personal topic directory (Soumen Chakrabarti et al. 1999) Root Semistrutured Data DB Core Technology Web Retrieval Data Mining XML seeds training critical issues: classifier accuracy feature selection quality of training data

8 Root Semistrutured Data DB Core Technology Web Retrieval Data Mining XML Crawler Classifier The BINGO! Focused Crawler Link Analysis seeds training topic-specific archetypes high SVM confidence high authority score re-training

9 Classification using Support Vector Machines (SVM) n training vectors with components (x 1,..., x m, C) with C = +1 oder -1 x1 x2  C CC  ? Compute separating hyperplane that maximizes the min. distance of all positive and negative samples to the hyperplane  solve quadratic programming problem Training: Test new vector for membership in C: Decision: Distance of from hyperplane yields classification confidence

10 BINGO! Adaptive Re-training and Focus Control for each topic V do { archetypes(V) := top docs of SVM confidence ranking  top authorities of V ; remove from archetypes(V) all docs d that do not satisfy confidence(d)  mean confidence(V) ; recompute feature selection based on archetypes(V) ; recompute SVM model for V with archetypes(V) as training data } combine re-training with two-phase crawling strategy: learning phase: aims to find archetypes (high precision)  hard focus for crawling vicinity of training docs harvesting phase: aims to find results (high recall)  soft focus for long-range exploration with tunnelling

11 Feature Space Construction & Meta Strategies single term frequencies or tf*idf with top n cross-entropy terms term pairs within proximity window (e.g., support vector, match about world championship) text terms from hyperlink neighbors anchor text terms from neighbors (e.g., click here for soccer results ) possible strategies: strategy with best ratio of estimated precision to runtime cost meta strategies (over m feature spaces for class k): unanimous decision: weighted average: with  estimator (Joachims‘00) for precision of model for class k based on leave-one-out training

12 Experiment on Information Portal Generation (1) for single-topic portal on „Database Research“ start with only 2 initial seeds: homepages of DeWitt and Gray goal: automatic gathering of DBLP author homepages (with DBLP excluded from crawl) learning phase for improved feature selection and classification: depth-first crawl limited to domains of seeds followed by archetype selection and retraining (for high precision) harvesting phase for building a rich portal: prioritized breadth-first crawl (for high recall)

13 Experiment on Information Portal Generation (2) result after 12 hours (on commodity PC): 3 mio. URLs crawled on hosts, 1 mio. pages analyzed, 0.5 mio. pages positively classified found 7000 homepages out of DBLP authors, 712 authors of the top 1000 DBLP authors with 267 among the 1000 best crawl results + postprocessing for querying and analysis: ranking by SVM confidence, authority score, etc. clustering, relevance feedback, etc.

Ongoing and Future Work Exploiting ontological knowledge e.g.: search for a „woman talking someone into murder“ Exploiting linguistic analysis methods e.g.: „... cut his throat...“ act: killing subject:...object:... Construct richer feature spaces Deep Web exploration with auto-generated queries Generalized links & semantic joins, e.g. named entities Identifying semantically coherent units Combining focused crawling with XML search  auto-annotation of HTML, Latex, PDF, etc. docs  cross-document querying à la XXL User guidance & portal admin methodology

Towards “Semantically Coherent” Units Teaching: DBS IR... Research: XML Auto-tuning... Homepage of... Teaching Research Address:... vs. Homepage of... My work: J2EE XML WSDL... My hobbies: Poisonous frogs South Amerian stamps Canoeing...

Ongoing and Future Work Exploiting ontological knowledge e.g.: search for a „woman talking someone into murder“ Exploiting linguistic analysis methods e.g.: „... cut his throat...“ act: killing subject:...object:... Construct richer feature spaces Deep Web exploration with auto-generated queries Generalized links & semantic joins, e.g. named entities Identifying semantically coherent units Combining focused crawling with XML search  auto-annotation of HTML, Latex, PDF, etc. docs  cross-document querying à la XXL User guidance & portal admin methodology

17 Concluding Remarks XML and Semantic Web are key assets, but by themselves not sufficient; we need to cope with diversity, incompleteness, and uncertainty  absolute need for ranked retrieval long-term goal: exploit the Web‘s potential for being the world‘s largest knowledge base view information organization and information search as dual views of the same problem combine techniques from DBS, IR, CL, AI, and ML need better theory about quality/efficiency tradeoffs as well as large-scale experiments