BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany.

Slides:

Advertisements

Similar presentations

An Introduction To Categorization Soam Acharya, PhD 1/15/2003.

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.

Chapter 5: Introduction to Information Retrieval

Albert Gatt Corpora and Statistical Methods Lecture 13.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Machine learning continued Image source:

1 The BINGO! System for Information Portal Generation and Expert Web Search Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald,

Information Retrieval in Practice

Search Engines and Information Retrieval

CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.

CS345 Data Mining Web Spam Detection. Economic considerations  Search has become the default gateway to the web  Very high premium to appear on the.

Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Information Retrieval

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Masquerade Detection Mark Stamp 1Masquerade Detection.

This week: overview on pattern recognition (related to machine learning)

APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.

TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.

A Web Crawler Design for Data Mining

Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.

Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec

1 BINGO! and Daffodil: Personalized Exploration of Digital Libraries and Web Sources Martin Theobald Max-Planck-Institut für Informatik Claus-Peter Klas.

Recent Trends in Text Mining Girish Keswani

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.

Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

Image Classification over Visual Tree Jianping Fan Dept of Computer Science UNC-Charlotte, NC

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.

Integrated Departmental Information Service IDIS provides integration in three aspects Integrate relational querying and text retrieval Integrate search.

© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

KNN & Naïve Bayes Hongning Wang

Data Mining and Text Mining. The Standard Data Mining process.

Information Retrieval in Practice

DATA MINING Introductory and Advanced Topics Part III – Web Mining

Restrict Range of Data Collection for Topic Trend Detection

Panagiotis G. Ipeirotis Luis Gravano

Feature Selection for Ranking

Presentation transcript:

BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Part I System Overview

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Motivation Web search engines The vector space model Link analysis & authority ranking Information demands Mass queries (“madonna tour”) Needle-in-a-haystack queries (“solidarity eisler”) ?

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Overview (II) WWW ROOT Semistructured Data DB Core Technology Networking Workflow and E-Services Web Retrieval Data Mining XML Semistructured Data DB Core Technology Networking Workflow and E-Services Web Retrieval Data Mining XML

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focused Crawling Crawler Queue Results Classifier

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focused Crawling (2) Key aspects: the mathematical model and algorithm that are used for the classifier (e.g., Naive Bayes vs. SVM) the feature set upon which the classifier makes its decision (e.g., all terms vs. a careful selection of the "most discriminative" terms) the quality of the training data

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focused Crawling (3) Crawler Re-Training Queue SVM Classifier H I T S SVM Archetypes Hubs Authorities

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information System Overview Crawler Document Analyzer Feature Selection Classifier Adaptive Re-Training Link Analyzer URL Queue Docs Feature Vectors Ontology Index Training Docs Book- marks Hubs & Authorities W W W

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Part II System Components

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focus Manager Focusing strategies Depth-first (df): Breadth-first (bf): Strong focus (learning phase) Soft focus (harvesting phase) Tunneling

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focus Manager (2) Sample URL Prioritization confidence = 0.3 topic=A confidence = 0.4 topic=A confidence = 0.85 topic=A confidence = 0.6 topic=B DF strong order: 1–2–5–3–6–4–9–10.. BF strong order:1–2–5–3–4–6–9–10.. DF soft order: 1–2–5–6–3–7–8–4–9–10.. BF soft order:1–2–5–3–6–4–7–8–9–10..

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Feature Selection Mutual Information (MI) criterion: A is the number of documents in Vj containing Xi, B is the number of documents with Xi in "competitive" topics C is the number of documents in Vj without Xi N is the overall number of documents in Vj and its competitive topics Time complexity: O(n)+O(mk) for n documents, m terms and k competitive topic.

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Feature Selection (2) Top features for the topic “DB Core Technology" with regard to tf*idf (left) and MI (right) tf*idf score MI weight below storag et modifi graph sql involv disk accomplish pointer backup deadlock command redo exactli implement feder correctli histor size

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Classifier δ ¬ V V ? δ x1x1 x2x2 Training: Compute Classification: Check Input: n training vectors with components (x 1,..., x m, C) and C = +1 or C = -1 σ

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Hierarchical Classification Recursive classification by the taxonomy tree. Decisions based on topic-specific feature spaces Semistructured Data DB Core Technology ROOT Networking Workflow and E-Services Web Retrieval Data Mining XML Semistructured Data 0.4 Data Mining

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Link Analysis The HITS Algorithm Iterative approximation of the dominant Eigenvectors of A T A and AA T :  Web graph G = (S, E) ?

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Retraining based on Archetypes Two sources of potential archetypes: Link analysis → N auth good authorities SVM classifier → N conf best-rated docs To avoid the "topic drift" phenomenon: the classification confidence of an archeteype must be higher than the mean confidence of the previous iteration's training documents.

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Retraining (2) if {at least one topic has more than Nmax positive documents or all topics have more than Nmin positive documents} { for each topic Vi { link analysis using all documents of Vi as base set; hubs (Vi) = top Nhub documents; authorities (Vi) = top Nauth documents; sort docs of Vi in descending order of confidence; archetypes (Vi) = top Nconf from confidence ranking  auth (Vi); remove from archetypes(Vi) all docs with confidence < mean of the previous iteration; archetypes (Vi) = archetypes(Vi)  bookmarks (Vi) }; for each topic Vi { perform feature selection based on archetypes (Vi); re-compute SVM decision model for Vi } re-initialize URL queue using hubs (Vi) to URL queue } }

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Part III Evaluation

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Testbed Bookmarks: homepages of researchers in the various areas Leaf nodes were filled with bookmarks The total training data comprised 81 documents Focused crawl: Crawling time: 6h Visited: pages (1800 hosts), link distances 1 – positively classified (675 different hosts) Entire crawl: 7 iterations with re-training. Parameters: Nmin = 50, Nmax = 200, Nhub = 50, Nauth = 20, Nconf = 20. Feature selection: MI criterion, best 300 for each topic; Authority ranking: HITS algorithm

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawling Precision IterationData MiningXML Entire ontology 10,980,940,98 2 0,930,98 30,990,970,96 40,870,990,97 50,900,950,96 60,98 0,95 70,940,970,96

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawling Precision (2) IterationBINGO! with focusing, no MI no focusing, no MI 10, , , , , , ,

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawling Recall IterationData MiningXML Entire ontology

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Archetype Selection Topic „Data Mining“: URLSVM confidence

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Archetype Selection (2) IterationData MiningXML Entire ontology 110 (1)5 (0)24 (4) 210 (2)11 (0)27 (5) 39 (1)17 (1)32 (4) 48 (0)7 (0)29 (3) 522 (2)26 (2)62 (8) 643 (4)12 (2)77 (10) 738 (0)13 (1)75 (8)

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Feature Selection Topic „Data Mining“: Feature MI weight mine knowledg olap frame pattern genet discov miner cluster dataset 0.044

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Future Work Large-scale experiments (portal generator) Annotation and semantic classification of HTML sources (e.g. transformation of HTML to XML for improved data management, detection of “information units”) Advanced feature construction and feature selection algorithms Fault tolerance on document collections with wrong samples, adaptive re-training... ?

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawler Key features: asynchronous DNS lookups with caching multiple download attempts advanced duplicate recognition following multiple redirects advanced topic-balanced URL-queue document filters for common datatypes focusing strategies

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Classifier (II) Training: Find hyperplane that separates the samples with maximum margin (quadratic optimization task): Classification: Test unlabeled vector y for Very efficient runtime in O(m)

Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Related Work General-purpose crawling Focused crawling Authority ranking Classification of Web documents Web ontologies