BINGO!: Bookmark-Induced Gathering of Information Sergej Sizov, Martin Theobald, Stefan Siersdorfer, Gerhard Weikum University of the Saarland Germany
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Part I System Overview
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Motivation Web search engines The vector space model Link analysis & authority ranking Information demands Mass queries (“madonna tour”) Needle-in-a-haystack queries (“solidarity eisler”) ?
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Overview (II) WWW ROOT Semistructured Data DB Core Technology Networking Workflow and E-Services Web Retrieval Data Mining XML Semistructured Data DB Core Technology Networking Workflow and E-Services Web Retrieval Data Mining XML
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focused Crawling Crawler Queue Results Classifier
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focused Crawling (2) Key aspects: the mathematical model and algorithm that are used for the classifier (e.g., Naive Bayes vs. SVM) the feature set upon which the classifier makes its decision (e.g., all terms vs. a careful selection of the "most discriminative" terms) the quality of the training data
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focused Crawling (3) Crawler Re-Training Queue SVM Classifier H I T S SVM Archetypes Hubs Authorities
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information System Overview Crawler Document Analyzer Feature Selection Classifier Adaptive Re-Training Link Analyzer URL Queue Docs Feature Vectors Ontology Index Training Docs Book- marks Hubs & Authorities W W W
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Part II System Components
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focus Manager Focusing strategies Depth-first (df): Breadth-first (bf): Strong focus (learning phase) Soft focus (harvesting phase) Tunneling
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Focus Manager (2) Sample URL Prioritization confidence = 0.3 topic=A confidence = 0.4 topic=A confidence = 0.85 topic=A confidence = 0.6 topic=B DF strong order: 1–2–5–3–6–4–9–10.. BF strong order:1–2–5–3–4–6–9–10.. DF soft order: 1–2–5–6–3–7–8–4–9–10.. BF soft order:1–2–5–3–6–4–7–8–9–10..
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Feature Selection Mutual Information (MI) criterion: A is the number of documents in Vj containing Xi, B is the number of documents with Xi in "competitive" topics C is the number of documents in Vj without Xi N is the overall number of documents in Vj and its competitive topics Time complexity: O(n)+O(mk) for n documents, m terms and k competitive topic.
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Feature Selection (2) Top features for the topic “DB Core Technology" with regard to tf*idf (left) and MI (right) tf*idf score MI weight below storag et modifi graph sql involv disk accomplish pointer backup deadlock command redo exactli implement feder correctli histor size
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Classifier δ ¬ V V ? δ x1x1 x2x2 Training: Compute Classification: Check Input: n training vectors with components (x 1,..., x m, C) and C = +1 or C = -1 σ
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Hierarchical Classification Recursive classification by the taxonomy tree. Decisions based on topic-specific feature spaces Semistructured Data DB Core Technology ROOT Networking Workflow and E-Services Web Retrieval Data Mining XML Semistructured Data 0.4 Data Mining
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Link Analysis The HITS Algorithm Iterative approximation of the dominant Eigenvectors of A T A and AA T : Web graph G = (S, E) ?
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Retraining based on Archetypes Two sources of potential archetypes: Link analysis → N auth good authorities SVM classifier → N conf best-rated docs To avoid the "topic drift" phenomenon: the classification confidence of an archeteype must be higher than the mean confidence of the previous iteration's training documents.
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Retraining (2) if {at least one topic has more than Nmax positive documents or all topics have more than Nmin positive documents} { for each topic Vi { link analysis using all documents of Vi as base set; hubs (Vi) = top Nhub documents; authorities (Vi) = top Nauth documents; sort docs of Vi in descending order of confidence; archetypes (Vi) = top Nconf from confidence ranking auth (Vi); remove from archetypes(Vi) all docs with confidence < mean of the previous iteration; archetypes (Vi) = archetypes(Vi) bookmarks (Vi) }; for each topic Vi { perform feature selection based on archetypes (Vi); re-compute SVM decision model for Vi } re-initialize URL queue using hubs (Vi) to URL queue } }
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Part III Evaluation
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Testbed Bookmarks: homepages of researchers in the various areas Leaf nodes were filled with bookmarks The total training data comprised 81 documents Focused crawl: Crawling time: 6h Visited: pages (1800 hosts), link distances 1 – positively classified (675 different hosts) Entire crawl: 7 iterations with re-training. Parameters: Nmin = 50, Nmax = 200, Nhub = 50, Nauth = 20, Nconf = 20. Feature selection: MI criterion, best 300 for each topic; Authority ranking: HITS algorithm
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawling Precision IterationData MiningXML Entire ontology 10,980,940,98 2 0,930,98 30,990,970,96 40,870,990,97 50,900,950,96 60,98 0,95 70,940,970,96
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawling Precision (2) IterationBINGO! with focusing, no MI no focusing, no MI 10, , , , , , ,
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawling Recall IterationData MiningXML Entire ontology
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Archetype Selection Topic „Data Mining“: URLSVM confidence
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Archetype Selection (2) IterationData MiningXML Entire ontology 110 (1)5 (0)24 (4) 210 (2)11 (0)27 (5) 39 (1)17 (1)32 (4) 48 (0)7 (0)29 (3) 522 (2)26 (2)62 (8) 643 (4)12 (2)77 (10) 738 (0)13 (1)75 (8)
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Feature Selection Topic „Data Mining“: Feature MI weight mine knowledg olap frame pattern genet discov miner cluster dataset 0.044
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Future Work Large-scale experiments (portal generator) Annotation and semantic classification of HTML sources (e.g. transformation of HTML to XML for improved data management, detection of “information units”) Advanced feature construction and feature selection algorithms Fault tolerance on document collections with wrong samples, adaptive re-training... ?
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Crawler Key features: asynchronous DNS lookups with caching multiple download attempts advanced duplicate recognition following multiple redirects advanced topic-balanced URL-queue document filters for common datatypes focusing strategies
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Classifier (II) Training: Find hyperplane that separates the samples with maximum margin (quadratic optimization task): Classification: Test unlabeled vector y for Very efficient runtime in O(m)
Sergej Sizov BINGO!: Bookmark-Induced Gathering of Information Related Work General-purpose crawling Focused crawling Authority ranking Classification of Web documents Web ontologies