December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
A Quality Focused Crawler for Health Information Tim Tang.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Crawling the WEB Representation and Management of Data on the Internet.
Web Crawling/Collection Aggregation CS431, Spring 2004, Carl Lagoze April 5 Lecture 19.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Chapter 5: Information Retrieval and Web Search
Internet Research Search Engines & Subject Directories.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Crawling Slides adapted from
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
Network of Epidemiology Digital Objects Naren Sundar, Kui Xu Client: Sandeep Gupta, S.M. Shamimul CS6604 Class Project.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Web- and Multimedia-based Information Systems Lecture 2.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Post-Ranking query suggestion by diversifying search Chao Wang.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Automatic Extraction of Malicious Behaviors
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engines and Search techniques
CS 430: Information Discovery
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engines & Subject Directories
Web Crawling and Automatic Discovery
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Search Engines & Subject Directories
Collection Synthesis CS 502 – Carl Lagoze – Cornell University
Presentation transcript:

December 20, 2002CUL Metadata WG Meeting1 Focused Crawling and Collection Synthesis Donna Bergmark Cornell Information Systems

December 20, 2002CUL Metadata WG Meeting2 Outline Crawlers Collection Synthesis Focused Crawling Some Results Student Project (Fall 2002)

December 20, 2002CUL Metadata WG Meeting3 Definition Spider = robot = crawler Crawlers are computer programs that roam the Web with the goal of automating specific tasks related to the Web.

December 20, 2002CUL Metadata WG Meeting4 Crawlers – some background Resource discovery Crawlers and internet history Crawling and crawlers Mercator

December 20, 2002CUL Metadata WG Meeting5 Resource Discovery Finding info on the Web –Surfing (random strategy, goal is serendipity) –Searching (inverted indices; specific info) –Crawling (“all” the info) Uses for crawling –Find stuff –Gather stuff –Check stuff

December 20, 2002CUL Metadata WG Meeting6 Crawlers and internet history 1991: HTTP 1992: 26 servers 1993: 60+ servers; self-register; archie 1994 (early) – first crawlers 1996 – search engines abound 1998 – focused crawling 1999 – web graph studies 2002 – use for digital libraries

December 20, 2002CUL Metadata WG Meeting7 Crawling and Crawlers Web overlays the internet A crawl overlays the web seed

December 20, 2002CUL Metadata WG Meeting8 Crawler Issues The web is so big Visit Order The URL itself Politeness Robot Traps The hidden web System Considerations

December 20, 2002CUL Metadata WG Meeting9 Standard for Robot Exclusion Martin Koster (1994) Maintained by the webmaster Forbid access to pages, directories Commonly excluded: /cgi-bin/ Adherence is voluntary for the crawler

December 20, 2002CUL Metadata WG Meeting10 Robot Traps Cycles in the Web graph Infinite links on a page Traps set out by the Webmaster

December 20, 2002CUL Metadata WG Meeting11 The Hidden Web Dynamic pages increasing Subscription pages Username and password pages Research in progress on how crawlers can “get into” the hidden web

December 20, 2002CUL Metadata WG Meeting12 System Issues Crawlers are complicated systems Efficiency is of utmost importance Crawlers are demanding of system and network resources

December 20, 2002CUL Metadata WG Meeting13

December 20, 2002CUL Metadata WG Meeting14 Mercator Features Written in Java One file configures a crawl Can add your own code –Extend one or more of M’s base classes –Add totally new classes called by your own Industrial-strength crawler: – uses its own DNS and java.net package

December 20, 2002CUL Metadata WG Meeting15 Collection Synthesis The NSDL –National Scientific Digital Library –Educational materials for K-thru-grave –A collection of digital collections Collection (automatically derived) –20-50 items on a topic, represented by their URLs, expository in nature, precision trumps recall

December 20, 2002CUL Metadata WG Meeting16 Crawler is the Key A general search engine is good for precise results, few in number A search engine must cover all topics, not just scientific For automatic collection assembly, a Web crawler is needed A focused crawler is the key

December 20, 2002CUL Metadata WG Meeting17 Focused Crawling

December 20, 2002CUL Metadata WG Meeting18 Focused Crawling R Breadth-first crawl R X X Focused crawl

December 20, 2002CUL Metadata WG Meeting19 Collections and Clusters Traditional – document universe is divided into clusters, or collections Each collection represented by its centroid Web – size of document universe is infinite Agglomerative clustering is used instead Two aspects: –Collection descriptor –Rule for when items belong to that Collection

December 20, 2002CUL Metadata WG Meeting20 Q = 0.2 Q = 0.6

December 20, 2002CUL Metadata WG Meeting21 The Setup A virtual collection of items about Chebyshev Polynomials

December 20, 2002CUL Metadata WG Meeting22 Adding a Centroid An empty collection of items about Chebyshev Polynomials

December 20, 2002CUL Metadata WG Meeting23 Document Vector Space Classic information retrieval technique Each word is a dimension in N-space Each document is a vector in N-space Example: Normalize the weights Both the “centroid” and the downloaded document are term vectors

December 20, 2002CUL Metadata WG Meeting24 Agglomerate A collection with 3 items about Ch. Polys.

December 20, 2002CUL Metadata WG Meeting25 Where does the Centroid come from? “Chebyshev Polynomials” A really good centroid for a collection about C.P.’s

December 20, 2002CUL Metadata WG Meeting26 Building a Centroid 1. Google(“Chebyshev Polynomials”)  {url1 … url-n 2. Let H be a hash (k,v) where k=word, value=freq 3. For each url in {u1 … un} do D  download(url) V  term vector(d) For each term t in V do If t not in H add it with value H(t) Compute tf-idf weights. C  top 20 terms.

December 20, 2002CUL Metadata WG Meeting27 Dictionary Given centroids C1, C2, C3 … Dictionary is C1 + C2 + C3 … –Terms are union of terms in Ci –Term Frequencies are total frequency in Ci –Document Frequency is how many C’s have t –Term IDF is as from Berkeley Dictionary is terms

December 20, 2002CUL Metadata WG Meeting28 Focused Crawling Recall the cartoon for a focused crawl: A simple way to do it is with 2 “knobs” R X X

December 20, 2002CUL Metadata WG Meeting29 Focusing the Crawl Threshold: page is on-topic if correlation to the closest centroid is above this value Cutoff: follow links from pages whose “distance” from closest on-topic ancestor is less than the cutoff

December 20, 2002CUL Metadata WG Meeting30 Illustration Cutoff = 1 Corr >= threshold

December 20, 2002CUL Metadata WG Meeting31 Closest Furthest

December 20, 2002CUL Metadata WG Meeting32 Collection “Evaluation” Assume higher correlations are good With human relevance assessments, one can also compute a “precision” curve Precision P(n) after considering the n most highly ranked items is number of relevant, divided by n.

December 20, 2002CUL Metadata WG Meeting33 Cutoff = 0 Threshold = 0.3

December 20, 2002CUL Metadata WG Meeting34

December 20, 2002CUL Metadata WG Meeting35 Tunneling with Cutoff Nugget – dud – dud… - dud – nugget Notation: 0 – X – X … - X – 0 Fixed cutoff: 0 – X1 – X2 - … Xc Adaptive cutoff: 0 – X1 – X2 - … X?

December 20, 2002CUL Metadata WG Meeting36 Statistics Collected 500,000 documents Number of seeds: 4 Path data for all but seeds 6620 completed paths (0-x…x-0) 100,000s incomplete paths (0-x…x..)

December 20, 2002CUL Metadata WG Meeting37 Nuggets that are x steps from a nugget

December 20, 2002CUL Metadata WG Meeting38 Nuggets that are x steps from a seed and/or a nugget

December 20, 2002CUL Metadata WG Meeting39 Better parents have better children.

December 20, 2002CUL Metadata WG Meeting40 Using the Empirical Observations Use the path history Use the page quality - cosine correlation Current distance should increase exponentially as you get away from quality nodes Distance = 0 if this is a nugget, otherwise: 1 or (1-corr) exp (2 x parent’s distance / cutoff)

December 20, 2002CUL Metadata WG Meeting41 Results Details in the ECDL paper Smaller frontier  more docs/second More documents downloaded in same time Higher-scoring documents were downloaded Cutoff of 20 averaged 7 steps at the cutoff

December 20, 2002CUL Metadata WG Meeting42 Fall 2002 Student Project Query Mercator CentroidCollectionDescription Term vectors Centroids, Dictionary Collection URLs Chebyshev P.s HTML

December 20, 2002CUL Metadata WG Meeting43 Conclusion We’ve covered crawling – history, technology, use Focused crawling with tunneling Adaptive cutoff with tunneling We have a good experimental setup for exploring automatic collection synthesis