Download presentation
Presentation is loading. Please wait.
Published byMariah Rose Modified over 9 years ago
1
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk
2
Outline Course & Tutors Information Introduction to Web Crawling Utilities of a crawler Features of a crawler Architecture of a crawler Introduction to Regular Expression Appendix
3
Course and Tutors Information Course homepage: http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/20 09 http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/20 09 Tutors: Xin Xin Email: xxin@cse.cuhk.edu.hkxxin@cse.cuhk.edu.hk Venue: Room 101 Tom (me) Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk Venue: Room 114A
4
Utilities of a crawler Web crawler, spider. Definition: A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia) Utilities: Gather pages from the Web. Support a search engine, perform data mining and so on. Object: Text, video, image and so on. Link structure.
5
Features of a crawler Must provide: Robustness: spider traps Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...http://foo.com/bar/foo/bar/foo/ Pages filled a large number of characters. Politeness: which pages can be crawled, and which cannot robots exclusion protocol: robots.txt http://blog.sohu.com/robots.txt User-agent: * Disallow: /manage/
6
Features of a crawler (Cont’d) Should provide: Distributed Scalable Performance and efficiency Quality Freshness Extensible
7
Architecture of a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set
8
www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted. Architecture of a crawler (Cont’d)
9
www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). en.wikipedia.org/wiki/Main_Page Disclaimers Dup URL Elim: the URL is checked for duplicate elimination. Architecture of a crawler (Cont’d)
10
Other issues: Housekeeping tasks: Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds) Checkpointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours) Priority of URLs in URL frontier: Change rate. Quality. Politeness: Avoid repeated fetch requests to a host within a short time span. Otherwise: blocked
11
Regular Expression Usage: Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words or patterns of characters. Today’s target: Introduce the basic principle. A tool to verify the regular expression: Regex Tester http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13 bce26d-7755-441e-92b3-1eb5f9e859f9.aspx
12
Regular Expression Metacharacter Similar to the wildcard in Windows, e.g.: *.doc Target: Detect the email address
13
Regular Expression \b: stands for the beginning or end of a Word. E.g.: \bhi\b find hi accurately \w: matches letters, or numbers, or underscore..: matches everything except the newline *: content before * can be repeated any number of times \bhi\b.*\bLucy\b +: content before + can be repeated one or more times []: match characters in it E.g: \b[aeiou]+[a-zA-Z]*\b {n}: repeat n times {n,}: repeat n or more times {n,m}: repeat n to m times
14
Regular Expression Target: Detect the email address Specifications: A@B A: combinations English characters a to z, or digits, or. or _ or % or + or – B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters) Answer: \b[a-z0-9._%+-]+@[a-z.]+\.[a-z]{2}\b
15
Appendix Mercator Crawler: http://mias.uiuc.edu/files/tutorials/mercator.pdf Regular Expression tutorial: http://www.regular-expressions.info/tutorial.html http://www.regular-expressions.info/tutorial.html
16
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.