Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

Outline Course & Tutors Information Introduction to Web Crawling  Utilities of a crawler  Features of a crawler  Architecture of a crawler Introduction to Regular Expression Appendix

Course and Tutors Information Course homepage:  http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/20 09 http://wiki.cse.cuhk.edu.hk/irwin.king/teaching/csc4170/20 09 Tutors:  Xin Xin Email: xxin@cse.cuhk.edu.hkxxin@cse.cuhk.edu.hk Venue: Room 101  Tom (me) Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk Venue: Room 114A

Utilities of a crawler Web crawler, spider. Definition:  A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner. (Wikipedia) Utilities:  Gather pages from the Web.  Support a search engine, perform data mining and so on. Object:  Text, video, image and so on.  Link structure.

Features of a crawler Must provide:  Robustness: spider traps Infinitely deep directory structures: http://foo.com/bar/foo/bar/foo/...http://foo.com/bar/foo/bar/foo/ Pages filled a large number of characters.  Politeness: which pages can be crawled, and which cannot robots exclusion protocol: robots.txt http://blog.sohu.com/robots.txt  User-agent: *  Disallow: /manage/

Features of a crawler (Cont’d) Should provide:  Distributed  Scalable  Performance and efficiency  Quality  Freshness  Extensible

Architecture of a crawler www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set

www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set URL Frontier: containing URLs yet to be fetches in the current crawl. At first, a seed set is stored in URL Frontier, and a crawler begins by taking a URL from the seed set. DNS: domain name service resolution. Look up IP address for domain names. Fetch: generally use the http protocol to fetch the URL. Parse: the page is parsed. Texts (images, videos, and etc.) and Links are extracted. Architecture of a crawler (Cont’d)

www DNS Fetch Parse Content Seen? URL Filter Dup URL Elim URL Frontier Doc Fingerprint Robots templates URL set Content Seen?: test whether a web page with the same content has already been seen at another URL. Need to develop a way to measure the fingerprint of a web page. URL Filter: Whether the extracted URL should be excluded from the frontier (robots.txt). URL should be normalized (relative encoding). en.wikipedia.org/wiki/Main_Page Disclaimers Dup URL Elim: the URL is checked for duplicate elimination. Architecture of a crawler (Cont’d)

Other issues:  Housekeeping tasks: Log crawl progress statistics: URLs crawled, frontier size, etc. (Every few seconds) Checkpointing: a snapshot of the crawler’s state (the URL frontier) is committed to disk. (Every few hours)  Priority of URLs in URL frontier: Change rate. Quality.  Politeness: Avoid repeated fetch requests to a host within a short time span. Otherwise: blocked 

Regular Expression Usage:  Regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words or patterns of characters. Today’s target:  Introduce the basic principle. A tool to verify the regular expression: Regex Tester  http://www.dotnet2themax.com/blogs/fbalena/PermaLink,guid,13 bce26d-7755-441e-92b3-1eb5f9e859f9.aspx

Regular Expression Metacharacter  Similar to the wildcard in Windows, e.g.: *.doc Target: Detect the email address

Regular Expression \b: stands for the beginning or end of a Word.  E.g.: \bhi\b find hi accurately \w: matches letters, or numbers, or underscore..: matches everything except the newline *: content before * can be repeated any number of times  \bhi\b.*\bLucy\b +: content before + can be repeated one or more times []: match characters in it  E.g: \b[aeiou]+[a-zA-Z]*\b {n}: repeat n times {n,}: repeat n or more times {n,m}: repeat n to m times

Regular Expression Target: Detect the email address Specifications:  A@B  A: combinations English characters a to z, or digits, or. or _ or % or + or –  B: cse.cuhk.edu.hk or cuhk.edu.hk (English characters) Answer:  \b[a-z0-9._%+-]+@[a-z.]+\.[a-z]{2}\b

Appendix Mercator Crawler:  http://mias.uiuc.edu/files/tutorials/mercator.pdf Regular Expression tutorial:  http://www.regular-expressions.info/tutorial.html http://www.regular-expressions.info/tutorial.html

Questions?

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

Similar presentations

Presentation on theme: "Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou

Similar presentations

Presentation on theme: "Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou"— Presentation transcript:

Similar presentations

About project

Feedback