Web Crawling. Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at.

Web Crawling

Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at 5:22 pm. –That is really tight to be here by 6:15 –May we have a delayed start to class: 7:00? If something bad happens and I will be later than that, I will let you know by e-mail or a post on blackboard.

Web crawling – Why? One form of gathering information. We all know about information overload –Numbers are staggering –More is coming The challenge of dealing with information, and data, will be with us for a long time. There is more out there than we might immediately expect

How much information is there? Data summarization, trend detection anomaly detection are key technologies Yotta Zetta Exa Peta Tera Giga Mega Kilo All books (words) All Books MultiMedia Everything Recorded ! A Photo 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli Slide source Jim Gray – Microsoft Research (modified) A Book A movie See also, Mike Lesk: How much information is there: http://www.lesk.com/mlesk/ksg97/ksg.html See Lyman & Varian: How much information http://www.sims.berkeley.edu/research/projects/how-much-info// Soon most everything will be recorded and indexed Most bytes will never be seen by humans. These require algorithms, data and knowledge representation, and knowledge of the domain

Astronomy and Computing The Large Synoptic Survey Telescope (LSST) Over 30 thousand gigabytes (30TB) of images will be generated every night during the decade-long LSST sky survey. http://lsst.org/lsst/google LSST and Google share many of the same goals: organizing massive quantities of data and making it useful.

Google and Information From New York Times, April 16 2012 "J. Trevor Hughes, president of the International Association of Privacy Professionals, said the Google case represented what happened when technical employees of technology companies made "innocent' decisions about collecting data that could infuriate consumers and in turn invite regulatory inquiry. "This is one of the most significant risks we see in the information age today," he said. "Project managers and software developers don't understand the sensitivity associated with data." The Federal Communications Commission fined and censured Google for obstructing an inquiry into its Street View project, which had collected Internet communications from potentially millions of unknowing households as specially equipped cars drove slowly by. The data was a snapshot of what people were doing online at the moment the cars rolled by — e-mailing a lover, texting jokes to a buddy, balancing a checkbook, looking up an ailment. Google spent more than two years scooping up that information, from January 2008 to April 2010. http://bits.blogs.nytimes.com/2012/04/16/daily-report-unanswered-questions-about-google/

Ocean Observatories NEPTUNE Canada ocean network is part of the Ocean Networks Canada (ONC) Observatory. Our network extends the Internet from the rocky coast to the deep abyss. We gather live data and video from instruments on the seafloor, making them freely available to the world, 24/7. http://www.neptunecanada.ca/ Live video from the seafloor, more than 2 KM deep OOI Data PolAll OOI data including data from OOI core sensors and all proposed sensors added by Principal Investigators, will be rapidly disseminated, open, and freely available (within constraints of national security). Rapidly disseminated implies that data will be made available as soon as technically feasible, but generally in near real-time, with latencies as small as seconds for the cabled components. In limited cases, individual PIs who have developed a data source that becomes part of the OOI network may request exclusive rights to the data for a period of no more than one year from the onset of the data stream. http://www.oceanobservatories.org/about/frequently-asked-questions/

Crawling – the how Agenda for tonight –The web environment –An architecture for crawling –Issues of politeness –Some technical assistance

First, What is Crawling A web crawler (aka a spider or a robot) is a program –Starts with one or more URL – the seed Other URLs will be found in the pages pointed to by the seed URLs. They will be the starting point for further crawling –Uses the standard protocols for requesting a resource from a server Requirements for respecting server policies Politeness –Parses the resource obtained Obtains additional URLs from the fetched page –Implements policies about duplicate content –Recognizes and eliminates duplicate or unwanted URLs –Adds found URLs to the queue and continues from the request to server step

An exercise Go to any URL you frequently use If you used that as a starting point for a crawl, how many pages could you get to if your crawl depth is 3 –That is, you go to each link on the original page, each link pointed to by those first links, and then each link pointed to by the next set. As always, work in groups of 2 or 3 Report just the number of links found

The Web Environment: Depth of the Web A URL gives access to a web page. That page may have links to other pages. Some pages are generated only when information is provided through a form. –These pages cannot be discovered just by crawling. The surface web is huge. The deeper web is unfathomable.

Anatomy of a URL http://www.csc.villanova.edu/~cassel That is a pointer to a web page. Three parts –http – the protocol to use for retrieving the page other protocols, such as ftp can be used instead –www.csc.villanova.edu -- the name of the domainwww.csc.villanova.edu csc is a subdomain of the villanova domain –~cassel Abbreviation subdirectory html in the directory cassel at the machine associated with www.csc.villanova.eduwww.csc.villanova.edu index.html is the default page to return if no other file is specified

The major domain categories Generic categories: –.net -- Originally restricted to major participants in maintaining the Internet. Now open. –.org -- Generally non profit organizations, including professional organizations such as acm.org –.com -- Commercial organizations such as amazon.com, etc. –.edu -- Restricted to higher education (post secondary) institutions. High schools and elementary schools are not allowed to use it. –.gov – Government organizations, such as nsf.gov –.mil – Military sites Country Codes –.us Example: http://www.dot.state.pa.us/ PA Dept of Transportationhttp://www.dot.state.pa.us/ –.it –.uk Uses second level domains such as ac.uk or co.uk –And other country designations. Who is.tv? Newer ones:.biz,.name, etc. All regulated by the Internet Assigned Numbers Authoriity (IANA) Islands of Tuvalu

If not http:// then what? Other protocols can be specified in the request to a server: –file:// local file on the current host –ftp:// use the ftp protocol to fetch the file –Etc.

Domain categories The domain categories serve to partition the universe of domain names. Domain Name Servers (DNS) do lookup to translate a domain name to an IP address. An IP address locates a particular machine and makes a communication path known. –Most common still: 32 bit IPv4 addresses –Newer: 128 bit IPv6 (note next slide)

IPv6 note Accessible via IPv6 TotalPercentage Web servers453111825.2% Mail servers201111811.1% DNS servers1596581527.4% Last Updated: Tue Apr 17 00:45:18 2012 UTC Source:http://www.mrp.net/IPv6_Survey.html

Web servers A server will typically have many programs running, several listening for network connections. –A port number (16 bits) identifies the specific process for the desired connection. –Default port for web connections: 80 –If other than 80, it must be specified in the URL

Exercise: What is where? Your project is running on a specific server at a specific port. Can you find the exact “address” of your project? –Use nslookup from a unix prompt (msdos also?) –example nslookup monet.csc.villanova.edu returns Server: ns1.villanova.edu Address: 153.104.1.2 Name: monet.csc.villanova.edu Address: 153.104.202.173 Domain server Note, a local domain name server replied So the “phone number” of the apache server on monet is 153.104.202.173:80

Crawler features A crawler must be –Robust: Survive spider traps. Websites that fool a spider into fetching large or limitless numbers of pages within the domain. Some deliberate; some errors in site design –Polite: Crawlers can interfere with the normal operation of a web site. Servers have policies, both implicit and explicit, about the allowed frequency of visits by crawlers. Responsible crawlers obey these. Others become recognized and rejected outright. Ref: Manning Introduction to Information Retrieval

Crawler features A crawler should be –Distributed: able to execute on multiple systems –Scalable: The architecture should allow additional machines to be added as needed –Efficient: Performance is a significant issue if crawling a large web –Useful: Quality standards should determine which pages to fetch –Fresh: Keep the results up-to-date by crawling pages repeatedly in some organized schedule –Extensible: Modular, well crafter architecture allows the crawler to expand to handle new formats, protocols, etc. Ref: Manning Introduction to Information Retrieval

Scale A one month crawl of a billion pages requires fetching several hundred pages per second It is easy to lose sight of the numbers when dealing with data sources on the scale of the Web. –30 days * 24 hours/day * 60 minutes/hour * 60 seconds/minute = 2,592,000 seconds –1,000,000,000 pages/2,592,000 seconds = 385.8 pages/second Note that those numbers assume that the crawling is continuous Ref: Manning Introduction to Information Retrieval

Google Search  See http://video.google.com/videoplay?docid=- 1243280683715323550&hl=en# http://video.google.com/videoplay?docid=- 1243280683715323550&hl=en#  Marissa Mayer of Google on how a search happens at Google.

Web Operation Basic Client Server model –The http protocol HyperText Transfer Protocol –Few simple commands that allow communication between the server and an application requesting something from the server – usually a browser, but not always. –Server The site where the content resides. Most of the web is served up by Apache and its byproducts. –Client The program requesting something from the server. Browsers most often, but also web crawlers and other applications.

HTTP: GET and POST GET HTTP/ –Requests that the server send the specific page at back to the requestor. –The version number allows compatible communication –Server sends header and the requested file (page). –Additional requests can follow. POST –Similar to a GET but allows additional information to be sent to the server. –Useful for purchases or page edits.

HEAD HEAD HTTP/ Useful for checking whether a previously fetched web page has changed. The request results in header information, but not the page itself. Response: –Confirm http version compatibility –Date: –Server: –Last-Modified:

Full set of HTTP commands CONNECT Command DISCONNECT Command GET Command POST Command HEAD Command LOAD RESPONSE_INFO BODY Command LOAD RESPONSE_INFO HEADER Command SYNCHRONIZE REQUESTS Command

Search Search engines, whether general engines like Google or Yahoo, or special purpose search engines in an application, do not crawl the web looking for results after receiving a query. –That would take much too long and provide unacceptable performance Search engines actually search a carefully constructed database with indices created for efficiently locating content

Architecture of a Search Engine Ref: Manning Introduction to Information Retrieval

Crawling in Context So, we see that crawling is just one step in a complex process of acquiring information from the Web to use in any application. Usually, we will want to sort through the information we found to get the most relevant part for our use. So, the example of a search engine is relevant.

Making a request of a server Browsers display pages by sending a request to a web server and receiving the coded page as a response. Protocol: HTTP –http://abc.com/filea.html … means use the http protocol to communicate with the server at the location abc.com and fetch the file named filea.html –the html extension tells the browser to interpret the file contents as html code and display it. 30

Programming Language Help Programming languages influence the kinds of problems that can be addressed easily. Most languages can be used to solve a broad category of problems –but are more closely attuned to some kinds of problems An example, –Python is very well suited to text analysis and has features useful in web crawling 31

Python module for web access urllib2 –Note – this is for Python 2.x, not Python 3 Python 3 splits the urllib2 materials over several modules –import urllib2 –urllib2.urlopen(url [,data][, timeout]) Establish a link with the server identified in the url and send either a GET or POST request to retrieve the page. The optional data field provides data to send to the server as part of the request. If the data field is present, the HTTP request used is POST instead of GET –Use to fetch content that is behind a form, perhaps a login page –If used, the data must be encoded properly for including in an HTTP request. See http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1 timeout defines time in seconds to be used for blocking operations such as the connection attempt. If it is not provided, the system wide default value is used. 32 http://docs.python.org/library/urllib2.html

URL fetch and use urlopen returns a file-like object with methods: –Same as for files: read(), readline(), fileno(), close() –New for this class: info() – returns meta information about the document at the URL getcode() – returns the HTTP status code sent with the response (ex: 200, 404) geturl() – returns the URL of the page, which may be different from the URL requested if the server redirected the request 33

URL info info() provides the header information that http returns when the HEAD request is used. ex: >>> print mypage.info() Date: Mon, 12 Sep 2011 14:23:44 GMT Server: Apache/1.3.27 (Unix) Last-Modified: Tue, 02 Sep 2008 21:12:03 GMT ETag: "2f0d4-215f-48bdac23" Accept-Ranges: bytes Content-Length: 8543 Connection: close Content-Type: text/html 34

URL status and code >>> print mypage.getcode() 200 >>> print mypage.geturl() http://www.csc.villanova.edu/~cassel/ 35

Python crawl example import urllib2 url = raw_input("Enter the URL of the page to fetch: ") try: linecount=0 page=urllib2.urlopen(url) result = page.getcode() if result == 200: for line in page: print line linecount+=1 print "Page Information \n ", page.info() print "Result code = ", page.getcode() print "Page contains ",linecount," lines." except: print "\nBad URL: ", url, "Did you include http:// ?" file: url-fetch-try.py in pythonwork/classexamples You almost certainly have a python interpreter on your machine. Copy and paste this and run it. Give it any url you want. Look at the results.

Basic Crawl Architecture 37 WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch Ref: Manning Introduction to Information Retrieval

Crawler Architecture Modules: –The URL frontier (the queue of URLs still to be fetched, or fetched again) –A DNS resolution module (The translation from a URL to a web server to talk to) –A fetch module (use http to retrieve the page) –A parsing module to extract text and links from the page –A duplicate elimination module to recognize links already seen 38 Ref: Manning Introduction to Information Retrieval

Crawling threads With so much space to explore, so many pages to process, a crawler will often consist of many threads, each of which cycles through the same set of steps we just saw. There may be multiple threads on one processor or threads may be distributed over many nodes in a distributed system. 39

Politeness Not optional. Explicit –Specified by the web site owner –What portions of the site may be crawled and what portions may not be crawled robots.txt file Implicit –If no restrictions are specified, still restrict how often you hit a single site. –You may have many URLs from the same site. Too much traffic can interfere with the site’s operation. Crawler hits are much faster than ordinary traffic – could overtax the server. (Constitutes a denial of service attack) Good web crawlers do not fetch multiple pages from the same server at one time. 40

Robots.txt Protocol nearly as old as the web See www.rototstxt.org/robotstxt.htmlwww.rototstxt.org/robotstxt.html File: URL/robots.txt Contains the access restrictions –Example: User-agent: * Disallow: /yoursite/temp/ User-agent: searchengine Disallow: 41 All robots (spiders/crawlers) Robot named searchengine only Nothing disallowed Source: www.robotstxt.org/wc/norobots.htmlwww.robotstxt.org/wc/norobots.html

Another example 42 User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joe/

Processing robots.txt First line: –User-agent – identifies to whom the instruction applies. * = everyone; otherwise, specific crawler name –Disallow: or Allow: provides path to exclude or include in robot access. Once the robots.txt file is fetched from a site, it does not have to be fetched every time you return to the site. –Just takes time, and uses up hits on the server –Cache the robots.txt file for repeated reference 43

Robots tag robots.txt provides information about access to a directory. A given file may have an html meta tag that directs robot behavior A responsible crawler will check for that tag and obey its direction. Ex: – –OPTIONS: INDEX, NOINDEX, FOLLOW, NOFOLLOW 44 See http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2 and http://www.robotstxt.org/meta.htmlhttp://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.2http://www.robotstxt.org/meta.html

Crawling Pick a URL from the frontier Fetch the document at the URL Parse the URL –Extract links from it to other docs (URLs) Check if URL has content already seen –If not, add to indices For each extracted URL –Ensure it passes certain URL filter tests –Check if it is already in the frontier (duplicate URL elimination) 45 Ref: Manning Introduction to Information Retrieval Which one? E.g., only crawl.edu, obey robots.txt, etc.

Recall: Basic Crawl Architecture 46 Ref: Manning Introduction to Information Retrieval WWW DNS Parse Content seen? Doc FP’s Dup URL elim URL set URL Frontier URL filter robots filters Fetch

DNS – Domain Name Server Internet service to resolve URLs into IP addresses Distributed servers, some significant latency possible OS implementations – DNS lookup is blocking – only one outstanding request at a time. Solutions –DNS caching –Batch DNS resolver – collects requests and sends them out together 47 Ref: Manning Introduction to Information Retrieval

Parsing Fetched page contains –Embedded links to more pages –Actual content for use in the application Extract the links –Relative link? Expand (normalize) –Seen before? Discard –New? Meet criteria? Append to URL frontier Does not meet criteria? Discard Examine content 48

Content Seen before? –How to tell? Finger Print, ShinglesFinger PrintShingles –Documents identical, or similar –If already in the index, do not process it again 49 Ref: Manning Introduction to Information Retrieval

Distributed crawler For big crawls, –Many processes, each doing part of the job Possibly on different nodes Geographically distributed –How to distribute Give each node a set of hosts to crawl Use a hashing function to partition the set of hosts –How do these nodes communicate? Need to have a common index 50 Ref: Manning Introduction to Information Retrieval

Communication between nodes 51 Ref: Manning Introduction to Information Retrieval WWW Fetch DNS Parse Content seen? URL filter Dup URL elim Doc FP’s URL set URL Frontier robots filters Host splitter To other hosts From other hosts The output of the URL filter at each node is sent to the Duplicate URL Eliminator at all nodes

URL Frontier Two requirements –Politeness: do not go too often to the same site –Freshness: keep pages up to date News sites, for example, change frequently Conflicts – The two requirements may be directly in conflict with each other. Complication –Fetching URLs embedded in a page will yield many URLs located on the same server. Delay fetching those. 52 Ref: Manning Introduction to Information Retrieval

Some tools WebSphinx –Visualize a crawl –Do some extraction of content from crawled pages See http://www.cs.cmu.edu/~rcm/websphinx/http://www.cs.cmu.edu/~rcm/websphinx/ and http://sourceforge.net/projects/websphinx/http://sourceforge.net/projects/websphinx/ Short demonstration, if possible; screen shots as backup

WebSphinx Do a simple crawl: –Crawl: the subtree –Starting URLs: Pick a favorite spot. Don’t all use the same one (Politeness) –Action: none –Press Start –Watch the pattern of links emerging –When crawl stops, click on the statistics tab. How many threads? How many links tested?, links in queue? How many pages visited? Pages/second? Note memory use 54

Advanced WebSphinx Default is depth-first crawl Now do an advanced crawl: –Advanced Change Depth First to Breadth First Compare statistics Why is Breadth First memory intensive –Still in Advanced, choose Pages tab Action: Highlight, choose color URL *new* 55

Just in case … 56 Crawl site: http://www.acm.org

From acm crawl of “new” 57

Using WebSphinx to capture Action: extract HTML tag expression: as HTML to –give the file name the extension html as this does not happen automatically –click the button with … to show where to save the file on Pages: All Pages Start Example results: acm-images.htmlacm-images.html 58

What comes next After crawling –A collection of materials –Possibly hundreds of thousands or more –How to find what you want when you want it Now we have a traditional Information Retrieval problem –Build an index –Search –Evaluate for precision and recall

Major source: Manning, Christopher, et al. Introduction to Information Retrieval. version available at http://nlp.stanford.edu/IR-book/ http://nlp.stanford.edu/IR-book/ Many other web sites as cited in the slides

Web Crawling. Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at.

Similar presentations

Presentation on theme: "Web Crawling. Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Crawling. Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at.

Similar presentations

Presentation on theme: "Web Crawling. Next week I am attending a meeting, Monday into Wednesday. I said I could go only if I can get back for class. My flight is due in PHL at."— Presentation transcript:

Similar presentations

About project

Feedback