Web Crawling and Basic Text Analysis Hongning Wang CS4501: Information Retrieval1.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Vector Space Model Hongning Wang Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly.
Inverted Index Hongning Wang
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web Crawlers.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Crawling Slides adapted from
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Search Engine Architecture
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Information Retrieval
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
Relevance Feedback Hongning Wang
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Data mining in web applications
Information Retrieval in Practice
Search Engine Architecture
Web Crawling and Basic Text Analysis
Text Based Information Retrieval
Inverted Index Hongning Wang
Search Engine Architecture
Vector Space Model Hongning Wang
Multimedia Information Retrieval
CSE 635 Multimedia Information Retrieval
Chapter 5: Information Retrieval and Web Search
Search Engine Architecture
Presentation transcript:

Web Crawling and Basic Text Analysis Hongning Wang CS4501: Information Retrieval1

Recap: core IR concepts Information need – An IR system is to satisfy users’ information need Query – A designed representation of users’ information need Document – A representation of information that potentially satisfies users’ information need Relevance – Relatedness between documents and users’ information need Information Retrieval2

Recap: Browsing v.s. Querying Browsing – Works well when the user wants to explore information or doesn’t know what keywords to use, or cannot conveniently enter a query Querying – Works well when the user knows exactly what query to use for expressing her information need Information Retrieval3

Recap: Pull v.s. Push in IR Pull mode – with query – User takes the initiative – Works well when a user has an ad hoc information need Push mode – without query – System takes the initiative – Works well when a user has a stable information need or the system has good knowledge about a user’s need Information Retrieval4

Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation Query Rep (Query) Evaluation Feedback Information Retrieval5 Indexed corpus Ranking procedure

Web crawler A automatic program that systematically browses the web for the purpose of Web content indexing and updating Synonyms: spider, robot, bot Information Retrieval6

How does it work In pseudo code Def Crawler(entry_point) { URL_list = [entry_point] while (len(URL_list)>0) { URL = URL_list.pop(); if (isVisited(URL) or !isLegal(URL) or !checkRobotsTxt(URL)) continue; HTML = URL.open(); for (anchor in HTML.listOfAnchors()) { URL_list.append(anchor); } setVisited(URL); insertToIndex(HTML); } Information Retrieval7 Which page to visit next? Is the access granted? Is it visited already? Or shall we visit it again?

Visiting strategy Breadth first – Uniformly explore from the entry page – Memorize all nodes on the previous level – As shown in pseudo code Depth first – Explore the web by branch – Biased crawling given the web is not a tree structure Focused crawling – Prioritize the new links by predefined strategies Information Retrieval8

Focused crawling Prioritize the visiting sequence of the web – The size of Web is too large for a crawler (even Google) to completely cover – Not all documents are equally important – Emphasize more on the high-quality documents Maximize weighted coverage Information Retrieval9 Weighted coverage till time t Importance of page p Pages crawled till time t In 1999, no search engine indexed more than 16% of the Web In 2005, large-scale search engines index no more than 40-70% of the indexable Web

Focused crawling Prioritize by in-degree [Cho et al. WWW’98] – The page with the highest number of incoming hyperlinks from previously downloaded pages is downloaded next Prioritize by PageRank [Abiteboul et al. WWW’07, Cho and Uri VLDB’07] – Breadth-first in early stage, then compute/approximate PageRank periodically – More consistent with search relevance [Fetterly et al. SIGIR’09] Information Retrieval10

Focused crawling Prioritize by topical relevance – In vertical search, only crawl relevant pages [De et al. WWW’94] E.g., restaurant search engine should only crawl restaurant pages – Estimate the similarity to current page by anchortext or text near anchor [Hersovici et al. WWW’98] – User given taxonomy or topical classifier [Chakrabarti et al. WWW’98] Information Retrieval11

Avoid duplicate visit Given web is a graph rather than a tree, avoid loop in crawling is important What to check – URL: must be normalized, not necessarily can avoid all duplication FTOKEN= FTOKEN= – Page: minor change might cause misfire Timestamp, data center ID change in HTML How to check – trie or hash table Information Retrieval12

Politeness policy Crawlers can retrieve data much quicker and in greater depth than human searchers Costs of using Web crawlers – Network resources – Server overload Robots exclusion protocol – Examples: CNN, UVaCNNUVa Information Retrieval13

Robot exclusion protocol examples Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/ Exclude a specific robot: User-agent: GoogleBot Disallow: / Allow a specific robot: User-agent: GoogleBot Disallow: User-agent: * Disallow: / Information Retrieval14

Re-visit policy The Web is very dynamic; by the time a Web crawler has finished its crawling, many events could have happened, including creations, updates and deletions – Keep re-visiting the crawled pages – Maximize freshness and minimize age of documents in the collection Strategy – Uniform re-visiting – Proportional re-visiting Visiting frequency is proportional to the page’s update frequency Information Retrieval15

Analyze crawled web pages What you care from the crawled web pages Information Retrieval16

Analyze crawled web pages What machine knows from the crawled web pages Information Retrieval17

Basic text analysis techniques Needs to analyze and index the crawled web pages – Extract informative content from HTML – Build machine accessible data representation Information Retrieval18

HTML parsing Generally difficult due to the free style of HTML Solutions – Shallow parsing Remove all HTML tags Only keep text between and – Automatic wrapper generation [Crescenzi et al. VLDB’01] Wrapper: regular expression for HTML tags’ combination Inductive reasoning from examples – Visual parsing [Yang and Zhang DAR’01] Frequent pattern mining of visually similar HTML blocks Information Retrieval19

HTML parsing jsoup – Java-based HTML parser scrape and parse HTML from a URL, file, or string to DOM tree Find and extract data, using DOM traversal or CSS selectors – children(), parent(), siblingElements() – getElementsByClass(), getElementsByAttributeValue() – Python version: Beautiful SoupBeautiful Soup Information Retrieval20

How to represent a document Represent by a string? – No semantic meaning Represent by a list of sentences? – Sentence is just like a short document (recursive definition) Represent by a list of words? – Tokenize it first – Bag-of-Words representation! Information Retrieval21

Tokenization Break a stream of text into meaningful units – Tokens: words, phrases, symbols – Definition depends on language, corpus, or even context Input: It’s not straight-forward to perform so-called “tokenization.” Output(1): 'It’s', 'not', 'straight-forward', 'to', 'perform', 'so-called', '“tokenization.”' Output(2): 'It', '’', 's', 'not', 'straight', '-', 'forward, 'to', 'perform', 'so', '-', 'called', ‘“', 'tokenization', '.', '”‘ Information Retrieval22

Tokenization Solutions – Regular expression [\w]+: so-called -> ‘so’, ‘called’ [\S]+: It’s -> ‘It’s’ instead of ‘It’, ‘’s’ – Statistical methods Explore rich features to decide where is the boundary of a word – Apache OpenNLP ( – Stanford NLP Parser ( parser.shtml) parser.shtml Online Demo – Stanford ( – UIUC ( Information Retrieval23

Full text indexing Bag-of-Words representation – Doc1: Information retrieval is helpful for everyone. – Doc2: Helpful information is retrieved for you. informationretrievalretrievedishelpfulforyoueveryone Doc Doc Word-document adjacency matrix Information Retrieval24

Full text indexing Bag-of-Words representation – Assumption: word is independent from each other – Pros: simple – Cons: grammar and order are missing – The most frequently used document representation Image, speech, gene sequence Information Retrieval25

Full text indexing Information Retrieval26

Full text indexing Information Retrieval27

Recap: web crawling In pseudo code Def Crawler(entry_point) { URL_list = [entry_point] while (len(URL_list)>0) { URL = URL_list.pop(); if (isVisited(URL) or !isLegal(URL) or !checkRobotsTxt(URL)) continue; HTML = URL.open(); for (anchor in HTML.listOfAnchors()) { URL_list.append(anchor); } setVisited(URL); insertToIndex(HTML); } Information Retrieval28 Which page to visit next? Is the access granted? Is it visited already? Or shall we visit it again?

Recap: crawling strategy Breadth first – Uniformly explore from the entry page Depth first – Biased crawling given the web is not a tree structure Focused crawling – Prioritize by in-degree [Cho et al. WWW’98] – Prioritize by PageRank [Abiteboul et al. WWW’07, Cho and Uri VLDB’07] – Prioritize by topical relevance Information Retrieval29

Recap: challenges in web crawling Avoid duplicate visit – Recognize URLs pointing to the same content Re-visit policy – Maximize freshness and minimize age of documents in the collection Information Retrieval30

Recap: HTML parsing Shallow parsing – Only keep text between and Automatic wrapper generation [Crescenzi et al. VLDB’01] – Wrapper: regular expression for HTML tags’ combination Visual parsing [Yang and Zhang DAR’01] – Frequent pattern mining of visually similar HTML blocks Information Retrieval31

Recap: full text indexing Bag-of-Words representation – Doc1: Information retrieval is helpful for everyone. – Doc2: Helpful information is retrieved for you. informationretrievalretrievedishelpfulforyoueveryone Doc Doc Word-document adjacency matrix Information Retrieval32

Recap: tokenization Break a stream of text into meaningful units – Rule-based solution: regular expressions – Statistical methods: learning-based solution to predict word boundaries Information Retrieval33

Recap: full text indexing Information Retrieval34

Statistical property of language A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency Information Retrieval35 discrete version of power law In the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences; the second-place word "of" accounts for slightly over 3.5% of words.

Zipf’s law tells us Head words may take large portion of occurrence, but they are semantically meaningless – E.g., the, a, an, we, do, to Tail words take major portion of vocabulary, but they rarely occur in documents – E.g., dextrosinistral The rest is most representative – To be included in the controlled vocabulary Information Retrieval36

Automatic text indexing Remove non-informative words Remove rare words Information Retrieval37 Remove 1s Remove 0s

Normalization Convert different forms of a word to normalized form in the vocabulary – U.S.A -> USA, St. Louis -> Saint Louis Solution – Rule-based Delete periods and hyphens All in lower case – Dictionary-based Construct equivalent class – Car -> “automobile, vehicle” – Mobile phone -> “cellphone” Information Retrieval38

Stemming Reduce inflected or derived words to their root form – Plurals, adverbs, inflected word forms E.g., ladies -> lady, referring -> refer, forgotten -> forget – Bridge the vocabulary gap – Risk: lose precise meaning of the word E.g., lay -> lie (a false statement? or be in a horizontal position?) – Solutions (for English) Porter stemmer: pattern of vowel-consonant sequence Krovetz Stemmer: morphological rules Information Retrieval39

Stopwords Useless words for query/document analysis – Not all words are informative – Remove such words to reduce vocabulary size – No universal definition – Risk: break the original meaning and structure of text E.g., this is not a good option -> option to be or not to be -> null The OEC: Facts about the language Information Retrieval40

Abstraction of search engine architecture Doc Analyzer Crawler Doc Representation Information Retrieval41 Indexed corpus 1.Visiting strategy 2.Avoid duplicated visit 3.Re-visit policy 1.HTML parsing 2.Tokenization 3.Stemming/normalization 4.Stopword/controlled vocabulary filter BagOfWord representation!

Automatic text indexing In modern search engine – No stemming or stopword removal, since computation and storage are no longer the major concern – More advanced NLP techniques are applied Named entity recognition – E.g., people, location and organization Dependency parsing Information Retrieval42 Query: “to be or not to be”

What you should know Basic techniques for crawling Zipf’s law Procedures for automatic text indexing Bag-of-Words document representation Information Retrieval43

Today’s reading Introduction to Information Retrieval – Chapter 20: Web crawling and indexes Section 20.1, Overview Section 20.2, Crawling – Chapter 2: The term vocabulary and postings lists Section 2.2, Determining the vocabulary of terms – Chapter 5: Index compression Section 5.1, Statistical properties of terms in information retrieval Information Retrieval44

Reference I Cho, Junghoo, Hector Garcia-Molina, and Lawrence Page. "Efficient crawling through URL ordering." Computer Networks and ISDN Systems 30.1 (1998): Abiteboul, Serge, Mihai Preda, and Gregory Cobena. "Adaptive on-line page importance computation." Proceedings of the 12th international conference on World Wide Web. ACM, Cho, Junghoo, and Uri Schonfeld. "RankMass crawler: a crawler with high personalized pagerank coverage guarantee." Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, Fetterly, Dennis, Nick Craswell, and Vishwa Vinay. "The impact of crawl policy on web search effectiveness." Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. ACM, De Bra, Paul ME, and R. D. J. Post. "Information retrieval in the World-Wide Web: making client-based searching feasible." Computer Networks and ISDN Systems 27.2 (1994): Hersovici, Michael, et al. "The shark-search algorithm. An application: tailored Web site mapping." Computer Networks and ISDN Systems 30.1 (1998): Information Retrieval45

Reference II Chakrabarti, Soumen, Byron Dom, Prabhakar Raghavan, Sridhar Rajagopalan, David Gibson, and Jon Kleinberg. "Automatic resource compilation by analyzing hyperlink structure and associated text." Computer Networks and ISDN Systems 30, no. 1 (1998): Crescenzi, Valter, Giansalvatore Mecca, and Paolo Merialdo. "Roadrunner: Towards automatic data extraction from large web sites." VLDB. Vol Yang, Yudong, and HongJiang Zhang. "HTML page analysis based on visual cues." Document Analysis and Recognition, Proceedings. Sixth International Conference on. IEEE, Information Retrieval46