ITEC547 Text Mining Web Technologies Search Engines.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
(c) Maria Indrawan Distributed Information Retrieval.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
1 Information Retrieval and Web Search Introduction.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 19: Information Retrieval
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
1 Web Information Retrieval Web Science Course. 2.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Web Technologies Search Engines
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Information Retrieval, Search, and Mining Introduction.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CSCE 5300 Information Retrieval and Web Search Introduction to IR models and methods Instructor: Rada Mihalcea Class web page:
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
I NFORMATION R ETRIEVAL AND W EB S EARCH Jianping Fan Department of Computer Science UNC-Charlotte 1.
ITEC547 Text Mining Fall Overview of Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval and Web Search Vasile Rus, PhD websearch/
Information Retrieval in Practice
Information Retrieval
Information Retrieval and Web Search
CIW Lesson 6 Web Search Engines.
Information Retrieval and Web Search
Information Retrieval and Web Search
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Chapter 5: Information Retrieval and Web Search
Web Search Engines.
The Search Engine Architecture
Information Retrieval and Web Design
Information Retrieval and Web Search
Presentation transcript:

ITEC547 Text Mining Web Technologies Search Engines

Outline of Presentation 1 Early Search Engines 2 Indexing Text for Search 3 Indexing Multimedia 4Queries 5 Searching an Index

Early Search Engines History, Problems, Solutions … 1

Rest In Peace Open Text ( ) Magellan ( ) Infoseek (Go) ( ) Snap (NBCi)( ) Direct Hit ( ) 4

Changing Lycos (1994; reborn 1999) WebCrawler (1994; reborn 2001) Yahoo (1994; reborn 2002) Excite (1995; reborn 2001) HotBot (1996; reborn 2002) Ask Jeeves (1998; reborn 2002) 5

Same As They Ever Were AltaVista (1995- ) LookSmart (1996- ) Overture (1998- ) 6

The New Breed Google (1998- ) AllTheWeb (1999- ) Teoma (2000- ) WiseNut (2001- ) 7

Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is the most recent and perhaps most widely used IR application Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

Typical IR Task Given: – A corpus of textual natural-language documents. – A user query in the form of a textual string. Find: – A ranked set of documents that are relevant to the query.

Typical IR System Architecture IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

EARLY SEARCH ENGINES Initially used in academic or specialized domains. – Legal and specialized domains consume a large amount of textual info Use of expensive proprietary hardware and software – High computational and storage requirements Boolean query model Iterative search model – Fetch documents in many steps 11

Medline of National Library of Medicine Developed in late 1960 and made available in 1971 Based on inverted file organization Boolean query language – Queries broken down and numbered into segments – Results of a queries fed into the next query segment Each user assigned a time slot – If cycle not completed in time slot, most recent results are returned Query and browse operations performed as separate steps – Following a query, results are viewed – Modifications start a new query-browse cycle

Dialog Broader subject content Specialized collections of data on payment Boolean query – Each term numbered and executed separately then combined – Word patterns – For multiword queries proximity operator W

2 Indexing Text for Search Reduce retrieval time improve hit accuracy

Why Index Simplest approach search text sequentially – Size must be small Static, semistatic index Inverted Index – mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. Documents/Positions in Documents/Weight Fuzzy/Stemming/Stopwords

Example T1 : "it is what it is“ T2 : "what is it“ T3 : "it is a banana" "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Inverted Index

Example "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} T1 : "it is what it is“ T2 : "what is it“ T3 : "it is a banana" Full Inverted Index

Inverted Index

Google Index A unique DocId associated with each URL Hit: word occurences – wordID: 24 bit number – Word position – Font size relative to the rest of the document – Plain hit : in the document – Fancy hit : in the URL, title, anchor text, meta tags Word occurrences of a web page are distributed across a set of barrels

Architecture of the 1 st Google Engine

3 Indexing Multimedia Broadcast and compress for seamless delivery

Indexing Multimedia Forming an index for multimedia – Use context : surrounding text – Add manual description – Analyze automatically and attach a description

4 Queries

Keywords Proximity Patterns Phrases Ranges Weights of keywords Spelling mistakes

Queries Boolean query – No relevance measure – May be hard to understand Multimedia query – Find images of Everest – Find x-rays showing the human rib cage – Find companies whose stock prices have similar patterns

29 Relevance Relevance is a subjective judgment and may include: – Being on the proper subject. – Being timely (recent information). – Being authoritative (from a trusted source). – Satisfying the goals of the user and his/her intended use of the information (information need).

30 Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words).

31 Problems with Keywords May not retrieve relevant documents that include synonymous terms. – “restaurant” vs. “café” – “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. – “bat” (baseball vs. mammal) – “Apple” (company vs. fruit) – “bit” (unit of data vs. act of eating)

Relevance Feedback

SEARCHING AN INDEX 5 Searching an Index

Searching an Inverted Index Tokenize the query, search index vocabulary for each query token Get a list of documents associated with each token Combine the list of documents using constraints specified in the query

Google Search 1.Tokenize query and remove stopwords 2.Translate the query words into wordIDs using the lexicon 3.For every wordID get the list of documents from the short inverted barrel and build a composite set of documents 4.Scan the composite list of documents i.Skip to next document if the current document does not match ii.Compute a rank using query and features iii.If no more documents go to step 3 and use full inverted barrels to find more docs iv.If there are sufficient # of docs go to step 5 5.Sort the final Document List by rank

How are results ranked? Weight type Location: title,URL, anchor,body Size: relative font size Capitalization Count occurences Closeness (proximity)

Evaluation

Ranking Algorithms : Hyperlink Popularity Ranking Rank “popular” documents higher among set of documents with specific keywords. Determining “Popularity” – Access rate ? How to get accurate data? – Bookmarks? Might be private? – Links to related pages? Using web crawler to analyze external links.

Popularity/Prestige transfer of prestige – a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z. Count of In-links/Out-links

Hypertext Induced Topic Search (HITS) The HITS algorithm: – compute popularity using set of related pages only. Important web pages : cited by other important web pages or a large number of less-important pages Initially all pages have same importance

Hubs and Authorities Hub - A page that stores links to many related pages – may not in itself contain actual information on a topic Authority - A page that contains actual information on a topic – may not store links to many related pages Each page gets a prestige value as a hub (hub- prestige), and another prestige value as an authority (authority-prestige).

Hubs and Authorities in twitter

Hubs and Authorities algorithm 1.Locate and build the subgraph 2.Assign initial values to hub and authority scores of each node 3.Run a loop till convergence i.Assign the sum of the hub scores of all nodes y that link to node x to the authority score of x ii.Assign the sum of the authority scores of all nodes y that are linked from node x to node y to hub score of node x iii.Normalize the hub and authority scores of all nodes iv.Check for convergence. Is the difference< threshold? 4.Return the list of nodes sorted in descending order of hub and authority scores

Page Rank Algorithm

Page rank Algorithm 1.Locate and build subgraph 2.Save the number of out-links from every node in an array 3.Assign a default PageRank to all nodes 4.Run a loop till convergence i.Compute a new PageRank score for every node. Assign the sum of PageRank scores divided by the number of out-links of every node that links to a node and add the default rank source ii.Check convergence. Is the difference between new and old PageRank< threshold?

? But wait… There’s Homework! 1-Explain web crawling and the general architecture of a web crawler. 2- What is the use of robots.txt? 3- Find a web crawler code and explain how it can be used to collect information on ? 4-Crawl the social media to collect emu related info.