ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

ITEC547 Text Mining Fall 2015-16 Overview of Search Engines

Outline of Presentation 1 Early Search Engines 2 Indexing Text for Search 3 Indexing Multimedia 4Queries 5 Searching an Index

Search Engine Characteristics Unedited – anyone can enter content Quality issues; Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Different kinds of users – Lexis-Nexis: Paying, professional searchers – Online catalogs: Scholars searching scholarly literature – Web: Every type of person with every type of goal Scale – Hundreds of millions of searches/day; billions of docs

Web Search Queries Web search queries are short: – ~2.4 words on average (Aug 2000) – Has increased, was 1.7 (~1997) User Expectations: – Many say “The first item shown should be what I want to see!” – This works if the user has the most popular/common notion in mind, not otherwise.

Background History, Problems, Solutions … 1

Search Engines Open Text (1995-1997) Magellan (1995-2001) Infoseek (Go) (1995-2001) Snap (NBCi)(1997-2001) Direct Hit (1998-2002) Lycos (1994; reborn 1999) WebCrawler (1994; reborn 2001) Yahoo (1994; reborn 2002) Excite (1995; reborn 2001) HotBot (1996; reborn 2002) Ask Jeeves (1998; reborn 2002) Teoma (2000- 2001) AltaVista (1995- ) LookSmart (1996- ) Overture (1998- ) 6

EARLY SEARCH ENGINES Initially used in academic or specialized domains. – Legal and specialized domains consume a large amount of textual info Use of expensive proprietary hardware and software – High computational and storage requirements Boolean query model Iterative search model – Fetch documents in many steps 7

Medline of National Library of Medicine Developed in late 1960 and made available in 1971 Based on inverted file organization Boolean query language – Queries broken down and numbered into segments – Results of a queries fed into the next query segment Each user assigned a time slot – If cycle not completed in time slot, most recent results are returned Query and browse operations performed as separate steps – Following a query, results are viewed – Modifications start a new query-browse cycle

Dialog Broader subject content Specialized collections of data on payment Boolean query – Each term numbered and executed separately then combined – Word patterns – For multiword queries proximity operator W

Information Retrieval The indexing and retrieval of textual documents. Searching for pages on the World Wide Web is perhaps most widely used IR application Concerned firstly with retrieving relevant documents to a query. Concerned secondly with retrieving from large sets of documents efficiently.

Web Search as a Typical IR Task Given: – A corpus of textual natural-language documents. – A user query in the form of a textual string. Find: – A ranked set of documents that are relevant to the query.

Typical IR System Architecture IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.

Architecture of a Search Engine The Web Ad indexes Gatherer/ Web Spider Indexer Indexes Search User User interface Searcher and evaluator

How does it work? User Interface – Allows you to type a query and displays the results. Searcher – The engine searches the database for matching your query. Evaluator – The engine assigns scores/ranks to the retrieved information. Gatherer – The component that travels the WEB, and collects information. Indexer – The engine that categorizes the data collected by the gatherer. 14

User Interface Provides a mechanism for a user to submit queries to the search engine. Uses forms, very user friendly. The user interface displays the search results in a convenient way. A summary of each matched page is shown. 15

Searcher It is a program that uses the search engine ’ s database to locate the matches for a specific query. The database of a search engine holds extremely large indexed pages. A highly efficient search algorithm is necessary. Computer Scientists have spent years to develop the searching and sorting methods. You can refer to computer books. 16

Evaluator The searcher returns a set of URLs that match your query. Not all of the hits equally match your query. More references to the page, the ranking of the page will be higher. How the relevancy score is calculated? – Varies from one engine to another one. – The number of times of the word appears? – The query words appear in the title? – The query words appear in the META tag? 17

Web Spider/Gatherer It is a program that traverses the Web and gathers information about the Web documents. It runs at a short and regular intervals. It returns information and will be indexed to the database. Alternate names: Bot, Crawler, Robot, Spider and Worm. 18

Example Crawling Algorithms

Indexer It organizes the data by creating a set of keys or an index. Indexes need to be rebuilt frequently. E.g. Libraries – Author, Title, ISBN, etc … In order to ensure the returned URL is not out of date. The search engine is very complex and needs to break down into different components. 20

Case Study - AltaVista Sending out Crawlers (robot programs) that capture information from the web and bring them back. The main crawler – “ Scooter ” simultaneously send out HTTP requests like blind users on the Web. Store all these information to the indexing engine. Scooter ’ s cousins help to remove “ dead ” links. A typical day, Scooter will visit over 10 million pages. Web pages with no links referencing will never be found. You can also submit your URL to AltaVista. 21

2 Indexing Text for Search Reduce retrieval time improve hit accuracy

Why Index Simplest approach search text sequentially – Size must be small Static, semi-static index Inverted Index – mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. Documents/Positions in Documents/Weight Fuzzy/Stemming/Stopwords

What is an inverted index? T1 : "it is what it is“ T2 : "what is it“ T3 : "it is a banana" "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} Inverted Index Example

"a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} T0 : "it is what it is“ T1 : "what is it“ T2 : "it is a banana" Full Inverted Index What is an inverted index? Example

 Periodically rebuilt, static otherwise.  Documents are parsed to extract tokens. These are saved with the Document ID. How Inverted Files Are Created Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

After all documents have been parsed the inverted file is sorted alphabetically. How Inverted Files are Created

 Multiple term entries for a single document are merged.  Within- document term frequency information is compiled. How Inverted Files are Created

Finally, the file can be split into – A Dictionary or Lexicon file set of all words in text, vocabulary and – A Postings file each occurrence of the word in the text How Inverted Files are Created

Dictionary/Lexicon Postings How Inverted Files are Created

Permit fast search for individual terms For each term, you get a list consisting of: – document ID – frequency of term in doc (optional) – position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms Inverted indexes

Inverted indexes are still used, even though the web is so huge. Some systems partition the indexes across different machines. Each machine handles different parts of the data. Other systems duplicate the data across many machines; queries are distributed among the machines. Most do a combination of these. Inverted Indexes for Web Search Engines

Inverted Index : should we index all words?

How to index words?

Inverted Index : web features

Google Index A unique DocId associated with each URL Hit: word occurences – wordID: 24 bit number – Word position – Font size relative to the rest of the document – Plain hit : in the document – Fancy hit : in the URL, title, anchor text, meta tags Word occurrences of a web page are distributed across a set of barrels

Architecture of the 1 st Google Engine

. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux. In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. – The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. – This file contains enough information to determine where each link points from and to, and the text of the link. The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents. The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.

3 Indexing Multimedia Broadcast and compress for seamless delivery

Indexing Multimedia Forming an index for multimedia – Use context : surrounding text – Add manual description – Analyze automatically and attach a description

4 Queries

Keywords Proximity Patterns Phrases Ranges Weights of keywords Spelling mistakes

Queries Boolean query – No relevance measure – May be hard to understand Multimedia query – Find images of Everest – Find x-rays showing the human rib cage – Find companies whose stock prices have similar patterns

Relevance Relevance is a subjective judgment and may include: – Being on the proper subject. – Being timely (recent information). – Being authoritative (from a trusted source). – Satisfying the goals of the user and his/her intended use of the information (information need). 46

Keyword Search Simplest notion of relevance is that the query string appears verbatim in the document. Slightly less strict notion is that the words in the query appear frequently in the document, in any order (bag of words). 47

Problems with Keywords May not retrieve relevant documents that include synonymous terms. – “restaurant” vs. “café” – “PRC” vs. “China” May retrieve irrelevant documents that include ambiguous terms. – “bat” (baseball vs. mammal) – “Apple” (company vs. fruit) – “bit” (unit of data vs. act of eating) 48

Relevance Feedback

SEARCHING AN INDEX 5 Searching an Index

Searching an Inverted Index Tokenize the query, search index vocabulary for each query token Get a list of documents associated with each token Combine the list of documents using constraints specified in the query

Google Search 1.Tokenize query and remove stopwords 2.Translate the query words into wordIDs using the lexicon 3.For every wordID get the list of documents from the short inverted barrel and build a composite set of documents 4.Scan the composite list of documents i.Skip to next document if the current document does not match ii.Compute a rank using query and features iii.If no more documents go to step 3 and use full inverted barrels to find more docs iv.If there are sufficient # of docs go to step 5 5.Sort the final Document List by rank

How are results ranked? Weight type Location: title, URL, anchor, body Size: relative font size Capitalization Count occurrences Closeness (proximity)

Evaluation

Ranking Algorithms : Hyperlink Popularity Ranking Rank “popular” documents higher among set of documents with specific keywords. Determining “Popularity” – Access rate ? How to get accurate data? – Bookmarks? Might be private? – Links to related pages? Using web crawler to analyze external links.

Popularity/Prestige transfer of prestige – a link from a popular page x to a page y is treated as conferring more prestige to page y than a link from a not-so-popular page z. Count of In-links/Out-links

Hypertext Induced Topic Search (HITS) The HITS algorithm: – compute popularity using set of related pages only. Important web pages : cited by other important web pages or a large number of less-important pages Initially all pages have same importance

Hubs and Authorities Hub - A page that stores links to many related pages – may not in itself contain actual information on a topic Authority - A page that contains actual information on a topic – may not store links to many related pages Each page gets a prestige value as a hub (hub- prestige), and another prestige value as an authority (authority-prestige).

Hubs and Authorities in twitter

Hubs and Authorities algorithm 1.Locate and build the subgraph 2.Assign initial values to hub and authority scores of each node 3.Run a loop till convergence i.Assign the sum of the hub scores of all nodes y that link to node x to the authority score of x ii.Assign the sum of the authority scores of all nodes y that are linked from node x to node y to hub score of node x iii.Normalize the hub and authority scores of all nodes iv.Check for convergence. Is the difference< threshold? 4.Return the list of nodes sorted in descending order of hub and authority scores

Page Rank Algorithm

Page rank Algorithm 1.Locate and build subgraph 2.Save the number of out-links from every node in an array 3.Assign a default PageRank to all nodes 4.Run a loop till convergence i.Compute a new PageRank score for every node. Assign the sum of PageRank scores divided by the number of out-links of every node that links to a node and add the default rank source ii.Check convergence. Is the difference between new and old PageRank< threshold?

? But wait… There’s Homework! 1-Explain web crawling and the general architecture of a web crawler. 2- What is the use of robots.txt? 3- Find a web crawler code and explain how it can be used to collect information on ? 4-Crawl the social media to collect emu related info. (if you want bonus)

ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Similar presentations

Presentation on theme: "ITEC547 Text Mining Fall 2015-16 Overview of Search Engines."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ITEC547 Text Mining Fall 2015-16 Overview of Search Engines.

Similar presentations

Presentation on theme: "ITEC547 Text Mining Fall 2015-16 Overview of Search Engines."— Presentation transcript:

Similar presentations

About project

Feedback