Download presentation
Presentation is loading. Please wait.
1
1 How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion
2
2 What is the Internet? A global network of computers connected to each other Computers “talk” to each other using standard protocols TCP/IP
3
3 What is the World-Wide Web (WWW)? Collection of pages available via the Internet Internet users can view pages with web browsers WWW is only one application of the Internet Other applications: email, messengers, VOIP, newsgroups, ftp
4
4 Web Pages Various formats pdf, word, excel, images, mp3, video, text Most popular format: HTML HTML pages point to each other using hyperlinks Users “surf the web” by clicking hyperlinks
5
5 What are Search Engines? Users have “information needs” Where can I find solutions to my math homework problem? Where can I find mp3s of Miri Messika’s latest album? What is the weather in Eilat in Channuka? What other Sharons are famous except for our prime minister? Search engines enable us to find web pages that match our information needs
6
6 What other Sharons are famous, except for our prime minister? Search Engines query User “Information Need” sharon -ariel 1.Sharon Creech 2.Sharon Stone 3.Sharon, Massachusetts Ranked list of matching pages Search Engine Search Engine Web pages Web
7
7 How Search Engines (don’t) Work? query User sharon -ariel 1.Sharon Creech 2.Sharon Stone 3.Sharon, Massachusetts Ranked list of matching pages Web pages Common misconception: when user submits a query, the search engine scans all web pages to find the relevant matches Search Engine Search Engine Web
8
8 How Search Engines Work? query User 1.Sharon Creech 2.Sharon Stone 3.Sharon, Massachusetts Ranked list of matching pages Web pages What do you do when you look for a term in an encyclopedia? Use the index! Web Search Engine index sharon -ariel
9
9 Search Engine Architecture Crawler Search Engine Index Ranking Algorithm Ranking Algorithm Query Processor Query Processor
10
10 Web Crawler (a.k.a. Spider) Fetches web pages and stores them in a local repository Tries to get as many web pages as possible Follows hyperlinks to learn about new pages Refetches pages that change frequently
11
11 The Index Ariel 1 Sharon 2, the 3 prime 4 minister 5 of 6 Israel 7 founded 8 a 9 new 10 political 11 party 12. Sharon 1 Stone 2 dressed 3 a 4 new 5 Jean 6 Paul 7 Gaultier 8 gown 9 at 10 the 11 Oscars 12 after 13 party 14. www.cnn.com ariel:(cnn.com,1) dress:(hollywood.com,3) found:(cnn.com,8) gaultier:(hollywood.com,8) gown:(hollywood.com,9) israel:(cnn.com,7) jean:(hollywood.com,6) minister:(cnn.com,5) new:(cnn.com,7), (hollywood.com, 5) oscar:(hollywood.com,12) party:(cnn.com,12), (hollywood.com,14) paul:(hollywood.com,7) political:(cnn.com,11) prime:(cnn.com,4) sharon:(cnn.com,2), (hollywood.com,1) stone:(hollywood.com,2) Index www.hollywood.com
12
12 Index by “Anchor Text” Anchor text: what’s written inside a linkinside a link Example: Ariel Sharon, the prime minister…Ariel Sharon Usually succinctly describes what’s written in the linked page By which terms a page is listed in the index? Terms that appear in the page Terms that appear in anchor text of links to the page
13
13 Query Processor Gets a user query Fetches relevant posting lists from index Extracts relevant matches from lists Example: Query = “sharon –ariel” L 1 posting list of sharon sharon: (cnn.com,2), (hollywood.com,1) L 2 posting list of ariel ariel: (cnn.com,1) Return all pages in L 1 that do not occur in L 2 cnn.com
14
14 Ranking Algorithm Many queries have many matching pages 472 million matches for “London” in Google Cannot return all of them to the user User needs the most relevant results anyway Need to order results by relevance Most relevant results are at the top Ranking algorithm: a method of ordering matches The “heart” of a search engine The reason why Google is the most preferred search engine today
15
15 Google’s PageRank Ranking Elections Candidates: all web pages Voters: all web pages p votes to q, if p has a hyperlink to q. Favorites(p) = all the pages p votes for. Fans(p) = all the pages that vote for p. 1 if p has no fans
16
16 Google’s PageRank Underlying principles: A page is “important” if it has important fans A page splits its “importance” evenly among its favorite pages. 1 1 1 1 1.5 2.5 4
17
17 Google’s PageRank Ranking algorithm: Find pages that match the given query Order them by their PageRank Return top 10 matches
18
18 But…PageRank Not Always Works SPAM
19
19 Conclusions Search engines use index to answer user queries Ranking is the most important component Spam is a problem
20
20 Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.