Yoram Bachrach Yiftah Ben-Aharon

Yoram Bachrach Yiftah Ben-Aharon
Searching the Web Yoram Bachrach Yiftah Ben-Aharon Arvind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, Sriram Raghavan 11/19/2018 SDBI 2001

Goal To better understand Web search engines: Fundamental concepts
Main challenges Design issues Implementation techniques and algorithms 11/19/2018 SDBI 2001

Schedule Search engine requirements Components overview
Specific modules Purpose Implementation Performance metrics Conclusion 11/19/2018 SDBI 2001

What does it do? Processes users queries
Find pages with related information Return a resources list Is it really that simple? 11/19/2018 SDBI 2001

How is a query represented? Find pages with related information Return a resources list Is it really that simple? 11/19/2018 SDBI 2001

Find pages with related information How do we find pages? Where in the web do we look? How do we store the data? Return a resources list Is it really that simple? 11/19/2018 SDBI 2001

Find pages with related information Return a resources list Is what order? How are the pages ranked? Is it really that simple? 11/19/2018 SDBI 2001

Find pages with related information Return a resources list Is it really that simple? Limited resources Time quality tradeoff 11/19/2018 SDBI 2001

Search Engine Structure
General Design Crawling Storage Indexing Ranking 11/19/2018 SDBI 2001

Motivation The web is Used by millions Contains lots of information Link based Incoherent Changes rapidly Distributed Traditional information retrieval was built with the exact opposite in mind 11/19/2018 SDBI 2001

The Web’s Characteristics
Size Over a billion pages available 5-10K per page => tens of terrabytes Size doubles every 2 years Change 23% change daily Half life time of about 10 days Poisson model for changes Bowtie structure 11/19/2018 SDBI 2001

Web Page Repository Indexer Collection Analysis Queries Results Query Engine Ranking Crawlers Text Structure Utility Crawl Control Indexes 11/19/2018 SDBI 2001

Terms Crawler Crawler control Indexes – text, structure, utility
Page repository Indexer Collection analysis module Query engine Ranking module 11/19/2018 SDBI 2001

“Itsi Bitsi Spider crawling up the … web!”
11/19/2018 SDBI 2001

Crawling web pages What pages to download When to refresh
Minimize load on web sites How to parallelize the process 11/19/2018 SDBI 2001

Page selection Importance metric Web crawler model
Crawler method for choosing page to download 11/19/2018 SDBI 2001

Importance Metrics Given a page P, define how “good” that page is.
Several metric types: Interest driven Popularity driven Location driven Combined 11/19/2018 SDBI 2001

Interest Driven Define a driving query Q
Find textual similarity between P and Q Define a word vocabulary W1…Wn Define a vector for P and Q: Vp, Vq = <W1,…,Wn> Wi = 0 if Wi does not appear in the document Wi = Inverse document frequency otherwise IDF(Wi) = 1 / number of appearances in the entire collection Importance: IS(P) = P * Q (cosine product) Finding IDF requires going over the entire web Estimate IDF by pages already visited, to calculate IS’ 11/19/2018 SDBI 2001

Popularity Driven How popular a page is: Backlink count
IB(P) – the number of pages containing a link to P Estimat by pervious crawls: IB’(P) More sophisticated metric, called PageRank: IR(P) 11/19/2018 SDBI 2001

Location Driven IL(P): A function of the URL to P
Words appearing on URL Number of “/” on the URL Easily evaluated, requires no data from pervious crawls 11/19/2018 SDBI 2001

Combined Metrics IC(P): a function of several other metrics
Allows using local metrics for first stage and estimated metrics for second stage IC(P) = a*IS(P)+b*IB(P)+c*IL(P) 11/19/2018 SDBI 2001

Crawler Models A crawler How well does a crawler perform?
Tries to visit more important pages first Only has estimates of importance metrics Can only download a limited amount How well does a crawler perform? Crawl and Stop Crawl and Stop with Threshold 11/19/2018 SDBI 2001

Crawl and Stop A crawler stops after visiting K pages
A perfect crawler Visits pages with ranks R1,…,Rk These are called Hot Pages A real crawler Visits only M<K hot pages Performance rate For a random crawler 11/19/2018 SDBI 2001

Crawl and Stop with Threshold
A crawler stops after visiting K pages Hot pages are pages with a metric higher than G A crawler visits V hot pages Metric: percent of hot pages visited Perfect crawler Random crawler 11/19/2018 SDBI 2001

Ordering Metrics The crawlers queue is prioritized according to an ordering metric The ordering metric is based on an importance metric Location metrics - directly Popularity metrics - via estimates according to pervious crawls Similarity metrics – via estimates according to anchor 11/19/2018 SDBI 2001

Case Study - WebBase Using Stanford’s 225,000 web pages as the entire collection Use popularity importance IB(P) Assume Crawl and Stop with Threshold G = 100 Start at Use PageRank, backlink and BFS as ordering metrics 11/19/2018 SDBI 2001

WebBase Results 11/19/2018 SDBI 2001

Page Refresh Make sure pages are up-to-date Many possible strategies
Uniform refresh Proportional to change frequency Need to define a metric 11/19/2018 SDBI 2001

Freshness Metric Freshness 1 if fresh, 0 otherwise Age of pages
time since modified 11/19/2018 SDBI 2001

Average Freshness Freshness changes over time
Take the average freshness over a long period of time 11/19/2018 SDBI 2001

Refresh Strategy Crawlers can refresh only a certain amount of pages in a period of time. The page download resource can be allocated in many ways The proportional refresh policy allocated the resource proportionally to the pages’ change rate. 11/19/2018 SDBI 2001

Example The collection contains 2 pages
E1 changes 9 times a day E2 changes once a day Simplified change model The day is split into 9 equal intervals, and E1 changes once on each interval E2 changes once during the entire day The only unknown is when the pages change within the intervals The crawler can download a page a day. Our goal is to maximize the freshness 11/19/2018 SDBI 2001

Example (2) 11/19/2018 SDBI 2001

Example (3) Which page do we refresh? If we refresh E2 in midday
If E2 changes in first half of the day, and we refresh in midday, it remains fresh for the rest half of the day. 50% for 0.5 day freshness increase 50% for no increase Expectancy of 0.25 day freshness increase If we refresh E1 in midday If E1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day. 50% for 1/18 day freshness increase Expectancy of 1/36 day freshness increase 11/19/2018 SDBI 2001

Example (4) This gives a nice estimation
But things are more complex in real life Not sure that a page will change within an interval Have to worry about age Using a Poisson model shows a uniform policy always performs better than a proportional one. 11/19/2018 SDBI 2001

Example (5) Studies have found the best policy for similar example
Assume page changes follow a Poisson process. Assume 5 pages, which change 1,2,3,4,5 times a day 11/19/2018 SDBI 2001

Repository “Hidden Treasures” 11/19/2018 SDBI 2001

Storage The page repository is a scalable storage system for web pages
Allows the Crawler to store pages Allows the Indexer and Collection Analysis to retrieve them Similar to other data storage systems – DB or file systems Does not have to provide some of the other systems’ features: transactions, logging, directory. 11/19/2018 SDBI 2001

Storage Issues Scalability and seamless load distribution
Dual access modes Random access (used by the query engine for cached pages) Streaming access (used by the Indexer and Collection Analysis) Large bulk update – reclaim old space, avoid access/update conflicts Obsolete pages - remove pages no longer on the web 11/19/2018 SDBI 2001

Designing a Distributed Web Repository
Repository designed to work over a cluster of interconnected nodes Page distribution across nodes Physical organization within a node Update strategy 11/19/2018 SDBI 2001

Page Distribution How to choose a node to store a page
Uniform distribution – any page can be sent to any node Hash distribution policy – hash page ID space into node ID space 11/19/2018 SDBI 2001

Organization Within a Node
Several operations required Add / remove a page High speed streaming Random page access Hashed organization Treat each disk as a hash bucket Assign according to a page’s ID Log organization Treat the disk as one file, and add the page at the end Support random access using a B-tree Hybrid Hash map a page to an extent and use log structure within an extent. 11/19/2018 SDBI 2001

Distribution Performance
Log Hashed Hashed Log Streaming performance ++ - + Random access performance +- Page addition 11/19/2018 SDBI 2001

Update Strategies Updates are generated by the crawler
Several characteristics Time in which the crawl occurs and the repository receives information Whether the crawl’s information replaces the entire database or modifies parts of it 11/19/2018 SDBI 2001

Batch vs. Steady Batch mode Steady mode Periodically executed
Allocated a certain amount of time Steady mode Run all the time Always send results back to the repository 11/19/2018 SDBI 2001

Partial vs. Complete Crawls
A batch mode crawler can Do a complete crawl every run, and replace entire collection Recrawl only a specific subset, and apply updates to the existing collection – partial crawl The repository can implement In place update Quickly refreshen pages Shadowing, update as another stage Avoid refresh-access conflics 11/19/2018 SDBI 2001

Partial vs. Complete Crawls
Shadowing resolves the conflicts between updates and read for the queries Batch mode suits well with shadowing Steady crawler suits with in place updates 11/19/2018 SDBI 2001

The WebBase Repository
Distributed storage that works with the Stanford WebCrawler Uses a node manager for monitoring storage nodes and collecting status information Each page is assigned a unique identifier, a signature of normalized URL URLs are normalized since a same resource can be pointed from several URLs Stanford Crawler runs in batch mode, so Shadowing is used by the repository 11/19/2018 SDBI 2001

The WebBase Repository
11/19/2018 SDBI 2001

“Excuse me, where can I find …”
Indexing “Excuse me, where can I find …” 11/19/2018 SDBI 2001

The Indexer Modul Creates Two indexes :
Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing. Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph. נעשה בשיטות indexing יחסית סטנדרטיות. הסבר על אינדקס הפוך. שני דברים : (גרף הפוך, מידע קשור) המידע שמעניין אותנו הוא מידע על שכנויות. לכן ש צורך ליצור inverted graph בכדי לדעת מי מצביע על דף, ולא רק על מי מצביע הדף. גרף זה מהווה את הבסיס להבאת מידע “קשור” 11/19/2018 SDBI 2001

The Collection Analysis Module
Uses the 2 basic indexes created by the indexer module in order to assemble “Utility Indexes”. e.g. : A site index. למשל אפשרות של חיפוש שתחת דומיין מסוים תשמש לקבל אינדקס של דפים באותו האתר. הכנה : איך אתם הייתם שומרים מידע (רשימות הפוכות של postings, מידע נוסף עם הרשימות, לקסיקונים). 11/19/2018 SDBI 2001

Inverted Index A Set of inverted lists, one per each index term (word). Inverted list of a term: A sorted list of locations in which the term appeared. Posting : A pair (w,l) where w is word and l is one of its locations. Lexicon : Holds all index’s terms with statistics about the term (not the posting) . .יכול להכיל גם מספר פרטי חשיבות כגון “היה <b> “ או “היה <h1>” למשל מספר ההופעות של הterm בכל הדפים. לכן התהליך הוא עיבוד הדפים והוצאת postings, מיון הpostings וכתיבתם לאינדקס. 11/19/2018 SDBI 2001

Challenges Index build must be : (unlike traditional index buildings)
Fast Economic (unlike traditional index buildings) Incremental Indexing must be supported Storage : compression vs. speed .האינדקסים גדולים וחייבים לייעל את התהליך ולא לצרוך הרבה זכרון. בגלל השינוי המתמיד ברשת, יש לשמור על טריות. לא רק מידע דחוס אלא גם יצוג חסכוני הוא בעייתי. יש לחשוב בקפידה על הפורמט קבצים. הכנה : איך אתם הייתם מחלקים את האינדקים על פני מספר מחשבים ? 11/19/2018 SDBI 2001

Index Partitioning A distributed text indexing can be done by :
Local inverted file (IFL) Each nodes contain disjoint random pages. Query is broadcasted. Result is the joined query answers. Global inverted file (IFG) Each node is responsible only for a subset of terms in the collection. Query sent only to the apropriate node. 11/19/2018 SDBI 2001

The WebBase Indexer : Architecture
Distributors : Store pages detected by the crawler and need to be indexed. Indexers : Performs the core indexing. Query Servers : holding the inverted index, partitioned using IFL זו הארכיטקטורה של סטנפורד. לשאול מה תפקיד הסטטיסטיקאי. 11/19/2018 SDBI 2001

The WebBase Indexer : Stages
Loading pages from the Distributor. Processing pages. Flushing results to disk. Stage 2 : Pairs of (Inverted file, Lexicon) are created by merging stage 1’s files. Each pair is transferred to a query server. להדגיש : יש שלושה שלבים ! 11/19/2018 SDBI 2001

The WebPage Indexer: Parallelizing stage 1
Use 3-steps pipeline, one stage per each action in stage 1. Each action has different orientation (IO/CPU intensive) בכל טור מוצג Thread אשר מבצע תהליך שלא מבוצע בשני הThreads האחרים. כך למשל בזמן שמחכים לפסיקות IO ניתן לבצע חישובים בשלב ה P. L ו F משתמשים במשאבים אחרים (רשת, כוננים קשיחים) ולכן אין ביניהם התנגשות. אפשר גם לחשב סטטיסטיקה ברמה של TERM אבל זה מצריך קישוריות בין ה query servers וזה עלול להאט את הקצב של הpipeline. 11/19/2018 SDBI 2001

The WebPage Indexer: Parallelizing results
Sequential index building is about 30-40% slower then pipelined one. רואים שחמישה מיליון דפים לוקחים במקום 6.5 שעות בערך 4.75 ! 11/19/2018 SDBI 2001

The WebBase Indexer: Statistics Collection concept
Term-level statistics must be collected e.g. IDF - inverse document frequency 1/(number of appearance in collection) Statistics computation as part of index creation (instead of at query time). A special server “Statistician” is dedicated for this goal. . במקרה של איסוף סטטיסטיקה כולל בזמן השאילתה יצטרכו שרתי השאילתות להעביר מידע סטטיסטי רב וזה יפגע בביצועי השאילתה. הסבר בשקף הבא 11/19/2018 SDBI 2001

The WebBase Indexer: Statistics Collection process
Stage 1 : Indexers pass local information to the statistician. The statistician process it (globally) and return results to the indexer Stage 2: Global statistics are integrated into the lexicons. הסטטיסטיקאי הוא מקבל את המידע הכל המידע מהאינדקסרים המקומיים. הוא יחזיר את זה לאינדקסרים והמידע ישולב בלקסיקונים. 11/19/2018 SDBI 2001

The WebBase Indexer: Statistics Collection optimizations
Send data to statistician when is in memory (avoid explicit IO) : FL - When flushing data to disk. ME - When merging the flushed data Local aggregation : Use partial order for sending less messages. e.g. : 1000 x “cat” vs. “cat, 1000” ב FL הם מסודרים עבור כל אינדקסר אבל לא גלובלית ולכן הוא צריך לשמור אותם אצלו. בME הם מסודרים גלובלית ולכן אין לו שום בעיה. למשל אם יש לאינדקסר 1000 postings של CAT הוא יוכל לשלוח הודעה את האומרת (חתול, 1000) ולא 1000 הודעות של “חתול”. איסוף הסטטיסטיקה המרכזי נחקר ונמצא יעיל. שתי השיטות משפיעות מעט (5% תוספת זמן היחסית על שני מיליון דפים). 11/19/2018 SDBI 2001

Indexing, Conclusion Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes). Challenges : Incremental indexing and personalization. במקרה שלנו הגודל כן קובע … פרסונליזציה : הפיכת התהליך אפשרי לקבוצות קטנות וחסרות משאבי מחישוב ענק (למשל קבוצת מחקר אוניברסיטאית הרוצה אינדקס בתחום העניין שלה). 11/19/2018 SDBI 2001

“Everybody wants to rule the world”
Ranking “Everybody wants to rule the world” 11/19/2018 SDBI 2001

Traditional Ranking Faults
Many pages containing a term may be of poor quality or not relevant. Insufficient self description vs. spamming. Not using link analysis. רשת ענקית ופומבית ללא כל מנגנון בקרה. לא כל דף יודע שהוא שייך לנושא או שאין מטה-דטה מספיק טוב. יש הצפה ע”י מפתחי אתרים מתחכמים. יש מבנה של לינקים, הוא מכיל אינפורמציה. 11/19/2018 SDBI 2001

PageRank Tries to capture the notion of “Importance of a page”.
Uses Backlinks for ranking. Avoids trivial spamming: Distributes pages’ “voting power” among the pages they are linking to. “Important” page linking to a page will raise it’s rank more the “Not Important” one. * * יש שימוש בכמה דפים הצביעו עליך, עם שיפור. * הרבה לינקים לדף מתוך דף אחד שקולים ללינק אחד מאותו דף. * אם יאהו ממליץ על דף, זה יותר טוב מאשר אני ממליץ על דף. 11/19/2018 SDBI 2001

Simple PageRank Given by :
Where B(i) : set of pages links to i. N(j) : number of outgoing links from j. Well defined if link graph is strongly connected. Based on “Random Surfer Model” - Rank of page equals to the probability of being in this page. נותנים לכל דף לא את משקל הדך הקודם לו אלא את החלק היחסי שלו במספר הלינקים. ישנה בעייתיות בהגדרה : היא משתמשת בעצמה. אם אי אפשר להגיע לכל דף מכל דף אז … מודל השקול לrandom walk על גרף מכוון - מה ההסתברות ששיכור יגיע לדף הזה לאחר הרבה זמן, כאשר בכל שניה הוא חייב לבצע צעד כלשהו. לעצור ולתת דוגמא על הלוח. לוודא שהם רואים שזאת מטריצת הtranspose. 11/19/2018 SDBI 2001

Computation Of Simple PageRank (1)
שני דברים : לשים לב שוב להגדרה המעגלית אם עבור הוקטור קיים סקלר שמקיים את זה (במקרה שלנו זה 1) אז הןקטור הוא הוקטור האופייני עבור הערך האופייני לסקלר. לא מפורט מדוע במאמר. 11/19/2018 SDBI 2001

Given a matrix A, an eigenvalue c and the corresponding eigenvector v is defined if Av=cv Hence r is eigenvector of Atr for eigenvalue “1” If G is strongly connected then r is unique. למשל אנשים קבוצת דפים המוקדשת למחקר DB. להן יש גרעין של Hub & Authorities . התגלו במחקר כ 100,000 קהילות שכאלו בהליך שנקרא trawling. * HITS עושה את זה : הוא יתן ציון דומה ע”י כך ש הAuthority יקבל ציון מHub ואז אם יש אותם Hubs ה Authorities יקבלו אותו ציון. * כמו האינדקסים של יאהוו ו alta vista. יש שימושים של הHITS בשביל זה. 11/19/2018 SDBI 2001

Simple PageRank can be computed by : זה מתכנס עפ”י בדיקות שהם עשו אחרי כ 1000 איטרציות 11/19/2018 SDBI 2001

Simple PageRank Example
2 מחלק את הכח שלו ל 1 ול 3. 1 מקבל משלושה (חצי מ : 2,3,5) 5 מקבל רק מ 4. הכל מנומל והסכום הוא 1 (סכום הסיכויים להיות במקום כלשהו חייב להיות 1) 11/19/2018 SDBI 2001

Practical PageRank : The Problem
Web is not a strongly connected graph. It contains : “Rank Sinks” : Cluster of pages without outgoing links.Pages outside cluster will be ranked 0. “Rank Leaks” : A page without outgoing links. All pages will be ranked 0. הוא יתקע בסוף במעגל וההסתברות שהוא יהיה שם תהפוך ל 1 דפים יתנו לאותו Rank Lick איזשהו כח, אך הוא לא יעביר אותו הלאה והציון שניתן יועבר וילך לאיבוד והכל יתכנס ל0. ההסתברות שהוא יהיה במקום כלשהי תהיה אפס. 11/19/2018 SDBI 2001

Practical PageRank : The Solution
Remove all Page Leaks. Add decay factor d to Simple PageRank Based on “Board Surfer Model” לא חייבים להיות כל כך רדיקליים.אפשר להניח שיש מהם לינקים חזרה למי שהצביע עליהם. ערך של דפים שהגענו אליהם דרך דפים חשובים יהיה גבוהה יותר מערך דפים שהגענו אליהם דרך דפים לא חשובים. שקול למקרה הקודם עם d= 1 כמו טיול מקרי על גרף עם קפיצה הסתברותית כתוצאה משעמום. 11/19/2018 SDBI 2001

Practical PageRank : In practice ...
Google uses IR techniques combined with Practical PageRank to determine the rank of a query. 11/19/2018 SDBI 2001

HITS : Hypertext Induced Topic Search
A query dependent technique. Produces two scores : Authority : A most likely to be relevant page to a given query. Hub : Points to many Authorities. Contains two part : Identifying the focused subgraph. Link analysis. בניגוד לPageRank צריך להתחיל מקבוצה כלשהי מיוחדת ולכן צריכה להבוא שאילתה ובעקבותיה קבוצת דפים ראשונית המבוססת על IR/ Hub : אינדקס של מנועי חיפוש, Authority : יאהו. טוב להחזיר את שניהם בתור תשובה. יש mutually reinforcement שלהם הכנה לשקף הבא: איך נוסיף הרבה authorities ו hubs לקבוצה. 11/19/2018 SDBI 2001

HITS: Identifying The Focused Subgraph
Subgraph creation from t-sized page set: (d reduces the influence of extremely popular pages like yahoo.com) יש הגבלה של d דפים בהוספה של דפים שמצביעים אל הדף הנוכחי כדי שאתר פופולרי כמו יאהו לא יעזור לטפילים להתקדם על חשבונו (הם יצביעו אליו וייהנו מכך) הקבוצה S אמורה להיות עשירה ב Authorities & hubs. 11/19/2018 SDBI 2001

HITS: Link Analysis Calculates Authorities & Hubs scores (ai & hi ) for each page in S 11/19/2018 SDBI 2001

HITS: Link Analysis Computation
Eigenvectors computation can be used by: Where a: Vector of Authorities’ scores h: Vector of Hubs’ scores. A: Adjacency matrix in which ai,j = 1 if points to j. 11/19/2018 SDBI 2001

Other Link Based Techniques
Identifying Communities: Sets of pages created and used by people sharing a common interest, Related Pages: Sibling pages may be related. Classification & Resource Compilation: Automatic vs. Manual classification. Identifying high quality pages for a topic למשל אנשים קבוצת דפים המוקדשת למחקר DB. להן יש גרעין של Hub & Authorities . התגלו במחקר כ 100,000 קהילות שכאלו בהליך שנקרא trawling. * HITS עושה את זה : הוא יתן ציון דומה ע”י כך ש הAuthority יקבל ציון מHub ואז אם יש אותם Hubs ה Authorities יקבלו אותו ציון. * כמו האינדקסים של יאהוו ו alta vista. יש שימושים של הHITS בשביל זה. 11/19/2018 SDBI 2001

Ranking, Conclusion The link structure of the web contains useful information. Ranking methods : PageRank: A global ranking scheme for ranking search results HITS: Computes the Authorities & Hubs for a given query. Future Directions: Use of other information sources, sophisticated text analysis. * * למשל Query logs ו Click Streams . 11/19/2018 SDBI 2001

Conclusion “What was it all about ?” 11/19/2018 SDBI 2001

The motivation Web’s vast scale . Limited resources.
Web is changing rapidly. Important and Demanded field. * יש רוחב פס מוגבל, כמות דיסקים מוגבלת אבל המשתמשים רוצים תוצאות מיידיות. 11/19/2018 SDBI 2001

The Basic Architecture
Crawlers : Travel the web, retrieving pages. Repositories: Store pages locally. Indexers: Index and analyze pages stored in repository. Ranking modules: Return the query engines the most promising pages. 11/19/2018 SDBI 2001

The End “Questions, anyone ?” 11/19/2018 SDBI 2001

Yoram Bachrach Yiftah Ben-Aharon

Similar presentations

Presentation on theme: "Yoram Bachrach Yiftah Ben-Aharon"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yoram Bachrach Yiftah Ben-Aharon

Similar presentations

Presentation on theme: "Yoram Bachrach Yiftah Ben-Aharon"— Presentation transcript:

Similar presentations

About project

Feedback