INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014

Announcements PA3 due in 1 week! 5/19, 11pm PST Anyone go to Startup weekend? If you have 0 late days left, please submit before 10:30pm PST just in case… Red errors = use Intellisense!

PA3 Only crawl sites in the approved domains Ignore non-html urls -> only *.html, *.htm Robots.txt – Sitemap: use this to initialize url queue – Disallow: remember this, filter out URLs

Any Questions??? PA3

Web vs. Worker roles Same hardware Web role – VM running IIS on port 80 that serves websites Worker role – VM that runs the code in Run() in WorkerRole.cs Why distinguish the two? – Better Information Architecture! => more scalable What happens if we did not distinguish the two? All web role? – No distinction, all web role, web role will fork a thread to start crawling? – What if I want 1000 machines to crawl? Send 1000 messages to asmx to start? They all start from same url? Duplicate work Spin off thread to mimic worker role? Ok… that’s worker role work! Load balancing across 1000 machines? Different machines = different # urls. – Scale web and worker separately & appropriately In the web role vs. worker role and Queue to communicate => even 1000 nodes can work together efficiently!

History of Search Infrastructure

Yahoo 1995 List of URLs Hierarchical organization of URLs (Categories) Initially manual? Maybe even just a text file that each EC2 machine loaded into memory Probably became a database, expensive Oracle database machines

Lycos By the way… Lycos is still alive and you have a better search engine than Lycos, no Query Suggestions even today!

Google Changed everything http://infolab.stanford.edu/~backrub/google.html Page Rank – Use links to find good sites – If a good page links to another good page with Anchor Text “Lebron James”, that’s probably a good indication that the linked page is about Lebron James and a pretty good quality site – Infrastructure problem – crawl entire web, fit in 1 drive to calculate the Page Rank, multiple iterations! Propagate authority/rank. Page has high page rank: -A lot of pages point to it -High page rank pages point to it

Infrastructure problem… Calculate page rank = needs entire web, calculate links, iterate N times! Internet is exploding… Invented MapReduce and all these infrastructure services (queue, table, query suggest, etc)

Infrastructure of a Search Engine

Anatomy QuerySuggest Web Role Search.aspx Dashboard.aspx Admin.asmx Azure Blob QuerySuggest Azure Queue URLs to Crawl Azure Table Web Index Red = Storage Blue = Compute Worker Role Crawler User query suggestions URLs word, URLs AWS RDS Structured Data (NBA stats) Wiki dataset query stats This is basically how Google works! query Azure Table Ranking Azure Blob User Logs

Anatomy QuerySuggest Web Role Search.aspx Dashboard.aspx Admin.asmx Azure Blob QuerySuggest Azure Queue URLs to Crawl Azure Table Web Index Red = Storage Blue = Compute Worker Role Crawler User query suggestions URLs word, URLs AWS RDS Structured Data (NBA stats) Wiki dataset query stats This is basically how Google works! query Azure Table Ranking Azure Blob User Logs PA3 PA1 PA2

Google PA2 PA3 PA1 Google pioneered the state of the art for Web Infrastructure

Generalizable Infrastructure

Amazon QuerySuggest Web Role Index.aspx Dashboard.aspx Admin.asmx Azure Blob QuerySuggest Azure Queue URLs to Crawl Azure Table Price Index Reviews Comments Red = Storage Blue = Compute Worker Role Crawler Price Calc Recs User query suggestions URLs Product, price AWS RDS Product Data User Data Wiki dataset query stats This is basically how Amazon works! query Azure Table Recommendation Azure Blob User Purchases

Interesting Problems in Infrastructure

Structured Data (PA1) In PA1, I gave you the CSV data Structured Data – Where to find this data? Wiki & Web – How to parse & understand Wiki? – How to parse & understand Web? – How to understand relationships? – How to understand tables? in html? Where to store this huge data? – Probably Table Storage. What about the relationships? Huge Engineering Effort, maybe 100 people at Google?

Query Suggestion (PA2) Data = Wiki + User Logs Fit into memory – Ours => A to C – Google => A to Z, digits, in all languages! Popularity biased – Type in ‘a’ => popularity will return amazon, alaska air, aol, apple – Ours ‘a’ => returns boring results – Popularity returns more interesting results – How to implement popularity-biased traversal? Also suggests misspellings!

Query Suggestion (PA2) Fit into memory – Better data structure, hybrid! Trie + List No need trie for tail, ex: “a story a…” Use List until > 100 then Trie Ex: “a story a story” – Last “a story” waste of memory to use trie. – Trie has 1 child, use 9 bytes instead of 1 byte! 9x difference if only 1 child! – Compression/C++? – More machines (traffic and memory) – Our PA2, maybe 6 machines to fit all? Spin up 6 Azure instances AJAX => if first char == a-c, ajax call machine 1, d-g => #2, etc. Client- side decide which machine to ask for the results! => distributed service

Query Suggestion (PA2) Hybrid Trie/List structure

Query Suggestion (PA2) Popularity biased (red = default, green = popular) – Keep track of popularity of each path in this trie Popularity = 1000 Popularity = 10

Query Suggestion (PA2) Misspellings – Part 1 of traversal (traverse what used typed) => traverse a slightly different path! Keep track of # edits => edit distance, edit to take more popular path instead

Crawler (PA3) TBD

Questions?

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Similar presentations

Presentation on theme: "INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Similar presentations

Presentation on theme: "INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014."— Presentation transcript:

Similar presentations

About project

Feedback