INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Slides:



Advertisements
Similar presentations
PHP Meetup - SEO 2/12/2009. Where to Focus? Ensuring the findability of content Ensuring content is well understood by search engines Maximizing the importance.
Advertisements

Performing a Technical SEO Audit. Audit SEO - plan de actiune Overview Gather Data Analyze Present Results.
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
1 ETT 429 Spring 2007 Microsoft Publisher II. 2 World Wide Web Terminology Internet Web pages Browsers Search Engines.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Google and Scalable Query Services
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Sept. 21, 2011 Windows Azure—Overview.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Postacademic Interuniversity Course in Information Technology – Module C1p1 Contents Data Communications Applications –File & print serving –Mail –Domain.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Windows Azure Conference 2014 Deploy your Java workloads on Windows Azure.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
CourseCrawler Matt Berntsen Don Frehulfer Evan Kaiser.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.
Building web applications with the Windows Azure Platform Ido Flatow | Senior Architect | Sela | This session.
Cloud Computing: Pay-per-Use for On-Demand Scalability Developing Cloud Computing Applications with Open Source Technologies Shlomo Swidler.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
BTEC NCF Dip in Comp - Unit 15 Website Development Lesson 04 – Search Engine Optimisation Mr C Johnston.
INFO 344 Web Tools And Development
The Anatomy of a Large-Scale Hypertextual Web Search Engine
02 | Hosting Services in Windows Azure
INFO 344 Web Tools And Development
Data Mining Chapter 6 Search Engines
All About the Internet.
The Search Engine Architecture
Searching the Internet
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
Presentation transcript:

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014

Announcements PA3 due in 1 week! 5/19, 11pm PST Anyone go to Startup weekend? If you have 0 late days left, please submit before 10:30pm PST just in case… Red errors = use Intellisense!

PA3 Only crawl sites in the approved domains Ignore non-html urls -> only *.html, *.htm Robots.txt – Sitemap: use this to initialize url queue – Disallow: remember this, filter out URLs

Any Questions??? PA3

Web vs. Worker roles Same hardware Web role – VM running IIS on port 80 that serves websites Worker role – VM that runs the code in Run() in WorkerRole.cs Why distinguish the two? – Better Information Architecture! => more scalable What happens if we did not distinguish the two? All web role? – No distinction, all web role, web role will fork a thread to start crawling? – What if I want 1000 machines to crawl? Send 1000 messages to asmx to start? They all start from same url? Duplicate work Spin off thread to mimic worker role? Ok… that’s worker role work! Load balancing across 1000 machines? Different machines = different # urls. – Scale web and worker separately & appropriately In the web role vs. worker role and Queue to communicate => even 1000 nodes can work together efficiently!

History of Search Infrastructure

Yahoo 1995 List of URLs Hierarchical organization of URLs (Categories) Initially manual? Maybe even just a text file that each EC2 machine loaded into memory Probably became a database, expensive Oracle database machines

Lycos By the way… Lycos is still alive and you have a better search engine than Lycos, no Query Suggestions even today!

Google Changed everything Page Rank – Use links to find good sites – If a good page links to another good page with Anchor Text “Lebron James”, that’s probably a good indication that the linked page is about Lebron James and a pretty good quality site – Infrastructure problem – crawl entire web, fit in 1 drive to calculate the Page Rank, multiple iterations! Propagate authority/rank. Page has high page rank: -A lot of pages point to it -High page rank pages point to it

Infrastructure problem… Calculate page rank = needs entire web, calculate links, iterate N times! Internet is exploding… Invented MapReduce and all these infrastructure services (queue, table, query suggest, etc)

Infrastructure of a Search Engine

Anatomy QuerySuggest Web Role Search.aspx Dashboard.aspx Admin.asmx Azure Blob QuerySuggest Azure Queue URLs to Crawl Azure Table Web Index Red = Storage Blue = Compute Worker Role Crawler User query suggestions URLs word, URLs AWS RDS Structured Data (NBA stats) Wiki dataset query stats This is basically how Google works! query Azure Table Ranking Azure Blob User Logs

Anatomy QuerySuggest Web Role Search.aspx Dashboard.aspx Admin.asmx Azure Blob QuerySuggest Azure Queue URLs to Crawl Azure Table Web Index Red = Storage Blue = Compute Worker Role Crawler User query suggestions URLs word, URLs AWS RDS Structured Data (NBA stats) Wiki dataset query stats This is basically how Google works! query Azure Table Ranking Azure Blob User Logs PA3 PA1 PA2

Google PA2 PA3 PA1 Google pioneered the state of the art for Web Infrastructure

Generalizable Infrastructure

Amazon QuerySuggest Web Role Index.aspx Dashboard.aspx Admin.asmx Azure Blob QuerySuggest Azure Queue URLs to Crawl Azure Table Price Index Reviews Comments Red = Storage Blue = Compute Worker Role Crawler Price Calc Recs User query suggestions URLs Product, price AWS RDS Product Data User Data Wiki dataset query stats This is basically how Amazon works! query Azure Table Recommendation Azure Blob User Purchases

Interesting Problems in Infrastructure

Structured Data (PA1) In PA1, I gave you the CSV data Structured Data – Where to find this data? Wiki & Web – How to parse & understand Wiki? – How to parse & understand Web? – How to understand relationships? – How to understand tables? in html? Where to store this huge data? – Probably Table Storage. What about the relationships? Huge Engineering Effort, maybe 100 people at Google?

Query Suggestion (PA2) Data = Wiki + User Logs Fit into memory – Ours => A to C – Google => A to Z, digits, in all languages! Popularity biased – Type in ‘a’ => popularity will return amazon, alaska air, aol, apple – Ours ‘a’ => returns boring results – Popularity returns more interesting results – How to implement popularity-biased traversal? Also suggests misspellings!

Query Suggestion (PA2) Fit into memory – Better data structure, hybrid! Trie + List No need trie for tail, ex: “a story a…” Use List until > 100 then Trie Ex: “a story a story” – Last “a story” waste of memory to use trie. – Trie has 1 child, use 9 bytes instead of 1 byte! 9x difference if only 1 child! – Compression/C++? – More machines (traffic and memory) – Our PA2, maybe 6 machines to fit all? Spin up 6 Azure instances AJAX => if first char == a-c, ajax call machine 1, d-g => #2, etc. Client- side decide which machine to ask for the results! => distributed service

Query Suggestion (PA2) Hybrid Trie/List structure

Query Suggestion (PA2) Popularity biased (red = default, green = popular) – Keep track of popularity of each path in this trie Popularity = 1000 Popularity = 10

Query Suggestion (PA2) Misspellings – Part 1 of traversal (traverse what used typed) => traverse a slightly different path! Keep track of # edits => edit distance, edit to take more popular path instead

Crawler (PA3) TBD

Questions?