INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.

Slides:



Advertisements
Similar presentations
Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Advertisements

How to make the most of your website: It’s one of your best marketing, branding, awareness tools.
1 Lesson 14 - Unit N Optimizing Your Web Site for Search Engines.
SEO Yearly Plan For 6 Keywords Basic SEO :10,000 per month Advanced: 15, 000 per month Super SEO: 20, 000 per month Complete SEO: 25, 000 per month *Prices.
What is SEO ? Search engine optimisation Way to optimise your web-site to increase your page rank in SE.
How Search Engines Work Source:
Search Engine Optimization March 23, 2011 Google Search Engine Optimization Starter Guide.
SEO Techniques Tech Talk 29 th August 2013 (By PEN Vannak)
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Wordpress SEO. Your Own Website If you want your own website, we have designed Wordpress website templates that you can purchase that have pretty much.
A detailed guide on how to set-up your printing storefront. Please Note: Storefronts are compatible with all browsers, however for optimal use of the admin.
Over My Shoulder Training Session 4. Over My Shoulder Training Week 3 – Fulfillment – Search Engines and Citations Setting client expectations – More.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
Launch Your WordPress site in One Hour By Bret Phillips For slides, codes, and notes: Web Devils WordPress.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
“While Google is constantly updating the way we rank, WordPress has forever changed the speed at which we rank” – Alex Miranda.
Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide May 16 – 18, 2007 Copyright 2007, Data Access Worldwide Search Engine Optimization.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Chapter 16 The World Wide Web Chapter Goals ( ) Compare and contrast the Internet and the World Wide Web Describe general Web processing.
KW Agent Website Training Getting Good with Google.
Kelly rowland WHAT WE ALL NEED!!. hoppadon formly of village deuce mafia...the hottest rap don spitting!!
W EB A NALYTICS : W HERE VISITORS COME FROM, WHAT THEY DO, AND WHERE THEY GO ? W HAT CAN WE LEARN FROM IT ? Chuck DelCamp Product Manager, StudyAboad.com.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Google Sitemaps Case Study Eric Papczun SES Chicago Bulk Submit 2.0 December 5 th, 2006.
Do's and don'ts to improve your site's ranking … Presentation by:
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Web Search Algorithms By Matt Richard and Kyle Krueger.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Data Collection and Web Crawling. Overview Data intensive applications are likely to powered by some databases. How do you get the data in your database?
AGB 3/26/121 ++=. 2 Yes, believe it or not this is a complete webpage. It has a Head, Title and Body between the start and end HTML Tag.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
SEO TIPS. Make the website about one thing  Get Your Domain Name  Choose a Web Host and Sign Up for an Account  Designing your Web Pages  Testing.
How to Perform Technical SEO Audit
Created By EZ Marketing Tech 1 +1 (347) | |
SEARCH ENGINE OPTIMIZATION, SECURITY, MAINTENANCE.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Technical SEO tips for Web Developers Richa Bhatia Singsys Pte. Ltd.
Presented by Karen Porter UM School of Business Administration & ImpactOnlineMarketing.com Adding Links & Multi-Media.
Search Engine Optimization
Information Architecture
Search Engine Optimization(S.E.O)
Google webmaster tools.  Webmaster is one or more person who is responsible to create one or more sites.  Google webmaster is now changed and called.
How to use.
Jill Sullivan Senior Marketing Manager Infront Webworks
IS 360 Web Promotion.
Lecture 7. Web Search. Author: Aleksey Semyonov
KW Agent Website Training
INFO 344 Web Tools And Development
Objective % Explain concepts used to create websites.
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
WJEC GCSE Computer Science
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
INFO 344 Web Tools And Development
Best Digital Marketing Tips For Quick Web Pages Indexing Presented By:- Abhinav Shashtri.
SEARCH ENGINE OPTIMIZATION
Presentation transcript:

INFO 344 Web Tools And Development CK Wang University of Washington Spring 2014

Programming Assignment #3

Web Crawler Offline processing Dashboard

Web Crawler Google – Crawl websites & index for search engine Amazon – Crawl web to price match w/ Amazon’s price Aggregate content – Shopping (Nextag) – News (finance.google.com)

Offline processing & Dashboard Offline/async processing – Facebook Lookback – Twitter fire hose and analyze sentiments – YouTube video compression (upload then compress) – Anything that takes > 5s to load => do offline! Dashboard – Easy way to see status of offline processing

Final Product Azure Cloud Service Worker Role Web Role Web Role -dashboard.aspx -status, #urls, last 10, etc -admin.asmx -ClearIndex -Also stops current crawling -StartCrawling -GetPageTitle Worker Role -Read URL from Queue -Crawl websites -Store title to Table -Add URLs found to Queue

Great User Experience Refresh dashboard => gets me new data ASMX admin page should return relevant status such as “Index Cleared” instead of void/empty string, consider other cases. Remove duplicates Only crawl websites in the same domain as your seed URL.

Start Now! (ok… after PA2)

Deliverables Due on May 19, 11pm PST Submit on Canvas Please submit the following as a single zip file: URL to your Azure instance hosting the dashboard (readme.txt), make sure crawling is complete! URL to your GitHub repro (share your GitHub with me & TA) in readme.txt Visual Studio 2013 project & source code Screenshot of your Azure dashboard with Instance running (azure- compute.jpg) Write up explaining how you implemented everything. Make sure to address each of the requirements, writeup.txt (~500 words) Extra credits – short paragraph in extracredits.txt for each extra credit (how to see/trigger/evaluate/run your extra credit feature and how you implemented it)

Hint Respect robots.txt (google it, it’s a simple format) Only need to crawl pages in the same domain Keep a list of already visited URLs, don’t re-crawl them, store in a fast lookup data structure Think about where to store stats Your code should handle 2+ worker threads. Think about concurrency in updating dashboard stats Local hosting/debugging = run as Admin

Sitemaps Start with these 2 robots.txt & sitemaps ( and For the CNN.com sitemap, ignore URLs > 2 months old; for the sportsillustrated sitemap, ignore non-nba related URLs

Extra Credit [10pts] Multi-threaded crawler [10pts] Crawl & index HTML body text (remove HTML tags)* [10pts] Graphical dashboard (shows stats over time) [5pts] Crawl more root domains (imdb, forbes, bbc, espn, Wikipedia, 1 pts per domain)

Questions?