WEB SCIENCE: SEARCHING THE WEB
Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program that surfs the web and indexes and/or copies the website Also known as bots, web spiders, web robots Meta-tag Extra information that tags the HTML document HyperLink or Link A reference/link to another web page
How do you evaluate a search engine? Time taken to return results Number of results Quality of results
How does a web crawler work? 1. Start at a webpage 2. Download the HTML content 3. Search for the HTML link tags 4. Repeat steps 2-3 for each of the links 5. When a website has been completely indexed, load and crawl other websites
Parallel Web Crawling Speed up your web crawling by running on multiple computers at the same time (i.e. parallel computing How often should you crawl the entire Internet? How many copies of the Internet should you keep? What are the different ways to index a webpage? Meta keywords Content Page rank (# links to page)
Basic Search Engine Algorithm 1. Crawl the Internet 2. Save meta keywords for every page 3. Save the content and popular words on the page 4. When somebody needs to find something, search for matching keywords or content words Problem: Nothing stops you from inserting your own keywords or content that do not relate to the page’s *actual* content
PageRank Algorithm 1. Crawl the Internet 2. Save the content and index the contents’ popular words 3. Identify the links on the page 4. Each link to an already indexed page increases the PageRank of that linked page 5. When somebody needs to find something, search for matching keywords or content words, BUT rank the search results according to PageRank Problem: Create a bunch of websites that link to a single specific page (
Shallow Web vs. Deep Web Shallow web Websites and content that are easily visible to “dumb search engines” Content publicly links to other content Shallow web content tends to be static content (unchanging) Deep web Websites and content that tend to be dynamic and/or unlinked Private web sites Unlinked content Smarter search engines can crawl the deep web
Search Engine Optimization (SEO) Meta keywords Words the relate to your content Human-readible URLs i.e. avoid complicated dynamically created URLs Links to your page on other websites Page visits Others? White hat vs. black hat SEO White hats are the good guys. When would they be used? Black hats are the bad guys. When would they be used?
Search Engine Design Assumptions are key to design! Major problem in older search engines: People gamed the search results Results were not tailored to the user What assumptions does a typical search engine make now? (i.e. what factors influence search today?)