Week 12 - Wednesday CS 113.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
Computer Information Technology – Section 3-2. The Internet Objectives: The Student will: 1. Understand Search Engines and how they work 2. Understand.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
Chapter 5 Searching for Truth: Locating Information on the WWW.
SEARCHING ON THE INTERNET
Library HITS Library HITS: Helpful Information for Trinity Students/Staff Library eResources for Sciences Michaelmas Term 2013 Trinity College Library.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
The Confident Researcher: Google Away (Module 2) The Confident Researcher: Google Away 2.
A Web Crawler Design for Data Mining
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
1999 Asian Women's Network Training Workshop Tools for Searching Information on the Web  Search Engines  Meta-searchers  Information Gateways  Subject.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Search Engines.
Stop Searching and Start FINDING: Strategies for Effective Web Research.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Internet Research – Illustrated, Fourth Edition Unit A.
Created by Branden Maglio and Flynn Castellanos Team BFMMA.
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
IENG 423 Design of Decision Support Systems Internet as a Decision Support Tool 6/8/ Internet as a Decision Support Tool.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
General Architecture of Retrieval Systems 1Adrienn Skrop.
BIG DATA/ Hadoop Interview Questions.
Traffic Source Tell a Friend Send SMS Social Network Group chat Banners Advertisement.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Data mining in web applications
Search Engine Optimization
Information Retrieval in Practice
Client-Side Internet and Web Programming
The Study of Computer Science Chapter 0
AP CSP: Cleaning Data & Creating Summary Tables
Marking the Most of the Web’s Resources
CSC 102 Lecture 12 Nicholas R. Howe
Search Engines and Search techniques
Search Engine Architecture
The Internet Industry Week Two.
CCT356: Online Advertising and Marketing
Week 12 - Monday CS 113.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Week 12 - Thursday CS 113.
Map Reduce.
The Study of Computer Science Chapter 0
UNIT 15 Webpage Creator.
Introduction to Computers
B OOST W EBSITE P ERFORMANCE WITH T HE C USTOM W ORDPRESS P LUG -I N D EVELOPMENT
Understanding how Google Search works to better answer your questions
The Anatomy of a Large-Scale Hypertextual Web Search Engine
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Lesson Objectives Aims You should know about: – Web Technologies
Data Mining Chapter 6 Search Engines
Searching for Truth: Locating Information on the WWW
RecTech - Associated Recreation Council
The Study of Computer Science
Searching for Truth: Locating Information on the WWW
Planning and Storyboarding a Web Site
Searching for Truth: Locating Information on the WWW
USER MANUAL - WORLDSCINET
Big DATA.
The Study of Computer Science Chapter 0
USER MANUAL - WORLDSCINET
Presentation transcript:

Week 12 - Wednesday CS 113

Last time What did we talk about last time? Exam 2 post mortem Python mistakes How search engines work

Questions?

Project 4

Final Project

How any search engine works Gather information Keep copies Build an index Understand the query Determine the relevance of each result Rank the relevant results Present the results

Gather information First, a search engine has to get information about all the data it is going to index Most search engines use spiders (also called web crawlers) These programs constantly visit pages on the Internet Some domain-specific search engines only visit certain kinds of pages (like law or medicine) Even Google can't visit everything, and it can only visit pages so often Sites with logins usually cannot be visited A file called robots.txt can ask spiders not to visit a site

Keep copies Getting the information is the first step Google actually stores a copy of a huge amount of information on the Internet Called caching Mostly text But they provide image-based search tools too Search engines do this so they can do analysis on the pages (but also because they can only visit them so often) Caching allows users to see "deleted" web pages The Way Back Machine is devoted to such viewing

Build an index For searches to be fast, a search engine has to organize the data At Google, there are huge tables of keywords and of websites How many websites exist in the world? 644 million active websites in 2012 according to Business Insider 50 billion pages are indexed by Google They're essentially organized in alphabetical order so that Google can jump to the right part of its index to find what is needed

Understand the query The prior steps must happen before you can make a query Once you make a query, the servers at Google have to figure out what you're asking Most queries work statistically and don't depend on the rules for English sentences There are special symbols that can be used for Google searches Quotes for an exact phrase: "eggplant stew" A tilde will give you synonyms: ~hot A minus sign will exclude results: "mail order" -bride The site: specifier searchers only a particular place: wombats site:etown.edu

Determine the relevance A good search engine needs to figure out how relevant each page is to your query If the page contains many repetitions of your search terms, it is probably more relevant There are sophisticated methods that involve context and semantics A query for philadelphia eagles is associated with football in Google But what if you're searching for species of eagles that live around Philadelphia?

Rank the results Google can find a long list of relevant pages, but which one goes first? The secret to Google's initial success was its PageRank algorithm Their rankings were significantly better than other rankings at the time It has evolved over time, but the heart of the PageRank algorithm is looking to see how many other pages link to a page Other metrics such as the relevance of pages linked to, quality of spelling, frequency of updates, and many more are useful

Present the results This step is relatively boring compared to the others Most search engines present the data in a list They have been adding various bells and whistles Previews when you hover over results Image-based sorting results

Who pays? Users could pay Websites could pay Government could pay This model is used for subscriber databases like the journal indexes available at our library It was part of the model for providers like AOL and CompuServe Websites could pay There are ethical problems here Sponsored links is Google's solution Government could pay But usually doesn't pay directly Advertisers could pay Just like TV, ads pay for most of the Internet

Other issues There are things that Google doesn't index For technical reasons: The spiders can't gather information from certain systems or file types It is possible to ask spiders not to index your page Google chooses not to index some pages There is this idea of the "deep web" that is not as easy to search Google is a big popularity contest Countries ban sites China censored the Chinese version of Google Google eventually stopped agreeing to be censored

Tracking searches Google tracks the most popular searches over time Twitter does something similar with trending tweets In terms of privacy, it is possible to track your searches and make inferences about you Even without knowing what computer you're at, a lot of personal information can be recovered from searches

Are there downsides to freely searchable information? You have grown up with the Internet Free speech and widely disseminated ideas are central both to democracy and scientific advancement But are there downsides to information being so easy to get? If you don't find it, you might believe it doesn't exist Lies and distortions spread as quickly as truth Whoever controls the searching controls information People may not try to think through the answer on their own before searching for it

Big Data Most of this lecture is taken from a Big Data talk by Aaron Gember at Marquette University in 2012

Big data There is a huge amount of data out there that computer scientists are trying to wrangle with We'll look at: Example problems Challenges that must be overcome A hands-on approach to "big" data right in this room Paradigms for tackling big data

Example Problems

Problem: Internet search We discussed this problem last time How can a search provider like Google or Bing search through billions of webpages and produce a list of results in less than a second? Gather information Keep copies Build an index Understand the query Determine the relevance of each result Rank the relevant results Present the results

Problem: Climate and weather analysis Analyze current and historical weather data Sensor readings from thousands of locations Satellite and radar images Geographic features Visualize predictions for many audiences People watching the weather report Climate scientists

Problem: Netflix recommendations Recommend movies from Netflix’s collection Netflix has data from around 75 million subscribers Accuracy of predictions impacts subscriptions

Problem: Netflix recommendations Many factors can influence viewing behavior Movie characteristics: cast, year, genre, duration Personal history: movies watched, queue Social: ratings, reviews Recommendations include categories and movies, presented in a specific order

Challenges

Challenge: Collection Where does the data come from? Input can be from: Humans Instruments and sensors Historical data Data from different sources could have different levels of accuracy and different formats Some process is needed to move the data from the collection point to the repository

Challenge: Organization Now that you've got the data, how do you organize it? It may need to be labeled or categorized Either manually or automatically with programs There may be relationships between different pieces of data that can be found or noted Inaccurate or bad data should be thrown out

Challenge: Storage How do we store large amounts of data? Need space for hundreds of terabytes (TBs) of data Remember that a modern hard drive stores about 1 TB of data Data needs to be efficiently accessed by servers doing computation High speed networks must connect all the hard drives The system needs fault tolerance Single hard drives can fail without any data being lost As replacement drives are added, data is automatically backed up

Example: Google data centers Google has data centers all over the world 8 in North America 1 in South America 2 in Asia 4 in Europe It invests hundreds of millions to over a billion in each one It's estimated that they have over 1 million servers They have proprietary networking systems Of course, they are relatively secretive about the details It's incredibly loud inside!

Challenge: Computation How do we get the information we want out of the data? We design algorithms to process the data As you know, some problems are hard to solve, even with fast computers They may take too much time or space to run And it is not obvious how to come up with the best algorithm Netflix had a contest for the best recommendation algorithm

Challenge: Visualization Let's say you've solved all the other challenges and gotten some great answers from your data How do you visualize the answers? The most important data should be highlighted somehow Viewing relationships may be important Several different related visualizations may be useful

Big Data Activity

Big data activity Let's divide the class into four groups of 5 or 6 people each Count how many times each unique word occurs in Dr. Seuss's One Fish Two Fish Red Fish Blue Fish We're actually only doing about 1/7 of the book Ignore differences in case Try for speed and accuracy Go!

Fishy results Who held what data? How was data passed? What algorithm did each person execute? How was the final result obtained? How did you present the final result?

Big Data Paradigms

Paradigm: MapReduce Leverage parallelization Divide analysis into two parts Map task: Given a subset of the data, extract relevant data and obtain partial results Reduce task: Receive partial results from each map task and combine into a final result Your One Fish Two Fish counting probably followed the idea of MapReduce

Paradigm: MapReduce Used for Internet search Map task: given a part of the index, identify pages containing keywords and calculate relevance Reduce task: rank pages based on relevance Infrastructure requirements Many machines to run map tasks in parallel Ability to retrieve and store data Coordination of who does what MapReduce is at the heart of many of Google's services

Paradigm: Cloud Computing Large collections of processing and storage resources used on demand Sell resources such as processors and gigabytes of storage for some period of time Benefits for users Only pay for what you use 100 servers at $1/hour for 1 hour = $100 1 server at $1/hour for 100 hours = $100 Externally managed Benefits for cloud providers Economies of scale for space and equipment

Paradigm: Cloud Computing Several different models Infrastructure as a service Virtual machines can be accessed by the user The user has complete control of OS and applications Platform as a service An OS, programming language execution environment, database and web server is provided The user can write any code for it Software as a service Software is made available to the user The user can use it without installing applications or storing data locally

Paradigm: Data mining Identify patterns and relationships in data Used to rank, categorize, and so on Commonly associated with artificial intelligence and machine learning We talked about data mining before, but it's important to mention it in the context of Big Data

Paradigm: Visualization Visualization might be one of the harder aspects of Big Data to deal with There's no "one size fits all" approach Conventional: line, bar, pie charts Alternative: bubble chart, tree map Text: tag cloud, word tree

Quiz

Upcoming

Next time… Objects in Python Lab 12

Reminders Read Python Chapter 10 Finish Project 4 Due tomorrow! Next class is Thursday, tomorrow