Week 12 - Wednesday CS 113
Last time What did we talk about last time? Exam 2 post mortem Python mistakes How search engines work
Questions?
Project 4
Final Project
How any search engine works Gather information Keep copies Build an index Understand the query Determine the relevance of each result Rank the relevant results Present the results
Gather information First, a search engine has to get information about all the data it is going to index Most search engines use spiders (also called web crawlers) These programs constantly visit pages on the Internet Some domain-specific search engines only visit certain kinds of pages (like law or medicine) Even Google can't visit everything, and it can only visit pages so often Sites with logins usually cannot be visited A file called robots.txt can ask spiders not to visit a site
Keep copies Getting the information is the first step Google actually stores a copy of a huge amount of information on the Internet Called caching Mostly text But they provide image-based search tools too Search engines do this so they can do analysis on the pages (but also because they can only visit them so often) Caching allows users to see "deleted" web pages The Way Back Machine is devoted to such viewing
Build an index For searches to be fast, a search engine has to organize the data At Google, there are huge tables of keywords and of websites How many websites exist in the world? 644 million active websites in 2012 according to Business Insider 50 billion pages are indexed by Google They're essentially organized in alphabetical order so that Google can jump to the right part of its index to find what is needed
Understand the query The prior steps must happen before you can make a query Once you make a query, the servers at Google have to figure out what you're asking Most queries work statistically and don't depend on the rules for English sentences There are special symbols that can be used for Google searches Quotes for an exact phrase: "eggplant stew" A tilde will give you synonyms: ~hot A minus sign will exclude results: "mail order" -bride The site: specifier searchers only a particular place: wombats site:etown.edu
Determine the relevance A good search engine needs to figure out how relevant each page is to your query If the page contains many repetitions of your search terms, it is probably more relevant There are sophisticated methods that involve context and semantics A query for philadelphia eagles is associated with football in Google But what if you're searching for species of eagles that live around Philadelphia?
Rank the results Google can find a long list of relevant pages, but which one goes first? The secret to Google's initial success was its PageRank algorithm Their rankings were significantly better than other rankings at the time It has evolved over time, but the heart of the PageRank algorithm is looking to see how many other pages link to a page Other metrics such as the relevance of pages linked to, quality of spelling, frequency of updates, and many more are useful
Present the results This step is relatively boring compared to the others Most search engines present the data in a list They have been adding various bells and whistles Previews when you hover over results Image-based sorting results
Who pays? Users could pay Websites could pay Government could pay This model is used for subscriber databases like the journal indexes available at our library It was part of the model for providers like AOL and CompuServe Websites could pay There are ethical problems here Sponsored links is Google's solution Government could pay But usually doesn't pay directly Advertisers could pay Just like TV, ads pay for most of the Internet
Other issues There are things that Google doesn't index For technical reasons: The spiders can't gather information from certain systems or file types It is possible to ask spiders not to index your page Google chooses not to index some pages There is this idea of the "deep web" that is not as easy to search Google is a big popularity contest Countries ban sites China censored the Chinese version of Google Google eventually stopped agreeing to be censored
Tracking searches Google tracks the most popular searches over time Twitter does something similar with trending tweets In terms of privacy, it is possible to track your searches and make inferences about you Even without knowing what computer you're at, a lot of personal information can be recovered from searches
Are there downsides to freely searchable information? You have grown up with the Internet Free speech and widely disseminated ideas are central both to democracy and scientific advancement But are there downsides to information being so easy to get? If you don't find it, you might believe it doesn't exist Lies and distortions spread as quickly as truth Whoever controls the searching controls information People may not try to think through the answer on their own before searching for it
Big Data Most of this lecture is taken from a Big Data talk by Aaron Gember at Marquette University in 2012
Big data There is a huge amount of data out there that computer scientists are trying to wrangle with We'll look at: Example problems Challenges that must be overcome A hands-on approach to "big" data right in this room Paradigms for tackling big data
Example Problems
Problem: Internet search We discussed this problem last time How can a search provider like Google or Bing search through billions of webpages and produce a list of results in less than a second? Gather information Keep copies Build an index Understand the query Determine the relevance of each result Rank the relevant results Present the results
Problem: Climate and weather analysis Analyze current and historical weather data Sensor readings from thousands of locations Satellite and radar images Geographic features Visualize predictions for many audiences People watching the weather report Climate scientists
Problem: Netflix recommendations Recommend movies from Netflix’s collection Netflix has data from around 75 million subscribers Accuracy of predictions impacts subscriptions
Problem: Netflix recommendations Many factors can influence viewing behavior Movie characteristics: cast, year, genre, duration Personal history: movies watched, queue Social: ratings, reviews Recommendations include categories and movies, presented in a specific order
Challenges
Challenge: Collection Where does the data come from? Input can be from: Humans Instruments and sensors Historical data Data from different sources could have different levels of accuracy and different formats Some process is needed to move the data from the collection point to the repository
Challenge: Organization Now that you've got the data, how do you organize it? It may need to be labeled or categorized Either manually or automatically with programs There may be relationships between different pieces of data that can be found or noted Inaccurate or bad data should be thrown out
Challenge: Storage How do we store large amounts of data? Need space for hundreds of terabytes (TBs) of data Remember that a modern hard drive stores about 1 TB of data Data needs to be efficiently accessed by servers doing computation High speed networks must connect all the hard drives The system needs fault tolerance Single hard drives can fail without any data being lost As replacement drives are added, data is automatically backed up
Example: Google data centers Google has data centers all over the world 8 in North America 1 in South America 2 in Asia 4 in Europe It invests hundreds of millions to over a billion in each one It's estimated that they have over 1 million servers They have proprietary networking systems Of course, they are relatively secretive about the details It's incredibly loud inside!
Challenge: Computation How do we get the information we want out of the data? We design algorithms to process the data As you know, some problems are hard to solve, even with fast computers They may take too much time or space to run And it is not obvious how to come up with the best algorithm Netflix had a contest for the best recommendation algorithm
Challenge: Visualization Let's say you've solved all the other challenges and gotten some great answers from your data How do you visualize the answers? The most important data should be highlighted somehow Viewing relationships may be important Several different related visualizations may be useful
Big Data Activity
Big data activity Let's divide the class into four groups of 5 or 6 people each Count how many times each unique word occurs in Dr. Seuss's One Fish Two Fish Red Fish Blue Fish We're actually only doing about 1/7 of the book Ignore differences in case Try for speed and accuracy Go!
Fishy results Who held what data? How was data passed? What algorithm did each person execute? How was the final result obtained? How did you present the final result?
Big Data Paradigms
Paradigm: MapReduce Leverage parallelization Divide analysis into two parts Map task: Given a subset of the data, extract relevant data and obtain partial results Reduce task: Receive partial results from each map task and combine into a final result Your One Fish Two Fish counting probably followed the idea of MapReduce
Paradigm: MapReduce Used for Internet search Map task: given a part of the index, identify pages containing keywords and calculate relevance Reduce task: rank pages based on relevance Infrastructure requirements Many machines to run map tasks in parallel Ability to retrieve and store data Coordination of who does what MapReduce is at the heart of many of Google's services
Paradigm: Cloud Computing Large collections of processing and storage resources used on demand Sell resources such as processors and gigabytes of storage for some period of time Benefits for users Only pay for what you use 100 servers at $1/hour for 1 hour = $100 1 server at $1/hour for 100 hours = $100 Externally managed Benefits for cloud providers Economies of scale for space and equipment
Paradigm: Cloud Computing Several different models Infrastructure as a service Virtual machines can be accessed by the user The user has complete control of OS and applications Platform as a service An OS, programming language execution environment, database and web server is provided The user can write any code for it Software as a service Software is made available to the user The user can use it without installing applications or storing data locally
Paradigm: Data mining Identify patterns and relationships in data Used to rank, categorize, and so on Commonly associated with artificial intelligence and machine learning We talked about data mining before, but it's important to mention it in the context of Big Data
Paradigm: Visualization Visualization might be one of the harder aspects of Big Data to deal with There's no "one size fits all" approach Conventional: line, bar, pie charts Alternative: bubble chart, tree map Text: tag cloud, word tree
Quiz
Upcoming
Next time… Objects in Python Lab 12
Reminders Read Python Chapter 10 Finish Project 4 Due tomorrow! Next class is Thursday, tomorrow