Download presentation
Presentation is loading. Please wait.
Published byWhitney Elliott Modified over 9 years ago
1
Data Structures & Algorithms and The Internet: A different way of thinking
2
Comparison Sort vs Bucket Sort What is bigger? The list of numbers or the alphabet of possible numbers? O(n log n) vs. O(n+m)
3
Password cracking Is it easier to try to crack one account by trying a million passwords, Or is it easier to try to crack a million accounts by trying the same password?
4
Histograms and large vectors How do search engines represent documents for searching/indexing? As a histogram of words, e.g. a vector of say 100000 entries for each document, containing word counts Term frequency – inverse document frequency: enhances rare words, diminishes common words Math done on vectors to determine similarity / relevance
5
Good Enough is OK Not the fastest algorithm? Put more computers on it. Cache results. Not the best results? Not a big deal if user can try again. The data may be wrong/ambiguous anyway. Britney Spears / Paris Hilton
6
Graphs are everywhere Web is a graph. Page rank algorithm says important/relevant pages are linked to by important/relevant pages. Reverse lookup problem with web links, basically building inverse edges to a directed graph Spam / porn links to each other… look for bad cliques? Social networking. Graph of friends. If you respond well to an ad, maybe your friends will too.
7
Liars, spam, hackers, etc. Algorithms are easier when you can trust your data Is this email spam? Does this website have useful content? Can you make it where if you do N work to detect spammers, they have to do N^2 work to avoid you?
8
Sample Server Architecture Load balancer front1 front2 front3 cache1 cache2 cache3 back1 back2 back3 cache4 back4 back5 back6
9
Denial of Service A little amount of work by a client causes a lot of work for the server, slowing down or blocking out other users Either malicious or accidental Example: fast queries to a search engine that cause cache misses How to catch?
10
Log IP addresses Block addresses causing too many cache misses within 10 sec, 1 min, 1 hr They may be hitting different front ends HashMap of IP addresses to counters? But how many IP addresses?
11
Data structures are too large for one computer Store pieces of data structures across many computers How do you know what is where? One solution is a master computer serving as an index, but this is a single point of failure
12
Partitioning How to divide 4 billion IP addresses among several machines? We need to know deterministically, given an IP, what machine to look it up on? Use a hash function that knocks it down to the four machines, i.e. hash has good entropy on AAA.BBB.CCC.DDD Sometimes can also use a small set of sorted data to divide into ranges that should evenly distribute the larger data
13
Captchkas Computation easy for person but hard for computer, vision-based problems How to stop computers from getting fake email addresses? Make it hard to automatically solve a puzzle required for registration How do spammers bypass? Pay 3 rd -world folks to solve puzzles all day, or offer free porn for solving puzzles
14
Databases Persistent storage for your data Let’s you do complex queries using a query language Can add a layer of abstraction over sets, lists, hashes, etc. Boolean/set logic very important
15
Redundancy and robustness Is your data fine if a computer goes down Is your server fine if a computer goes down Can you undo/revert changes? If something changes somewhere, how does everyone know about it? Average runtime vs guaranteed runtime
16
Scaling to more users/data Is growth as simple as adding another machine? (e.g. linear time / space) If you get dugg, can you handle a burst of users? Can you fail softly?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.