Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science

Similar presentations


Presentation on theme: "Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science"— Presentation transcript:

1 Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science http://www.cs.wustl.edu/~cytron/ Century Club May 2002 Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood, George Varghese (UCSD) Mahesh Jayaram Thanks: Ben Brodie Center for Distributed Object Computing Department of Computer Science Washington University

2 Outline Computers have come a long way

3 Outline Computers have come a long way Today’s computers are never lonely

4 Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data

5 Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media needle needelneedle

6 Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Internet packet filtering

7 Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Internet packet filtering Conclusion

8 A Grandchild’s Gift 1966 1999 Cost: $60Cost: $35 Memory ½ charMemory 16 M chars Speed: 1 cycle/sSpeed: 16 M cycles/s Fails: 10 secondsFails: 5 years

9 If cars improved that much in 30 years … $4000 60,000 miles per hour Seats 10,000 people Gets 20,000 miles per gallon Breaks every 70 years

10 The Haystack The Internet is large and growing Content on the Internet is growing even faster A haystack sits still, but the Internet….

11 Growth of the Internet (why computers aren’t lonely anymore) Y2K Problem (?): More computers sold than TVs

12 Growth of Internet Content (volumes and volumes of data) Anybody can publish Problem is how to find what you want

13 Page 6B What can tech companies do? Some say they're at a loss, but others offer budding solutions By Kevin Maney On July 7, 1940, as the nation edged toward World War II, IBM put out a statement that made headlines. The company offered all its facilities for national defense, ready to convert to making anything the government needed. Other leaders in the electro-mechanical technology of the day -- Ford Motor, General Motors, General Electric -- also threw their weight into defense efforts. They switched from making cars and washing machines to building tanks, aircraft engines and machine guns. So here we are in 2001, readying for another war. The U.S. technology industry is the best and most innovative in the world. It is the nation's pride and joy. Shouldn't it do something? 9/17/2001

14 ... One possibility is in data-mining technology. Data mining is a way to collect millions of pieces of information in a computer system, sift through that data, make sense of them and come up with something useful. ''We (the U.S. tech industry) are experts at data mining and have vast resources of data to mine,'' says Tom Evslin, CEO of Internet communications company ITXC. ''We have used it to target advertising. We can probably use it to identify suspicious activity or potential terrorists.''...

15 Fast searching of magnetic media with Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood

16 Enabling Technology: Disk Drives Magnetic disk storage areal density vs. year of IBM product introduction (From D. A. Thompson) Almost 10,000,000x increase in 45 years!

17 Cost per Megabyte Price history of hard disk product vs. year of product introduction (From D. A. Thompson) Cost decreasing 3% per week!

18 Storage industry will ship 4,000,000,000,000,000,000 Bytes this year FedEx generated 14 Terabytes of data last year US intelligence collects data equaling the printed collection of the US library every day! Massive Storage & Data

19 Massive Data Sets Employee records Consumer information Maps/mission/intelligence data Genome maps  Data sets now measured in Terabytes, and are dynamic!

20 Genome Application Genome maps growing expanded daily –Wash U sequencing center –Each of us has 80,000 genes found among 3 billion characters of DNA (A,C,G,T) Look for matches –Identify function –Disease: understand, diagnose, detect, medicine, therapy –Biofuels, warfare, toxic waste –Understand evolution –Forensics, organ donors, authentication –More effective crops, disease resistance

21 DNA String Matching Looking for CACGTTAGT…TAGC Interested in matches and near matches Search human genome and other gene oceans –Need to search entire data sets

22 Bio Computation Problem *BIG* Genome Databases A C GT G T A CA G DNA pattern DNA sequence Match? Approximate matches are just as useful

23 Finding a needel in a heystuck DNA and live text can contain errors We often seek an approximate match, for example needle No match? Try 2-transpositions enedle, needle, nedele, neelde, needel No match? Try 1-deletions eedle, nedle, nedle, neele, neede, needl No match? Try insertions, larger edits, … An exponential number of possibilities

24 No How is this done today? Think of every way a word can be misspelled Present each misspelling to the computer for an exact match enedle needle nedele neelde needel Yes

25 How can we do better? Data is present on magnetic media Hardware at the disk is –Already fault tolerant (more on this later) needel  needle –Distributed across all surfaces needle needel We win if number of misspellings is large, and the number of false hits is small

26 Another Application:Intelligence Data Lots of data Changing constantly Many perturbations –Tzar, tsar, czar,... Don’t know what we want to look for beforehand

27 Google Search Engine Crawls the web once per month Caches web pages Fast, exact text-based search (see how soon) needle needel

28 Image Database Applications Challenging database Unstructured Massive data sets Don’t know what we need to look for in each picture

29 Satellite Data Low-orbit fly-over every 90 minutes Look for differences in images –Large objects –Troops –Changes to landscape Flag, transmit these differences immediately National Reconnaissance Office City assessors...

30 Washington University Hilltop Campus

31 How do we find what we’re looking for?!

32 Conventional Structured Database D id 4 3 1 2 Document Agent James Bond Agent mobile computer James Madison movie James Bond movie Word James computer agent Bond Inverted list - pointers Madison mobile movie

33 Challenges in Searching Massive Databases  Know what to search for –need to build index beforehand –maintain index as it changes  Do not know what to search for –need to search the whole database!

34 Conventional Search Hard drive Processor Memory I/O bus Memory bus

35 Conventional Search Hard drive Processor Memory I/O bus Memory bus find …. Conventional Search

36 Hard drive Processor Memory I/O bus Memory bus contents yes, no, no, yes, yes …. Conventional Search

37 Conventional Approach

38 WUSTL’s Approach

39 Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing Streaming Approach

40 Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing find Streaming Approach

41 Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing find Streaming Approach

42 Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing Parallelism through each transducer and drive find yes, no, no, yes, yes Streaming Approach

43 Magnetic Recording Channel Schematic Encoder Decoder Detector Input User Data Decoded User Data Channel Bits HeadDisk Analog Readback A BC To Bus or Cache

44 Key streaming over Data

45 Disk Level Implementation 100-bit-key matching through a pseudo-random binary series scorescore matches

46 Status: Prototype in progress FPX NID RAD Hard drive Host ATAPI Controller IDE bus Tap 16bit Data 15bit CTRL Custom PCB for Electrical Termination & 5V to 3.3V Conversion 32 RAD test pins Loopback module module Setup reused from FPX IDE_to_ATM module

47 Internet Packet Filtering with Mahesh Jayaram and George Varghese

48 Finding Needles in a Moving Haystack

49 As technology improves, transmission time decreases but latency stays the same Year Cost of Internet Request Latency Transmission Time

50 Example: Garden Hose Water Supply Latency (first drop) ~ distance Bandwidth ~ hose diameter Fire department and gardener suffer the same wait

51 Example: Hot Shower You want this water Latency (time to get hot water) ~ distance

52 Convection circuit continuously circulates hot water Latency ~ 0 Latency-Free Hot Shower

53 Better to receive than to give Cable broadcast Radio broadcast TV guide channel Gate connection announcements in flight Winning lottery number Modern name: push technology

54 Better to receive than to give

55 How do you get what you want?

56 Packet Filters Filter F (Weather)

57 Packet Filters Filter F (Weather)

58 Existing Approach IBM Quote Weather Flight Schedule

59 Our approach IBM QuoteWeatherFlight Schedule Composite filter makes just one pass

60 How we do it IBM Quote Weather Flight Schedule Grammar 1 Grammar 2 Grammar 3 Parsing Engine

61 TCPConnHeader : EtherType IPHeader TCPPortPair EtherType : #IP_TYPE IPHeader : Vers HlenPlusRest Vers : HalfByte HlenPlusRest : 0 1 0 1 FixedRest | 0 1 1 0 FixedRest OneIPOption | 0 1 1 1 FixedRest TwoIPOption | 1 0 0 0 FixedRest ThreeIPOption | 1 0 0 1 FixedRest FourIPOption | 1 0 1 0 FixedRest FiveIPOption | 1 0 1 1 FixedRest FiveIPOption OneIPOption | 1 1 0 0 FixedRest FiveIPOption TwoIPOption | 1 1 0 1 FixedRest FiveIPOption ThreeIPOption | 1 1 1 0 FixedRest FiveIPOption FourIPOption | 1 1 1 1 FixedRest FiveIPOption FiveIPOption FixedRest : ServiceType TotalLength Identification Flags FragmentOffset TimeToLive Protocol HeaderChecksum IPAddrPair ServiceType : Byte TotalLength : TwoByte Identification : TwoByte Flags : bit bit bit FragmentOffset : bit Byte HalfByte TimeToLive : Byte Protocol : #TCP_PROTOCOL HeaderChecksum : TwoByte IPAddrPair : #IP_SRC_DST_PAIR FiveIPOption : ThreeIPOption TwoIPOption FourIPOption : TwoIPOption TwoIPOption ThreeIPOption : TwoIPOption OneIPOption TwoIPOption : OneIPOption OneIPOption OneIPOption : Option Padding Option : ThreeByte Padding : Byte TCPPortPair : #TCP_PORT_PAIR FourByte : TwoByte TwoByte ThreeByte : TwoByte Byte TwoByte : Byte Byte Byte : HalfByte HalfByte HalfByte : bit bit bit bit bit : 0 | 1 Sample grammar for TCP packet

62 Results The more things you want, the slower existing approaches get Our performance doesn’t degrade

63 Conclusions The Internet and its content are growing explosively Disk storage is abundant, cheap, reliable Technology must provide fast, inexact searching of text and images As more data is hurled at and past us, fast filtering of Internet traffic is a must

64 Questions?


Download ppt "Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science"

Similar presentations


Ads by Google