Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science

Slides:



Advertisements
Similar presentations
GOTOEX is a modern information technology. Extracting information from the world wide web is a thing of the past. Realistic sense of action is something.
Advertisements

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Digital Fountains: Applications and Related Issues Michael Mitzenmacher.
Announcements You survived midterm 2! No Class / No Office hours Friday.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
1 Fall 2005 Local Serial Asynchronous Communication Qutaibah Malluhi Computer Science and Engineering Qatar University.
1 Lecture 2: Review of Computer Organization Operating System Spring 2007.
Conversion Between Video Compression Protocols Performed by: Dmitry Sezganov, Vitaly Spector Instructor: Stas Lapchev, Artyom Borzin Cooperated with:
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
IT Systems In and Out EN230-1 Justin Champion C208 –
MICROSOFT AZURE ISV PROFILE: BUYING BUTLER LTD Our free concierge buying service makes complex purchases easy. Our first category is cars: We help consumers.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Sparky + The Next Generation College Mobile Solution Ruoyang Zhang ENG 302 Class Project All rights reserved 08/06/2014.
Business Computing 550 Lesson 4. Fundamentals of Information Systems, Fifth Edition Chapter 4 Telecommunications, the Internet, Intranets, and Extranets.
TransVision2006: Infinite Bandwidth Helsinki University of Technology Hannu H. Kari/HUT/CS/TCSPage 1/43 Infinite bandwidth...or.. Future of Networks and.
1. An Idea “In order to create wealth, you must be the first with an idea. Then, you must be first to tell the world about that idea” Warren Buffett “…probably.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Hardware Case that houses the computer Monitor Keyboard and Mouse Disk Drives – floppy disk, hard disk, CD Motherboard Power Supply (PSU) Speakers Ports.
CE 4228 DATA COMMUNICATIONS AND NETWORKING Introduction.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Expand your capabilities. Increase efficiency. With the Lexmark MX6500e – a powerful, versatile multifunction option device.
Lecture 2 Title: E-Business Advantages By: Mr Hashem Alaidaros MIS 326.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
Open Systems and Data Link Protocols November 7, 2002.
Data Representation and Storage Lecture 5. Representations A number value can be represented in many ways: 5 Five V IIIII Cinq Hold up my hand.
Tal Lavian Technology & Society More Questions Than Answers.
IoT, Big Data and Emerging Technologies
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
Data and Computer Communications Circuit Switching and Packet Switching.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
IT253: Computer Organization
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Memory Systems How to make the most out of cheap storage.
Computer Guts and Operating Systems CSCI 101 Week Two.
The Mercury System: Embedding Computation into Disk Drives Roger Chamberlain, Ron Cytron, Mark Franklin, Ron Indeck Center for Security Technologies Washington.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
CS211 - Fernandez - 1 CS211 Graduate Computer Architecture Network 3: Clusters, Examples.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Lecture 13: Reconfigurable Computing Applications October 10, 2013 ECE 636 Reconfigurable Computing Lecture 11 Reconfigurable Computing Applications.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Chapter2 Networking Fundamentals
Lecture (Mar 23, 2000) H/W Assignment 3 posted on Web –Due Tuesday March 28, 2000 Review of Data packets LANS WANS.
Department of Computer Science and Engineering Applied Research Laboratory Architecture for a Hardware Based, TCP/IP Content Scanning System David V. Schuehler.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Lecture 1: Review of Computer Organization
Slide No. 1 Chapter 1, Unit c Data Communications H Telecommunications H LANs, WANs and Intranets.
Mining of Massive Datasets Ch4. Mining Data Streams
CS1315 Introduction to Media Computation Introduction: Why study computer science at all?!?
Steve Cassidy Computing at MacquarieNo 1 Searching The Web Steve Cassidy Centre for Language Technology Department of Computing Macquarie University.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Component 4: Introduction to Information and Computer Science Unit 7: Networks & Networking Lecture 1 This material was developed by Oregon Health & Science.
Unit 1: Computing Fundamentals. Computer Tour-There are 7 major components inside a computer  Write down each major component as it is discussed.  Watch.
Memory 2. Activity 1 Research / Revise what cache memory is. 5 minutes.
Discovering Computers 2008 Fundamentals Fourth Edition Discovering Computers 2008 Fundamentals Fourth Edition Chapter 1 Introduction to Computers.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
CHAPTER 1 COMPUTER SCIENCE II. HISTORY OF COMPUTERS (1.1) Eniac- one of the worlds first computers Used more electricity than an entire city block of.
Computer Technology Semester 2 Final Exam Review.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
The Online World DATA EXCHANGE. Introduction data devices/componentsperipheral Data exchange is the term used to cover all methods of passing data (including.
CMPE Database Systems Workshop June 16 Class Meeting
Steve Ko Computer Sciences and Engineering University at Buffalo
Introduction to networks
Introduction to Computers
MapReduce Simplied Data Processing on Large Clusters
Steve Ko Computer Sciences and Engineering University at Buffalo
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Operating Systems Chapter 5: Input/Output Management
Introduction to computers
Presentation transcript:

Finding Needles in the Internet Haystack Ron K. Cytron Washington University in Saint Louis Department of Computer Science Century Club May 2002 Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood, George Varghese (UCSD) Mahesh Jayaram Thanks: Ben Brodie Center for Distributed Object Computing Department of Computer Science Washington University

Outline Computers have come a long way

Outline Computers have come a long way Today’s computers are never lonely

Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data

Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media needle needelneedle

Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Internet packet filtering

Outline Computers have come a long way Today’s computers are never lonely Volumes and volumes of data Fast searching of magnetic media Internet packet filtering Conclusion

A Grandchild’s Gift Cost: $60Cost: $35 Memory ½ charMemory 16 M chars Speed: 1 cycle/sSpeed: 16 M cycles/s Fails: 10 secondsFails: 5 years

If cars improved that much in 30 years … $ ,000 miles per hour Seats 10,000 people Gets 20,000 miles per gallon Breaks every 70 years

The Haystack The Internet is large and growing Content on the Internet is growing even faster A haystack sits still, but the Internet….

Growth of the Internet (why computers aren’t lonely anymore) Y2K Problem (?): More computers sold than TVs

Growth of Internet Content (volumes and volumes of data) Anybody can publish Problem is how to find what you want

Page 6B What can tech companies do? Some say they're at a loss, but others offer budding solutions By Kevin Maney On July 7, 1940, as the nation edged toward World War II, IBM put out a statement that made headlines. The company offered all its facilities for national defense, ready to convert to making anything the government needed. Other leaders in the electro-mechanical technology of the day -- Ford Motor, General Motors, General Electric -- also threw their weight into defense efforts. They switched from making cars and washing machines to building tanks, aircraft engines and machine guns. So here we are in 2001, readying for another war. The U.S. technology industry is the best and most innovative in the world. It is the nation's pride and joy. Shouldn't it do something? 9/17/2001

... One possibility is in data-mining technology. Data mining is a way to collect millions of pieces of information in a computer system, sift through that data, make sense of them and come up with something useful. ''We (the U.S. tech industry) are experts at data mining and have vast resources of data to mine,'' says Tom Evslin, CEO of Internet communications company ITXC. ''We have used it to target advertising. We can probably use it to identify suspicious activity or potential terrorists.''...

Fast searching of magnetic media with Roger Chamberlain, Mark Franklin, Ron Indeck, John Lockwood

Enabling Technology: Disk Drives Magnetic disk storage areal density vs. year of IBM product introduction (From D. A. Thompson) Almost 10,000,000x increase in 45 years!

Cost per Megabyte Price history of hard disk product vs. year of product introduction (From D. A. Thompson) Cost decreasing 3% per week!

Storage industry will ship 4,000,000,000,000,000,000 Bytes this year FedEx generated 14 Terabytes of data last year US intelligence collects data equaling the printed collection of the US library every day! Massive Storage & Data

Massive Data Sets Employee records Consumer information Maps/mission/intelligence data Genome maps  Data sets now measured in Terabytes, and are dynamic!

Genome Application Genome maps growing expanded daily –Wash U sequencing center –Each of us has 80,000 genes found among 3 billion characters of DNA (A,C,G,T) Look for matches –Identify function –Disease: understand, diagnose, detect, medicine, therapy –Biofuels, warfare, toxic waste –Understand evolution –Forensics, organ donors, authentication –More effective crops, disease resistance

DNA String Matching Looking for CACGTTAGT…TAGC Interested in matches and near matches Search human genome and other gene oceans –Need to search entire data sets

Bio Computation Problem *BIG* Genome Databases A C GT G T A CA G DNA pattern DNA sequence Match? Approximate matches are just as useful

Finding a needel in a heystuck DNA and live text can contain errors We often seek an approximate match, for example needle No match? Try 2-transpositions enedle, needle, nedele, neelde, needel No match? Try 1-deletions eedle, nedle, nedle, neele, neede, needl No match? Try insertions, larger edits, … An exponential number of possibilities

No How is this done today? Think of every way a word can be misspelled Present each misspelling to the computer for an exact match enedle needle nedele neelde needel Yes

How can we do better? Data is present on magnetic media Hardware at the disk is –Already fault tolerant (more on this later) needel  needle –Distributed across all surfaces needle needel We win if number of misspellings is large, and the number of false hits is small

Another Application:Intelligence Data Lots of data Changing constantly Many perturbations –Tzar, tsar, czar,... Don’t know what we want to look for beforehand

Google Search Engine Crawls the web once per month Caches web pages Fast, exact text-based search (see how soon) needle needel

Image Database Applications Challenging database Unstructured Massive data sets Don’t know what we need to look for in each picture

Satellite Data Low-orbit fly-over every 90 minutes Look for differences in images –Large objects –Troops –Changes to landscape Flag, transmit these differences immediately National Reconnaissance Office City assessors...

Washington University Hilltop Campus

How do we find what we’re looking for?!

Conventional Structured Database D id Document Agent James Bond Agent mobile computer James Madison movie James Bond movie Word James computer agent Bond Inverted list - pointers Madison mobile movie

Challenges in Searching Massive Databases  Know what to search for –need to build index beforehand –maintain index as it changes  Do not know what to search for –need to search the whole database!

Conventional Search Hard drive Processor Memory I/O bus Memory bus

Conventional Search Hard drive Processor Memory I/O bus Memory bus find …. Conventional Search

Hard drive Processor Memory I/O bus Memory bus contents yes, no, no, yes, yes …. Conventional Search

Conventional Approach

WUSTL’s Approach

Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing Streaming Approach

Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing find Streaming Approach

Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing find Streaming Approach

Hard drive Processor Memory I/O bus Memory Bus Reconfigurable hardware Memory/ processing Parallelism through each transducer and drive find yes, no, no, yes, yes Streaming Approach

Magnetic Recording Channel Schematic Encoder Decoder Detector Input User Data Decoded User Data Channel Bits HeadDisk Analog Readback A BC To Bus or Cache

Key streaming over Data

Disk Level Implementation 100-bit-key matching through a pseudo-random binary series scorescore matches

Status: Prototype in progress FPX NID RAD Hard drive Host ATAPI Controller IDE bus Tap 16bit Data 15bit CTRL Custom PCB for Electrical Termination & 5V to 3.3V Conversion 32 RAD test pins Loopback module module Setup reused from FPX IDE_to_ATM module

Internet Packet Filtering with Mahesh Jayaram and George Varghese

Finding Needles in a Moving Haystack

As technology improves, transmission time decreases but latency stays the same Year Cost of Internet Request Latency Transmission Time

Example: Garden Hose Water Supply Latency (first drop) ~ distance Bandwidth ~ hose diameter Fire department and gardener suffer the same wait

Example: Hot Shower You want this water Latency (time to get hot water) ~ distance

Convection circuit continuously circulates hot water Latency ~ 0 Latency-Free Hot Shower

Better to receive than to give Cable broadcast Radio broadcast TV guide channel Gate connection announcements in flight Winning lottery number Modern name: push technology

Better to receive than to give

How do you get what you want?

Packet Filters Filter F (Weather)

Packet Filters Filter F (Weather)

Existing Approach IBM Quote Weather Flight Schedule

Our approach IBM QuoteWeatherFlight Schedule Composite filter makes just one pass

How we do it IBM Quote Weather Flight Schedule Grammar 1 Grammar 2 Grammar 3 Parsing Engine

TCPConnHeader : EtherType IPHeader TCPPortPair EtherType : #IP_TYPE IPHeader : Vers HlenPlusRest Vers : HalfByte HlenPlusRest : FixedRest | FixedRest OneIPOption | FixedRest TwoIPOption | FixedRest ThreeIPOption | FixedRest FourIPOption | FixedRest FiveIPOption | FixedRest FiveIPOption OneIPOption | FixedRest FiveIPOption TwoIPOption | FixedRest FiveIPOption ThreeIPOption | FixedRest FiveIPOption FourIPOption | FixedRest FiveIPOption FiveIPOption FixedRest : ServiceType TotalLength Identification Flags FragmentOffset TimeToLive Protocol HeaderChecksum IPAddrPair ServiceType : Byte TotalLength : TwoByte Identification : TwoByte Flags : bit bit bit FragmentOffset : bit Byte HalfByte TimeToLive : Byte Protocol : #TCP_PROTOCOL HeaderChecksum : TwoByte IPAddrPair : #IP_SRC_DST_PAIR FiveIPOption : ThreeIPOption TwoIPOption FourIPOption : TwoIPOption TwoIPOption ThreeIPOption : TwoIPOption OneIPOption TwoIPOption : OneIPOption OneIPOption OneIPOption : Option Padding Option : ThreeByte Padding : Byte TCPPortPair : #TCP_PORT_PAIR FourByte : TwoByte TwoByte ThreeByte : TwoByte Byte TwoByte : Byte Byte Byte : HalfByte HalfByte HalfByte : bit bit bit bit bit : 0 | 1 Sample grammar for TCP packet

Results The more things you want, the slower existing approaches get Our performance doesn’t degrade

Conclusions The Internet and its content are growing explosively Disk storage is abundant, cheap, reliable Technology must provide fast, inexact searching of text and images As more data is hurled at and past us, fast filtering of Internet traffic is a must

Questions?