Federated Facts and Figures Joseph M. Hellerstein UC Berkeley.

Slides:



Advertisements
Similar presentations
Executional Architecture
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.
Telegraph Endeavour Retreat 2000 Joe Hellerstein.
Introducing Calais A Thomson Reuters initiative designed to make content interoperable on the Web A free API that anyone can use An easy way to automatically.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Technical Architectures
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Self-Tuning and Self-Configuring Systems Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems March 16, 2005.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
Traffic Engineering With Traditional IP Routing Protocols
Dynamic Hypercube Topology Stefan Schmid URAW 2005 Upper Rhine Algorithms Workshop University of Tübingen, Germany.
Overview Distributed vs. decentralized Why distributed databases
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
Adaptive Dataflow Joe Hellerstein UC Berkeley. Overview Trends Driving Adaptive Dataflow Lessons –networking flow control, event programming, app-level.
CMSC724: Database Management Systems Instructor: Amol Deshpande
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
Streaming Data, Continuous Queries, and Adaptive Dataflow Michael Franklin UC Berkeley NRC June 2002.
Telegraph: A Universal System for Information. Telegraph History & Plans Initial Vision –Carey, Hellerstein, Stonebraker –“Regres”, “B-1” Sweat, ideas.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Chapter 13 The Data Warehouse
Eddies: Continuously Adaptive Query Processing Based on a SIGMOD’2002 paper and talk by Avnur and Hellerstein.
Data-Intensive Systems Michael Franklin UC Berkeley
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
Web Application Architecture: multi-tier (2-tier, 3-tier) & mvc
Client/Server Architectures
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
CSC271 Database Systems Lecture # 30.
What is Architecture  Architecture is a subjective thing, a shared understanding of a system’s design by the expert developers on a project  In the.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
施賀傑 何承恩 TelegraphCQ. Outline Introduction Data Movement Implies Adaptivity Telegraph - an Ancestor of TelegraphCQ Adaptive Building.
Telegraph Continuously Adaptive Dataflow Joe Hellerstein.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Master Thesis Defense Jan Fiedler 04/17/98
Crawling Slides adapted from
What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.
Murach’s ASP.NET 4.0/VB, C1© 2006, Mike Murach & Associates, Inc.Slide 1.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
1 Fjording The Stream An Architecture for Queries over Streaming Sensor Data Samuel Madden, Michael Franklin UC Berkeley.
Module 10 Administering and Configuring SharePoint Search.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
DATABASE MANAGEMENT SYSTEMS CMAM301. Introduction to database management systems  What is Database?  What is Database Systems?  Types of Database.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Telegraph Status Joe Hellerstein. Overview Telegraph Design Goals, Current Status First Application: FFF (Deep Web) Budding Application: Traffic Sensor.
BIG DATA/ Hadoop Interview Questions.
Data mining in web applications
Chris Menegay Sr. Consultant TECHSYS Business Solutions
Chapter 13 The Data Warehouse
Software Design and Architecture
CHAPTER 3 Architectures for Distributed Systems
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Telegraph: An Adaptive Global-Scale Query Engine
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Data Warehousing Concepts
TelegraphCQ: Continuous Dataflow Processing for an Uncertain World
Information Capture and Re-Use
Presentation transcript:

Federated Facts and Figures Joseph M. Hellerstein UC Berkeley

Road Map The Deep Web and the FFF An Overview of Telegraph Demo: Election 2000 From Tapping to Trawling A Taste of Policy and Countermeasures Delicious Snacks

Available in your browser, but not via hyperlinks Accessed via forms (press the “submit” button) Typically runs some code to generate data E.g. call out to a database, or run some “servlet” Pretty-print results in HTML Dynamic HTML Estimated to be >400x larger than the “surface web” Not accessible in the search engines Typically crawl hyperlinks only Meet the “Deep Web”

Federated Facts and Figures One part of the deep web: more full-text documents E.g. archived newspaper articles, legal documents, etc. Figure out how to fetch these, the add to search engine Various people working on this (e.g. CompletePlanet) Another part: Facts and Figures I.e. structured database data Fetch is only the first challenge Want to combine (“federate”) these databases Want to search by criteria other than keywords Want to analyze the data en masse I.e. want full query power, not just search Search was always easy Ranking not clearly appropriate here

Meet the FFF

Telegraph An adaptive dataflow system Dataflow siphon data from the deep web and other data pools harness data streaming from sensors and traces flow these data streams through code Adaptive sensor nets & wide area networks: volatile! like Telegraph Avenue needs to “be cool” with volatile mix from all over the world adaptive techniques route data to machines and code marriage of queries, app-level route/filter, machine learning First apps Facts and Figures Federation: Election 2000 Continuous queries on sensor nets Rich queries on Peer-to-Peer Joe Hellerstein, Mike Franklin, & co.

Dataflow Commonalities Dataflow at the heart of queries and networks Query engines move records through operators Networks move packets through routers Networked data-intensive apps an emerging middle ground Database Systems: High-function, high integrity, carefully administered. Compile intelligent query plans based on data models and statistical properties, query semantics. Networks: Low-function, high availability, federated administration. Adapt to performance variabilities, treat data and code as opaque for loose coupling.

Long-Running Dataflows on the FFF Not precomputed like web indexes Need online systems & apps for online performance goals Subject of prior work in CONTROL project Combo of query processing, sampling/estimation, HCI Time 100% Online Traditional

Telegraph Architecture Telegraph executes Dataflow Graphs Extensible set of operators With extensible optimization rules Data access operators TeSS: the Telegraph Screen Scraper Napster/Gnutella readers File readers Data processing operators Selections (filters), Joins, Drill-Down/Roll-Up, Aggregation Adaptivity Operators Eddies, STeMs, FLuX, etc.

Screen Scraping: TeSS Screen scrapers do two things: Fetch: emulate a web user clicking Parse: extract info from resulting HTML/XML Somebody has to train the screen scraper Need a separate wrapper for each site Some research work on making this process semi-automatic TeSS is an open-source screen-scraper Available at Written by a (superstar) sophomore! Simple scripting interface targeted today Moving towards GUI for non-technical users (“by example”)

First Demo: Election 2000

From Tapping to Trawling Telegraph allows users to pose rich queries over the deep web But sometimes would like to be more aggressive: Preload a telegraph cache Access a variety of data for offline mining More (we’ll see soon!) Want something like a webcrawler for FFF But FFF is too big. Want to “trawl” for interesting stuff hidden there.

From Tapping to Trawling

Infospace NameInfospace Street Eddy Anywho NameYahoo Maps “Smith” DupElim “1600 Pennsylvania Avenue, DC” Name Address

API Challenges in Trawling Load APIs on the web today: service and silence Various policies at the servers, hard to learn No analogy to robots.txt (which is too limiting anyhow) Feedback can be delayed, painful Solutions Be very conservative Make out-of-band (human) arrangements Both seem inefficient Finding new sites to trawl is hard Have to wrap them: fetch is easyish, parse hardish XML will help a little here Query? Or Update? Again, an API problem! Imagine we auto-trawled AnyWho and WeSpamYou.com

Trawling “Domains” Can now collect lists of: Names (First, Last), Addresses, Companies, Cities, States, etc. etc. Can keep lists organized by site and in toto Allows for offline mining, etc. Q: Do webgraph mining techniques apply to facts and figures?

Exploiting Enumerated Domains I Can trawl any site on known domains! Suddenly the deep web is not so hidden. In essence, we expand our trawl Can use pre-existing domains to trawl further Or, can add new sites to the trawl process

Exploiting Enumerated Domains II Trawling gets a sample (signature) of a site’s content Analogous to a random walk, but needs to be characterized better Can identify that 2 sites have related subsets of domains Helps with the query composition problem Rich query interfaces tend to be non-trivial What sites to use? How to combine them? Imagine: Traditional search engine experience to pick some sites System suggests how to join the sites in a meaningful way As you build the query, you always see incremental results Refine query as the data pours in Berkeley CONTROL project has been incremental queries Blends search, query, browse and mine

A Sampler of FFF Policy Issues Statistical DB Security Issues Facing the Power of the FFF “False” combinations Combination strength What is trawling? Copying? So what? Akamai for the deep web? Cracking?

Sampler of Countermeasures Trawl detection And Distributed Trawl Detection Metadata Watermarking Provenance, Lineage, Disclaimers Stockpiling Spam

Delicious Snacks "Concepts are delicious snacks with which we try to alleviate our amazement” -- A. J. Heschel, Man Is Not Alone

Technical Snacks Adaptive Dataflow Systems + Learning Incremental & continuous querying And online, bounded trawling Adds an HCI component to the above FFF APIs, standards The wrapper-writing bottleneck: XML? Backoff APIs? Search vs. Update Mining trawls

More Technical Snacks Tie-ins with Security Applications beyond FFF Sensors P2P Overlay Networks

Policy Questions Presenting & Interpreting Data Not just search Privacy: What is it, what’s it for? Leading Indicators from the FFF

More? Collaborators: Mike Franklin, Hal Varian -- UCB Lisa Hellerstein & Torsten Suel -- Polytechnic Sirish Chandrasekaran, Amol Deshpande, Sam Madden, Vijayshankar Raman, Fred Reiss, Mehul Shah -- UCB

Backup Slides

Telegraph: Adaptive Dataflow Mixed design philosophy: Tolerate loose coupling and partial failure Adapt online and provide best-effort results Learn statistical properties online Exploit knowledge of semantics via extensible optimization infrastructures Target new networked, data-intensive applications

Adaptive Systems: General Flavor Repeat: 1. Observe (model) environment 2. Use observation to choose behavior 3. Take action

Adaptive Dataflow in DBs: History Rich But Unacknowledged History Codd's data independence predicated on adaptivity! adapt opaquely to changing schema and storage Query optimization does it! statistics-driven optimization key differentiator between DBMSs and other systems

Adaptivity in Current DBs Limited & coarse grain Repeat: 1. Observe (model) environment – runstats (once per week!!): model changes in data 2. Use observation to choose behavior – query optimization: fixes a single static query plan 3. Take action – query execution: blindly follow plan

What’s So Hard Here? Volatile regime Data flows unpredictably from sources Code performs unpredictably along flows Continuous volatility due to many decentralized systems Lots of choices Choice of services Choice of machines Choice of info: sensor fusion, data reduction, etc. Order of operation Maintenance Federated world Partial failure is the common case Adaptivity required!

Adaptive Query Processing Work Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97] Per Query: Mariposa [SA+96], ASE [CR94] Competition: RDB [AZ96] Inter-Op: [KD98], Tukwila [IF+99] Query Scrambling: [AF+96,UFA98] Survey: Hellerstein, Franklin, et al., DE Bulletin 2000 System R Late Binding Per Query Competition & Sampling Inter-Operator Query Scrambling Eddies Ingres DECOMP Frequency of Adaptivity Future Work

A Networking Problem!? Networks do dataflow! Significant history of adaptive techniques E.g. TCP congestion control E.g. routing But traditionally much lower function Ship bitstreams Minimal, fixed code Lately, moving up the foodchain? app-level routing active networks politics of growth assumption of complexity = assumption of liability

Networking Code as Dataflow? States & Events, Not Threads Asynchronous events natural to networks State machines in protocol specification and system code Low-overhead, spreading to big systems Totally different programming style remaining area of hacker machismo Eventflow optimization Can’t eventflow be adaptively optimized like dataflow? Why didn’t that happen years ago? Hold this thought

Query Plans are Dataflow Too Programming model: iterators old idea, widely used in DB query processing object with three methods: Init(), GetNext(), Close() input/output types query plan: graph of iterators pipelining: iterators that return results before children Close()

Clever Dataflow Tricks Volcano: “exchange” iterator [Graefe] encapsulate exchange logic in an iterator not in the dataflow system Box-and-arrow programming can ignore parallelism

Some Solutions We’re Focusing On Rivers Adaptive partitioning of work across machines Eddies Adaptive ordering of pipelined operations Quality of Service Online aggregation & data reduction: CONTROL MUST have app-semantics Often may want user interaction UI models of temporal interest Data Dissemination Adaptively choosing what to send, what to cache

River Berkeley built the world-record sorting machine On the NOW: 100 Sun workstations + SAN Only beat the record under ideal conditions No such thing in practice! (Arpaci-Dusseau) 2 with Culler, Hellerstein, Patterson River: adaptive dataflow on clusters One main idea: Distributed Queues adaptive exchange operator Simplifies management and programming Remzi Arpaci-Dusseau, Eric Anderson, Noah Treuhaft w/Culler, Hellerstein, Patterson, Yelick

River

Multi-Operator Query Plans Deal with pipelines of commutative operators Adapt at finer granularity than current DBMSs

Continuous Adaptivity: Eddies A pipelining tuple-routing iterator just like join or sort or exchange Works best with other pipelining operators like Ripple Joins, online reordering, etc. Ron Avnur & Joe Hellerstein Eddy

Continuous Adaptivity: Eddies How to order and reorder operators over time based on performance, economic/admin feedback Vs.River: River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically” Eddy

Continuous Adaptivity: Eddies Adjusts flow adaptively Tuples routed through ops in different orders Visit each op once before output Naïve routing policy: All ops fetch from eddy as fast as possible A la River Turns out, doesn’t quite work Only measures rate of work, not benefit Lottery-based routing Uses “lottery scheduling” to address a bandit problem Kris Hildrum, et al. looking at formalizing this Various AI students looking at Reinforcement Learning Competitive Eddies Throw in redundant data access and code modules!

An Aside: n-Arm Bandits A little machine learning problem: Each arm pays off differently Explore? Or Exploit? Sometimes want to randomly choose an arm Usually want to go with the best If probabilities are stationary, dampen exploration over time

Eddies with Lottery Scheduling Operator gets 1 ticket when it takes a tuple Favor operators that run fast (low cost) Operator loses a ticket when it returns a tuple Favor operators with high rejection rate Low selectivity Lottery Scheduling: When two ops vie for the same tuple, hold a lottery Never let any operator go to zero tickets Support occasional random “exploration” Set up “inflation” (forgetting) to adapt over time E.g. tix’ =   oldtix + newtix

Promising! Initial performance results Ongoing work on proofs of convergence have analysis for contrained case

To Be Continued Tune & formalize policy Competitive eddies Source & Join selection Requires duplicate management Parallelism Eddies + Rivers? Reliability Long-running flows Rivers + RAID-style computation Eddy R2R1 R3 S1S2 S3  hash blockindex1  index2

To Be Continued, cont. What about wide area? data reduction sensor fusion asynchronous communication Continuous queries events disconnected operation Lower-level eventflow? can eddies, rivers, etc. be brought to bear on programming?