Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003

Paul Burstein: PIER, 11/10/20032 Outline Motivation Architecture Join Algorithms Evaluation Discussion

Paul Burstein: PIER, 11/10/20033 Motivation Inject a degree of distribution into databases Internet scale systems vs. hundred node systems Large scale applications requiring database functionaity

Paul Burstein: PIER, 11/10/20034 Applications P2P Databases Highly distributed and available data Network Monitoring Intrusion detection Fingerprint queries

Paul Burstein: PIER, 11/10/20035 Design Principles Relaxed Consistency Sacrifice Consistency in face of Availability and Partition tolerance Organic Scaling Growth with deployment Natural Habitats for Data Data remains in original format with a DB interface Standard Schemas Achieved though common software

Paul Burstein: PIER, 11/10/20037 PIER Architecture

Paul Burstein: PIER, 11/10/20038 DHT Design Implemented with CAN and Chord Routing Layer Mapping for keys Storage Manager Node data storage Provider Storage access interface for higher levels

Paul Burstein: PIER, 11/10/20039 Routing & Storage Routing Layer DHT-based API locationMapChange – local key set change Storage Manager Easy to realize API Efficient performance relative to network Main-memory storage manager used

Paul Burstein: PIER, 11/10/200310 Provider Couples the routing and storage layers namespace – relation resourceId – primary key namespace + resourceId  key instanceId – distinguishes objects with same namespace and resourceID lifetime – item storage duration multicast – contacts namespace’s nodes lscan – iterates over a node’s local data newData – application callback on data arrival

Paul Burstein: PIER, 11/10/200311 PIER Query Processor Query dataflow engine Operators: Selection, projection, joins, grouping, aggregation Operators push and pull data Current data modification is though the DHT interface Relaxed consistency and reachable snapshot Working only with nodes reachable at the time a query is issued

Paul Burstein: PIER, 11/10/200313 Join Algorithms Symmetric Hash Join Rehashes the relations Scan and copy Fetch Matches One relation already hashed on join attribute R, S – relations Nr, Ns – relation namespaces Nq - DHT-based temporary table

Paul Burstein: PIER, 11/10/200314 Join Rewriting Aimed at lowering the bandwidth utilization Symmetric semi-join Local projections to join keys Global fetch matches join Bloom joins Local bloom filters are published into temporary namespaces Filters multicast to opposite relation’s nodes

Paul Burstein: PIER, 11/10/200315 How does this scale?

Paul Burstein: PIER, 11/10/200317 Workload Parameters CAN configuration: d = 4 R 10 times larger than S Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used

Paul Burstein: PIER, 11/10/200318 Simulation Setup Up to 10,000 nodes Network cross-traffic, CPU and memory utilizations ignored 1. 100ms and 10Mbps fully connected links 2. GT-ITM transit-stub topology

Paul Burstein: PIER, 11/10/200319 Scalability 1MB data per node Fully-connected topology Variable number of computation nodes Network congestion is an issue with few computation nodes How is the computation workload distributed?

Paul Burstein: PIER, 11/10/200320 Join Algorithms (1/2) Infinite Bandwidth 1024 data and computation nodes Core join Algorithms Perform faster Rewrites Bloom Filter: two multicasts Semi-join: two CAN lookups

Paul Burstein: PIER, 11/10/200321 Join Algorithms (2/2) Limited Bandwidth 10Mbps inbound capacity 25GB relations, 1024 nodes Symmetric Hash Join Rehashes both tables Semi-join Transfers only matching tuples At 40% selectivity, bottleneck switches from computation nodes to query sites

Paul Burstein: PIER, 11/10/200322 Soft State Failure detection and recovery 15 second failure detection 4096 nodes Refresh period Time to reinsert lost tuples

Paul Burstein: PIER, 11/10/200323 Transit Stub Topology GT-ITM 4 Domains, 10 nodes per domain, 3 stubs per node 50ms, 10ms, 2ms latency 10Mbps inbound links Similar trends as fully connected topology A bit longer end-to-end delays

Paul Burstein: PIER, 11/10/200324 Experimental Results 64 PCs on 1Gbps network All nodes are computation nodes

Paul Burstein: PIER, 11/10/200326 Discussion PIER presents a distributed query engine What remains to be done? DB issues Networking issues

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

Similar presentations

Presentation on theme: "Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.

Similar presentations

Presentation on theme: "Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003."— Presentation transcript:

Similar presentations

About project

Feedback