Download presentation
Presentation is loading. Please wait.
Published byLindsay Baker Modified over 9 years ago
1
Querying The Internet With PIER Nitin Khandelwal
2
Motivation Inject a degree of distribution into databases Internet scale systems vs. hundred node systems Large scale applications requiring database functionaity
3
Applications P2P Databases Highly distributed and available data Network Monitoring Intrusion detection Fingerprint queries
4
Design Principles Relaxed Consistency Sacrifice Consistency in face of Availability and Partition tolerance Organic Scaling Growth with deployment Natural Habitats for Data Data remains in original format with a DB interface Standard Schemas Achieved though common software
5
DHTs Implemented with CAN (Content Addressable Network). Node identified by hyper-rectangle in d-dimensional space Key hashed to a point, stored in corresponding node. Routing Table of neighbours is maintained. O(d)
6
DHT Design Routing Layer Mapping for keys (-- dynamic as nodes leave and join) Storage Manager DHT based data Provider Storage access interface for higher levels
7
Provider Couples the routing and storage layers namespace – relation resourceId – primary key namespace + resourceId >> key instanceId – distinguishes objects with same namespace and resourceID lifetime – item storage duration LScan, Multicast, Newdata
8
PIER Query Processor Operators: Selection, proj, joins, grouping, agg Operators push and pull data Relaxed Consistency and reachable snapshot: - working with nodes reachable at query issue. - Instead, use arrival of query multicast message.
9
Join Algorithm R, S – relations Nr, Ns – relation namespaces Nq - DHT-based temporary table Symmetric Hash Join: - Rehashes the relations - Scan and copy in new namespace Nq Fetch Matches - One relation(S) already hashed on join attribute - Selections on non-join attributes of S cannot be pushed into the DHT
10
Join Rewriting Aimed at lowering the bandwidth utilization Symmetric semi-join - Local projections to Resource ID + join keys - Symmetric Hash Join on two projections - Global fetch matches join using Resource Ids of R and S Bloom joins(Hashed semi-join) - Bloom filter is hashing based bit-vector - Local bloom filters are published into temporary namespaces - Filters are OR-ed and multicast to opposite relation’s nodes
11
Workload Parameters CAN configuration: d = 4 R 10 times larger than S Constants provide 50% selectivity f(x,y) evaluated after the join 90% of R tuples match a tuple in S Result tuples are 1KB each Symmetric hash join used
12
Simulation Setup Up to 10,000 nodes Network cross-traffic, CPU and memory utilizations ignored Data shipped from source to computation node for every query operation 1. 100ms and 10Mbps fully connected links 2. GT-ITM transit-stub topology (similar results)
13
Join Algorithms Infinite Bandwidth (Observe Impact of just propagation delay) 1024 data and computation nodes Core Join Algorithms : Performs faster Rewrites: Bloom Filter: two multicasts Semi-join: two CAN lookups
14
Join Algorithms -- 2 Limited Bandwidth Symmetric Hash Join: - Rehashes both tables Semi Joins: - Transfer only matching tuples At 40% selectivity, bottleneck switches from computation nodes to query sites
15
Conclusions Scalability of PIER dervies from relaxed design principles - adoption of soft states - dilated snapshot semantics Limitation: Just equality predicates Directions: - Pushdown of selections into DHT - Caching and replication of DHT data - Catalog Manager – Stringent consistency and availability requirements.
16
Sophia: An Information Plane Nitin Khandelwal
17
Shared Information Plane Distributed System running throughout the network. - Collects information about network elements Local state(load/memory usage), local perspective (reachability of other nodes) - Evaluate statements(questions) about the state - Reacting according to conclusions Killing misbehaving service
18
Challenges Information is widely distributed and dynamic Statements formulated at run-time – not a- priori Centralized analysis not practical Push analysis to the nodes(push into the network)
19
Approach Use logic programming model - In dynamic and distributed system, therefore temporal and positional logic Why? - Expressivity: Intuitive to make statements about the state of the system - Performance: :: Logic expression transformation for efficient evaluation :: Partial results caching
20
Time and Position in the Language Every term in the system has an environment containing time and location Eval( bandwidth( env (at(node(Node), time(Time), Time > 1032445465, BwVar), BwVar > 40000))
21
Performance Aggressive Caching: - Evaluation results are cached - Sometimes latency is more important then freshness - Time environment used to control freshness Scheduling - Pre-scheduling results to be available when and where they may be needed. - Cache can be refreshed with fresh values
22
Evaluation Planning Given an expression, plan - where(close to data) - when (time when dependencies resolved) - what to evaluate Logic expressions can be transformed at runtime
23
Extensibility Users can add new functionality at run-time Capabilities : to protect modules, grant and revoke privileges. cap569354(Val) :- read sensor. cap435456(Val) :- cap569354(Val). bandwidth(Val) :- cap(435456(Val) Module Protection: All predicates transformed into capabilities, shared through master key capability Danger in caching – different interfaces
24
PIER and Sophia Sophia: location of code execution is both explicit in the language and can be evaluated in the course of evaluation. PIER: details of query execution left to underlying implementation to optimize. Consequence: Sophia queries are more sophisticated: both user and system participate in evaluation planning.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.