Download presentation
Presentation is loading. Please wait.
Published byBlanche Allison Modified over 10 years ago
1
Declarative Distributed Programming with Dedalus and Bloom Peter Alvaro, Neil Conway UC Berkeley
2
This Talk 1. Background – BOOM Analytics 2. Theory – Dedalus – CALM 3. Practice – Bloom – Lattices
3
Berkeley Orders of Magnitude Vision: Can we build small programs for large distributed systems? Approach: Language system design System language design
4
Initial Language: Overlog Data-centric programming – Uniform representation of system state High-level, declarative query language – Distributed variant of Datalog Express systems as queries
5
BOOM Analytics Goal: “Big Data” stack – API-compliant – Competitive performance System: [EuroSys’10] – Distributed file system HDFS compatible – Hadoop job scheduler
6
What Worked Well Concise, declarative implementation – 10-20x more concise than Java (LOCs) – Similar performance (within 10-20%) Separation of policy and mechanism Ease of evolution 1. High availability (failover + Paxos) 2. Scalability (hash partitioned FS master) 3. Monitoring as an aspect
7
What Worked Poorly Unclear semantics – “Correct” semantics defined by interpreter behavior In particular, 1. change (e.g., state update) 2. uncertainty (e.g., async communication)
8
Temporal Ambiguity Goal: Increment a counter upon “request” message Send response message with value of counter counter(“hostname”,0). counter(To,X+1) :- counter(To,X), req(To,_). response(@From,X) :- counter(@To,X), req(@To,From). When is counter incremented? What does response contain?
9
Implicit Communication Implicit communication was the wrong abstraction for systems programming. – Hard to reason about partial failure Example: we never used distributed joins in the file system! path(@S,D) :- link(@S,Z), path(@Z,D).
10
Received Wisdom We argue that objects that interact in a distributed system need to be dealt with in ways that are intrinsically different from objects that interact in a single address space. These differences are required because distributed systems require that the programmer be aware of latency, have a different model of memory access, and take into account issues of concurrency and partial failure. Jim Waldo et al., A Note on Distributed Computing (1994)
11
Dedalus (it’s about time) Explicitly represent logical time as an attribute of all knowledge “Time is a device that was invented to keep everything from happening at once.” (Unattributed)
12
Dedalus: Syntax Datalog + temporal modifiers 1. Instantaneous (deduction) 2. Deferred (sequencing) True at successor time 3. Asynchronous (communication) True at nondeterministic future time
13
Dedalus: Syntax (1) Deductive rule: (Plain Datalog) (2) Inductive rule: (Constraint across “next” timestep) (3) Async rule: (Constraint across arbitrary timesteps) p(A,B,S) :- q(A,B,T), T=S. p(A,B,S) :- q(A,B,T), S=T+1. p(A,B,S) :- q(A,B,T), time(S), choose((A,B,T), (S)). Logical time
14
Syntax Sugar (1) Deductive rule: (Plain Datalog) (2) Inductive rule: (Constraint across “next” timestep) (3) Async rule: (Constraint across arbitrary timesteps) p(A,B) :- q(A,B). p(A,B)@next :- q(A,B). p(A,B)@async:- q(A,B).
15
State Update p(A, B)@next :- p(A, B), notin p_del(A, B). Example Trace: p(1, 2)@101; p(1, 3)@102; p_del(1, 3)@300; Timep(1, 2)p(1, 3)p_del(1, 3) 101 102... 300 301
16
Logic and time Key relationships: Atomicity Mutual exclusion Sequentiality Overlog: Relationships among facts Dedalus: Also, relationships between states
17
Change and Asynchrony Overlog counter(“hostname”,0). counter(To,X+1) :- counter(To,X), req(To,_). response(@From,X) :- counter(@To,X), req(@To,From). Dedalus counter(“hostname”,0). counter(To,X+1) :- counter(To,X), req(To,_). counter(To,X) :- counter(To,X), notin req(To,_). response(@From,X) :- counter(X), req(@To,From). @async @next Increment is deferred Pre-increment value sent Non-deterministic delivery time
18
Dedalus: Semantics Goal: declarative semantics – Reason about a program’s meaning rather than its executions Approach: model theory
19
Minimal Models A negation-free (monotonic) Datalog program has a unique minimal model ModelNo “missing” facts MinimalNo “extra” facts UniqueProgram has a single meaning
20
Stable Models The consequences of async rules hold at a nondeterministic future time – Captured by the choice construct Greco and Zaniolo (1998) – Each choice leads to a distinct model Intuition: A stable model is an execution trace
21
Traces and models counter(To, X+1)@next :- counter(To, X), request(To, _). counter(To, X)@next :- counter(To, X), notin request(To, _). response(@From, X)@async :- counter(@To, X), request(@To, From). response(From, X)@next :- response(From, X). Persistence rules lead to infinitely large models Async rules lead to infinitely many models
22
An Execution counter(To, X+1)@next :- counter(To, X), request(To, _). counter(To, X)@next :- counter(To, X), notin request(To, _). response(@From, X)@async :- counter(@To, X), request(@To, From). response(From, X)@next :- response(From, X). counter(“node1”, 0)@0. request(“node1”, “node2”)@0.
23
A Stable Model counter(“node1”, 0)@0. request(“node1”, “node2”)@0. counter(“node1”, 1)@1. counter(“node1”,1)@2. […] response(“node2”, 0)@100. counter(“node1”, 1)@101. counter(“node1”, 1)@102. response(“node2”, 0)@101. response(“node2”, 0)@102. […] A stable model for choice = 100 counter(To, X+1)@next :- counter(To, X), request(To, _). counter(To, X)@next :- counter(To, X), notin request(To, _). response(@From, X)@async :- counter(@To, X), request(@To, From). response(From, X)@next :- response(From, X).
24
Ultimate Models A stable model characterizes an execution – Many of these models are not “interestingly” different Wanted: a characterization of outcomes – An ultimate model contains exactly those facts that are “eventually always true”
25
Traces and models counter(To,X+1)@next :- counter(To,X), request(To,_). counter(To,X)@next :- counter(To,X), notin request(To,_). response(@From,X)@async :- counter(@To,X), request(@To,From). response(From,X)@next :- response(From,X). counter(“node1”, 0)@0. request(“node1”, “node2”)@0. counter(“node1”, 1)@1. counter(“node1”, 1)@2. […] response(“node2”, 0)@100. counter(“node1”, 1)@101. response(“node2”, 0)@101. […] counter(“node1”, 1)@102. response(“node2”, 0)@102.
26
Traces and models counter(To,X+1)@next :- counter(To,X), request(To,_). counter(To,X)@next :- counter(To,X), notin request(To,_). response(@From,X)@async :- counter(@To,X), request(@To,From). response(From,X)@next :- response(From,X). counter(“node1”, 1). response(“node2”, 0). Ultimate Model
27
Confluence This program has a unique ultimate model – In fact, all negation-free Dedalus programs have a unique ultimate model [DL2’12] We call such programs confluent: same program outcome, regardless of network non-determinism
28
The Bloom Programming Language
30
Lessons from Dedalus 1. Clear program semantics is essential 2. Avoid implicit communication 3. Confluence seems promising
31
Lessons From Building Systems 1. Syntax matters! – Datalog syntax is cryptic and foreign 2. Adopt, don’t reinvent – DSL > standalone language – Use host language’s type system (E. Meijer) 3. Modularity is important – Scoping – Encapsulation
32
Bloom Operational Model
33
Bloom Rule Syntax <= now <+ next <- delete (at next) <~ async table persistent state scratch transient state channel network transient map, flat_map reduce, group join, outerjoin empty?, include? Local computation State update Asynchronous message passing
34
QUORUM_SIZE = 5 RESULT_ADDR = "example.org" class QuorumVote include Bud state do channel :vote_chn, [:@addr, :voter_id] channel :result_chn, [:@addr] table :votes, [:voter_id] scratch :cnt, [] => [:cnt] end bloom do votes <= vote_chn {|v| [v.voter_id]} cnt <= votes.group(nil, count(:voter_id)) result_chn = QUORUM_SIZE} end Example: Quorum Vote Communication interfaces Coordinator state Accumulate votes Send message when quorum reached 34 Bloom state Bloom logic Ruby class definition Count votes Asynchronous messaging
35
Question: How does confluence relate to practical problems of distributed consistency?
36
Common Technique: Replicate state at multiple sites, for: Fault tolerance Reduced latency Read throughput
37
Problem: Different replicas might observe events in different orders … and then reach different conclusions!
38
Alternative #1: Enforce consistent event order at all nodes (“Strong Consistency”)
39
Alternative #1: Enforce consistent event order at all nodes (“Strong Consistency”) Problems: Availability CAP Theorem Latency
40
Alternative #2: Achieve correct results for any network order (“Weak Consistency”)
41
Alternative #2: Achieve correct results for any network order (“Weak Consistency”) Concerns: Writing order-independent programs is hard!
42
Challenge: How can we make it easier to write order-independent programs?
43
Order-Independent Programs Alternative #1: – Start with a conventional language – Reason about when order can be relaxed This is hard! Especially for large programs.
44
Taking Order For Granted Data(Ordered) array of bytes Compute(Ordered) sequence of instructions Writing order-sensitive programs is too easy!
45
Order-Independent Programs Alternative #1: – Start with a conventional language – Reason about when order can be relaxed This is hard! Especially for large programs. Alternative #2: – Start with an order-independent language – Add order explicitly, only when necessary – “Disorderly Programming”
46
(Leading) Question: So, where might we find a nice order-independent programming language? Recall: All monotone Dedalus programs are confluent.
47
Monotonic Logic As input set grows, output set does not shrink Order independent e.g., map, filter, join, union, intersection Non-Monotonic Logic New inputs might invalidate previous outputs Order sensitive e.g., aggregation, negation
48
Consistency As Logical Monotonicity CALM Analysis [CIDR’11] 1. Monotone programs are deterministic (confluent) [Ameloot’11, Marczak’12] 2. Simple syntactic test for monotonicity Result: Whole-program static analysis for eventual convergence
49
Case Study
50
Scenario
54
Questions 1. Will cart replicas eventually converge? – “Eventual Consistency” 2. What will client observe on checkout? – Goal: checkout reflects all session actions 3. To achieve #1 and #2, how much additional coordination is required?
55
if kvs[x] exists: old = kvs[x] kvs.delete(x) if old > c kvs[x] = old – c Design #1: Mutable State Add(item x, count c): if kvs[x] exists: old = kvs[x] kvs.delete(x) else old = 0 kvs[x] = old + c Remove(item x, count c): Non-monotonic!
56
CALM Analysis Conclusion: Every operation might require coordination! Non-monotonic!
57
if kvs[x] exists: old = kvs[x] kvs.delete(x) if old > c kvs[x] = old – c Subtle Bug Add(item x, count c): if kvs[x] exists: old = kvs[x] kvs.delete(x) else old = 0 kvs[x] = old + c Remove(item x, count c): What if remove before add?
58
Design #2: “Disorderly” Add(item x, count c): Append x,c to add_log Remove(item x, count c): Append x,c to del_log Checkout(): Group add_log by item ID; sum counts. Group del_log by item ID; sum counts. For each item, subtract deletions from additions. Non-monotonic!
59
CALM Analysis Conclusion: Replication is safe; might need to coordinate on checkout Monotonic
60
Takeaways Major difference in coordination cost! – Coordinate once per operation vs. Coordinate once per checkout Disorderly accumulation when possible – Monotone growth confluent “Disorderly”: common design in practice! – e.g., Amazon Dynamo
61
Generalizing Monotonicity Monotone logic: growing sets over time – Partial order: set containment In practice, other kinds of growth: – Version numbers, timestamps – “In-progress” committed/aborted – Directories, sequences, …
62
Example: Quorum Vote 62 Not (set-wise) monotonic! QUORUM_SIZE = 5 RESULT_ADDR = "example.org" class QuorumVote include Bud state do channel :vote_chn, [:@addr, :voter_id] channel :result_chn, [:@addr] table :votes, [:voter_id] scratch :cnt, [] => [:cnt] end bloom do votes <= vote_chn {|v| [v.voter_id]} cnt <= votes.group(nil, count(:voter_id)) result_chn = QUORUM_SIZE} end
63
Challenge: Extend monotone logic to allow other kinds of “growth”
64
h S, t, ?i is a bounded join semilattice iff: – S is a set – t is a binary operator (“least upper bound”) Induces a partial order on S: x · S y if x t y = y Associative, Commutative, and Idempotent – “ACID 2.0” Informally, LUB is “merge function” for S – ? is the “least” element in S 8 x 2 S: ? t x = x
65
Time Set ( t = Union) Increasing Int ( t = Max) Boolean ( t = Or)
66
f : S T is a monotone function iff: 8 a,b 2 S : a · S b ) f(a) · T f(b)
67
Time Set ( t = Union) Increasing Int ( t = Max) Boolean ( t = Or) size() >= 3 Monotone function: set increase-int Monotone function: increase-int boolean
68
Quorum Vote with Lattices QUORUM_SIZE = 5 RESULT_ADDR = "example.org" class QuorumVote include Bud state do channel :vote_chn, [:@addr, :voter_id] channel :result_chn, [:@addr] lset :votes lmax :vote_cnt lbool :got_quorum end bloom do votes <= vote_chn {|v| v.voter_id} vote_cnt <= votes.size got_quorum <= vote_cnt.gt_eq(QUORUM_SIZE) result_chn <~ got_quorum.when_true { [RESULT_ADDR] } end Monotone function: set max Monotone function: max bool Threshold test on bool (monotone) Lattice state declarations 68 Accumulate votes into set Program state Program logic Merge new votes together with stored votes (set LUB) Merge using lmax LUB
69
Conclusions Interplay between language and system design Key question: what should be explicit? – Initial answer: asynchrony, state update – Refined answer: order Disorderly programming for disorderly networks
70
Thank You! Queries welcome. gem install bud http://www.bloom-lang.net Emily Andrews Peter Bailis William Marczak David Maier Tyson Condie Joseph M. Hellerstein Rusty Sears Sriram Srinivasan Collaborators:
71
Extra slides
72
Ongoing Work 1. Lattices – Concurrent editing – Distributed garbage collection 2. Confluence and concurrency control – Support for “controlled non-determinism” – Program analysis for serializability? 3. Safe composition of monotone and non-monotone code
73
Overlog “Our intellectual powers are rather geared to master static relations and […] our powers to visualize processes evolving in time are relatively poorly developed. For that reason we should do (as wise programmers aware of our limitations) our utmost to shorten the conceptual gap between the static program and the dynamic process, to make the correspondence between the program (spread out in text space) and the process (spread out in time) as trivial as possible.” Edgar Djikstra
74
(Themes) Disorderly / order-light programming (understanding, simplifying) the relation btw program syntax and outcomes Determinism (in asynchronous executions) as a correctness criterion Coordination – theoretical basis and mechanisms – What programs require coordination? – How can we coordinate them efficiently?
75
Traces and models counter(X+1)@next :- counter(X), request(_, _). counter(X)@next :- counter(X), notin request(_, _). response(@From, X)@async :- counter(X), request(To, From). response(From, X)@next :- response(From, X). counter(0)@0. request(“node1”, “node2”)@0.
76
Traces and models -- 0 counter(0+1)@1 :- counter(0)@0, request(“node1”, “node2”)@0. counter(X)@next :- counter(X), notin request(_, _). response(“node2”, 0)@100 :- counter(0)@0, request(“node1”, “node2”)@0. response(From, X)@next :- response(From, X). counter(0)@0. request(“node1”, “node2”)@0. counter(1)@1. […] response(“node2”, 0)@100.
77
Traces and models -- 1 counter(X+1)@next :- counter(X), request(To, From). counter(1)@?+1 :- counter(1)@?, notin request(_, _)@?. response(From, X)@async :- counter(X), request(To, From). response(“node2”, 0)@101:- response(“node2”, 0)@100. counter(0)@0. request(“node1”, “node2”)@0. counter(1)@1. counter(1)@2. […]
78
Traces and models -- 100 counter(X+1)@next :- counter(X), request(To, From). counter(1)@101 :- counter(1)@100, notin request(_, _)@100. response(From, X)@async :- counter(X), request(To, From). response(“node2”, 0)@101:- response(“node2”, 0)@100. counter(0)@0. request(“node1”, “node2”)@0. counter(1)@1. counter(1)@2. […] response(“node2”, 0)@100. counter(1)@101. response(“node2”, 0)@101.
79
Traces and models – 101+ counter(X+1)@next :- counter(X), request(To, From). counter(1)@102 :- counter(1)@101, notin request(_, _)@101. response(From, X)@async :- counter(X), request(To, From). response(“node2”, 0)@102:- response(“node2”, 0)@101. counter(0)@0. request(“node1”, “node2”)@0. counter(1)@1. counter(1)@2. […] response(“node2”, 0)@100. counter(1)@101. counter(1)@102. response(“node2”, 0)@101. response(“node2”, 0)@102. […] A stable model for choice = 100
80
Traces and models counter(X+1)@next :- counter(X), request(_, _). counter(X)@next :- counter(X), notin request(_, _). response(@From, X)@async :- counter(X), request(To, From). response(From, X)@next :- response(From, X). counter(0)@0. request(“node1”, “node2”)@0. Stable models: { counter(0)@0, counter(1)@1, counter(1)@2, […] response(“node2”, 0)@k, response(“node2”, 0)@k+1, […] }
81
Studying confluence in Dedalus q(#L, X)@async <- e(X), replica(L). p(X) <- q(_, X) p(X)@next <- p(X). Bob Carol q(Bob,1)@1 q(Bob, 2)@2 e(1). e(2). replica(Bob). Replica(Carol). q(Carol, 2)@1 q(Carol, 1)@2 p(1), p(2) Alice UUM
82
Studying confluence in Dedalus q(#L, X)@async <- e(X), replica(L). r(#L, X)@async <- f(X), replica(L). p(X) <- q(_, X), r(_, X). p(X)@next <- p(X). Bob Carol q(Bob,1)@1 r(Bob, 1)@2 e(1). f(1). replica(Bob). replica(Carol). q(Carol, 1)@1 r(Carol, 1)@1 { } p(1) Alice Multiple ultimate models
83
Studying confluence in Dedalus Bob Carol q(Bob,1)@1 r(Bob, 1)@2 e(1). f(1). replica(Bob). replica(Carol). r(Carol, 1)@1 q(Carol, 1)@2 p(1) { } Alice Multiple ultimate models q(#L, X)@async <- e(X), replica(L). r(#L, X)@async <- f(X), replica(L). p(X) <- q(_, X), r(_, X). p(X)@next <- p(X). q(L, X)@next <- q(L, X).
84
Studying confluence in Dedalus Bob Carol q(Bob,1)@1 r(Bob, 1)@2 e(1). f(1). replica(Bob). replica(Carol). r(Carol, 1)@1 q(Carol, 1)@2 p(1) Alice UUM q(#L, X)@async <- e(X), replica(L). r(#L, X)@async <- f(X), replica(L). p(X) <- q(_, X), r(_, X). p(X)@next <- p(X). q(L, X)@next <- q(L, X). r(L, X)@next <- r(L, X).
85
Studying confluence in Dedalus Bob Carol q(Bob,1)@1 r(Bob, 1)@2 e(1). f(1). replica(Bob). replica(Carol). r(Carol, 1)@1 q(Carol, 1)@2 p(1) { } Alice q(#L, X)@async <- e(X), replica(L). r(#L, X)@async <- f(X), replica(L). p(X) <- q(_, X), NOT r(_, X). p(X)@next <- p(X). q(L, X)@next <- q(L, X). r(L, X)@next <- r(L, X). Multiple ultimate models
86
CALM – Consistency as logical monotonicity Logically monotonic => confluent Consequence: a (conservative) static analysis for eventual consistency Practical implications: – Language support for weakly-consistent, coordination-free distributed systems!
87
Does CALM help? Is the monotonic subset of Dedalus sufficiently expressive / convenient to implement distributed systems?
88
Coordination CALM’s complement: – Nonmonotonic => order-sensitive – Ensuring deterministic outcomes may require controlling order. We could constrain the order of – Data E.g., via ordered delivery – Computation E.g., via evaluation barriers
89
Coordination mechanisms Bob Carol r(Bob,1)@1 e(1). f(1). replica(Bob). replica(Carol). { } Alice Approach 1: Deliver the q() and r() tuples in the same total order to all replicas. q(Bob, 1)@2 r(Bob,1)@1 q(Bob, 1)@2 { }
90
Coordination mechanisms Bob Carol q(Bob,1)@1 e(1). f(1). replica(Bob). replica(Carol). { } Alice p(X) <- q(_, X), NOT r(_, X). r(Bob, 1)@2 r(Bob,1)@1 q(Bob, 1)@2 { } Approach 2: Do not evaluate “NOT r(X)” until its contents are completely determined.
91
Ordered delivery vs. stratification (Differences) Stratified evaluation – Unique outcome across all executions – Finite inputs – Communication between producers and consumers Ordered delivery – Different outcomes in different runs – No restriction on inputs – Multiple producers and consumers => need distributed consensus
92
Ordered delivery vs. stratification (Similarities) Stratified evaluation – Control order of evaluation at a course grain table by table. – Order is given by program syntax Ordered delivery – Fine-grained order of evaluation Row by row – Order is ND chosen by oracle (e.g. Paxos) – Analogy: Assign a stratum to each tuple. Ensure that all replicas see the same stratum assignments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.