NDB The new Python client library for the Google App Engine Datastore

Name: NDB The new Python client library for the Google App Engine Datastore
Uploaded: 2017-09-30T18:19:22+00:00
Duration: PTM14S11
Channel: Kari Hamley
Description: NDB The new Python client library for the Google App Engine Datastore

NDB The new Python client library for the Google App Engine Datastore
Guido van Rossum

Google App Engine in a nutshell
Run your web apps in Google’s cloud Opinionated Platform-as-a-Service (PaaS) Automatically scales your app Python-only launch April 2008; Java in 2009 NoSQL datastore ORM is primary API small subset of SQL (“GQL”) on top or ORM Original Python ORM called “db”

Google App Engine in numbers
Attained 7.5 Billion daily hits 1 Million active applications 250,000 active developers (30-day actives) Half of all internet IP addresses touch Google App Engine servers per week 2 Trillion datastore operations per month

NDB in a nutshell Fix design bugs in the old db API
Implement cool new API ideas Asynchronous to the core 100% compatible on-disk representation Google App Engine Datastore only Python 2.5 and 2.7 (single- and multi-threaded) HRD and M/S datastore; US and EU datacenters

Development process Notice widespread frustration with old db
Get management buy-in for a full rewrite Sit in a corner coding for a year :-) No, really: Release open source version early and often Beg users for feedback and contributions Try to document, redesign what’s hard to explain Rinse and repeat

What’s wrong with old db
Hard to modify any time we try to change internals, some user code breaks that depends on those internals Started out as a quick demo “how to do Django-style models in App Engine” made the official API only weeks before launch Has too many layers data is copied too many times between layers

Layer cake (old) db datastore.py protocol buffers

Layer cake (new) db ndb datastore.py datastore_{rpc,query}.py
protocol buffers

Cool new API features Async core Auto-batching Integrated caching
Pythonic query syntax Give entities nestable structure Make subclassing Property classes easy

Other nice things Use repeated=True instead of ListProperty
Pre- and post-operation hooks Key and Query types are truly immutable All objects have useful repr()s Unified terminology (id instead of key_name) PickleProperty, JsonProperty ProtoRPC support: MessageProperty

The basics

Model (schema) definitions
Model class and Property classes similar to Django (or any Python ORM) uses a simple metaclass Example: class Employee(ndb.Model): name = ndb.StringProperty(required=True) rank = ndb.IntegerProperty(default=3) phone = ndb.StringProperty()

Basic CRUD (Create, Read, Update, Delete) emp = Employee(name=‘Guido’)
key = emp.put() emp = key.get() emp.phone = ‘ ’; emp.put() key.delete()

Queries Query for all entities: Query for property values:
all_emps = Employee.query().fetch() for emp in Employee.query(): … Query for property values: Employee.query(Employee.rank > 3) Employee.query(Employee.phone == None) Query for multiple conditions: Employee.query(<cond1>, <cond2>, …)

Why repeat the class name?
Limitations of Python as a DSL… Old db used string literals; error-prone: Employee.all().filter(‘ rank >’, 3) # extra space Protip: write queries as class methods: @classmethod def outranks(cls, rank): return cls.query(cls.rank > rank) Employee.outranks(3).fetch()

Mapping a query over a callback
# Pretend you don’t see the async bits @ndb.tasklet def callback(ent): if not ent.name: ent.name = ent.first_name + ent.last_name yield ent.put_async() Employee.query().map(callback) Concurrency controlled by query batch size

StructuredProperty Example: list of tagged phone numbers In old db:
class Contact(db.Model): name = db.StringProperty() # following two are parallel arrays phones = db.StringListProperty() tags = db.StringListProperty() def add_phone(contact, number, tag): contact.phones.append(number) contact.tags.append(tag)

StructuredProperty (2)
class Phone(ndb.Model): number = ndb.StringProperty() tag = ndb.StringProperty() class Contact(ndb.Model): name = ndb.StringProperty() phones = ndb.StructuredProperty(Phone, repeated=True) def add_phone(contact, number, tag): contact.phones.append(Phone(number=number, tag=tag)) Contact.query(Contact.phones.number == ‘ ’)

Transactions Nothing really new or exciting
Well integrated with contexts and caching To specify options: @ndb.transactional(retries=N, xg=True) Join current transaction if one is in progress: @ndb.transactional(propagation=ALLOWED)

Caching CRUD automatically caches in two places:
in memory (per-context; write-through) in memcache (shared) one memcache server for all instances of you app write locks and clears, but doesn’t update memcache memcache algorithm ensures consistency even when using transactions except maybe under extreme failure conditions

Caching (2) User can override caching policies
per call, per model class, per context write your own policy function can even turn off datastore writes completely! Query results are not cached consistency is too hard to guarantee however, this works for high cache hit rates: ndb.get_multi(q.fetch(keys_only=True))

The async API (a fairly deep dive)

Async basics Based on PEP 342: generators as coroutines
Has its own event loop and Future class Constrained by App Engine async API based on RPCs (“Futures” for server-side work) only RPCs can be asynchronous (no select/poll) can wait for multiple RPCs in original (Python 2.5) runtime, no threads greenlets/gevent/etc. useless in this environment

Synchronous example code
def get_or_insert(id): ent = Employee.get_by_id(id) if ent is None: ent = Employee(…, id=id) ent.put() return ent

Converted to async style
@ndb.tasklet def get_or_insert_async(id): ent = yield Employee.get_by_id_async(id) if ent is None: ent = Employee(…, id=id) yield ent.put_async() raise ndb.Return(ent) “Look ma, no callbacks”

Writing async code The decorated function (tasklet) is async itself
Really, async operations just return Futures can separate call from yield: f = foo_async(); …; a = yield f yield takes any Future, or a list of Futures yield <list> returns a list of results: f = f_sync(); …; g = g_sync(); …; a, b = yield f, g yielding multiple futures is key to running multiple tasklets concurrently

Futures NDB Futures are explicit Futures Three ways to wait:
must use an explicit API to wait for the result Three ways to wait: call f.get_result() # in synchronous context yield f # in a tasklet f.add_callback(callback_function) # internal Any number of waiters are supported An exception is also a result (i.e. is re-raised)

Event loop Doesn’t know about Futures
Knows about App Engine RPCs though… And knows about callback functions When you’re calling an async API or tasklet a helper to run the tasklet is queued you’re given a Future right away the helper will eventually set the Future’s result use the Future to wait for the result

The magic yield How does yielding a Future wait for its result?
“Trampoline” code calls g.next() or g.send() on the underlying generator object If this returns a Future, the trampoline adds a callback to the Future to restart the generator It’s up to whatever created the Future to make sure that its result is eventually set Go to #1, passing the result into g.send()

Edge cases If g.next() or g.send() raises StopIteration, we’re done (ndb.Return is a subclass thereof) If it raises another exception, we’re also done, and we pass the exception on If it returns an RPC instead of a Future, use the event loop’s native understanding of RPCs If it returns a non-Future, that’s an error

You don’t have to understand this
Just remember these rules: on a generator function yield *_async() operations raise ndb.Return(x) instead of return x Use yield <list> to increase concurrency Don’t call synchronous APIs! Helpful convention: name tasklets *_async Exception passing is remarkably natural

Auto-batching Automatically combine operations in one RPC Example:
Only like operations can be combined Must use async API to benefit Example: e1.put(); e2.put() # Two RPCs yield e1.put_async(), e2.put_async() # One RPC! Implemented for datastore get, put, delete; and memcache operations (via Context)

Auto-batching (2) Biggest benefit is between multiple tasklets
Each tasklets does some single ops example: get_or_insert() Tasklets are run concurrently Each tasklet in turn runs until first blocking op Those ops are buffered, not sent out yet When no tasklets left to run, buffered ops are combined into one batch RPC

Auto-batching (3) Each original single op has its own Future
When the RPC completes, its result is distributed back over those Futures And… the tasklets are back in the race! But… why not just manually batch operations? restructuring your code to do that is often hard!

Conclusion: caveats Async coding has lots of newbie traps
Careful when overlapping I/O and CPU work auto-batch queues only flushed when blocking Mixing async and synchronous ops can be bad in extreme cases can cause stack overflow Debugging async code is a challenge too much state in suspended generators’ locals can’t easily step over a yield in pdb

NDB The new Python client library for the Google App Engine Datastore

Similar presentations

Presentation on theme: "NDB The new Python client library for the Google App Engine Datastore"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NDB The new Python client library for the Google App Engine Datastore

Similar presentations

Presentation on theme: "NDB The new Python client library for the Google App Engine Datastore"— Presentation transcript:

Similar presentations

About project

Feedback