NDB The new Python client library for the Google App Engine Datastore

Slides:

Advertisements

Similar presentations

Introduction to NHibernate By Andrew Smith. The Basics Object Relation Mapper Maps POCOs to database tables Based on Java Hibernate. V stable Generates.

Advertisements

Improving Rotor for Dynamically Typed Languages Fabio Mascarenhas and Roberto Ierusalimschy.

Supporting Persistent Objects In Python Jeremy Hylton

Connecting to Databases. relational databases tables and relations accessed using SQL database -specific functionality –transaction processing commit.

More about Ruby Maciej Mensfeld Presented by: Maciej Mensfeld More about Ruby dev.mensfeld.pl github.com/mensfeld.

Staying in Sync with Cloud 2 Device Messaging. About Me Chris Risner Twitter: chrisrisner.

Gevent network library Denis Bilenko gevent.org. Problem statement from urllib2 import urlopen response = urlopen(' body = response.read()

SFDC Integration Basics Gerry Winning. Integrating Your Progress App with SFDC Ovid Back Office App is Fully Integrated with SFDC (about two and a half.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Threading Part 3 CS221 – 4/24/09. Teacher Survey Fill out the survey in next week’s lab You will be asked to assess: – The Course – The Teacher – The.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.

Microsoft ASP.NET AJAX - AJAX as it has to be Presented by : Rana Vijayasimha Nalla CSCE Grad Student.

Fundamentals of Python: From First Programs Through Data Structures

Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.

Google App Engine Danail Alexiev Technical Trainer SoftAcad.bg.

Google App Engine Guido van Rossum Stanford EE380 Colloquium, Nov 5, 2008.

Locking Key Ranges with Unbundled Transaction Services 1 David Lomet Microsoft Research Mohamed Mokbel University of Minnesota.

TCP Sockets Reliable Communication. TCP As mentioned before, TCP sits on top of other layers (IP, hardware) and implements Reliability In-order delivery.

How WebMD Maintains Operational Flexibility with NoSQL Rajeev Borborah, Sr. Director, Engineering Matt Wilson – Director, Production Engineering – Consumer.

CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.

Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.

JavaScript & jQuery the missing manual Chapter 11

Database Design for DNN Developers Sebastian Leupold.

1 JavaScript. 2 What’s wrong with JavaScript? A very powerful language, yet –Often hated –Browser inconsistencies –Misunderstood –Developers find it painful.

Files COP3275 – PROGRAMMING USING C DIEGO J. RIVERA-GUTIERREZ.

LiveCycle Data Services Introduction Part 2. Part 2? This is the second in our series on LiveCycle Data Services. If you missed our first presentation,

Meet with the AppEngine Márk Gergely eu.edge. What is AppEngine? It’s a tool, that lets you run your web applications on Google's infrastructure. –Google's.

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.

(Business) Process Centric Exchanges

Stackless Python: programming the way Guido prevented it intended.

Reactive Database Access in Scala with Slick 3

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.

1 Concurrency Architecture Types Tasks Synchronization –Semaphores –Monitors –Message Passing Concurrency in Ada Java Threads.

1 Lecture 5 (part2) : “Interprocess communication” n reasons for process cooperation n types of message passing n direct and indirect message passing n.

Google App Engine Data Store ae-10-datastore

Transactions and Locks A Quick Reference and Summary BIT 275.

COS 461 Recitation 7 Remote Procedure Calls. Let’s Look at Layers Again.

Google App Engine MemCache ae-09-session

Caching Willem Visser RW334. Overview AppEngine Datastore No Caching Naïve Caching Caching invalidation Cache updating Memcached Beyond your code.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

1 Channel Access Concepts – IHEP EPICS Training – K.F – Aug EPICS Channel Access Concepts Kazuro Furukawa, KEK (Bob Dalesio, LANL)

COMP 430 Intro. to Database Systems Transactions, concurrency, & ACID.

Tutorial 2: Homework 1 and Project 1

Google App Engine. Contents Overview Getting Started Databases Inter-app Communications Modes.

Introduction to threads

Platform as a Service (PaaS)

Platform as a Service (PaaS)

Platform as a Service (PaaS)

z/Ware 2.0 Technical Overview

Google App Engine Danail Alexiev

Sasha Popov November 16, 2018 iRobot Create.

AJAX Robin Burke ECT 360.

Elasticsearch and SQL Server Integration

On transactions, and Atomic Operations

Google App Engine Ying Zou 01/24/2016.

On transactions, and Atomic Operations

CMPT 354: Database System I

Why Threads Are A Bad Idea (for most purposes)

Channel Access Concepts

Why Events Are a Bad Idea (for high concurrency servers)

Why Threads Are A Bad Idea (for most purposes)

Why Threads Are A Bad Idea (for most purposes)

File System Performance

Software Engineering and Architecture

Presentation transcript:

NDB The new Python client library for the Google App Engine Datastore Guido van Rossum guido@google.com

Google App Engine in a nutshell Run your web apps in Google’s cloud Opinionated Platform-as-a-Service (PaaS) Automatically scales your app Python-only launch April 2008; Java in 2009 NoSQL datastore ORM is primary API small subset of SQL (“GQL”) on top or ORM Original Python ORM called “db”

Google App Engine in numbers Attained 7.5 Billion daily hits 1 Million active applications 250,000 active developers (30-day actives) Half of all internet IP addresses touch Google App Engine servers per week 2 Trillion datastore operations per month

NDB in a nutshell Fix design bugs in the old db API Implement cool new API ideas Asynchronous to the core 100% compatible on-disk representation Google App Engine Datastore only Python 2.5 and 2.7 (single- and multi-threaded) HRD and M/S datastore; US and EU datacenters

Development process Notice widespread frustration with old db Get management buy-in for a full rewrite Sit in a corner coding for a year :-) No, really: Release open source version early and often Beg users for feedback and contributions Try to document, redesign what’s hard to explain Rinse and repeat

What’s wrong with old db Hard to modify any time we try to change internals, some user code breaks that depends on those internals Started out as a quick demo “how to do Django-style models in App Engine” made the official API only weeks before launch Has too many layers data is copied too many times between layers

Layer cake (old) db datastore.py protocol buffers

Layer cake (new) db ndb datastore.py datastore_{rpc,query}.py protocol buffers

Cool new API features Async core Auto-batching Integrated caching Pythonic query syntax Give entities nestable structure Make subclassing Property classes easy

Other nice things Use repeated=True instead of ListProperty Pre- and post-operation hooks Key and Query types are truly immutable All objects have useful repr()s Unified terminology (id instead of key_name) PickleProperty, JsonProperty ProtoRPC support: MessageProperty

The basics

Model (schema) definitions Model class and Property classes similar to Django (or any Python ORM) uses a simple metaclass Example: class Employee(ndb.Model): name = ndb.StringProperty(required=True) rank = ndb.IntegerProperty(default=3) phone = ndb.StringProperty()

Basic CRUD (Create, Read, Update, Delete) emp = Employee(name=‘Guido’) key = emp.put() emp = key.get() emp.phone = ‘555-5555’; emp.put() key.delete()

Queries Query for all entities: Query for property values: all_emps = Employee.query().fetch() for emp in Employee.query(): … Query for property values: Employee.query(Employee.rank > 3) Employee.query(Employee.phone == None) Query for multiple conditions: Employee.query(<cond1>, <cond2>, …)

Why repeat the class name? Limitations of Python as a DSL… Old db used string literals; error-prone: Employee.all().filter(‘ rank >’, 3) # extra space Protip: write queries as class methods: @classmethod def outranks(cls, rank): return cls.query(cls.rank > rank) Employee.outranks(3).fetch()

Mapping a query over a callback # Pretend you don’t see the async bits @ndb.tasklet def callback(ent): if not ent.name: ent.name = ent.first_name + ent.last_name yield ent.put_async() Employee.query().map(callback) Concurrency controlled by query batch size

StructuredProperty Example: list of tagged phone numbers In old db: class Contact(db.Model): name = db.StringProperty() # following two are parallel arrays phones = db.StringListProperty() tags = db.StringListProperty() def add_phone(contact, number, tag): contact.phones.append(number) contact.tags.append(tag)

StructuredProperty (2) class Phone(ndb.Model): number = ndb.StringProperty() tag = ndb.StringProperty() class Contact(ndb.Model): name = ndb.StringProperty() phones = ndb.StructuredProperty(Phone, repeated=True) def add_phone(contact, number, tag): contact.phones.append(Phone(number=number, tag=tag)) Contact.query(Contact.phones.number == ‘555-1212’)

Transactions Nothing really new or exciting Well integrated with contexts and caching Decorator @ndb.transactional To specify options: @ndb.transactional(retries=N, xg=True) Join current transaction if one is in progress: @ndb.transactional(propagation=ALLOWED)

Caching CRUD automatically caches in two places: in memory (per-context; write-through) in memcache (shared) one memcache server for all instances of you app write locks and clears, but doesn’t update memcache memcache algorithm ensures consistency even when using transactions except maybe under extreme failure conditions

Caching (2) User can override caching policies per call, per model class, per context write your own policy function can even turn off datastore writes completely! Query results are not cached consistency is too hard to guarantee however, this works for high cache hit rates: ndb.get_multi(q.fetch(keys_only=True))

The async API (a fairly deep dive)

Async basics Based on PEP 342: generators as coroutines Has its own event loop and Future class Constrained by App Engine async API based on RPCs (“Futures” for server-side work) only RPCs can be asynchronous (no select/poll) can wait for multiple RPCs in original (Python 2.5) runtime, no threads greenlets/gevent/etc. useless in this environment

Synchronous example code def get_or_insert(id): ent = Employee.get_by_id(id) if ent is None: ent = Employee(…, id=id) ent.put() return ent

Converted to async style @ndb.tasklet def get_or_insert_async(id): ent = yield Employee.get_by_id_async(id) if ent is None: ent = Employee(…, id=id) yield ent.put_async() raise ndb.Return(ent) “Look ma, no callbacks”

Writing async code The decorated function (tasklet) is async itself Really, async operations just return Futures can separate call from yield: f = foo_async(); …; a = yield f yield takes any Future, or a list of Futures yield <list> returns a list of results: f = f_sync(); …; g = g_sync(); …; a, b = yield f, g yielding multiple futures is key to running multiple tasklets concurrently

Futures NDB Futures are explicit Futures Three ways to wait: must use an explicit API to wait for the result Three ways to wait: call f.get_result() # in synchronous context yield f # in a tasklet f.add_callback(callback_function) # internal Any number of waiters are supported An exception is also a result (i.e. is re-raised)

Event loop Doesn’t know about Futures Knows about App Engine RPCs though… And knows about callback functions When you’re calling an async API or tasklet a helper to run the tasklet is queued you’re given a Future right away the helper will eventually set the Future’s result use the Future to wait for the result

The magic yield How does yielding a Future wait for its result? “Trampoline” code calls g.next() or g.send() on the underlying generator object If this returns a Future, the trampoline adds a callback to the Future to restart the generator It’s up to whatever created the Future to make sure that its result is eventually set Go to #1, passing the result into g.send()

Edge cases If g.next() or g.send() raises StopIteration, we’re done (ndb.Return is a subclass thereof) If it raises another exception, we’re also done, and we pass the exception on If it returns an RPC instead of a Future, use the event loop’s native understanding of RPCs If it returns a non-Future, that’s an error

You don’t have to understand this Just remember these rules: use @ndb.tasklet on a generator function yield *_async() operations raise ndb.Return(x) instead of return x Use yield <list> to increase concurrency Don’t call synchronous APIs! Helpful convention: name tasklets *_async Exception passing is remarkably natural

Auto-batching Automatically combine operations in one RPC Example: Only like operations can be combined Must use async API to benefit Example: e1.put(); e2.put() # Two RPCs yield e1.put_async(), e2.put_async() # One RPC! Implemented for datastore get, put, delete; and memcache operations (via Context)

Auto-batching (2) Biggest benefit is between multiple tasklets Each tasklets does some single ops example: get_or_insert() Tasklets are run concurrently Each tasklet in turn runs until first blocking op Those ops are buffered, not sent out yet When no tasklets left to run, buffered ops are combined into one batch RPC

Auto-batching (3) Each original single op has its own Future When the RPC completes, its result is distributed back over those Futures And… the tasklets are back in the race! But… why not just manually batch operations? restructuring your code to do that is often hard!

Conclusion: caveats Async coding has lots of newbie traps Careful when overlapping I/O and CPU work auto-batch queues only flushed when blocking Mixing async and synchronous ops can be bad in extreme cases can cause stack overflow Debugging async code is a challenge too much state in suspended generators’ locals can’t easily step over a yield in pdb

Q & A