Stefan Tilkov's Random Stuff

QCon SF 2009: Max Ross, Mapping Relational Data Patterns to the App Engine Datastore

These are my unedited notes from Max Ross's talk about Mapping Relational Data Patterns to the App Engine Datastore at QCon SF 2009.

  • Datastore is transactional, natively partitioned, hierarchical, schema-less, based on BigTable – not a relational database
  • Goals: Simplify storage by simplifying development, management
  • Even though Datastore is based on the ridiculously scalable BigTable, you don't need to have scalability problems to benefit from it
  • Scale always matters - the problem is not in the second step, it's the first step
  • Free to get started (not only for the first 30 days), pay only for what you need
  • Let someone else manage upgrades, redundancy, connectivity
  • Let someone else handle problems
  • Detailed post-mortem of GAE downtime available somewhere
  • Scale automatically to any point on the scale curve
  • Trying to get people out of the business of managing their database in production
  • Basic entity: Kind, Entity group, key, age, + any number of properties
  • Datastore is schemaless - soft schema model. Much of the stuff available in the DB (constraints, type checking, schema) needs to move up to the app layer (but is usually replicated there anyway)
  • primary benefit of the schemaless datastore: much faster iterations
  • soft schemas can give you type safety despite using a simple key/value store underneath
  • JPA annotations provide soft schema - even though targeted at creating DB information, GAE can benefit from it
  • JPA annotations are a data definition language (proof: relational DB schema can be created from annotations)
  • Primary key in the datastore contains the kind and are hierarchical, e.g. /Person:13/Pet:Ernie
  • Analogy: Hierarchical datastore keys are similar to composite primary keys
  • Surrogate keys are harder to move - dropping is often not an option. Mapping options: 1) make surrogate part of the key a property 2) make surrogate key primary key, put rest into property

Transations:

  • transactions in the Datastore apply to a single Entity Group
  • Entities in the same Entity Group share the same root part of the key
  • This makes Entity Group selection a critical design choice, with obvious effects on transactions
  • Too coarse hurts througput, too fine limits usefulness of transactions
  • Datastore does optimistic concurrency checks at the Entity Group level
  • [Strong relationship between data modeling and transaction processing – reminds me of the old debate on EJB 2.0 pre-final entity beans and dependent objects]
  • Unreleased new feature: Transactional tasks can update multiple entity groups, a task in a queue can participate in a DB transaction
  • Example: Deferred, transactional, async balance update (eventual consistency) as well as synchronous
  • Two-phase commit protocol algorithm implemented at Berkely, implemented by a Google developer (Erick Armbrust)

Relationships

  • Letting a framework manage relationships can simplify code for RDBMS, but especially for App Engine Datastore
  • Goal: make handling relationships with JPA as easy as possible
  • Google's JPA implementatin has some sensible defaults: Ownership implies entites are placed in the same Entity Group
  • E.g. Person with a @OneToMay to Pet (with a back reference of @ManyToOne) makes both part of the same Entity Group

Queries

  • Testing set membership – requires a join table with an RDBMS, can use a multi-value property in the GAE datastore (select from User where hobbies = 'yoga')
  • Other than that, no joins supported
  • Conflict: Google promises that query performance scales linearly with the size of the result set; not possible when cross products are needed to fulfill queries
  • Making good progress with a subset for join progress, not releases yet - nowhere near ready for production
  • RDBMS encourage cheap writes and expensive reads; datastore encourage expensive writes and cheap reads. Denormalization enouraged where it makes sense
  • Obvious problems with denormalized data

Taking code somewhere else

  • App engine is in general more restrictive
  • Suggestion: Decide early whether or not portability matters to you
  • Shows examples of portable code - somewhat ugly
  • Congratulations, you have already sharded your data model

Key takeaways

  • App engine datastore simplifies persistence
  • JPA adds typical RDBMS features to the datastore
  • Important to understand how the datastore is different
  • Easier to move apps off than on
  • If portability is important, plan for it
  • http://gae-java-persistence.blogspot.com

Q&A

  • Q. Does the shown transaction example really solve the problem? A. No, not to the full extent. lot of Google's billing software is built without multi-row transactions
  • Q. Is JPA a good model when starting from scratch? A. Many people like the low-level API, then start building an ORM on top of it … possibly better to start using an existing one.
  • Q. What kind of apps are on GAE? A. Not really known, many backend applications for iPhone apps, Facebook, … Obama virtal town hall meeting peaked at 700 req/s
  • Q. Export features? A. Some bulk import/export, but there should be more
  • Q. Caching? A. No direct support for JPA caching using memcached, but should be pluggable
  • Q. Is Python going to be replaced by Java? A. Absolutely not, the Java team rather has to fight to be accepted as an equal citizen
  • Q. Restrictions on some JDK features relevant? A. No.
  • Q. Staging area? A. No, not yet.
  • Q. JDO? A. GAE supports both, datanucleus supports both; JPA was chosen randomly for this talk today.
  • Q. Can apps be run offline? A. You can run the app SDK locally, but it won't scale; but stub implementations are pluggble and they could be replaced.