These are my unedited notes from Max Ross's talk about Mapping Relational Data Patterns to the App Engine Datastore at QCon SF 2009.
- Datastore is transactional, natively partitioned, hierarchical, schema-less, based on BigTable – not a relational database
- Goals: Simplify storage by simplifying development, management
- Even though Datastore is based on the ridiculously scalable BigTable, you don't need to have scalability problems to benefit from it
- Scale always matters - the problem is not in the second step, it's the first step
- Free to get started (not only for the first 30 days), pay only for what you need
- Let someone else manage upgrades, redundancy, connectivity
- Let someone else handle problems
- Detailed post-mortem of GAE downtime available somewhere
- Scale automatically to any point on the scale curve
- Trying to get people out of the business of managing their database in production
- Basic entity: Kind, Entity group, key, age, + any number of properties
- Datastore is schemaless - soft schema model. Much of the stuff available in the DB (constraints, type checking, schema) needs to move up to the app layer (but is usually replicated there anyway)
- primary benefit of the schemaless datastore: much faster iterations
- soft schemas can give you type safety despite using a simple key/value store underneath
- JPA annotations provide soft schema - even though targeted at creating DB information, GAE can benefit from it
- JPA annotations are a data definition language (proof: relational DB schema can be created from annotations)
- Primary key in the datastore contains the kind and are hierarchical, e.g. /Person:13/Pet:Ernie
- Analogy: Hierarchical datastore keys are similar to composite primary keys
- Surrogate keys are harder to move - dropping is often not an option. Mapping options: 1) make surrogate part of the key a property 2) make surrogate key primary key, put rest into property
Transations:
- transactions in the Datastore apply to a single Entity Group
- Entities in the same Entity Group share the same root part of the key
- This makes Entity Group selection a critical design choice, with obvious effects on transactions
- Too coarse hurts througput, too fine limits usefulness of transactions
- Datastore does optimistic concurrency checks at the Entity Group level
- [Strong relationship between data modeling and transaction processing – reminds me of the old debate on EJB 2.0 pre-final entity beans and dependent objects]
- Unreleased new feature: Transactional tasks can update multiple entity groups, a task in a queue can participate in a DB transaction
- Example: Deferred, transactional, async balance update (eventual consistency) as well as synchronous
- Two-phase commit protocol algorithm implemented at Berkely, implemented by a Google developer (Erick Armbrust)
Relationships
- Letting a framework manage relationships can simplify code for RDBMS, but especially for App Engine Datastore
- Goal: make handling relationships with JPA as easy as possible
- Google's JPA implementatin has some sensible defaults: Ownership implies entites are placed in the same Entity Group
- E.g. Person with a @OneToMay to Pet (with a back reference of @ManyToOne) makes both part of the same Entity Group
Queries
- Testing set membership – requires a join table with an RDBMS, can use a multi-value property in the GAE datastore (select from User where hobbies = 'yoga')
- Other than that, no joins supported
- Conflict: Google promises that query performance scales linearly with the size of the result set; not possible when cross products are needed to fulfill queries
- Making good progress with a subset for join progress, not releases yet - nowhere near ready for production
- RDBMS encourage cheap writes and expensive reads; datastore encourage expensive writes and cheap reads. Denormalization enouraged where it makes sense
- Obvious problems with denormalized data
Taking code somewhere else
- App engine is in general more restrictive
- Suggestion: Decide early whether or not portability matters to you
- Shows examples of portable code - somewhat ugly
- Congratulations, you have already sharded your data model
Key takeaways
- App engine datastore simplifies persistence
- JPA adds typical RDBMS features to the datastore
- Important to understand how the datastore is different
- Easier to move apps off than on
- If portability is important, plan for it
- http://gae-java-persistence.blogspot.com
Q&A
- Q. Does the shown transaction example really solve the problem? A. No, not to the full extent. lot of Google's billing software is built without multi-row transactions
- Q. Is JPA a good model when starting from scratch? A. Many people like the low-level API, then start building an ORM on top of it … possibly better to start using an existing one.
- Q. What kind of apps are on GAE? A. Not really known, many backend applications for iPhone apps, Facebook, … Obama virtal town hall meeting peaked at 700 req/s
- Q. Export features? A. Some bulk import/export, but there should be more
- Q. Caching? A. No direct support for JPA caching using memcached, but should be pluggable
- Q. Is Python going to be replaced by Java? A. Absolutely not, the Java team rather has to fight to be accepted as an equal citizen
- Q. Restrictions on some JDK features relevant? A. No.
- Q. Staging area? A. No, not yet.
- Q. JDO? A. GAE supports both, datanucleus supports both; JPA was chosen randomly for this talk today.
- Q. Can apps be run offline? A. You can run the app SDK locally, but it won't scale; but stub implementations are pluggble and they could be replaced.