Notes from Werner Vogels's Keynote
In what has become a tradition when I'm in the mood, these are my unedited notes from Werner Vogels's keynote talk "Web-scale Computing: Compete on Ideas, not Resources" at IIR's (German) Web 2.0/SOA/EAM conference in Wiesbaden.
- everybody in the audience except two guys are Amazon customers
- when you put something in your shopping cart, you don't want to care about the technical details
- now: put off your Amazon customer hat, think of Amazon as a technology provider
- shows example - subscription model for toilet paper!
- "buy box" (the blue area) shows the best product for the customer - even if it's not sold by Amazon.com
- being a platform provider means you have to be absolutely neutral
- many other examples of websites powered by Amazon.com
- some statistical data - 80M customers, 1.3M active resellers ...
- retail, ecommerce (associates), infrastructure ws, enterprise customers
- shows Amazon.com from 1995 - key idea back then: do something on the Web that you couldn't do otherwise (have all of the world's book in stock)
- history: app server/database (1995-2001) --> service orientation --> massively scalable services
- for one year, Amazon ran a mainframe DB
- in 2001, the Web servers hit a performance/scalability wall
- 2001-2004: services
- now, everything is massive scale
- the secret sauce of Amazon.com: not its recommendations, but its capability to do anything at scale
- 1st step: modularization - co-locate data and the logic depending on it, no direct DB access anymore
- now: ~1000 different services
- a page will hit 250-350 services - even single lines, such as "sales rank", call a service
- large services at the bottom (customer, product, offer) serve as indices to additional services
- each team has a small team associated with it, responsible for building and running it - no separate operations dept
- no better motivation to fix a bug than your beeper going off at 4 in the morning
- software as bits as opposed to software as a service
- one bug/one fix approach
- the whole saas thing is a big lie! there's a big elephant in the room nobody's talking about
- reason: between test and operate, you need to handle all the non-functionals - load balancing, scaling, utilization, ...
- vendors have no idea how to handle these things b/c traditionally, the customers did it
- most of the engineers' time was spent configuring router, managing load balancers, spending 70% of their time on undifferentiated heavy lifting
- example: picture of AT&T data center built near a trailer park - which of course was destroyed by a tornado
- 365 Maine downtown SF run 8 generators in their data center - three months ago 6 of 8 generators failed despite being tested --> most of Web 2.0 offline
- Google study: 10% of disks will fail per year - w/ 80000 disks in a typical data center means 8000 disks fail per year -> you'll have people employed who only change disks
- graph of target.com and walmart.com -> holiday peaks 2-3 times the rest of the year's average
- lessons learned - offer 1000 wiis, 100000 people will show up
- pitch for 37signals' "Getting Real"
- Amazon.com web services: s3, sqs, ec2, simpledb, fps
- was used internally for 2 years before it was offered externally
Scalability
- growth by good customer experience -> traffic -> sellers -> selection -> lower prices -> customer experience
- incremental scalability is key
- being able to grow systems one step at a time
- infrastructure needs to move from capital investment to variable cost
- elastic: capable of growing and shrinking on demand
- minimal disruption to customer performance
- addresses: different growth paths, fault-tolerance, heterogeneity, operational efficiency
- you can't assume your infrastructure is homegeneous
Availability
- everything fails, all the time
- somebody cuts a cable in the Suez canal - the rest of the world thinks India is gone
- failures are highly correlated
- things fail in groups
- things don't fail by stopping - instead, systems fail by sending out large amounts of garbage
- a load balancer sending to a machine returning very fast responses -- all 500s
- let go of control - take a probabilistic approach: determinism doesn't exist in real life
Performance
- engineering for performance for 99.9%
- averages are irrelevant
Cost effectiveness
- uncertainty
- acquire resources on demand - you can't predict anything
- release resources when no longer needed
- the new economy is all about much intensified competition
- don't rely on resources
the power of your success is now no longer in your hand
these four non-functional properties of large systems are dominated by state management
categorization of data access patterns
- primary key access (high read volume, always writable)
- query-based access (relationless + relational)
- two large services: S3 and SimpleDB
- EC2 with persistent storage for dedicated purposes
- billions of objects in Amazon S3
- the traffic out of Amazon's web services is larger than the traffic of all retail properties combined
- availability zones
- explanation of persistent storage for EC2
- the big deal is: any type of legacy system can be run within the cloud
- the only thing needed to get started: a credit card and http://aws.amazon.com (no contract, negotiations, ...)
- (Question by yours truly: does AMazon.com use the services internally?)
- Yes, extensively, given it's . If S3 ever failed, you'd notice it in Amazon.com (Question: is Amazon impacted by the peak loads it has to handle?)
- Amazon.com scale is basically dwarfed by the platform it offers for others, it profits just as much.
Great talk, too bad it was this short.
Thanks for the writeup — looks like they’re doing even cooler stuff than their products imply!
Stefan, great summary. Are slides of his talk available somewhere?