This is a single archived entry from Stefan Tilkov’s blog. For more up-to-date content, check out my author page at INNOQ, which has more information about me and also contains a list of published talks, podcasts, and articles. Or you can check out the full archive.

Notes from Werner Vogels's Keynote

Stefan Tilkov, Apr 17, 2008

In what has become a tradition when I'm in the mood, these are my unedited notes from Werner Vogels's keynote talk "Web-scale Computing: Compete on Ideas, not Resources" at IIR's (German) Web 2.0/SOA/EAM conference in Wiesbaden.

everybody in the audience except two guys are Amazon customers
when you put something in your shopping cart, you don't want to care about the technical details
now: put off your Amazon customer hat, think of Amazon as a technology provider
shows example - subscription model for toilet paper!
"buy box" (the blue area) shows the best product for the customer - even if it's not sold by Amazon.com
being a platform provider means you have to be absolutely neutral
many other examples of websites powered by Amazon.com
some statistical data - 80M customers, 1.3M active resellers ...
retail, ecommerce (associates), infrastructure ws, enterprise customers
shows Amazon.com from 1995 - key idea back then: do something on the Web that you couldn't do otherwise (have all of the world's book in stock)
history: app server/database (1995-2001) --> service orientation --> massively scalable services
for one year, Amazon ran a mainframe DB
in 2001, the Web servers hit a performance/scalability wall
2001-2004: services
now, everything is massive scale
the secret sauce of Amazon.com: not its recommendations, but its capability to do anything at scale
1st step: modularization - co-locate data and the logic depending on it, no direct DB access anymore
now: ~1000 different services
a page will hit 250-350 services - even single lines, such as "sales rank", call a service
large services at the bottom (customer, product, offer) serve as indices to additional services
each team has a small team associated with it, responsible for building and running it - no separate operations dept
no better motivation to fix a bug than your beeper going off at 4 in the morning
software as bits as opposed to software as a service
one bug/one fix approach
the whole saas thing is a big lie! there's a big elephant in the room nobody's talking about
reason: between test and operate, you need to handle all the non-functionals - load balancing, scaling, utilization, ...
vendors have no idea how to handle these things b/c traditionally, the customers did it
most of the engineers' time was spent configuring router, managing load balancers, spending 70% of their time on undifferentiated heavy lifting
example: picture of AT&T data center built near a trailer park - which of course was destroyed by a tornado
365 Maine downtown SF run 8 generators in their data center - three months ago 6 of 8 generators failed despite being tested --> most of Web 2.0 offline
Google study: 10% of disks will fail per year - w/ 80000 disks in a typical data center means 8000 disks fail per year -> you'll have people employed who only change disks
graph of target.com and walmart.com -> holiday peaks 2-3 times the rest of the year's average
lessons learned - offer 1000 wiis, 100000 people will show up
pitch for 37signals' "Getting Real"
Amazon.com web services: s3, sqs, ec2, simpledb, fps
was used internally for 2 years before it was offered externally

Scalability

growth by good customer experience -> traffic -> sellers -> selection -> lower prices -> customer experience
incremental scalability is key
being able to grow systems one step at a time
infrastructure needs to move from capital investment to variable cost
elastic: capable of growing and shrinking on demand
minimal disruption to customer performance
addresses: different growth paths, fault-tolerance, heterogeneity, operational efficiency
you can't assume your infrastructure is homegeneous

Availability

everything fails, all the time
somebody cuts a cable in the Suez canal - the rest of the world thinks India is gone
failures are highly correlated
things fail in groups
things don't fail by stopping - instead, systems fail by sending out large amounts of garbage
a load balancer sending to a machine returning very fast responses -- all 500s
let go of control - take a probabilistic approach: determinism doesn't exist in real life

Performance

engineering for performance for 99.9%
averages are irrelevant

Cost effectiveness

uncertainty
acquire resources on demand - you can't predict anything
release resources when no longer needed
the new economy is all about much intensified competition
don't rely on resources
the power of your success is now no longer in your hand
these four non-functional properties of large systems are dominated by state management
categorization of data access patterns
- primary key access (high read volume, always writable)
- query-based access (relationless + relational)
two large services: S3 and SimpleDB
EC2 with persistent storage for dedicated purposes
billions of objects in Amazon S3
the traffic out of Amazon's web services is larger than the traffic of all retail properties combined
availability zones
explanation of persistent storage for EC2
the big deal is: any type of legacy system can be run within the cloud
the only thing needed to get started: a credit card and http://aws.amazon.com (no contract, negotiations, ...)
(Question by yours truly: does AMazon.com use the services internally?)
Yes, extensively, given it's . If S3 ever failed, you'd notice it in Amazon.com (Question: is Amazon impacted by the peak loads it has to handle?)
Amazon.com scale is basically dwarfed by the platform it offers for others, it profits just as much.

Great talk, too bad it was this short.

On April 17, 2008 7:46 PM, mnot.net said:

Thanks for the writeup — looks like they’re doing even cooler stuff than their products imply!

On April 23, 2008 10:32 PM, Armin Auth said:

Stefan, great summary. Are slides of his talk available somewhere?