A cloud is a large distributed system, whose design requires trade-offs among competing goals. Notably, the CAP theorem, conjectured by Eric Brewer -- a professor at UC Berkeley and the founder of Inktomi -- at a
PODC keynote talk, governs the trade-off. The CAP theorem states that a system can only achieve two out of three desirable properties: Consistency, Availability and Partition tolerance. Since distributed systems, by definition, use a cluster of machines and since the network connecting them could fail, these systems have to tolerate network partitions. So, in reality, the trade-off is often between Availability and Consistency.
However, such a decision is not necessarily good for enterprise applications -- a big target audience of the Amazon Web Services offerings. Most enterprise applications require the data to be correct. It does not matter if the system is available, if the result is wrong, you cannot make progress on your application. As a concrete example, we recently worked on a project at Labs, called
Cloud MapReduce, which implements
Google's MapReduce programming model on the Amazon cloud. We have seen many
manifestations of eventual consistency at work, which create many problems for our implementation. Cloud MapReduce cannot progress correctly when it reads wrong results from the cloud; instead, it has to spend a lot of efforts to detect and correct the wrong results when consistency problems arise.
So, the natural question is, if you are not running an e-commerce business, why not choose an eventually available system over an eventually consistent system? "Eventually available" is a term coined by myself. It is a system design that guarantees a strong consistency while trading off availability. The downside of such a system is that it may not be available for a brief period of time, but it guarantees that the system would eventually be available.
Why choose eventually available? The key reason is that it is much easier to deal with than an eventually consistent system. In Cloud MapReduce, we have to invent all sorts of techniques to detect and get around consistency problems, which are not easy. However, in an eventually available system, it is very easy to both detect (check the error code) and get around (just retry) when system is unavailable.
Fortunately, Amazon cloud is moving towards having eventually available (even though they do not call it so yet) as an option to be enterprise-application friendly. A couple of weeks ago, Amazon announced Consistent Read, Conditional Put & Delete features for SimpleDB (see
Werner's and
Jeff Barr's posts). Both of these new features guarantee strong consistency. We have done an extensive performance study on the cost of eventual consistency, and surprisingly, our study has found that strong consistency introduces no additional overhead, both in terms of latency and throughput. In addition, during the 3 days of testing, we are not able to observe any system unavailability. Maybe eventually available is even good enough for an e-commerce application?