A cloud is a large distributed system, whose design requires trade-offs among competing goals. Notably, the CAP theorem, conjectured by Eric Brewer -- a professor at UC Berkeley and the founder of Inktomi -- at a PODC keynote talk
, governs the trade-off. The CAP theorem states that a system can only achieve two out of three desirable properties: Consistency, Availability and Partition tolerance. Since distributed systems, by definition, use a cluster of machines and since the network connecting them could fail, these systems have to tolerate network partitions. So, in reality, the trade-off is often between Availability and Consistency.
However, such a decision is not necessarily good for enterprise applications -- a big target audience of the Amazon Web Services offerings. Most enterprise applications require the data to be correct. It does not matter if the system is available, if the result is wrong, you cannot make progress on your application. As a concrete example, we recently worked on a project at Labs, called Cloud MapReduce
, which implements Google's MapReduce
programming model on the Amazon cloud. We have seen many manifestations of eventual consistency
at work, which create many problems for our implementation. Cloud MapReduce cannot progress correctly when it reads wrong results from the cloud; instead, it has to spend a lot of efforts to detect and correct the wrong results when consistency problems arise.
So, the natural question is, if you are not running an e-commerce business, why not choose an eventually available system over an eventually consistent system? "Eventually available" is a term coined by myself. It is a system design that guarantees a strong consistency while trading off availability. The downside of such a system is that it may not be available for a brief period of time, but it guarantees that the system would eventually be available.
Why choose eventually available? The key reason is that it is much easier to deal with than an eventually consistent system. In Cloud MapReduce, we have to invent all sorts of techniques to detect and get around consistency problems, which are not easy. However, in an eventually available system, it is very easy to both detect (check the error code) and get around (just retry) when system is unavailable.
Fortunately, Amazon cloud is moving towards having eventually available (even though they do not call it so yet) as an option to be enterprise-application friendly. A couple of weeks ago, Amazon announced Consistent Read, Conditional Put & Delete features for SimpleDB (see Werner's
and Jeff Barr's
posts). Both of these new features guarantee strong consistency. We have done an extensive performance study on the cost of eventual consistency, and surprisingly, our study has found that strong consistency introduces no additional overhead, both in terms of latency and throughput. In addition, during the 3 days of testing, we are not able to observe any system unavailability. Maybe eventually available is even good enough for an e-commerce application?
In the year 2015, you can buy a cloud Operating System (OS) from SkySoft (a fictitious business), install it on your cluster of PCs and have a private cloud up and running in minutes. Are you wondering what is included in that software box pictured above? To imagine what it would include, it is helpful if we could draw analogy from a server OS.
A traditional OS serves two purposes. First, it hides the hardware details from the applications to simplify application design. For example, an application programmer sees a file system, rather than a hard disk interface and its control registers. Second, it provides a set of high level services to applications, again to simplify an application programmer's job. For example, an application programmer makes socket calls to connect to remote machines.
A cloud OS is similar to a server OS. First, it hides the hardware details. You do not have to know that a cloud is running on many clusters of hundreds of thousands of machines. Second, it provides a set of high level services to application programmers. For example, Amazon provides a queue service called SQS, which is used for communication between applications. Since I wrote about the definition of a cloud OS
before, I would not bother you with more details here.
Today, if you want to buy something similar to the box above, you can buy VMWare VCloud
(essentially vSphere and vCenter products). Alternative, you can also use open source software, such as Eucalyptus
. Keep in mind that these packages are still in their early stage of development, so many features are missing. Even if you can live with the feature set today, they are still limited to only the compute service, i.e., they only provide Virtual Machines (VM) on demand. So if you are looking for a scalable storage in the cloud or looking for services to help your VMs to communicate, you are out of luck.
Although the choices are limited today, I believe you will see a full featured cloud OS in the near future, which will consist of at least the following services:
- Compute service: A VM on demand service that is scalable beyond a small cluster of tens of machines. The service should also come with a full metering and charge-back mechanism to enable pay-per-use.
- Storage service: A scalable fully distributed data storage service. I believe it is most likely a blob storage service because it is much easier to scale than a block device abstraction. There are already various open source projects working on a highly scalable key-value pair storage system. I will cover them in details in a later post.
- Queue service: A queue interface is a simple message passing abstraction critical for machine-to-machine communication. Unlike MPI, it decouples the sender and receiver, which is important for cloud VMs due to their ephermal nature.
- Structured data storage service: If you need to query or index your data, a blob storage is not a good fit. Instead, we need some form of semi-structured data storage which can enable a more efficient querying.
Other scalable cloud services are likely to emerge as well, although it is unclear what they are. This is where the server OS analogy is most helpful. We can look into services that are most needed in a server OS and project whether it should be implemented in a cloud OS.
Given the rapid evolving ecosystem, a full-featured cloud OS may come sooner than 2015. Who knows, maybe SkySoft does exist and it may even be a spin-off from Accenture. One can only dream.
The Accenture Technology Labs blog will feature the opinions and perspectives from the very people that are driving innovation today for Accentu...