This paper is from SoCC 2014 (Symposium of Cloud Computing) which IBM is the platinum sponsor for the year. It's a relatively easy read. In a nutshell, the authors explore the issues reported in repositories of popular open source cloud softwares/architectures such as Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume. They reviewed around ~22K issues, but the paper focuses only on 3655 issues which they consider as 'vital' issues -- 'which affect deployment'. It's a relatively long paper (14 pages), therefore I'll only refer to their findings in this blog post. Feel free to refer to the original paper for more in depth information.
The paper classify the bugs into different issue types. For example as an aspect reliability, performance, availability, security, consistency, scalability, topology and QoS are defined as the classes. For the scope of the bug single machine, multiple machines and clusters are defined. For software bugs logic, error handling, optimization, config, race, hang, space and load is defined.
As the bugs classified into the aspects, it turns out that reliability (45%), performance (22%) and availability (16%) are the dominant categories. However, data consistency (5%), scalability (2%) and topology (1%) are also becoming more prevalent as a result of cloud architectures.
Although people thrive to design systems with no single point of failure (no-SPoF), according to the paper, 139 bugs which authors defined as `killer bugs`, indicate a situation where cascade of failures happen in subtle ways and no-SPoF principle may not always be hold.
Another important finding of the paper is to reveal that hardware failures (13% of the issues) may not be easy to handle. Although cloud community has been preached for a long time about handling hardware failures in software, as the authors pointed out, hardware can fail for any reason, 'stop', 'corrupt' or 'limp', any time and worse recovery itself can see another failure.
Another interesting but not surprising finding from the paper is the fact that 'availability first, correctness second' paradigm. As reported, authors found cases where inconsistency and corruption in data are reported but ignored which they think because of the fact that availability and uptime is more important than inconsistent data. There is also another fact that cloud systems are generally evaluated with their uptimes and reliability, and also it's easy to quantify, therefore availability is more important than consistency.










