Perspective: January 2011

This is an "experience" paper from UC Berkeley Professor Eric A. Brewer on giant-scale web services. It has been nearly 10 years (published in 2001) after this paper. Main contribution of this paper is to introduce a metric called DQ to adresses the challenges of giant-scale web services such as high-availability, evolution and growth. Paper mainly consists of 3 parts: "The Basic Model" of the giant-scale services, "High Availability" of these services and lastly the "Online Evaluation and Growth" of these services.

This paper focus on "internet-based systems" and the discussion is limited to single-site single-owner, well-connected clusters which may be part of a larger service. Most of the service issues related to network partitioning, multiple admin domains etc. are not covered in this paper. Hence it specifically focuses on the basic block of giant-scale web services. This section starts with set of advantages of this basic blocks. These are: "Access anywhere any time", "Availability via multiple devices" including smart phones, tablets etc, "Groupware support" which explains the possibility of exploiting group-based applications, "Lower overall cost" in the sense of utilization and finally "simplified service updates".

Basic Model

After introducing the advantages of this giant-scale service, paper explains the components of the system: clients, load manager, servers, persistent data store and backplane. Basic duty of Load manager is to balance the loads and hide down the faults from the external world. The original approach used in load management is "Round-robin DNS" in which loads are distributed among different nodes in a round robin fashion. Main disadvantage of this system is: it doesn't hide inactive servers. However as the author explains, now most of the services now include "layer-4" switches. These switches can understand the TCP and port numbers and can make decision on whether the node is down or not. Furthermore, author explains/examines two other load management approach. First one uses a custom "front-end" nodes that acts as a service specific layer-7 routers by controlling session information. The other approach is to use smart client.

High Availability

The backbone and main aim of the giant-scale web services is to have "high availability" very close to 100% of the time. I can't imagine Facebook down for an hour a week which would make life disaster for them in the sense that users do not like services which disappear for some time. Economic impact of this should be very huge. Hence "high availability" is the major requirement of giant-scale services. In order for such systems to be evaluated there needs to some metrics and the traditional metric for availability is the uptime which can be defined as follows:

uptime = (MTBF-MTTR)/MTBF

Hence, uptime is the fraction of time the service is up. Although this is a traditional approach, one can easily realize that this is not a good metric for availability. Because service may be down for some time but no one used the service at that time, hence no real impact occurs. On the other hand, if the down time intersects with the peak usage of the system, then this might cause a disaster. Therefore author suggests two more metrics as follows:

yield = queries completed/queries offered

harvest = data available/complete data

This two metrics basically capture the availability in a more meaningful way and perfect system would require to have 100% yield and 100%harvest all the time. However, this is unrealistic given the current technology and the demand for such giant-scale services. At this point author introduces the DQ principle which is basically:

Data Per Query x Queries Per Second ----> constant

Intuition comes from the fact that system's overall capacity tends to have a particular physical bottleneck such as total I/O bandwidth etc. It is in fact true and valid assumption and mostly network-bounded rather that disk-bounded considering the giant-scale web services. DQ is the main contribution of this paper and as the formula stated it is measurable and tunable. In fact, it can scale up linearly in regards to adding new nodes or down having a fault/down in another node. Hence, DQ is very valuable for future traffic predictions and also for the future improvement on the hardwares and softwares. One important point to aware of is that this measurements are for data-intensive sites so it is not suitable to apply these principles to computation-bounded sites in which yield and harvest probably defined differently.

Replication vs. Partitioning

Replication is a traditional technique of increasing availability and in this part of the paper discusses the comparison of replication and partitioning under fault in regards to DQ, yield and harvest. The example given in the paper consists of a 2 nodes cluster where one of the node is down due to some fault. For the replicated system: data availability is not changed because it is replicated so harvest is not affected. However, yield reduces by 50%, because all the queries are now directed to one node instead of two. For the partitioned system, data availability is changed because half of the data is now unusable, hence harvest is down by 50%. On the other hand, yield is not affected. As a result DQ change for both case is same and down by 50%. Note that the real bottleneck is in the DQ value not in the replicated data, therefore even if you replicated the data under the faulty condition other nodes will have a higher loads than they have before and this affects the system. Assuming there is enough excess capacity to redirect queries (load redirection problem) is not realistic under the heavy load of the giant-scale services.

Another alternative presented in the paper is to replicate key data where if the main node fails then you can use the replicated one. Last approach presented in the paper is randomized partition where using some has function data is partitioned. In this way, worst case and best case are harmonized and we can obtain average case losses.

Graceful Degradation

Graceful Degradation is the process of effectively managing the saturation by controlling yield, harvest and DQ. It can be achieved either by controlling AC (Admission Control) which reduces Q or trough dynamic database reduction which reduces D. Paper also exemplifies some more sophisticated techniques for graceful degradation such as Cost-based AC: which control the admission of queries based on their DQ scores. By this way on the cost of not providing service for costly queries we can provide more lower costly queries and this increases both Q. Another example is Priority or Value-based AC where requests treated differently in terms of their priorities. And lastly, reducing data freshness by increasing expiration time will increase the yield, but reduces harvest (due to old cache data).

Online Evolution and Growth

Update is an inevitable fact of the giant-scale web services. Although the traditional approach dictates minimal changes to the system, giant-scale services generally needs some changes in terms of upgrades, maintenance etc.. Paper states 3 main approach for online evolution, these are;

Fast Reboot: It is simply the rebooting all nodes into its new version. This guarantees some downtime and yield is affected by this. One can reduce the impact by rebooting in a convenient time where small amount of people are using the system.

Rolling Upgrade: Maybe the most convenient one. In this approach nodes are updated in rolling-based one by one. Assuming having enough capacity, this will provide no reduction in yield and harvest (if replicated).

Big Flip: Most complicated one. First half of the nodes are updated and the layer-4 switches used to direct the traffic to other half. Then the other half is updated. In this scenario, we'll have 50% reduction in DQ (see the 2 nodes cluster example above).

As a conclusion, I here make the verbatim copies of the points in the paper.

Get the basics right. Start with a professional data center and layer-7 switches, and use sym- metry to simplify analysis and management.

Decide on your availability metrics. Everyone should agree on the goals and how to measure them daily. Remember that harvest and yield are more useful than just uptime.

Focus on MTTR at least as much as MTBF. Repair time is easier to affect for an evolving system and has just as much impact.

Understand load redirection during faults. Data replication is insufficient for preserving uptime under faults; you also need excess DQ.

Graceful degradation is a critical part of a high-availability strategy. Intelligent admission control and dynamic database reduction are the key tools for implementing the strategy.

Use DQ analysis on all upgrades. Evaluate all proposed upgrades ahead of time, and do capacity planning.

Automate upgrades as much as possible. Develop a mostly automatic upgrade method, such as rolling upgrades. Using a staging area will reduce downtime, but be sure to have a fast, simple way to revert to the old version.

Questions

Is the assumption that queries outnumber the writes or the updates valid in case of web site like youtube.com and application like Picasa by Google? Or in other words, is it safe to consider only the query part in case of these web sites for designing the Giant Scale infrastructure?

Yes it is valid, but these kinds of sites probably heavier write/update traffic than the sites mentioned in the paper. So the discussions in the paper may not valid for these sites. For example, thinking about DQ value for heavy write/update sites, replicated sites will have more DQ value than the partitioned ones.

If uploading or updates comprises of a significant part of some giant scale application or web site, then are the metrics Yield, Harvest and DQ enough to consider the design issues or some other metric(s) is/are needed to be introduced to cater to the design issues?

Yes, there needs to be new issues concerning this. Refer to above question.

DQ Principle (Pg6): How do we conclude here that scaling is linear for any given system?

Assuming site is not network-bounded, one can intuitively conclude that adding a node to the cluster will make it more powerful in terms of yield and harvest.

Rolling Upgrades (Pg9): How are restarts delays related to interdependent services accounted?? This should lead to more downtime in a Rolling upgrade when compare to Ideal Upgrade

In rolling upgrades, nodes are upgraded one at a time and this will take longer than fast reboot. However, system will be available because other nodes are functioning while a particular node is upgraded.

Were any new technologies, techniques or approaches developed since this article was written to increase MTBF in a reasonable amount of time? Note, the author describes uptime as (MTBF-MTTR)/MTBF and claims that it is easier to reduce the time it takes to fix failures than it is to reduce the frequency of failures. It appears more effort goes into reducing MTTR than increasing MTBF (and rightly so). Also, we know that data replication is insufficient for preserving uptime under faults since this technique reduces yield in terms of availability. Have advancements been made to utilize data replication to preserve uptime with little effect on its yield?

Being able to recover from failures is more important than having failures less often. Because not all failures are equal and one might take very long time. Therefore, MTTR has direct impacts on user. In addition. measuring MTBF is very difficult. If you think of a system with one week MTBF, then one has to test the system for more than one week on a heavy load and this makes process lengthly.

How does CAP theorem relate to harvest & yield?

CAP theorem states that it is impossible to simultaneously have all three (Consistency, Availability and Partition Tolerance) features in a distributed computer environment. Since Partiton Tolerance is inevitable part of these systems, one has to trade off between C and A. On the other hand, C and A can be applied in various degrees and yield, harvest are two relax definitions of these concepts.

How does a node failure affect DQ limit? How does replication affect DQ?

Node failure reduces DQ value. Under high utilization DQ value will not change with replication, because real bottleneck is the physical bottleneck that the system has. So one has to increase also its DQ value with replication so that replicated data can be utilized.

Even with replication, can't we reduce harvest, and keep the same yield?

Each replica may provide only a fraction of data so that yield is preserved. This can be arranged if replicas were limit-aware, fault-aware.

DQ principle is the key factor in this paper. But as I understanding, the DQ is only relevant to hardware field. For example, “behind this principle is that the system’s overall capacity tends to have a particular physical bottleneck. The DQ value is the total amount of data that has to be moved per second on average and it is thus bonded by the underlying physical limitation.” As the basic model proposed in this paper, giant-scale services include load manager, servers which obviously involve software issue. So I don’t understand, why the whole paper discussing DQ principle as the most significant factor since replication vs. partitioning, graceful degradation, disaster tolerance and online evolution and growth are all evaluated by DQ. Software such as loader manager is nothing contributes to the performance of giant-scale services.

Load Manager can be a physical bottleneck in the system too and one can solve it by adding a one more load manager as seen in the figure 3.

The paper mentioned:” the small test cluster is a good predictor for DQ changes on the production system since DQ normally scales linearly with number of nodes, it is easy to measure the DQ impact of faults if given metric and load generator”. How can we deduce this conclusion?

scaling is one of the question above. Measuring can done by using a load generator and looking for the Data per query and Query per second in the system.

Well my first impression for iphone development is: there are not many documentation available to address all the problems like we have for Java or C etc. Therefore, I decided write some of the basic things that I was looking for while I was developing an app.

This one basically is needed on almost every application that you build on iphone. In one way or another, you need to write your custom table view cell classes. For example if you want to have a table view with images on the left and text on the right you need to overwrite UITableViewCell. Good thing is; it is not that difficult. Here are the steps:

First create a new View XIB file from your interface builder named it as X.
Then remove the view and put a UITableViewCell to your UI.
Now you need to click on your UITableViewCell element and change it class to to X from your interface builder.
Now double click yo your custom table view cell and put some labels image views etc on your need.
Now wire all elements in your IB by defining them in your .m file and referencing them as outlets in your IB.
You are ready to go. In your Table View you just need to replace the standart UITableViewCell object by a new Custom Table View Cell X by the following code

- (UITableViewCell *)tableView:(UITableView *)tableView cellForRowAtIndexPath:(NSIndexPath *)indexPath {

static NSString *CellIdentifier = @"X";

X *cell = [tableView dequeueReusableCellWithIdentifier:CellIdentifier];

if (cell == nil) {

NSArray *topLevelObjects = [[NSBundle mainBundle] loadNibNamed:@"X" owner:self options:nil];

for(id currentObject in topLevelObjects) {

if([currentObject isKindOfClass:[X class]]) {

cell = (X *)currentObject; break;

}

//configure your cell

return cell;

}

Perspective

Thursday, January 27, 2011

UIActionSheet and UITabBar

Sunday, January 23, 2011

Lessons from Giant-Scale Services by Eric A. Brewer

Tuesday, January 18, 2011

How to add custom UITableViewCell

Monday, January 17, 2011

Crash-Only Software

Hints for Computer System Design by Butler W. Lampson