Thursday, October 1, 2015

Paper Summary: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

This paper is from SoCC 2014 (Symposium of Cloud Computing) which IBM is the platinum sponsor for the year.  It's a relatively easy read. In a nutshell, the authors explore the issues reported in repositories of popular open source cloud softwares/architectures such as Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume. They reviewed around ~22K issues, but the paper focuses only on 3655 issues which they consider as 'vital' issues -- 'which affect deployment'. It's a relatively long paper (14 pages), therefore I'll only refer to their findings in this blog post. Feel free to refer to the original paper for more in depth information. 

The paper classify the bugs into different issue types. For example as an aspect reliability, performance, availability, security, consistency, scalability, topology and QoS are defined as the classes. For the scope of the bug single machine, multiple machines and clusters are defined. For software bugs logic, error handling, optimization, config, race, hang, space and load is defined.

As the bugs classified into the aspects, it turns out that reliability (45%), performance (22%) and availability (16%) are the dominant categories. However, data consistency (5%), scalability (2%) and topology (1%) are also becoming more prevalent as a result of cloud architectures. 

Although people thrive to design systems with no single point of failure (no-SPoF), according to the paper, 139 bugs which authors defined as `killer bugs`, indicate a situation where cascade of failures happen in subtle ways and no-SPoF principle may not always be hold.

Another important finding of the paper is to reveal that hardware failures (13% of the issues) may not be easy to handle. Although cloud community has been preached for a long time about handling hardware failures in software, as the authors pointed out, hardware can fail for any reason, 'stop', 'corrupt' or 'limp', any time and worse recovery itself can see another failure. 

Another interesting but not surprising finding from the paper is the fact that 'availability first, correctness second' paradigm. As reported, authors found cases where inconsistency and corruption in data are reported but ignored which they think because of the fact that availability and uptime is more important than inconsistent data. There is also another fact that cloud systems are generally evaluated with their uptimes and reliability, and also it's easy to quantify, therefore availability is more important than consistency.

Wednesday, April 15, 2015

How To Access Device's Location Through Browser

In this blog post, I'll try to explore different mechanism to access device's location through the browser. By device, I mean laptops, PCs, smartphones and tablets. I'll try to present the tradeoffs of using different mechanism to learn device's location.

First using the HTML 5 geolocation api. This is a standard way of accessing the location powered by HTML 5. Almost every browser supports this capability. Moreover, returned location is more accurate if it's run by a device which has GPS capability such as a smartphone. To access location use the "navigator.getLocation.getCurrentPosition(showPosition)" method. You need to give a callback showPosition as a parameter. Once the location is known, callback will be called with position. Surprisingly this method is pretty successful in determining the device's location. Apparently, the high accuracy is due to the various information that the API uses such as the IP address, Access Point BSSID, Cell tower Id (if available) etc. Once such information is collected from the device, it's fed into a location service such as Google, Windows or Apple Location Services. Since these services are already mapped certain APs to locations, they usually return a good estimation of the device's accuracy. There is also watchPosition() and clearWatch() methods which enable continuous update of device's location. Finally, the only drawback of this method is that it asks user for permission as shown below. Hence, you might loose some privacy concerned people if you use this API.


Second method of learning the location is to use the IP address of the device and using a mapping service. Although it sounds simple, it requires a lot to learn device's location through the ip address. First, you need a server side script to learn the device's IP address as it's not possible to learn the external ip address in client side. Common way to learn the IP address is to have a server side script that returns back the IP address of an request. There are also some third party services which can do it for you. Once you have the IP address, second you need to map the IP address to geolocation. For that you need another third party service. Usually such services map IP to the scale of state successfully and if you're lucky maybe you get the city correct too. For me, it looks like a lot of dependency for such a basic functionality and the result would not be the around the accuracy that you might want. The only positive thing is, if you prefer this approach you won't get a browser popup for asking permission. If you're curious you can learn more about Internet geolocation from this paper.

Third option is not actually an option but it's something that I thought you can use. The idea is to use the BSSID of the AP that the device is connected. However, it looks like it's not possible to learn information about the AP using a client site scripting. So unfortunately, this is not an option.

Friday, April 10, 2015

Real Time Analysis of Images Posted on Twitter Using Bluemix

In this blog post, I'll share my experience of developing a real-time image analysis application using Bluemix. The best thing is, you don't need an understanding of computer vision or image processing to develop such an application on Bluemix (thumbs up!).

Lets start! Text analysis on Twitter is a widely known and performed activity nowadays. Sentiment analysis and event detection are just a few of the things that researchers and companies do. However, analysis of images posted on Twitter has not been explored much in the past (although it's starting to pick up recently...). In this post, I'll show you how to build such an application within an hour worth of development using IBM's Bluemix platform. I hope you have enough knowledge about Bluemix before reading further. If not, you can read my introductory Bluemix blog post.

Please note that although Bluemix provides a Twitter Insight service in the catalog, it's still in Beta and therefore have some limited capabilities. Hence I opt out of Bluemix service for Twitter and directly use the Twitter API. However stay tuned for future updates on the Bluemix service.

Twitter provides a nice and well documented API on its website. In terms of the capabilities that I needed, I'm interested in getting real-time tweets about a topic (such as 'NYC' or 'Obama'). Therefore, for my case it looks like Public Streaming API is the way to go. In addition, I'd like to get only the posts that have images in it. However, it looks like current API doesn't support this kind of queries. As a result, I request all twitter posts about a specific query term and processed only the ones that have images.

To use Twitter API, first create a Twitter application on Twitter website and obtain credentials for accessing the API and note your consumer_key, consumer_secret_key, access_token and access_token_secret. Then choose a Twitter library that can make calls to Twitter's streaming API. Since I plan my app to be a Node js one, I use this one, however feel free to use another library. Once that's set up, go to Bluemix console and sign in.

On the dashboard click '+' for creating an application. Choose a web application.
Next select a runtime, I chose Nodejs but feel free to choose another one if you like. Once that's done, you need to name it. 
Once you name it, your app should be created and deployed on cloud foundry in a few seconds. Click on it. Now you should be given an option to download the starter code. Please do so and download it. 
Next go back to your application overview page (left menu) and click on add a service. Select Watson category and choose 'Visual Recognition'. 
Once you name the service and bind it to the application you created, you'll be able to see your service in the Overview page of your application. Click on "Show Credentials" and note the username and password which you'll need to make calls to the service. Please see this Github page to see how to make calls to the service.
You environment is now set up and now you can start coding. I'll not be going into the details of the code, but feel free to look at my quick implementation on github using express, passport and jade. The result of the app can be seen from this website. It basically shows a pie chart of labels of all the images posted in Twitter for the last 2 hours for NYC. I regard a category of a label as "Others" if it's less than %3 of all images. As can be seen from the chart, "Others" takes the majority due to the long tail distribution of images posted in Twitter.


Bluemix: Platform as a Service (PaaS) Offering of IBM

I believe by now, most of you have already heard about Bluemix, IBM's Plarform as a service (PaaS) solution for Cloud Computing. It's based on an open source cloud foundry platform. Bluemix adds additional values to cloud foundry by providing IBM specific services such as DB2 or Watson Services.

Bluemix lowers the barrier for app development. As a developer, you no longer need to maintain and worry about infrastructure. Plus you have the option to select from several different languages and services that's currently available on Bluemix.
Bluemix has structured around runtimes and services. Runtimes are the environments that your code is running. There are currently several runtimes available on Bluemix such as Java, NodeJs, Go, PHP, Python and Ruby. You can even bring your own buildpack and have your own runtime.
Bluemix is also rich in terms of the services. There are 4 levels of support for the services, i.e. IBM services which are developed, maintained and supported by IBM, Third Party services which are maintained by a third party such as Twilio, Community services which are maintained by the cloud foundry community and finally Experimental and Beta services also exist in Bluemix.

Services in Bluemix is categorized into various types. One of the most compelling reasons to use Bluemix is the variety of services that are available. For some of these services, you can only find them available on Bluemix. These services are usually IBM owned and therefore only available in Bluemix such as Watson -- a unique capability derived from IBM Watson. Some of the services are widely used services by the developers and therefore also available in Bluemix such as mysql or postgresql database. This post is short enough to cover all of the services in Bluemix, please visit Bluemix website to learn more about the services.
Another great feature of Bluemix is to combine the most frequently used runtimes and services and make it available as Boilerplates. In this way, you no longer need to create your runtime and add services one by one. Instead, it's already created for you with just a few clicks. There are variety of Boilerplates available in Bluemix such as Mobile Cloud Boilerplates which combines a Node JS runtime with Mobile Application Security, Push, Mobile Data and Mobile Quality Assurance services. Isn't it nice?
Last but not least, Bluemix has recently been evolved from just a PaaS. It now includes docker containers and virtual machines enabled by open stack.With these two, Bluemix now offers a full cloud development enviornment where everyone can find a way to fulfill their needs.

Finally, you can sign up Bluemix for a 1 month free trial period (no credit card required!). Then you'll be charged as you use the runtimes and services. You can find more information on pricing in this link. To learn more about Bluemix please visit the nice Bluemix documentation or developer community webpage.