Saturday, February 6, 2016

Architecture of Docker



- Docker client
It is the external face of docker. User can communicate with docker through the docker client's command interface. Docker client can run in the same host as the docker daemon or it can run in a different host as well. Docker client can connect to a docker daemon through sockets or RESTful API.

- Docker daemon
Backbone of docker containers. It is responsible for building, running and distributing docker containers. User communicate with docker daemon through the client.

- Docker registry
Docker registry holds docker images which are basically a form of read-only template for containers. An Ubuntu operating system with Tomcat and your web application can make up the docker image. Registry can be public or private. Docker hub (https://hub.docker.com) is a public registry provided by Docker and enable people to share their images with others.

To understand how Docker works, one first needs to understand how does a docker image works... In a nutshell, a docker image consists of several layers. Union file systems enable layers to be combined into a single coherent image. As a result of the layering structure, docker images are lightweight and update to the image can easily be applied to a single layer of the image. Therefore, docker images can easily be pulled and pushed in contrast to virtual machines.

Every docker image starts with a base operating system image such as Ubuntu or Fedora. User provided "Dockerfile" includes instructions to specify what additional layers will be added. Each instruction can be an action either running a command, adding a file or directory, creating environment variables or specifying what process to launch when running the container. Each instructions add additional layer to the base image and after each instruction is completed, a final image is generated by Docker.

Docker images are read-only. Containers are built from docker images. When docker runs a container from an image, it adds an additional read-write layer on top of the image, then the application can run.

Docker takes advantage of several Linux kernel features to work seamlessly. "Namespaces" are used to create isolated workspace which docker calls as the container. Docker creates several namespaces including:

  • 'pid: process isolation' 
  • 'net: network isolation' 
  • 'ipc: inter-process communication namespace' 
  • 'mnt: mount points'
  • 'uts: kernel and version identifiers isolation' namespace 

Control groups (cgroups) are used control the resource (max, min) that the container would use. Union file systems is utilized for combining multiple layers in a coherent way.

Docker combines all of these components into a wrapper called container format (libcontainer). Docker also supports Linux containers.

Docker provides couple of tools to optimize and ease the deployment of containers into clusters. These components are:

  • Docker Machine: Create and manage machines running docker daemon.
  • Docker Swarm: Native clustering capability by turning several Docker engines to a single virtual docker engine.
  • Docker Compose: Provides capability to define multi-container applications in a single docker file.
  • Docker Registry: Storage and distribution for docker images.
  • Docker Engine: Builds and run docker containers.
  • Docker Kitematic: UI for managing and docker engines, images and containers.

- Additional Notes
  • Users are not namespaced in containers, means that if you run an application as root, then it has all the privileges on host, no isolation for users.

- Suggestions for how to run docker from the above article 
  • running minimal images: that contains minimum number of services and applications to reduce the attach space.
  • using read-only file system: so that no malicious scripts can be downloaded and written.
  • limiting kernel calls: with SELinux.
  • restricting networking: only linked container communication.
  • limiting memory and CPU: with cgroups so as to prevent Denial of service attacks.

Is Container a new -aaS? Container Technology in a nutshell

Introduction
---
Three cloud computing models have been emerged so far: SaaS, PaaS and IaaS. For the sake of brevity, lets focus on the latter two. When cloud computing first emerged, Infrastructure as a Service (IaaS) was the main driver of cloud computing business. Amazon Web Services (AWS) started its business by selling virtual machinesThe problem with IaaS is developers have to manage all the details including which operating system to install, keeping installed softwares up to date in addition to developing and deploying applications. Too much burden... As a res
ult Platform as a Service (PaaS) emerged. In PaaS, developers only need to care about developing their applications without worrying about infrastructure level issues such as the latest security patches for the operating system or figuring out the right firewall rules. The PaaS model has made all of these easier for developers. Cloud Foundry (CF) has emerged as one of the widest deployment of PaaS so far. IBM, HP, Pivotal and others has already embraced and been using cf in their offerings.

So far so good except that we have started to see a kind of new model in application development, i.e. containers. So the question arises, wasn't it everything great, what the hack containers are? Although there is no simple answer for the question, looks like, not everything was perfect, at least from the developersperspective. So what's a container? Container technology basically an extension of linux containers which was first released in 2008 (https://en.wikipedia.org/wiki/LXC)Containers provide isolation in the operating system level. Probably at this moment, you are wondering what containers offer that PaaS or IaaS don't. There are several aspects which I will try to examine below:

Portability
---
I think Portability is the most compelling reason for the emergence of containers. Virtual machines typically is not considered to be portable considering the size and all the other burdens. Moreover, consider a typical application development in PaaS. Take cloud foundry as an example... You choose a runtime (Node.js) and couple of services (which runs independently outside or inside of the platform) and run your application. Now in future if you change your mind and try to move to IaaS, you will need to do a lot of tasks including installing your own Node.js runtime and database services etc. Moving into another PaaS solution is another headache with all the different capabilities and settings unless they are all compatible which sounds like a myth for me. On the other hand, containers promise portability, you package your application by telling which software to install and how to initiate/start them. Rest is taken care of the container engine (for docker it's docker engine) by running them as linux containers inside a linux machine. If you later decide to move from your original machine and go somewhere else, you will still be able to do it easily without worrying about the size and compatibility.

Microservice Model
---
In software development, especially in recent years, there is an obvious tendency to develop application as microservices instead of monolithic model. Especially with multiple teams working on different parts of the application, microservices offer the right model to go for as development and deployment of each microservices will be independent from each other. As a result, there will be less risk of failures during the development and deployment. In microservice model, everyone is responsible for their own service. If you are an owner of the database service then you need to provide that database service as robust and solid as possible. As a result of reducing the dependency of teams, deployment will be fastereasier, less prone to errors.

Now lets think about the containers for a secondIn a nutshell, if you develop your service as a container, your service will get the benefit of containers by providing an isolation of a service from the other services. Each service of the application will be deployed as an independent container and as a result each will be able to work independently and deploy independently. If you try to do the same in cloud foundry then you need to create a separate application for microservice that you want to develop (aside from the available services in the catalog that can be directly bound to the application). It does not sound like a good model both in terms portability and maintainability.

Flexibility
---
This might be a little biased and subjective perspective but if you ask me, developers love hacking things and then fixing them upLike it or not, this is a given characteristic of today's developers. As a result quarantined infrastructure environments don't work for them. They want more power and flexibility in terms of controlling the environment and I believe that the PaaS models lack of this flexibility. Containers provide a tier above VMs and therefore give developers a feel of control without dealing with all the burdens of managing a VM.

Thursday, October 1, 2015

Paper Summary: What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems

This paper is from SoCC 2014 (Symposium of Cloud Computing) which IBM is the platinum sponsor for the year.  It's a relatively easy read. In a nutshell, the authors explore the issues reported in repositories of popular open source cloud softwares/architectures such as Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume. They reviewed around ~22K issues, but the paper focuses only on 3655 issues which they consider as 'vital' issues -- 'which affect deployment'. It's a relatively long paper (14 pages), therefore I'll only refer to their findings in this blog post. Feel free to refer to the original paper for more in depth information. 

The paper classify the bugs into different issue types. For example as an aspect reliability, performance, availability, security, consistency, scalability, topology and QoS are defined as the classes. For the scope of the bug single machine, multiple machines and clusters are defined. For software bugs logic, error handling, optimization, config, race, hang, space and load is defined.

As the bugs classified into the aspects, it turns out that reliability (45%), performance (22%) and availability (16%) are the dominant categories. However, data consistency (5%), scalability (2%) and topology (1%) are also becoming more prevalent as a result of cloud architectures. 

Although people thrive to design systems with no single point of failure (no-SPoF), according to the paper, 139 bugs which authors defined as `killer bugs`, indicate a situation where cascade of failures happen in subtle ways and no-SPoF principle may not always be hold.

Another important finding of the paper is to reveal that hardware failures (13% of the issues) may not be easy to handle. Although cloud community has been preached for a long time about handling hardware failures in software, as the authors pointed out, hardware can fail for any reason, 'stop', 'corrupt' or 'limp', any time and worse recovery itself can see another failure. 

Another interesting but not surprising finding from the paper is the fact that 'availability first, correctness second' paradigm. As reported, authors found cases where inconsistency and corruption in data are reported but ignored which they think because of the fact that availability and uptime is more important than inconsistent data. There is also another fact that cloud systems are generally evaluated with their uptimes and reliability, and also it's easy to quantify, therefore availability is more important than consistency.

Wednesday, April 15, 2015

How To Access Device's Location Through Browser

In this blog post, I'll try to explore different mechanism to access device's location through the browser. By device, I mean laptops, PCs, smartphones and tablets. I'll try to present the tradeoffs of using different mechanism to learn device's location.

First using the HTML 5 geolocation api. This is a standard way of accessing the location powered by HTML 5. Almost every browser supports this capability. Moreover, returned location is more accurate if it's run by a device which has GPS capability such as a smartphone. To access location use the "navigator.getLocation.getCurrentPosition(showPosition)" method. You need to give a callback showPosition as a parameter. Once the location is known, callback will be called with position. Surprisingly this method is pretty successful in determining the device's location. Apparently, the high accuracy is due to the various information that the API uses such as the IP address, Access Point BSSID, Cell tower Id (if available) etc. Once such information is collected from the device, it's fed into a location service such as Google, Windows or Apple Location Services. Since these services are already mapped certain APs to locations, they usually return a good estimation of the device's accuracy. There is also watchPosition() and clearWatch() methods which enable continuous update of device's location. Finally, the only drawback of this method is that it asks user for permission as shown below. Hence, you might loose some privacy concerned people if you use this API.


Second method of learning the location is to use the IP address of the device and using a mapping service. Although it sounds simple, it requires a lot to learn device's location through the ip address. First, you need a server side script to learn the device's IP address as it's not possible to learn the external ip address in client side. Common way to learn the IP address is to have a server side script that returns back the IP address of an request. There are also some third party services which can do it for you. Once you have the IP address, second you need to map the IP address to geolocation. For that you need another third party service. Usually such services map IP to the scale of state successfully and if you're lucky maybe you get the city correct too. For me, it looks like a lot of dependency for such a basic functionality and the result would not be the around the accuracy that you might want. The only positive thing is, if you prefer this approach you won't get a browser popup for asking permission. If you're curious you can learn more about Internet geolocation from this paper.

Third option is not actually an option but it's something that I thought you can use. The idea is to use the BSSID of the AP that the device is connected. However, it looks like it's not possible to learn information about the AP using a client site scripting. So unfortunately, this is not an option.

Friday, April 10, 2015

Real Time Analysis of Images Posted on Twitter Using Bluemix

In this blog post, I'll share my experience of developing a real-time image analysis application using Bluemix. The best thing is, you don't need an understanding of computer vision or image processing to develop such an application on Bluemix (thumbs up!).

Lets start! Text analysis on Twitter is a widely known and performed activity nowadays. Sentiment analysis and event detection are just a few of the things that researchers and companies do. However, analysis of images posted on Twitter has not been explored much in the past (although it's starting to pick up recently...). In this post, I'll show you how to build such an application within an hour worth of development using IBM's Bluemix platform. I hope you have enough knowledge about Bluemix before reading further. If not, you can read my introductory Bluemix blog post.

Please note that although Bluemix provides a Twitter Insight service in the catalog, it's still in Beta and therefore have some limited capabilities. Hence I opt out of Bluemix service for Twitter and directly use the Twitter API. However stay tuned for future updates on the Bluemix service.

Twitter provides a nice and well documented API on its website. In terms of the capabilities that I needed, I'm interested in getting real-time tweets about a topic (such as 'NYC' or 'Obama'). Therefore, for my case it looks like Public Streaming API is the way to go. In addition, I'd like to get only the posts that have images in it. However, it looks like current API doesn't support this kind of queries. As a result, I request all twitter posts about a specific query term and processed only the ones that have images.

To use Twitter API, first create a Twitter application on Twitter website and obtain credentials for accessing the API and note your consumer_key, consumer_secret_key, access_token and access_token_secret. Then choose a Twitter library that can make calls to Twitter's streaming API. Since I plan my app to be a Node js one, I use this one, however feel free to use another library. Once that's set up, go to Bluemix console and sign in.

On the dashboard click '+' for creating an application. Choose a web application.
Next select a runtime, I chose Nodejs but feel free to choose another one if you like. Once that's done, you need to name it. 
Once you name it, your app should be created and deployed on cloud foundry in a few seconds. Click on it. Now you should be given an option to download the starter code. Please do so and download it. 
Next go back to your application overview page (left menu) and click on add a service. Select Watson category and choose 'Visual Recognition'. 
Once you name the service and bind it to the application you created, you'll be able to see your service in the Overview page of your application. Click on "Show Credentials" and note the username and password which you'll need to make calls to the service. Please see this Github page to see how to make calls to the service.
You environment is now set up and now you can start coding. I'll not be going into the details of the code, but feel free to look at my quick implementation on github using express, passport and jade. The result of the app can be seen from this website. It basically shows a pie chart of labels of all the images posted in Twitter for the last 2 hours for NYC. I regard a category of a label as "Others" if it's less than %3 of all images. As can be seen from the chart, "Others" takes the majority due to the long tail distribution of images posted in Twitter.


Bluemix: Platform as a Service (PaaS) Offering of IBM

I believe by now, most of you have already heard about Bluemix, IBM's Plarform as a service (PaaS) solution for Cloud Computing. It's based on an open source cloud foundry platform. Bluemix adds additional values to cloud foundry by providing IBM specific services such as DB2 or Watson Services.

Bluemix lowers the barrier for app development. As a developer, you no longer need to maintain and worry about infrastructure. Plus you have the option to select from several different languages and services that's currently available on Bluemix.
Bluemix has structured around runtimes and services. Runtimes are the environments that your code is running. There are currently several runtimes available on Bluemix such as Java, NodeJs, Go, PHP, Python and Ruby. You can even bring your own buildpack and have your own runtime.
Bluemix is also rich in terms of the services. There are 4 levels of support for the services, i.e. IBM services which are developed, maintained and supported by IBM, Third Party services which are maintained by a third party such as Twilio, Community services which are maintained by the cloud foundry community and finally Experimental and Beta services also exist in Bluemix.

Services in Bluemix is categorized into various types. One of the most compelling reasons to use Bluemix is the variety of services that are available. For some of these services, you can only find them available on Bluemix. These services are usually IBM owned and therefore only available in Bluemix such as Watson -- a unique capability derived from IBM Watson. Some of the services are widely used services by the developers and therefore also available in Bluemix such as mysql or postgresql database. This post is short enough to cover all of the services in Bluemix, please visit Bluemix website to learn more about the services.
Another great feature of Bluemix is to combine the most frequently used runtimes and services and make it available as Boilerplates. In this way, you no longer need to create your runtime and add services one by one. Instead, it's already created for you with just a few clicks. There are variety of Boilerplates available in Bluemix such as Mobile Cloud Boilerplates which combines a Node JS runtime with Mobile Application Security, Push, Mobile Data and Mobile Quality Assurance services. Isn't it nice?
Last but not least, Bluemix has recently been evolved from just a PaaS. It now includes docker containers and virtual machines enabled by open stack.With these two, Bluemix now offers a full cloud development enviornment where everyone can find a way to fulfill their needs.

Finally, you can sign up Bluemix for a 1 month free trial period (no credit card required!). Then you'll be charged as you use the runtimes and services. You can find more information on pricing in this link. To learn more about Bluemix please visit the nice Bluemix documentation or developer community webpage.