Sunday, December 11

Containers & Microservices

In order to successfully host the internet scale applications organizations are seriously considering migrating to Containers and Microservices. Few companies had containers on their product roadmap in 2014. In 2015 almost every organisation was either testing or evaluating them.
With computes running in hundreds if not thousands and the need to cater to any kind of surge in user traffic organization see the solution in Containers and Microservices approach. Soon Containers and Microservices would become table stakes.

Agility

Being slow is the new dead when it comes to internet scale applications. The ability of an organization to rapidly adapt to market and technology changes is extremely important to the success of the organization.
The Microservices architecture allows one to break things into smaller services which adds to the ability to make changes in response to a bug or a new feature request much faster. These services when developed in containers makes it super easy and quick to deploy on QA or production systems.
Environment deployment times have reduced from 4-6 hours to minutes thanks to containers. Containers are created from an image. Image creation is fast and easy. The whole cycle from development to deployment is drastically reduced giving an edge to companies by releasing features at a faster pace.

Operability

Most of the organizations are scaling up quiet fast and need technologies that would be easy to operate at such a scale. The use of containers holds a promise to ease the whole task of operability as the applications scales out horizontally and elastically. Container is slowly becoming the fundamental & standard unit of deployment.

Portability

Even with VMs shipping code from development environment to production is a problem. VM images have to be converted and transferred both of which are slow operations. Unlike VMs, Containers are not restricted to a specific cloud providers.
Containers provide write once, run anywhere design that improves the portability in a big way. Containers are easy to move as container images are broken down into layers. When updating and distributing an image, only relevant layers are shipped. One can build, ship, and run any app, anywhere.

Cost Savings

With containers businesses can optimize infrastructure resources, standardize environments and reduce time to market. Running containers is less resource intensive than running VMs. Docker enables users to run their containers wherever they please either in a Cloud or on-premise. thus avoiding vendor or Cloud lock-ins. One can move their containers to any Docker host on any infrastructure from Amazon, Google, or Azure whichever is the best for their budget.

Scalability

Elastic (Scaling) is a new normal. Monolithic applications are difficult to scale let alone elastic scaling. The solution is to migrate to microservices architecture. The scaling of a particular micro service is far more easy because if a particular microservice becomes a bottleneck due to slow execution, it can run multiple instances on different machines to process data in parallel.
With the monolithic systems you would have to run a copy of the complete system on a different machine making it far more difficult. With Containers and microservices organization can elastically and horizontally scale their application dynamically in response to the spike in user traffic.

Resilient

With microservices architecture even if a particular service goes down it would not cause the crash of the whole application. Services are up front designed for failure and for the unavailability of other services.

Developer Experience

The developer experience with Docker has improved multi-fold in the recent past. Developers love the experience with Containers as it allows them to package up an application with all of the parts it needs (such as libraries), ship it and run it. The Application that runs in Developers machine works in QA and production without any hassles. A developer machine can run tons of containers which may not be possible with VMs. The same container built for development are also deployed to production.

Summary

While both Containers and microservices have existed independently for a long time they work best when put together. The benefits of microservice architectures are amplified when used in combination with containers. Container-based development are a great option for microservices, DevOps, and continuous deployment, all of which are critical to an organizations success.
Containers can share more context than VMs thus making them a better partners of microservices as it helps to decouple complexity. Docker is available on all modern Linux variants. Many IAAS providers have server images with Docker. There are multiple mature container orchestration engines like Kubernetes, Docker Swarm and Apache Mesos.

Thursday, December 8

Stream Processing At Scale : Kafka & Samza

Businesses today generate millions of events as part of their daily operations. One such example is Uber that generates thousands of events like when you open the Uber app to see how many cars are near by that is a eye ball event, your booking of a cab is an event, the uber driver accepting your request is another event and many more such events.
Unbounded un-ordered and large scale data sets have become increasingly common and come from varied sources like satellite data , scientific instruments, stock data and traffic control. Basically data that arrives continuously and is large in scale and is never ending. To summarize entire business can be represented as streams of data. Turning these streams into valuable, actionable information in real time is critical to the success of any organization.

Challenges

These requirements lead to development of applications whose primary job is to consume these never ending continuous stream of events and process them successfully in near real time. The number of events from such a business are extremely high or large scale in nature. Each of these events needs to be sent somewhere and in most cases there would be multiple applications that would like to process a single event.

Stream Processing

Stream Processing can be looked into in 2 parts. One is how the application gets its input and the way it produces the output. The aspect of getting the stream input can be owned by a message broker and the job to produce the output can be owned by the processing framework.

Message Broker

A Message Broker would need to deal with the never ending fire hose of events (100k+/sec) and hence needs to be scalable, fault-tolerant and with high throughput. The Message Broker should also support multiple subscribers as there could be more than one application that would process the messages. The message broker should also have the ability to persist the messages. Another important requirement from the message broker is performance.

Apache Kafka

Currently Apache Kafka is the unanimous choice when it comes to message broker in stream processing applications. Performance-wise, it looks like Kafka blows away the competition as it handles tens of millions of reads and writes per second from thousands of clients all day long on modest hardware. In Kafka, messages belonging to a topic are distributed among partitions.

This ability of a Kafka topic to be divided into partitions allows Kafka to score high on scalability. Kafka is designed as distributed from the ground up as it runs as a cluster comprised of one or more servers each of which is called a broker. Kafka's message delivery is durable as it writes everything to the disk while maintaining the performance. Kafka can process 8 million messages per second at peak load.

​Processing Framework

​The other half of the stream processing is the Processing Framework that would consume the message, process it and produce the output message. Stream Processing framework need to have a one-at-a-time processing model and the data has to be processed immediately upon arrival. A Stream Processing framework also need to achieve low sub-second latency as it is critical to keep the data moving. The requirement to produce result for every event needs the processing framework to be scalable, highly available and fault tolerant.

streaming-comparision.png
Stream Processing Frameworks Comparision. Source : http://www.cakesolutions.net


​Apache Samza

​The Stream Processing framework space is crowded with multiple players like Strom, Flink, Spark Streaming, Samza, Kafka Streams, DataFlow. Apache Samza is probably the least well known stream processing framework that is trying to make a space for itself. Data Intensive organizations like Uber, NetFlix, Linkedin that process millions of events every second have Samza in their Stream Processing Architecture.
Samza has an advantage when it comes to performance, stability and support for a variety of input sources. Since Samza and Kafka were developed at Linkedin around the same time Samza is very Kafka-centric and has excellent integration with Kafka.
The key difference between Samza and other streaming technologies is its stateful streaming processing capability. Samza tasks have dedicated key/value store co-located on the same machine as the task. This approach delivers better read/write performance than any other streaming processing software. The Kafka, Samza based Stream Processing Architecture has proved its mettle in data intensive high frequency unbound stream processing use cases.

Sunday, December 4

The Reactive Manifesto

Today’s requirements demand new technologies as applications are deployed on multiple devices (mobile, tabs, cloud), with each of the machines having thousands of multicore processors. The user expectation on the response time is in millisecond, or even micro-second. The systems are expected to be up 100% of the time and churn big data (petabytes).

In summary today’s problems are far bigger in scale as well as in complexity as compared to those in the recent past.This calls for some changes to the design principles to be applied well in advance. One such approach is listed in The Reactive Manifesto by summarising the key principles of how to design highly scalable and reliable applications.

The Reactive Manifesto

In 2013 a group of individuals came up with the collection of design principles that would help build systems that are responsive, maintainable, elastic and scalable from the outset. This collection of design principles was published as a manifesto called as The Reactive Manifesto. The manifesto attempts to summarize the information on how to design highly scalable and reliable applications. These are listed as architecture best practices and also define a common vocabulary to easy communication while discussing these topics like architecture and scalability among various stakeholders like engineering managers, developers, architects, CTOs.

High Level Traits

The reactive systems exhibit four high-level traits: Responsive, Resilient, Elastic and Message Drive.
A reactive system will be responsive meaning it will react to users in timely manner. For the users, when the response time exceeds their expectation, the system is down.

A resilient system keeps processing transactions, even when there are transient impulses, persistent stresses, or component failures disrupting normal processing. This is what most people mean when they just say stability.

Resilient systems are loosely coupled in order to achieve high degree of resilience. This is achieved with a shared-nothing architecture, clear boundaries and use of Microservices that lead to single responsibility independent components.

Elasticity plays a major role in Scalability. A reactive system would Scale Up for responding to increase in number of users and will scale down to save cost as the number of system users decreases.
Reactive Systems are more flexible, loosely-coupled and scalable. They are easier to develop and open to change. They are significantly more tolerant of failure. Reactive Systems are highly responsive, giving users effective interactive feedback.

The manifesto is aimed at end-user projects as well as for reusable libraries and frameworks. One can look at the manifesto as a dictionary of best practices.

The long term advantage that this manifesto provides is to come up with a set of principles to avoid confusion and facilitate on-going dialog and improvement for scalable, resilient and responsive systems.

The Reactive Manifesto is about effectively modelling the use cases in our problem domain, writing solutions that will scale and live up to the customers’ expectations. The reactive characteristics can be considered as a list of service requirements.

The requirements of a system need to be captured and factored into a design right at the onset and the reactive manifesto allows one to have the characteristics in one place. This allows one to check if the particular characteristic is applicable to their system.

Sunday, September 25

Building Scalable Applications



In this blog posts Rahul (rahulamodkar at gmail dot com) looks at key strategies to consider while building highly scalable application.

Introduction

Everyone want to build the next Unicorn whether it is a cab aggregating company like Uber or the room stay aggregating platform like Airbnb. The common thing among these is that these need to highly scalable web applications that are ready to serve millions of users. We will look at some strategies that one needs to consider while designing such applications that are expected to scale.


High Scalability Strategies

Without going into too much details and also not talking about the specific tools, we will look at each of these strategies that can help one develop the scalable application.


1. Cloud

Consider deploying your application on the cloud. The cloud comes with the promise of limitless resources which can help you scale up or down in response to the traffic to your site. Theoretically the cloud can provide you as many resource as you request. The aspect of auto scaling also allows you to pay for only those resource that you end up using.


2. Load Balancing

One can deploy a load balancer that will help you distribute the load to your site into different servers thereby ensuring that not one server is overloaded which could result in it going down. Think of it as a traffic cop that will route the client requests across all servers such a way that no single server is overworked.


3. Content Delivery Network

Deploy the services of Content Delivery Network (CDN). A CDN (Content Delivery Network) is a global cluster of caches that can serve as local caches for static files (objects). The static content files are served from the CDN's caching-node closest to the visitor because the CDN takes the geo-location of the user in account.


4. Design for High Availability

It is very likely that the server on which your application is deployed may go down or in a extreme scenario the entire data centre may be down. To tackle this, think of deploying your application in a more than one geographical zone or data centre. This way when a server or an entire data centre goes down, your architecture be such that the application would simply stop routing traffic to the affected data centre. Also one needs to handle hardware and software failures by adding redundancy to the application design and eliminate any single point of failure.

5. Microservices Architecture

In micro-services based architecture, complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task. In a Microservices architecture the scaling of a particular micro service is far more easy when compared to a monolithic application . If a particular microservice in a application becomes a bottleneck due to slow execution, that particular Microservice can run multiple instances on different machines to process data in parallel. With the monolithic systems the scaling is difficult as you would have to run a copy of the complete system on a different machine.

6. Distributed Cache

A effective distributed caching strategy will improve performance in most of the scalable applications. Especially for read-intensive applications a caching implementation can boost the performance as application processing time and database access is reduced.

7. Database Master Slave

Deploy the Master Slave Database strategy where there is a dedicated powerful server for the writes and much lesser powerful database server dedicated to reads. Such a master-slave replication strategy when deployed in the right scenario lets an application distribute its queries efficiently.

8. NoSQL Databases

Consider using the NoSQL Databases. NoSQL databases have the ability to handle large volumes of structured, semi-structured, and unstructured data. They come with an inbuilt scale-out architecture would seamlessly work with your scalable application architecture.

9. Monitor

It is important that you monitor your application for various parameters so that you can either manually or in a automated way take a corrective action. The parameters that should typically be monitored could be applications, services, operating systems, network protocols, system metrics and infrastructure components. For e.g. monitoring will help you identify the health of server instances and automatically terminate & re-launch unhealthy instances. Monitors the application log files and put in place a process that make the engineers take action on issues identified.

10. Database Sharding

As the data grows one cannot keep on buying ever bigger, faster, and more expensive machines. Sharding as a strategy breaks the application database into smaller chunks called "shards" and spreads them across a number of distributed servers. The "Sharding" approach allows the application to scale linearly at a low cost as the data is distributed to multiple physical nodes and allows parallel data access.


11. Horizontal Scaling

In this Linear scalability approach the job of processing is spilt into multiple pieces and distributed among multiple compute nodes. More nodes can be added or removed in response to the surge or dip in traffic thus auto scaling the application to tackle the workload. If a node fails, the other nodes in the cluster take up the workload of the failed node thus adding fault tolerance to the application.

12. Stateless

It is important that the application is stateless for it to auto scale especially in a horizontal scaling approach. Being stateless, services can be easily scaled horizontally thus improving availability and tackle the surge in traffic by scaling automatically.

13. DNS Lookups

Reduce the number of DNS lookups needed to reach the application pages. This is especially true for those pages where the user expects the page to be high on performance. The less number of DNS lookups on the application pages the better your page download performance will be.

14. Commodity Systems

Use inexpensive commodity grade systems where possible so that if one of the systems goes down it is easy to replace it with another. This approach goes hand in hand with the scale out or the horizontal scaling approach and is more effective in a environment where there is hyper growth in the number of users of the system.

15. Asynchronous Communication

Asynchronous Communication is an integral ingredient in the recipe for scalable applications. In addition to asynchronous communication the application also needs to be developed such that it is asynchronous in behaviour. The problem with synchronous communication is that they stall the entire application’s execution as they wait for a response, which binds all the services and tiers together resulting in cascading failures.

16. Containers

Running containers is less resource intensive then running Virtual Machines thus allowing you to add more computing workload onto the same server. Provisioning of new containers take a few seconds or less, thus the data center can react quickly in response to a spike in user activity. Containers are a cost effective solution. They can potentially help you to decrease your operating cost (less servers, less staff) and your development cost (develop for one consistent run-time environment) which could be a big factor when it comes to scalable applications.

Monday, May 30

Serverless Computing

Introduction

Serverless is the latest buzzword in the software architecture world. It is an approach to development where the need for creating and maintaining server’s from physical machines to VMs to cloud instances is removed. This usually means that the architecture is some form of application that interacts with multiple third party APIs/services and self-created non-server based APIs to deliver it's functionality.
Serverless computing could mean that while the server-side logic is written by the application developer, it is run in stateless compute containers that are event-triggered and fully managed by a 3rd party. Going by the amount of attention that major cloud vendors  -  Amazon, Microsoft, Google and IBM  -  are giving it, Serverless technologies could well be the future of cloud.


Existing Platforms

Naturally not all applications can be implemented in the server-less way. There are limitations especially when it comes to legacy systems and using a public cloud. However, the adoption of existing server-less frameworks is only growing by the day. Currently the following server-less computing frameworks are available

  • ·        AWS Lambda
  • ·        Google Cloud Functions
  • ·        Iron.io
  • ·        IBM Open Whisk
  • ·        Microsoft Azure Web Jobs

Function As A Service

Function-As-A-Service lets you run code without provisioning or managing servers. The paradigm of Server-less
Computing is based on the micro-services architecture. Server-less computing frameworks invoke autonomous code snippets when triggered by external events.
These snippets are loosely coupled with each other that are essentially designed to perform one task at a time. Server-less frameworks are responsible for orchestrating the code snippets at runtime. This way one can deploy their applications as independent functions, that respond to events, get charged for only when they run, and scale automatically.
The concept of Server-less computing allows developers to upload autonomous code snippets that are invoked and orchestrated at runtime. Each snippet is versioned and maintained independently of other snippets. This approach marries the Microservices concept with Server-less computing. 
The server-less computing allows the developers to skip the need to provision their resources based on current or anticipated loads, or put a lot of effort into planning for new projects. Similar to Virtual Machines which have made it easy to spin up servers to create new applications, server-less computing services make it simple to grow.

Conclusion

The rise in containers and microservices is the driving force behind serverless computing. Server-less Computing turns out to be a excellent choice for applications like Event Driven Systems, Mobile Backends, IoT Applications, ETL and APIs. For certain use cases serverless computing approach can significantly reduce operational cost.  

References

Friday, April 8

Apache Sentry : Hadoop Security Authorization

                           

            Hadoop has gained significant momentum in the past few years as a platform for scalable data storage and processing. More and more organisations are adopting to Hadoop as an integral part of their data strategy. With the concepts of data lakes and democratization of data becoming more popular, more users are getting access to data that was privy to a select few.
             
                  In absence of a robust security model in place, this approach posses a risk to the sensitive data getting into the hands of unintended audience. Also with Government regulations and compliance like HIPPA, PII and others it is becoming more important for organizations to implement a meaningful security model around the Data Lake. The topmost concern for anyone building a Data Lake today are security and governance. The problem becomes all the more complex as there are multiple components within Hadoop that accesses the data. Also there are multiple security mechanisms that one can use.
                  
                 While implementing security would mean taking care of Authentication, Authorization and Auditing, in today's blog we will focus on the Authorization aspect and particularly on one solution Apache Sentry.  Apache Sentry an Apache Top Level project is an authorization module for Hive, Search, Impala, and others that helps define what users and applications can do with data.

Currently Users can grant themselves permissions in order to prevent accidental deletion of data. The problem with this approach is it does not guard against malicious users. The other option is HDFS Impersonation where Data is protected at the file level by HDFS permissions. But File-level permissions are not granular enough and it does not support role based access control.

              
           Apache Sentry overcomes most of these problems providing secure role-based authorization. Sentry provides fine grained authorization, a key requirement for any authorization service. The granularity supported by Sentry is supported for servers, databases, tables, views, indexes and collections. Sentry supports multi-tenant administration with separate policies for each database/schema and support for maintaining by separate administrators. In order to use Sentry one needs to have CDH 4.3.0 or later and secure HiveServer2 with strong authentication (Kerberos or LDAP).

                In short Apache Sentry is enabled to store sensitive Data in Hadoop, there by extending Hadoop to more users and helping organizations to comply with regulations. The fact that Sentry recently as graduated out of Incubator and is now an Apache Top Level says a lot about its abilities in providing a robust authorization to Hadoop ecosystem. 

Wednesday, April 6

Data Lake

A data lake can be defined as an unstructured data warehouse where you pull in all of your different sources into one large pool of data. Data lake as a concept can be applied more broadly to include any type of systems but in today's Big Data world it is most likely to be about storing data in the Hadoop Distributed File System (HDFS) across a set of clustered compute nodes.
More and more organisations are moving away from the traditional data warehouses and adapting Hadoop data lakes that provide a less expensive central location for analytics data. The reasons could be far and many but the primary reasons among them are the ability to build the data lake using cheap commodity hardware and the use of open source technology for Hadoop development.
James Dixon, the founder and CTO of Pentaho, who has been credited with coming up with the term "Data Lake" describes it as follows:
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” - James Dixon, the founder and CTO of Pentaho
The focus of a model Data Lake is on storing unrelated data and ignores the how and why the data is used, governed and secured. The Data Lake solves the problem of isolated pieces of information. It removes the need for having lots of independently managed collections of data. It combines these sources in the single data lake. This leads to increase in information use and sharing and at the same time reduces the costs through server and license reduction.
In today's scenario the security model within Data Lakes (which in all probability are hadoop based) is still immature and not comprehensive. The reason for this could be because Hadoop is not a single technology stack but is a collection of projects.
Anybody working on the defining the architecture of a Hadoop based Data Lake needs to think hard on the how to provide robust authentication, authorization and auditing. There are several open source security efforts in the Apache community like Knox, Ranger, Sentry and Falcon. One needs to understand that dumping data in the Data Lake with no process, procedures or data governance will only lead to a mess.
To summarize "Data Lake" as a concept is increasingly recognized as a important part of data strategy. There are multiple use cases within businesses that exist for the Data Lake. The key challenges for building the Data Lake are identified as security and Governance.

Sunday, March 20

Schema On Read

According to a study 90% of today's data warehouses process just 20% of the enterprise data. One of the primary reason for such a low percentage of data being processed is because the traditional data warehouses are schema on write which requires schema, partitions and indexes to be pre-built before the data can be read.

For a long time now the data world has adapted the schema-on-write approach where systems require users to create a schema before loading any data into the system. The flow is to define the schema, then write the data, then read the data in the schema that you defined at the start.  This allows such systems to tightly control the placement of the data during load time hence enabling them to answer interactive queries very fast. However, this leads to loss of agility.

But there is an alternative to this approach and that approach is schema-on-read. Hadoop's schema-on-read capability, allows data to start flowing into the system in its original form, then the schema is parsed at read time where each user can apply their own view to interpret the data.

In the "Schema on read" world the flow is to load the data in its current form and apply your own lens to the data when you read it back out. The reason to do it this way is because nowadays data is shared with large and varied set of users with each category of user having a unique way to look at the data and derive a insight that is specific to them. With schema-on-read approach the data is not restricted to a one-size-fits-all format. This approach allows one to present the data back in a schema that is most relevant to the task at hand and allows for extreme agility while dealing with complex evolving data structures..
The schema-on-read approach allows one to load the data as is and start to get value from it right away. This is equally true for structured,  semi-structured and unstructured data. It is easier to create multiple and different views of the same data with schema on read.
The popularity of Hadoop and related technologies in today's enterprise technology can be in parts credited to their ability to support the schema-on-read strategy. Organisations have large amounts of raw data that powers all sorts of business processes by applying transformation  systems involving corporate data warehouses and other large data assets.