The Technology Blog: design for cloud

Showing posts with label design for cloud. Show all posts

Thursday, December 8

Stream Processing At Scale : Kafka & Samza

Businesses today generate millions of events as part of their daily operations. One such example is Uber that generates thousands of events like when you open the Uber app to see how many cars are near by that is a eye ball event, your booking of a cab is an event, the uber driver accepting your request is another event and many more such events.

Unbounded un-ordered and large scale data sets have become increasingly common and come from varied sources like satellite data , scientific instruments, stock data and traffic control. Basically data that arrives continuously and is large in scale and is never ending. To summarize entire business can be represented as streams of data. Turning these streams into valuable, actionable information in real time is critical to the success of any organization.

Challenges

These requirements lead to development of applications whose primary job is to consume these never ending continuous stream of events and process them successfully in near real time. The number of events from such a business are extremely high or large scale in nature. Each of these events needs to be sent somewhere and in most cases there would be multiple applications that would like to process a single event.

Stream Processing

Stream Processing can be looked into in 2 parts. One is how the application gets its input and the way it produces the output. The aspect of getting the stream input can be owned by a message broker and the job to produce the output can be owned by the processing framework.

Message Broker

A Message Broker would need to deal with the never ending fire hose of events (100k+/sec) and hence needs to be scalable, fault-tolerant and with high throughput. The Message Broker should also support multiple subscribers as there could be more than one application that would process the messages. The message broker should also have the ability to persist the messages. Another important requirement from the message broker is performance.

Apache Kafka

Currently Apache Kafka is the unanimous choice when it comes to message broker in stream processing applications. Performance-wise, it looks like Kafka blows away the competition as it handles tens of millions of reads and writes per second from thousands of clients all day long on modest hardware. In Kafka, messages belonging to a topic are distributed among partitions.

This ability of a Kafka topic to be divided into partitions allows Kafka to score high on scalability. Kafka is designed as distributed from the ground up as it runs as a cluster comprised of one or more servers each of which is called a broker. Kafka's message delivery is durable as it writes everything to the disk while maintaining the performance. Kafka can process 8 million messages per second at peak load.

Processing Framework

The other half of the stream processing is the Processing Framework that would consume the message, process it and produce the output message. Stream Processing framework need to have a one-at-a-time processing model and the data has to be processed immediately upon arrival. A Stream Processing framework also need to achieve low sub-second latency as it is critical to keep the data moving. The requirement to produce result for every event needs the processing framework to be scalable, highly available and fault tolerant.

Stream Processing Frameworks Comparision. Source : http://www.cakesolutions.net

Apache Samza

The Stream Processing framework space is crowded with multiple players like Strom, Flink, Spark Streaming, Samza, Kafka Streams, DataFlow. Apache Samza is probably the least well known stream processing framework that is trying to make a space for itself. Data Intensive organizations like Uber, NetFlix, Linkedin that process millions of events every second have Samza in their Stream Processing Architecture.

Samza has an advantage when it comes to performance, stability and support for a variety of input sources. Since Samza and Kafka were developed at Linkedin around the same time Samza is very Kafka-centric and has excellent integration with Kafka.

The key difference between Samza and other streaming technologies is its stateful streaming processing capability. Samza tasks have dedicated key/value store co-located on the same machine as the task. This approach delivers better read/write performance than any other streaming processing software. The Kafka, Samza based Stream Processing Architecture has proved its mettle in data intensive high frequency unbound stream processing use cases.

Sunday, December 4

The Reactive Manifesto

Today’s requirements demand new technologies as applications are deployed on multiple devices (mobile, tabs, cloud), with each of the machines having thousands of multicore processors. The user expectation on the response time is in millisecond, or even micro-second. The systems are expected to be up 100% of the time and churn big data (petabytes).

In summary today’s problems are far bigger in scale as well as in complexity as compared to those in the recent past.This calls for some changes to the design principles to be applied well in advance. One such approach is listed in The Reactive Manifesto by summarising the key principles of how to design highly scalable and reliable applications.

The Reactive Manifesto

In 2013 a group of individuals came up with the collection of design principles that would help build systems that are responsive, maintainable, elastic and scalable from the outset. This collection of design principles was published as a manifesto called as The Reactive Manifesto. The manifesto attempts to summarize the information on how to design highly scalable and reliable applications. These are listed as architecture best practices and also define a common vocabulary to easy communication while discussing these topics like architecture and scalability among various stakeholders like engineering managers, developers, architects, CTOs.

High Level Traits

The reactive systems exhibit four high-level traits: Responsive, Resilient, Elastic and Message Drive.

A reactive system will be responsive meaning it will react to users in timely manner. For the users, when the response time exceeds their expectation, the system is down.

A resilient system keeps processing transactions, even when there are transient impulses, persistent stresses, or component failures disrupting normal processing. This is what most people mean when they just say stability.

Resilient systems are loosely coupled in order to achieve high degree of resilience. This is achieved with a shared-nothing architecture, clear boundaries and use of Microservices that lead to single responsibility independent components.

Elasticity plays a major role in Scalability. A reactive system would Scale Up for responding to increase in number of users and will scale down to save cost as the number of system users decreases.

Reactive Systems are more flexible, loosely-coupled and scalable. They are easier to develop and open to change. They are significantly more tolerant of failure. Reactive Systems are highly responsive, giving users effective interactive feedback.

The manifesto is aimed at end-user projects as well as for reusable libraries and frameworks. One can look at the manifesto as a dictionary of best practices.

The long term advantage that this manifesto provides is to come up with a set of principles to avoid confusion and facilitate on-going dialog and improvement for scalable, resilient and responsive systems.

The Reactive Manifesto is about effectively modelling the use cases in our problem domain, writing solutions that will scale and live up to the customers’ expectations. The reactive characteristics can be considered as a list of service requirements.

The requirements of a system need to be captured and factored into a design right at the onset and the reactive manifesto allows one to have the characteristics in one place. This allows one to check if the particular characteristic is applicable to their system.

Monday, May 30

Serverless Computing

Introduction

Serverless is the latest buzzword in the software architecture world. It is an approach to development where the need for creating and maintaining server’s from physical machines to VMs to cloud instances is removed. This usually means that the architecture is some form of application that interacts with multiple third party APIs/services and self-created non-server based APIs to deliver it's functionality.

Serverless computing could mean that while the server-side logic is written by the application developer, it is run in stateless compute containers that are event-triggered and fully managed by a 3rd party. Going by the amount of attention that major cloud vendors - Amazon, Microsoft, Google and IBM - are giving it, Serverless technologies could well be the future of cloud.

Existing Platforms

Naturally not all applications can be implemented in the server-less way. There are limitations especially when it comes to legacy systems and using a public cloud. However, the adoption of existing server-less frameworks is only growing by the day. Currently the following server-less computing frameworks are available

· AWS Lambda

· Google Cloud Functions

· Iron.io

· IBM Open Whisk

· Microsoft Azure Web Jobs

Function As A Service

Function-As-A-Service lets you run code without provisioning or managing servers. The paradigm of Server-less

Computing is based on the micro-services architecture. Server-less computing frameworks invoke autonomous code snippets when triggered by external events.

These snippets are loosely coupled with each other that are essentially designed to perform one task at a time. Server-less frameworks are responsible for orchestrating the code snippets at runtime. This way one can deploy their applications as independent functions, that respond to events, get charged for only when they run, and scale automatically.

The concept of Server-less computing allows developers to upload autonomous code snippets that are invoked and orchestrated at runtime. Each snippet is versioned and maintained independently of other snippets. This approach marries the Microservices concept with Server-less computing.

The server-less computing allows the developers to skip the need to provision their resources based on current or anticipated loads, or put a lot of effort into planning for new projects. Similar to Virtual Machines which have made it easy to spin up servers to create new applications, server-less computing services make it simple to grow.

Conclusion

The rise in containers and microservices is the driving force behind serverless computing. Server-less Computing turns out to be a excellent choice for applications like Event Driven Systems, Mobile Backends, IoT Applications, ETL and APIs. For certain use cases serverless computing approach can significantly reduce operational cost.

References

https://aws.amazon.com/lambda
https://cloud.google.com/functions
https://www.ibm.com/cloud-computing/bluemix/openwhisk

Thursday, December 8

Stream Processing At Scale : Kafka & Samza

Challenges

Stream Processing

Message Broker

Apache Kafka

​Processing Framework

​Apache Samza

Sunday, December 4

The Reactive Manifesto

The Reactive Manifesto

High Level Traits

Monday, May 30

Serverless Computing

Introduction

Existing Platforms

Function As A Service

Conclusion

References

Processing Framework

Apache Samza