New Ecosystem of Big Data Pipelines push the Limits of Modern Data Streaming |
Posted: September 28, 2017 |
Data is being generated in truly massive amounts each day. Latest estimates claim that 2.7 Zettabytes of data exist in the digital universe today (big-data-interesting-facts). This data is being produced by a whole host of emerging mobile technologies, Internet of Things (IoT) sensors, social media, and web activity histories. Making sense and using this data has become a key driving objective for even the modest business to help to make critical decisions. In today's business world, what was once secondary data has now become the primary data driving business growth. Business and big data streamsBut, the real challenges of big data is not about storing data or analysis with machine learning, it is about channeling all of this unstructured data from its source to multiple downstream analysis engines. To deal with the large data throughput, big data software pipelines must utilize distributed cluster platforms, adding additional complexity to an already difficult problem. Traditional solutions ETL (extract, load, and store) database solutions do not scale easily and are not well suited to handle unstructured and real-time streaming data. This brief article takes a look at the key challenges that limit data streaming and how such tools were designed to tackle such problems. Key Challenges of Modern big dataThere are three fundamental challenges that limit big data streaming:
Several data streaming solutions, such as Google Cloud Dataflow, Storm, Samza, and Spark have emerged to handle these problems. They are loosely based on an old idea, namely, asynchronous event messaging. Messaging based systems for Stream processingEvent messaging is a natural way to deal with sequential stream processing. Traditional message queues, such as ActiveMQ or RabbitMQ, can provide reliability and scalability, however other they lack the temporal persistence and are not easily maintained. Log aggregation system (Flume, Scribe) have been commonly used for temporal persistence of events. Inspired by messaging and logging systems, Kafka is a low-latency distributed messaging systems that act as an event ledger that is designed specifically for distributed platforms. It is based on a publish/subscribe metaphor and can handle near real-time asynchronous data streaming. Because this is a pub/subscribe system, it can back-pressure and reactive programming, which means that it essentially acts as a buffer to incoming data streams, solving the temporal persistence problem by inherently throttling the data until consumers can send it to downstream application processing. A typical user-case for a big data streaming service is to process website activity. For example, whenever a page is loaded, view events are saved and sent to the messaging system which is processed through multiple downstream channels: storing a message for future analysis, alert triggers, email notifications, processing of user profile information, etc. A practical use can be found here. The emerging ecosystemThe traditional boundaries between messaging, log aggregation, and streaming data platforms are increasingly becoming blurred. A mature and exciting ecosystem of scalable distributed software platforms has emerged that will push big data into the future.
|
||||||||||||||||||||||||||||||||||||||||||
|