In this article I will show how a normal big data/machine learning project progresses, taking in account the stream processing and analysis part. I will try to functionally position a wide variety of Apache projects such as Sqoop, Flume, Kafka, Hadoop, Storm, Spark, Samza, Flink, Beam, Oozie, Thrift.
First step of any project is to gather data from various places. Data is of 3 types:
Sqoop is a tool designed for efficiently transferring bulk data between Hadoop and structured data stores such as relational databases.
Unstructured /Semi structured data (Kafka vs Flume)
Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
Kafka is a good choice if you code your own producers and consumers. Otherwise Flume is the alternative, if the existing Flume sources and sinks match your requirements and you need a system that can be set up without any development.
Kafka is very much a general-purpose system. You can have many producers and many consumers sharing multiple topics. In contrast, Flume is a special-purpose tool designed to send data to HDFS and HBase. It has specific optimizations for HDFS and it integrates with Hadoop’s security. As a result, Cloudera recommends using Kafka if the data will be consumed by multiple applications, and Flume if the data is designated for Hadoop.
Data gathered can be stored in various platforms like variety of NoSQL databases and file systems, including HBase, MongoDB, MapR-DB, HDFS, MapR-FS, Amazon S3, Azure Blob Storage, Google Cloud Storage, OpenStack Swift, NAS and local files.
There are various dispersed calculation frameworks that can process Big Data. After you store the data then you need to do various actions on it which is done by some stream processing agents. Mostly: Apache Storm, Apache Spark , Apache Samza.
Apache Storm does certain things exceptionally well1, and is less suitable for other things 2. Storm depends on tuples and streams. Storm contrasted with Spark Streaming. One of the greatest major contrasts between the two is that Storm works on individual events as Samza does, and Spark Streaming works on micro-batches.
Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets. If you’re leveraging an existing Hadoop or Mesos cluster and/or if your processing needs involve substantial requirements for graph processing, SQL access, or batch processing, you might want to look at Spark first.
Samza’s way to deal with streaming is to process messages. Samza’s stream primitive is not a tuple or a Dstream, but rather a message. Streams are separated into parcels and every segment is a requested succession of read-just messages with every message having an interesting ID (counterbalance). The framework additionally supports batching.
|LANGUAGE SUPPORT||most (Java, Clojure and any non JVM language)||medium (Scala, Java, Python)||least (Java, Scala, only JVM languages)|
|THROUGHPUT||10K+ RECORDS PER NODE||100K+ RECORDS PER NODE||10K+ RECORDS PER NODE|
If you need further information about real-time streaming with Storm and Spark, please read this presentation .
Well you can use all but there are some which gets the edge over the other. In particular:
- CACHE INVALIDATION
- DISTRIBUTED RPC
- CACHE INVALIDATION
New interesting trends: near and far future
A new technology is also in market which is better than the existing technologies known as Flink.
Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams.
Flink includes several APIs for creating applications that use the Flink engine:
- DataStream API for unbounded streams embedded in Java and Scala, and
- DataSet API for static data embedded in Java, Scala, and Python,
- Table API with a SQL-like expression language embedded in Java and Scala.
Apache Spark is a system that likewise underpins clump and stream handling. Flink’s bunch API looks very comparable and locations comparable utilize cases as Spark yet contrasts in the internals. For streaming , both frameworks take after altogether different methodologies (smaller than normal clusters versus streaming) which makes them appropriate for various types of use. Flink is substantial and helpful, however Spark is not the most comparable stream preparing motor to Flink.
Beam is the future of streaming and batch data processing. Apache Beam is an open source, unified model for defining and executing data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and runtime-specific Runners for executing them.
Different tools for analysis
There are 3 types of analysis:
- Exploratory analysis. Use Pig/Hive, etc.
- Predictive analysis and machine learning. Use Mahout, WEKA, Oryx (is a realization of the lambda architecture built on Spark and Kafka, but with specialization for real-time large scale machine learning. Lambda Architecture enables real-time responses to query over Petabytes of data3), H2O, etc.
- Prescriptive analysis (exploratory + predictive analysis)
Monitoring all the data flow
Here Oozie is a good choice. It is a Workflow scheduler system for MapReduce jobs using DAGs (Direct Acyclical Graphs). Oozie Coordinator can trigger jobs by time (frequency) and data availability.
Apache Zookeeper – ZooKeeper is an application library with two principal implementations of the APIs—Java and C—and a service component implemented in Java that runs on an ensemble of dedicated servers. Zookeeper is for building distributed systems, simplifies the development process, making it more agile and enabling more robust implementations.
Apache Thrift – Thrift provides a framework for developing and accessing remote services. It allows developers to create services that can be consumed by any application that is written in a language that there are Thrift bindings for. Thrift manages serialization of data to and from a service, as well as the protocol that describes a method invocation, response, etc. Instead of writing all the RPC code — you can just get straight to your service logic.
Highly Recommended Reading
- New to Spark? Start with Apache Spark Key Terms, Explained
- Very useful two-part series about the evolution of data processing, with a focus on streaming systems, unbounded data sets, and the future of big data: First part, Second part
- Google Cloud Dataflow is a fully managed cloud service for creating and evaluating data processing pipelines at scale. A dataflow pipeline can operate in both batch and streaming mode. Google Cloud Dataflow is part of the Google Cloud Platform. But Google Cloud Dataflow SDK is also part of the the Apache Beam incubator project and this could be an important step in the direction of a powerful model for data-parallel processing, both streaming and batch, portable across a variety of runtime platforms. Two advisable readings here are: Why Apache Beam? A Google perspective and Dataflow/Beam & Spark: A programming model comparison
- Are you looking for a comparison between Google Cloud Platform, Microsoft Azure and Amazon AWS? Here a very good resource: AWS vs Azure vs Google Cloud Platform – Analytics & Big Data
- Big Data Landscape and the big picture
- Some areas where Storm excels include: near real-time analytics, natural language processing, data normalization and ETL transformations. It also stands apart from traditional MapReduce and other course-grained technologies yielding fine-grained transformations allowing very flexible processing topologies.
- Storm is not capable of stateful operations, which are essential in making real-time decisions. The analysis of stateful data is at the heart of fast data applications such as personalization, customer engagement, recommendation engines, alerting, authorization, and policy enforcement. Without additional components such as ZooKeeper and Cassandra, Storm is unable to look up dimension data, update an aggregate, or act directly on an event (that is, make real-time decisions).
- State-of-the-art in techniques used in the developing of Big Data applications and related technology offerings that are available on the market