Big data is ubiquitous nowadays. Among the tools that process all that information, Apache Spark and Hadoop MapReduce get the most attention. If people mention them together, they usually mean to compare Spark vs. MapReduce. Amazon, eBay, Yahoo!, and other major players have adopted Spark, and it seems to have seized the initiative in the big data analytics space.
If you’d like to know why, or aren’t sure which big data framework is right for your business, this article covers the important differences.
Apache Hadoop got its start in 2006 and actually triggered big data evolvement. The framework lets users process large amounts of information across highly scalable computer clusters using simple programming models. Its primary components are:
MapReduce processes structured and unstructured data in a parallel and distributed setting. The Mapper task sorts the data that is available, and the Reducer task combines and converts it into smaller chunks.
Spark was developed in 2012 at UC Berkeley. It’s another Apache project focused on parallel processing across a cluster. Unlike Hadoop which reads and writes files to HDFS, it works in-memory. The information is processed using Resilient Distributed Datasets (RDDs). An RDD is an immutable distributed collection of objects that can be operated on in parallel. The tool can also use the disk for volumes that don’t entirely fit into memory. Spark Core drives the scheduling, optimizations, and RDD abstraction. On top of it, MLLib operates for machine learning and GraphX for graph processing.
Spark can run either independently or on Hadoop YARN Cluster Manager and can read existing Hadoop data. Having no file management, it has to rely on the Hadoop Distributed File System or another solution. It’s even listed as a module on Hadoop’s project page. Hence, the popular Google request Hadoop vs. Spark is incorrect: it’s Hadoop MapReduce and Spark that are comparable. Now we’ll take a look at their attributes that matter most for businesses.
Spark is known for its speed. It has been found to run tenfold faster on disk and 100 times faster in-memory than MapReduce. Such processing delivers near real-time analytics, making the tool suitable for credit card processing systems, IoT sensors, security analytics, marketing campaigns, social media sites, machine learning, and log monitoring.
For batch processing, huge volumes of information are collected over a period of time. Then, all of it is processed at once. That approach works well with large, static data sets, e.g., for the calculation of a country’s average income, but does not help businesses react to changes in real time. It’s the stream processing approach that is optimal for real-time predictive analytics or machine learning tasks that require immediate output. The input for it is generated by real-time event streams, e.g., Facebook with millions of events occurring per second.
Spark is suitable for both approaches. It speeds up batch processing via in-memory computation and processing optimization. Users can view the same info as graphs or collections and can transform and join graphs with RDDs. With its real-time data processing capability, Spark is a good choice for big data analytics. Its MLLib is used for iterative machine learning applications in-memory.
MapReduce stores the information on-disk and processes it in sequences on various nodes (machines) and even collections of nodes. The separate processing results are then combined to deliver the final output. MapReduce is thus scalable and has proved efficient with larger data sets.
The ease of use is one of Spark’s hallmarks. There are user-friendly APIs for its native language Scala and for Java, Python, and Spark SQL. The latter is similar to SQL, making it easier to use for SQL developers. Since Spark allows for streaming, batch processing, and machine learning in the same cluster, it’s easy to simplify the infrastructure for data processing. An interactive shell enables the users to have a prompt feedback for queries and other actions.
Hadoop is written in Java, is difficult to program, and requires abstractions. There’s no interactive mode in MapReduce. (The work can be facilitated with add-on tools Hive and Pig.)
Both frameworks are tolerant to failures within a cluster. Hadoop was designed to run thousands of machines for distributed processing. Some of these machines were expected to be down due to the sheer scale and statistical probability. If it happens, the system can rebuild the files from other blocks elsewhere. MapReduce’ TaskTrackers also provide an effective method for fault tolerance but can slow down operations that have a failure.
RDDs and various data storage models ensure Spark’s fault tolerance. Initially, data-at-rest is stored in the fault-tolerant HDFS. As an RDD is built, so is a lineage that remembers how the data set was constructed. RDDs can be persistent in caching a data set in memory across operations. If any partition of an RDD is lost, it can be traced back to the source. However, replicated across executor nodes, the data may be corrupted due to a failure of the node or driver-to-executors communication.
MapReduce has enjoyed better security than Spark so far. HDFS supports access control lists (ACLs) and a traditional file permissions model. For user control in job submission, Hadoop provides Service Level Authorization, which ensures that clients have the right permissions. Hadoop supports Kerberos, a trusted authentication management system, and third-party vendors like LDAP for authentication. Those vendors offer data encrypt for in-flight and data-at-rest. MapReduce can also integrate with Hadoop security projects, e.g., Knox Gateway or Sentry.
Spark’s security model is evolving. However, organizations can run it on HDFS, taking advantage of its ACLs and file-level permissions, or on YARN leveraging Kerberos authentication.
As big data is growing, cluster sizes are expected to increase to maintain throughput expectations. Both MapReduce and Spark were built with that idea and are scalable using HDFS. However, Spark’s optimal performance setup requires random-access memory (RAM).
Both Hadoop and Spark are open-source and come for free. The major investment will be in hardware, personnel, or outsourcing the development. The total cost includes hiring a team that understands cluster administration, hardware and software purchases, and maintenance.
When it comes to costs, look at the business’ requirements. If it’s about processing large amounts of information, MapReduce is preferable: hard disk space is cheaper than memory space. You’d have to buy faster disks and a lot of disk space for it, though, as well as need more systems for distributing the disk input/output over different systems.
If real-time data processing is required, Spark is preferable. It takes large amounts of RAM to run everything in-memory but can deal with a standard disk space that runs at standard speeds. Still, requiring significantly fewer costly systems, the technology may reduce the costs per unit of computation even with the extra RAM requirement.
A comparison of Apache Spark vs. Hadoop MapReduce shows that both are good in their own sense. Both are driven by the goal of enabling faster, scalable, and more reliable enterprise data processing. However:
On the other hand:
The choice of the big data technology stack should suit your business goals. MapReduce is attractive to businesses that need vast data sets brought under control by commodity systems. However, it's clear that while the industries’ needs are evolving, Spark is perfect for machine learning scenarios and way better for real-time analytics. Due to high compatibility, it’s the favorite in data science and seems to be replacing MapReduce rapidly.
Despite all comparisons of MapReduce vs. Spark, businesses can benefit from their synergy in many ways. Spark’s speed, agility, and ease of use should complement MapReduce’ lower cost of operation. Most importantly, they can bring in the mix of real-time and batch processing capabilities.