What is Apache Spark? And Why Use It?

research@theseattledataguy.com May 29, 2021 Uncategorized 0

Born out of frustration with the only open source distributed programming implementation of the time, Apache Spark was created in the UC Berkeley AMPLab in 2014 to replace it’s predecessor Hadoop MapReduce.

MapReduce was robust but burdened by excessive boiler plating, serialization and deserialization. MapReduce was created to anticipate node failure at every step. As computation became more reliable, repetitive serialization/deserialization became less essential.

Correspondingly, MapReduce allocated most of its time to conducting inconsequential operations. As a result of exorbitant serialization, deserialization and disk IO operations, MapReduce allocated more than 90% of it’s time to conducting HDFS read-writes. Noticing how advancements in MapReduce were not keeping pace with hardware innovations, researchers at UC Berkeley created Apache Spark to address the aforementioned issues prevalent to the MapReduce programming model.

In this article we will outline exactly how Spark functions and why it has remained the primary big data processor for so long. Additionally, we will also contrast Spark with Dask, an emergent parallel/distributed processor praised for it’s intuitive application.

What Is Spark?

Two major Improvements featured in Apache Spark were:

Fast data sharing
Directed acyclic graphs (DAGs)

Additionally, Spark contains a large set of API hooks, supporting various programming languages including; Python, Scala, R, and Java. However, it should be noted that Spark is still a JVM language and Python and R function merely as wrappers. As a result, there is a slight time cost associated with using these hooks.

Resilient Distributed Datasets (RDD)

Recognizing the problems sustained from superfluous serialization and deserialization in Hadoop MapReduce, Spark researchers created Resilient Distributed Datasets (RDDs). These datasets use in-memory data sharing as opposed to the sluggish network or disk methods implemented in MapReduce. Below is a graph demonstrating the improvement of iterative operations. Where data between iterations is held in distributed memory as opposed to written to the disk.

In addition to preventing outrageous serialization and deserialization, RDDs are strongly typed Java Virtual Machines and immutable objects. Immutability of RDDs provokes a key aspect of functional programming, Spark does not directly mutate RDDs, it generates copies of RDDs to facilitate changes. Holding RDDs constant promotes fault tolerance which we will discuss later in this article. When it comes to mutating RDDs, there are two general types of operations that can be applied; actions and transformations.

Transformations

Transformations are lazily evaluated, meaning that they are not required to happen at the exact location assigned in the code. This aspect allows for the distributed process to be optimized.

Actions

Actions are eagerly evaluated, meaning the computation cannot be delayed and all computations prior must be calculated before the action can take place. In the interest of runtime reduction, this operation should be applied sparingly.

Another key component of RDDs is that they can facilitate parallel execution on each partition. Mutations can be individually conducted by worker nodes where the partition data is stored. Distributed operations applied to a single dataset tremendously reduces computational time. Spark also allows for the size and quantity of partitions to be specified.

Directed acyclic graph (DAG)

As previously stated, the directed acyclic graph is one of the key improvements of Spark. DAG is a graph depicting the sequence of operations tobe performed on the target RDD. This process allows for Spark to optimize computation by maximizing parallel/distributed operations. For example, below is a DAG corresponding to the following operation:

sc.parallelize(1 to 100).toDF.count()

The DAG scheduler segments operations into stages, the images above show two separate stages. Stages are tasks based on the set of partitions. The states are subsequently relayed to the Task Scheduler, where tasks are launched via a cluster manager. Available cluster managers include: Cassandra , Kubernetes, Apache hbase, Mesos, Hadoop.

Fault Tolerance

The combination of immutable RDDs and DAGs permit fault tolerance. If a node crashes in the middle of operation, data can be retrieved and the cluster manager can assign another node to continue processing where the dead node crashed without data loss.

Dask

Branching off from Spark’s proficient distributed computing, Dask was released in 2018. Unfortunately for those who love R/JVM, Dask is exclusive to Python. Dask is considered to be a lighter weight version of Spark. However, Dask is very tightly integrated into the SciPy ecosystem of NumPy, Pandas, Scikit-Learn, Matplotlib, Bokeh and RAPIDS. If your problems vary beyond typical ETL and SQL, Dask will likely be a more viable option, especially if you plan to implement the SciPy library. Resulting from it’s affiliation with SciPy, Dask is highly adapted to complex algorithms and machine learning modeling.

Dask expanded the idea of task dependency graphs and made parallel processes at a much more granular level.

Taking advantage of as many parallel processes as possible greatly reduces computational time. All Dask processes are “lazy” once the task graph is made it can be sent to the scheduler. Scheduler options include: Synchronous, threaded, multiprocessing and distributed.

Why is Dask gaining popularity?

Dask’s seamless use is it’s most attractive component. Dask has a near mirrored version of packages like NumPy, Pandas and Scikit-Learn. Meaning, you can use most all features you are used to, while only changing the import. With respect to intuitiveness and accessibility, this feature makes Dask far more user-friendly than Spark.

When you should use Spark:

Spark is older than Dask and has had a history of being the predominant big data tool. Due to its long standing prevalence, Spark has become a trusted solution for business.

If the objective of your application fits into the Map-Shuffle-Reduce paradigm with simplistic machine learning, then Spark would be a more optimal choice. Spark is decidedly optimized for uniform straightforward mutations. Additionally, if the majority of your infrastructure is JVM based, Spark will harmonize better with your system.

When you Should use Dask:

If you plan to conduct further analysis within the python ecosystem, Dask gains a substantial advantage over Spark. While Spark is geared primarily towards rudimentary business operational tasks, Dask is more proficient when complex algorithms need to be deployed. Subsequently, Dask is versed with more convoluted machine learning operations.

Conclusion

Both Dask and Spark focus on parallelizing/distributing computation by implementing Directed acyclic graphs (DAGs). Dask is a lighter weight version of Spark. However, Dask should be considered if you plan to implement Python, especially if your project requires computation outside of the vanilla ETL and SQL combination. However, Spark has been the largest established big data processor for a long period of time, with this prominence comes exhaustive feature sets. For additional delineation, Dask has dedicated and entire webpage to assessing the juxtaposition of Spark and Dask.

Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.

5 Data Engineering Project Ideas To Add To Your Resume

Greylock VC and 5 Data Analytics Companies It Invests In

5 SQL Concepts You Need To Know Before Your Next Data Science Or Data Engineering Interview

How To Improve Your Data-Driven Strategy

What In The World Is Dremio And Why Is It Valued At 1 Billion Dollars?

Mistakes That Are Ruining Your Data-Driven Strategy

5 Great Libraries To Manage Big Data With Python