Parallelism techniques in Apache Spark

Hi Readers,

In this post you will be learning the various parellelism techniques that can be used in Apache Spark.

We need to use parallelism techniques to achieve the full utilization of the cluster capacity. In HDFS, it means that the number of partitions is the same as the number of input splits, which is mostly the same as the number of blocks.

Example:

If you want to check the parallelism default value in your cluster use sc.defaultParallelism in your spark shell.

scala> sc.defaultParallelism
res3: Int = 2

In my cluster the defaultparallelism value is 2.

In Spark shell we can achieve parallelism by adding  value 10 to the sc.textFile() which means the file has number of partition as 10.

scala> sc.textFile(“/user/edureka_162051/HackerRank-Developer-Survey-2018-Values.csv”,10)

Important :
To maximize the parallelism, the number of partitions value should be two to three times the number of cores present in your cluster.

 

coalesce(numOfPartitions):

Reduce the number of partitions using a RDD method called coalesce(numOfPartitions),where numOfPartitions is the final number of partitions.

repartition(numOfPartitions):

If your data has to be reshuffled over the network, then use the RDD method called repartition(numOfPartitions), where numOfPartitions is the final number of partitions.

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s