In this post you will be learning the various parellelism techniques that can be used in Apache Spark.
We need to use parallelism techniques to achieve the full utilization of the cluster capacity. In HDFS, it means that the number of partitions is the same as the number of input splits, which is mostly the same as the number of blocks.
If you want to check the parallelism default value in your cluster use sc.defaultParallelism in your spark shell.
res3: Int = 2
In my cluster the defaultparallelism value is 2.
In Spark shell we can achieve parallelism by adding value 10 to the sc.textFile() which means the file has number of partition as 10.
To maximize the parallelism, the number of partitions value should be two to three times the number of cores present in your cluster.
Reduce the number of partitions using a RDD method called coalesce(numOfPartitions),where numOfPartitions is the final number of partitions.
If your data has to be reshuffled over the network, then use the RDD method called repartition(numOfPartitions), where numOfPartitions is the final number of partitions.
You can also bookmark this page for future reference.
You can share this page with your friends.
Follow mefor future notifications.