In this post you will be learning the various optimization techniques used in apache spark.
We can optimize our Spark applications by using data serialization technique, broadcasting etc…
Spark provides two options for data serialization
1 Java serialization
2 Kryo serialization
Compared to Java serialization, Kryo serialization is much fastert than Jav serialization.
You can start using Kryo by initializing your Spark job with a SparkConf as below
scala> import org.apache.spark._
scala> import org.apache.spark.rdd.RDD
scala> val conf = new SparkConf().setAppName("My App").setMaster("local[*]")
conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@54c15e
scala> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
res1: org.apache.spark.SparkConf = org.apache.spark.SparkConf@54c15e
A broadcast enables a read-only copy of an instance or class variable cached on each driver program, rather than transferring a copy of its own.
When to use:
If you have a certain task in your Spark job that uses large objects from the driver program, you should turn it into a broadcast variable.
how to use:
you can instantiate it using SparkContext.broadcast
scala> val m1 = 20
m1: Int = 20
scala> val m2 = sc.broadcast(m1)
bv: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(0)
res7: Int = 20
The Broadcast feature of Spark uses the SparkContext to create broadcast values. After that, the BroadcastManager and ContextCleaner are used to control their life cycle.
Some other techniques like memory tuning, data structure tuning, GC tuning etc.
You can also bookmark this page for future reference.
You can share this page with your friends.
Follow mefor future notifications.