Optimizing joins techniques in Apache Spark

Hi Readers, In this post you will be learning the various optimizing joins techniques that can be used in Apache Spark. Three types of joins in Spark are Shuffle hash join (default): It is a map-reduce type join. Based on output key it shuffles the datasets. During reduce phase, it joins the datasets for same output …

Continue reading Optimizing joins techniques in Apache Spark

Parallelism techniques in Apache Spark

Hi Readers, In this post you will be learning the various parellelism techniques that can be used in Apache Spark. We need to use parallelism techniques to achieve the full utilization of the cluster capacity. In HDFS, it means that the number of partitions is the same as the number of input splits, which is mostly the same …

Continue reading Parallelism techniques in Apache Spark

Optimization techniques in Apache Spark

Hi Readers, In this post you will be learning the various optimization techniques used in apache spark. We can optimize our Spark applications by using data serialization technique, broadcasting etc... Data serialization Spark provides two options for data serialization 1 Java serialization 2 Kryo serialization Compared to Java serialization, Kryo serialization is much fastert than …

Continue reading Optimization techniques in Apache Spark

Run a Scala program in Apache Spark

Hi Readers, In this post you will learn how to Run a Scala program in Apache Spark. In your machine you need to install scala, sbt(build tool), java 8 and spark cluster( aws / use cloudera vm). Open command prompt and type ‘sbt new scala/hello-world.g8’. write your logic in src/main/scala/Main.scala file. To compile/ run / package the scala project open the …

Continue reading Run a Scala program in Apache Spark

Machine Learning with Apache Spark

Hi Readers, In this post I will share my learning of  'Machine Learning with Apache Spark'. What is machine learning? Machine learning is a field of computer science that gives computer systems the ability to "learn" (i.e., progressively improve performance on a specific task) with data, without being explicitly programmed. Machine learning is closely related …

Continue reading Machine Learning with Apache Spark

Apache Spark – Convert Parquet files into Avro schema

Hi Readers, In this post I will explain how to convert Parquet files into Avro schema. Read Parquet file, use write.avro() method of this package com.databricks:spark-avro_2.11:3.2.0. Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes. Apache Avro is a data serialization system. A compact, fast, binary data …

Continue reading Apache Spark – Convert Parquet files into Avro schema

Spark SQL with JSON to Avro schema

Hi Readers, In this post I will explain two things. How to convert JSON file to avro format. Read avro data, use sparksql to query and partition avro data using some condition. Apache Avro is a data serialization system. A compact, fast, binary data format. A container file, to store persistent data. Avro relies on schemas. …

Continue reading Spark SQL with JSON to Avro schema