Spark SQL with JSON to Avro schema

Hi Readers,

In this post I will explain two things. How to convert JSON file to avro format. Read avro data, use sparksql to query and partition avro data using some condition.

Apache Avro is a data serialization system. A compact, fast, binary data format. A container file, to store persistent data. Avro relies on schemas.

Step 1:

The JSON dataset is in my hdfs at ‘user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json’ then start the spark shell using “spark2-shell –packages com.databricks:spark-avro_2.11:3.2.0”

Step 2:

Import import com.databricks.spark.avro._ package in the spark-shell. Load the JSON data into reviewDF.

scala> val reviewDF = spark.read.json(“/user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json”)

Use printSchema() to know the fields and characteristics.

reviewDF.printSchema()

Step 3:

Convert the JSON data to avro format. I used coalesce(coalesce results in partitions with different amounts of data) So i have used 1. Only 1 avro file gets created in hdfs.

scala> reviewDF.filter(“overall < 4”).coalesce(1).write.avro(“/user/edureka_162051/avrodata”)

Step 4:

Read the avro data from hdfs.

scala> val reviewAvroDF = spark.read.avro(“/user/edureka_162051/avrodata/part-00000-eca1b21e-78dc-44fc-95ea-4ebd02897721.avro”)

Use printSchema() to know the fields and characteristics.

scala> reviewAvroDF.printSchema()

Use show(10) to get the top 10 data in tabular format.

scala> reviewAvroDF.select(“reviewerName”,”reviewText”,”reviewTime”).show(10)

avro_sparkscala2

Step 5:

Use uncompressed, snappy, and deflate compression to compress the avro data and partition using column. Here I have used deflate compression and set the deflate level to 5.

scala> spark.conf.set(“spark.sql.avro.compression.codec”, “deflate”)

scala> spark.conf.set(“spark.sql.avro.deflate.level”, “5”)

Partition using overall field ( has rating values 1,2,3 ).

scala> reviewAvroDF.write.partitionBy(“overall”).avro(“/user/edureka_162051/avrodata/partitioned”)

Once the partitioned has been done. Please check the hdfs folder 3 folders will be created as show below.

avro_sparkscala1

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.