Spark SQL with JSON to parquet files

Hi Readers,

In this post I will explain two things. How to convert JSON file to parquet files. Read parquet data, use sparksql to query and partition parquet data using some condition.

Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes.

Step 1:

The JSON dataset is in my hdfs at ‘user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json’ then start the spark shell using “spark2-shell”

Step 2:

Load the JSON data into reviewDF.

scala> val reviewDF = spark.read.json(“/user/edureka_162051/reviews_Cell_Phones_and_Accessories_5.json”)

Use printSchema() to know the fields and characteristics.

scala> reviewDF.printSchema()

Step 3:

Convert the JSON data to parquet file. I used coalesce(coalesce results in partitions with different amounts of data) So i have used 1. Only 1 parquet file gets created in hdfs.

scala> reviewDF.filter(“overall < 4”).coalesce(1).write.parquet(“/user/edureka_162051/parquetdata”)

Step 4:

Read the parquet data from hdfs.

scala> val reviewParquetDF = spark.read.parquet(“/user/edureka_162051/parquetdata/part-00000-6e546050-c328-4cee-84cd-dd445ff9ac2c.snappy.parquet”)

Use printSchema() to know the fields and characteristics.

scala> reviewParquetDF.printSchema()

scala> reviewParquetDF.createOrReplaceTempView(“reviewsTable”)

scala> val reviewDetailsDF = spark.sql(“select reviewerName,reviewText,summary from reviewsTable”)

scala> reviewDetailsDF.show(5)

parquet_sparkscala_jan1

Step 5:

Use snappy compression to compress the parquet file and partition using column field.

scala> spark.conf.set(“spark.sql.parquet.compression.codec”, “snappy”)

Partition using overall( has rating values 1,2,3 ) field.

scala> reviewParquetDF.write.partitionBy(“overall”).parquet(“/user/edureka_162051/parquetdata/partitioned”)

Once the partitioned has been done. Please check the hdfs folder 3 folders will be created as show below.

parquet_sparkscala_jan

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s