Apache Spark – Convert Parquet files into Avro schema

Hi Readers,

In this post I will explain how to convert Parquet files into Avro schema. Read Parquet file, use write.avro() method of this package com.databricks:spark-avro_2.11:3.2.0.

Apache Parquet is a columnar storage format. Parquet is built to support very efficient compression and encoding schemes.

Apache Avro is a data serialization system. A compact, fast, binary data format. A container file, to store persistent data. Avro relies on schemas.

Step 1:

Start the spark shell using spark2-shell –packages com.databricks:spark-avro_2.11:3.2.0. Avro dependency gets downloaded.

Step 2:

Import the avro dependency in the spark shell as

scala> import com.databricks.spark.avro._

Step 3:

Read the parquet file from hdfs directory using spark.read.parquet() method.

scala> val reviewParquetDF = spark.read.parquet(“/user/edureka_162051/parquetdata/part-00000-6e546050-c328-4cee-84cd-dd445ff9ac2c.snappy.parquet”)

Step 4:

Now use the write.avro method of com.databricks:spark-avro_2.11:3.2.0 package and pass a hdfs directory.  I used coalesce(coalesce results in partitions with different amounts of data) So i have used 1. Only 1 avro file gets created in hdfs.

scala> reviewParquetDF.coalesce(1).write.avro(“/user/edureka_162051/parquettoavrodata”)

parqueut_avro_janscreenshots

If you have any doubts / stuck with issues please comment. You can share this page with your friends.

Follow me Jose Praveen for future notifications.

2 thoughts on “Apache Spark – Convert Parquet files into Avro schema

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s