File Optimization and Compression techniques in Apache Hive

Hi Readers,

In this post you will be learning the various file optimization and compression techniques that can be used in Apache Hive.

Hive supports TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET file formats.

Optimization Techniques:

TEXTFILE:

  • This is the default file format for Hive.
  • Data is not compressed in the text file.
  • It can be compressed with compression tools, such as GZip, Bzip2, and Snappy.

SEQUENCEFILE:

  • This is a binary storage format for key/value pairs.
  • The sequence file is  more compact than a text file and fits well with the MapReduce output format.
  • Sequence files can be compressed on record or block level where block level has a better compression ratio.

To enable block level compression, we need to do the following settings:
hive> SET hive.exec.compress.output=true;

hive> SET io.seqfile.compression.type=BLOCK;

RCFILE:

  • Record Columnar File is a flat file consisting of binary key/value pairs that shares much similarity with a sequence file.
  • The RCFile splits data horizontally into row groups. Then, RCFile saves the row group data in a columnar format by saving the first column across all rows, then the second column across all rows, and so on.
  • This format is splittable and is it faster.

ORC:

  • Optimized Row Columnar is available since Hive 0.11.0.
  • It provides a larger block size of 256 MB by default (RCFILE has 4 MB and SEQUENCEFILE has 1 MB) optimized for large sequential reads on HDFS for more throughput and fewer files to reduce overload in the namenode.
  • It also stores basic statistics, such as MIN, MAX, SUM, and COUNT, on columns as well as a lightweight index that can be used to skip blocks of rows that do not matter.

PARQUET:

  • Parquet is another row columnar file format that has a similar design to that of ORC.
  • Parquet has a wider range of support for the majority projects in the Hadoop ecosystem compared to ORC that only supports Hive and Pig.
  • Parquet has native support since Hive 0.13.0.

 

Compression techniques

To reduce the amount of data transferring between mappers and reducers by proper intermediate output compression as well as output data size in HDFS by output compression,

hive2:> SET hive.exec.compress.intermediate=true

The compression codec can be specified in either mapred-site.xml, hive-site.xml, or Hive CLI,

hive2:> SET hive.intermediate.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;

Intermediate compression will only save disk space for specific jobs that require multiple map and reduce jobs.

For further saving of disk space, the actual Hive output files can be compressed.

When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapred.map.output.compression.codec property to compress the storage in HDFS as follows.

These properties can be set in the hive-site.xml or in the Hive CLI.

hive> SET hive.exec.compress.output=true

hive> SET mapred.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;

You can also bookmark this page for future reference.

You can share this page with your friends.

Follow me Jose Praveen for future notifications.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s