In this post you will be learning the various file optimization and compression techniques that can be used in Apache Hive.
Hive supports TEXTFILE, SEQUENCEFILE, RCFILE, ORC, and PARQUET file formats.
- This is the default file format for Hive.
- Data is not compressed in the text file.
- It can be compressed with compression tools, such as GZip, Bzip2, and Snappy.
- This is a binary storage format for key/value pairs.
- The sequence file is more compact than a text file and fits well with the MapReduce output format.
- Sequence files can be compressed on record or block level where block level has a better compression ratio.
To enable block level compression, we need to do the following settings:
hive> SET hive.exec.compress.output=true;
hive> SET io.seqfile.compression.type=BLOCK;
- Record Columnar File is a flat file consisting of binary key/value pairs that shares much similarity with a sequence file.
- The RCFile splits data horizontally into row groups. Then, RCFile saves the row group data in a columnar format by saving the first column across all rows, then the second column across all rows, and so on.
- This format is splittable and is it faster.
- Optimized Row Columnar is available since Hive 0.11.0.
- It provides a larger block size of 256 MB by default (RCFILE has 4 MB and SEQUENCEFILE has 1 MB) optimized for large sequential reads on HDFS for more throughput and fewer files to reduce overload in the namenode.
- It also stores basic statistics, such as MIN, MAX, SUM, and COUNT, on columns as well as a lightweight index that can be used to skip blocks of rows that do not matter.
- Parquet is another row columnar file format that has a similar design to that of ORC.
- Parquet has a wider range of support for the majority projects in the Hadoop ecosystem compared to ORC that only supports Hive and Pig.
- Parquet has native support since Hive 0.13.0.
To reduce the amount of data transferring between mappers and reducers by proper intermediate output compression as well as output data size in HDFS by output compression,
hive2:> SET hive.exec.compress.intermediate=true
The compression codec can be specified in either mapred-site.xml, hive-site.xml, or Hive CLI,
hive2:> SET hive.intermediate.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;
Intermediate compression will only save disk space for specific jobs that require multiple map and reduce jobs.
For further saving of disk space, the actual Hive output files can be compressed.
When the hive.exec.compress.output property is set to true, Hive will use the codec configured by the mapred.map.output.compression.codec property to compress the storage in HDFS as follows.
These properties can be set in the hive-site.xml or in the Hive CLI.
hive> SET hive.exec.compress.output=true
hive> SET mapred.output.compression.codec= org.apache.hadoop.io.compress.SnappyCodec;
You can also bookmark this page for future reference.
You can share this page with your friends.
Follow mefor future notifications.