We do a lot of work using HDFS as our file system. It provides us with a fault tolerant distribute environment to store our files. The two main file formats used with HDFS are Avro and Parquet. Since Parquet is a compressed format, I wondered how the performance of Parquet writes compare to Avro writes. So I setup some tests to figure this out.
Since Impala cannot write to Avro, I will be needing to perform these inserts in Hive (which will make the inserts longer). So I fire up my ole' Hive console and get started. First I need to create a table to insert into. I created one that is Avro backed and one that is Parquet backed. Then I insert into each of these tables 8 times with an "insert overwrite table avro select * from insertTable". The insertTable has 913304 Rows and 63 Columns with a partition column on the month. The resulting size of the Avro and Parquet directories on HDFS are 673.6M and 76.8M respectively. The following table displays the results.
Avro | Parquet |
---|---|
128.121 | 114.786 |
99.132 | 112.18 |
97.644 | 125.045 |
97.627 | 97.107 |
97.17 | 96.553 |
98.832 | 104.718 |
97.14 | 108.456 |
97.104 | 98.2 |
AVERAGE | AVERAGE |
101.59 | 107.13 |
So I am seeing Parquet being almost 1/10th the size of Avro but only taking ~6% more time to write. If the application use pattern is one that is write once read many times, it makes sense to go with Parquet. But, depending on your data profile, even if you are writing many times, it may still make sense to use Parquet. There may be other sets of data which widen the gap between these two formats. If anyone has any insight on how to do this, I could rerun these tests with updated data. Just leave a comment.
Next step is to Compare the write performance of Parquet vs Kudu.