# hdfs文件格式比较

Row-oriented: data in the same row is stored together, that is, continuous storage. SequenceFile,MapFile,Avro Datafile . In this way, if you only need to access a small portion of the data of a row and read the entire row into memory, delaying serialization can alleviate this problem to some extent, but the overhead of reading the entire row of data from disk cannot be avoided. Row-oriented storage is suitable for situations where entire rows of data need to be processed at the same time.

Column oriented: the entire file is cut into columns of data, each of which is stored together. Parquet, RCFile,ORCFile . The column-oriented format allows you to skip unwanted columns when reading data, suitable for situations where you are only in a small portion of the fields of the row. But reading and writing in this format requires more memory space because you need to cache rows in memory (in order to get a column in multiple rows). At the same time, it is not suitable for streaming writing, because if the write fails, the current file cannot be recovered, and the row-oriented data can be resynchronized to the last synchronization point when the write fails, so Flume uses a line-oriented storage format.

Here are several related file formats, which are widely used in the Hadoop system:

SequenceFile的文件结构如下：

Depending on whether or not to compress, and whether to use record compression or block compression, the storage format varies:

Do not compress:

It is stored according to record length, Key length, Value degree, key value and value value. Length is the number of bytes. Serializes with the specified Serialization.

Record压缩：

Only value is compressed, and the compressed codec is saved in Header.

Block压缩：

Multiple records are compressed together, which can take advantage of the similarity between records to save space. Synchronization identifiers are added before and after the Block. The minimum value of Block is set by the io.seqfile.compress.blocksize property.

Avro是一种用于支持数据密集型的二进制文件格式。它的文件格式更为紧凑,若要读取大量数据时,Avro能够提供更好的序列化和反序列化性能。并 且Avro数据文件天生是带Schema定义的,所以它不需要开发者在API 级别实现自己的Writable对象。最近多个Hadoop 子项目都支持Avro 数据格式,如Pig 、Hive、Flume、Sqoop和Hcatalog。

RCFile是Hive推出的一种专门面向列的数据格式。 它遵循”先按列划分,再垂直划分”的设计理念。当查询过程中,针对它并不关心的列时,它会在IO上跳过这些列。需要说明的是,RCFile在map阶段从 远端拷贝仍然是拷贝整个数据块,并且拷贝到本地目录后RCFile并不是真正直接跳过不需要的列,并跳到需要读取的列, 而是通过扫描每一个row group的头部定义来实现的,但是在整个HDFS Block 级别的头部并没有定义每个列从哪个row group起始到哪个row group结束。所以在读取所有列的情况下,RCFile的性能反而没有SequenceFile高。

Hive的Record Columnar File,这种类型的文件先将数据按行划分成Row Group，在Row Group内部，再将数据按列划分存储。其结构如下：

Compared to simply row-oriented and column-oriented:

ORCFile（Optimized Record Columnar File)提供了一种比RCFile更加高效的文件格式。其内部将数据划分为默认大小为250M的Stripe。每个Stripe包括索引、数据和Footer。索引存储每一列的最大最小值，以及列中每一行的位置。

CREATE TABLE … STORED AS ORC

ALTER TABLE … SET FILEFORMAT ORC

SET hive.default.fileformat=ORC

A general column-oriented storage format based on Google’s Dremel. Especially good at dealing with deeply nested data.

For nested structures, Parquet converts them into flat column storage, which is represented by Repeat Level and Definition Level (R and D). When reading data to reconstruct the whole record, metadata is used to reconstruct the structure of the record. Here is an example of R and D:

We chose a TPC-H standard test to illustrate the storage overhead of different file formats. Because this data is public, readers who are interested in the results can also do it themselves against the following experiments. The original size of the Orders text format is 1.62G. We load it into Hadoop and use Hive to convert it to the above formats, and test the size of the resulting file in the same LZO compression mode.

As can be seen from the above experimental results, SequenceFile is larger than the original plain text TextFile in both compressed and uncompressed cases, including 11% in uncompressed mode and 6.4% in compressed mode. This has something to do with the definition of SequenceFile’s file format: SequenceFile defines its metadata in the header, and the size of the metadata varies slightly depending on the compression mode. In general, compression is performed at the block level, where each block contains the length of the key and the length of the value, and there is a sync-marker tag for every 4K byte. For TextFile file formats, only one line spacer is needed between different columns, so the TextFile file format is smaller than the SequenceFile file format. But the TextFile file format does not define the length of the column, so it must determine whether each character is a delimiter and line Terminator on a character-by-character basis. As a result, the deserialization overhead of TextFile can be dozens of times higher than other binary file formats.

RCFile文件格式同样也会保存每个列的每个字段的长度。但是它是连续储存在头部元数据块中,它储存实际数据值也是连续的。另外RCFile 会每隔一定块大小重写一次头部的元数据块(称为row group,由hive.io.rcfile.record.buffer.size控制,其默认大小为4M),这种做法对于新出现的列是必须的,但是如 果是重复的列则不需要。RCFile 本来应该会比SequenceFile 文件大,但是RCFile 在定义头部时对于字段长度使用了Run Length Encoding进行压缩,所以RCFile 比SequenceFile又小一些。Run length Encoding针对固定长度的数据格式有非常高的压缩效率,比如Integer、Double和Long等占固定长度的数据类型。在此提一个特例—— Hive 0.8引入的TimeStamp 时间类型,如果其格式不包括毫秒,可表示为”YYYY-MM-DD HH:MM:SS”,那么就是固定长度占8个字节。如果带毫秒,则表示为”YYYY-MM-DD HH:MM:SS.fffffffff”,后面毫秒的部分则是可变的。

Avro文件格式也按group进行划分。但是它会在头部定义整个数据的模式(Schema), 而不像RCFile那样每隔一个row group就定义列的类型,并且重复多次。另外,Avro在使用部分类型的时候会使用更小的数据类型,比如Short或者Byte类型,所以Avro的数 据块比RCFile 的文件格式块更小。

We can use Java’s profile tool to view the CPU and memory overhead of Hadoop runtime tasks. The following are the settings on the Hive command line:

Among them, the more obvious is RCFile, which consumes unnecessary array movement overhead in order to construct rows. This is mainly because RCFile needs to construct RowContainer in order to restore rows, read one row sequentially to construct RowContainer, and then assign values to the corresponding columns. Because RCFile is compatible with SequenceFile, it can merge two block, and because RCFile does not know which row group the column ends in, it must maintain the current position of the array, similar to the following format definition:

Array>

This data format can be changed to column-oriented serialization and deserialization. Such as:

Map,array,array….>

Deserialization in this way avoids unnecessary array movement, of course, as long as we know which row group the column starts and which row group ends. This approach improves the efficiency of the overall deserialization process.

