hive报lzo Premature EOF from inputStream错误-blfshiye-SparkStreaming+Kafka整合

今天,我在dw小组的同事发了一封电子邮件,说有一个问题需要解决。他们自己都处理不了。以下是问题解决流程:

[En]

Today, my colleagues in the dw group sent an email saying that there is a problem to be solved. They couldn’t handle it themselves. The following problem resolution process:

1、hql

insert overwrite table mds_prod_silent_atten_user partition (dt=20141110) select uid, host, atten_time from (select uid, host, atten_time from (select case when t2.uid is null then t1.uid else t2.uid end uid, case when t2.uid is null and t2.host is null then t1.host else t2.host end host, case when t2.atten_time is null or t1.atten_time > t2.atten_time then t1.atten_time else t2.atten_time end atten_time from (select uid, findid(extend,'uids') host, dt atten_time, sum(case when (mode = '1' or mode = '3') then 1 else -1 end) num from ods_bhv_tblog where behavior = '14000076' and dt = '20141115' and (mode = '1' or mode = '3' or mode = '2') and status = '1' group by uid,findid(extend,'uids'),dt) t1 full outer join (select uid, attened_uid host, atten_time from mds_prod_silent_atten_user where dt='20141114') t2 on t1.uid = t2.uid and t1.host = t2.host where t1.uid is null or t1.num > 0) t3 union all select t5.uid, t5.host, t5.atten_time from (select uid, host, atten_time from (select uid, findid(extend,'uids') host, dt atten_time, sum(case when (mode = '1' or mode = '3') then 1 else -1 end) num from ods_bhv_tblog where behavior = '14000076' and dt = '20141115' and (mode = '1' or mode = '3' or mode = '2') and status = '1' group by uid,findid(extend,'uids'),dt) t4 where num = 0) t5 join (select uid, attened_uid host, atten_time from mds_prod_silent_atten_user where dt='20141114') t6 on t6.uid = t5.uid and t6.host = t5.host) t7

上面是带有详细错误的hql。它看起来非常复杂,但实际上逻辑比简单要简单,只涉及两个表的关联:

[En]

The above is the hql with detailed errors. It looks very complex, but in fact the logic is simpler than simple, involving only the association of two tables:

2、报错日志:

Error: java.io.IOException: java.lang.reflect.InvocationTargetException
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97)
    at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:302)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.(HadoopShimsSecure.java:249)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getRecordReader(HadoopShimsSecure.java:363)
    at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:591)
    at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:168)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:409)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1550)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileRecordReader.initNextRecordReader(HadoopShimsSecure.java:288)
    ... 11 more
Caused by: java.io.EOFException: Premature EOF from inputStream
    at com.hadoop.compression.lzo.LzopInputStream.readFully(LzopInputStream.java:75)
    at com.hadoop.compression.lzo.LzopInputStream.readHeader(LzopInputStream.java:114)
    at com.hadoop.compression.lzo.LzopInputStream.(LzopInputStream.java:54)
    at com.hadoop.compression.lzo.LzopCodec.createInputStream(LzopCodec.java:83)
    at org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer.(RCFile.java:667)
    at org.apache.hadoop.hive.ql.io.RCFile$Reader.(RCFile.java:1431)
    at org.apache.hadoop.hive.ql.io.RCFile$Reader.(RCFile.java:1342)
    at org.apache.hadoop.hive.ql.io.rcfile.merge.RCFileBlockMergeRecordReader.(RCFileBlockMergeRecordReader.java:46)
    at org.apache.hadoop.hive.ql.io.rcfile.merge.RCFileBlockMergeInputFormat.getRecordReader(RCFileBlockMergeInputFormat.java:38)
    at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:65)
    ... 16 more

日志显示,使用LZO压缩时出现来自inputStream的过早EOF错误,这是由今天的Stage-3引起的

[En]

The log shows that a Premature EOF from inputStream error occurred when compressing with LZO, which is caused by today’s stage-3

3、stage-3的运行计划信息例如以下:

Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            Union
              Statistics: Num rows: 365 Data size: 146323 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 365 Data size: 146323 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 365 Data size: 146323 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.hive.ql.io.RCFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
                      name: default.mds_prod_silent_atten_user
          TableScan
            Union
              Statistics: Num rows: 365 Data size: 146323 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: _col0 (type: string), _col1 (type: string), _col2 (type: string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 365 Data size: 146323 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 365 Data size: 146323 Basic stats: COMPLETE Column stats: NONE
                  table:
                      input format: org.apache.hadoop.hive.ql.io.RCFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.RCFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe
                      name: default.mds_prod_silent_atten_user

stage-3仅仅有map。没有reduce,并且map阶段仅仅是简单的进行union,看不错有什么特殊的地方。

4、问题查找

谷歌根据lzo过早的EOF来自inputStream的错误消息。果然,有人遇到了类似的问题,LINK:

[En]

Google according to the lzo Premature EOF from inputStream error message. Sure enough, someone has encountered a similar problem, link:

问题的原因:

[En]

The cause of the problem:

假设输出格式是TextOutputFormat,要用LzopCodec,对应的读取这个输出的格式是LzoTextInputFormat。

假设输出使用SequenceFile配上LzopCodec的话。那就等着用SequenceFileInputFormat读取这个输出时收到”java.io.EOFException: Premature EOF from inputStream”吧。

上述链接的描述与我们的问题类似。我们的表格输出格式是RCFileOutputFormat,不是纯文本,压缩编码不能用LzopCodec。应该使用LzoCodec,错误消息确认了这一点。从以前的作业管理员处读取由LzopCodec压缩生成的rcfile文件时报告错误。

[En]

The description of the link above is similar to our problem. Our table output format is RCFileOutputFormat, not plain text, compression encoding can not use LzopCodec. LzoCodec should be used, and the error message confirms this. An error was reported while reading the rcfile file generated by LzopCodec compression from the previous job administrator.

现在您已经找到了问题的解决方案,下一步是找到相应的参数,该参数应该是控制Reduced输出压缩编码的参数。根据相关作业的配置信息,将相应的LZO压缩代码替换为LzoCodec:

[En]

Now that you have found a solution to the problem, the next step is to find the corresponding parameter, which should be the parameter that controls the reduce output compression coding. Replace the corresponding lzo compression code with LzoCodec, based on the configuration information of the job in question:

果然。mapreduce.output.fileoutputformat.compress.codec选项被设置成了LzopCodec。将该选项改动mapreduce.output.fileoutputformat.compress.codec的值即可了,改动成org.apache.hadoop.io.compress.DefaultCodec,默认使用LzoCodec。

Original: https://www.cnblogs.com/blfshiye/p/5424097.html
Author: Hongten
Title: hive报lzo Premature EOF from inputStream错误-blfshiye-SparkStreaming+Kafka整合

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6755/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部