spark读取和处理zip、gzip、excel、等各种文件最全的技巧总结

一、当后缀名为zip、gzip,spark可以自动处理和读取

1、spark非常智能,如果一批压缩的zip和gzip文件,并且里面为一堆text文件时,可以用如下方式读取或者获取读取后的schema

2、当压缩的一批text文件里面的内容为json时,还可以通过read.json读取,并且自动解析为json数据返回

spark读取文件内容时是按行处理的,如果需要将文件里面多行处理为一行数据,可以通过设置multiLine=true(默认为false)

3、当zip或者gzip的文件没有任何后缀名或者后缀名不对时,那spark就无法自动读取了,但是此时可以通过类似如下的方式来读取

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:187c2056-d44c-4d9b-ab94-0c595bb5f07d

[En]

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped due to arrears, please recharge your account in Tencent Cloud requestId:db5ab2bc-e9a2-4761-8262-1e5227b90a77

spark在读取数据转换为dataframe时,是通过DataFrameReader.scala来处理的(https://github.com/apache/spark/blob/v3.1.2/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala)。从中可以看到option选项除了支持multiLine外,还支持了很多,从源码注释中可以看到,如下所示。

sep:default ,
encoding:default UTF-8 decodes the CSV files by the given encoding type
quote:default " sets a single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv
escape:default \ sets a single character used for escaping quotes inside an already quoted value.
charToEscapeQuoteEscaping:default escape or \0
comment:default empty string
header:default false
enforceSchema:default true
inferSchema:(default false)
samplingRatio:default is 1.0
ignoreLeadingWhiteSpace:default false
ignoreTrailingWhiteSpace:default false
nullValue:default empty string
emptyValue:default empty string
nanValue:default NaN
positiveInf:default Inf
negativeInf:default -Inf
dateFormat:default yyyy-MM-dd
timestampFormat:default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
maxColumns:default 20480
maxCharsPerColumn:default -1
unescapedQuoteHandling:default STOP_AT_DELIMITER
mode:default PERMISSIVE
columnNameOfCorruptRecord:default is the value specified in spark.sql.columnNameOfCorruptRecord
multiLine:default false
locale:default is en-US
lineSep:default covers all \r, \r\n and \n
pathGlobFilter:an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.
modifiedBefore(batch only): an optional timestamp to only include files with modification times occurring before the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter(batch only):an optional timestamp to only include files with modification times occurring after the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
recursiveFileLookup: recursively scan a directory for files. Using this option disables partition discovery

scala;gutter:true;
org.apache.hadoop.fs.GlobFilterorg.apache.hadoop.fs.GlobFilterorg.apache.hadoop.fs.GlobFilterorg.apache.hadoop.fs.GlobFilterorg.apache.hadoop.fs.GlobFilter

二、spark读取excel文件

引入如下依赖

Spark 2.0+:

For convenience, there is an implicit that wraps the DataFrameReader returned by spark.read and provides a .excel method which accepts all possible options and provides default values:

If the sheet name is unavailable, it is possible to pass in an index:

or to read in the names dynamically:

As you can see in the examples above, the location of data to read or write can be specified with the dataAddress option. Currently the following address styles are supported:

  • B3: Start cell of the data. Reading will return all rows below and all columns to the right. Writing will start here and use as many columns and rows as required.

  • B3:F35: Cell range of data. Reading will return only rows and columns in the specified range. Writing will start in the first cell (B3 in this example) and use only the specified columns and rows. If there are more rows or columns in the DataFrame to write, they will be truncated. Make sure this is what you want.

  • 'My Sheet'!B3:F35: Same as above, but with a specific sheet.

  • MyTable[#All]: Table of data. Reading will return all rows and columns in this table. Writing will only write within the current range of the table. No growing of the table will be performed. PRs to change this are welcome.

更多详细可以参考:

https://github.com/crealytics/spark-excel/

Original: https://www.cnblogs.com/laoqing/p/15516458.html
Author: 张永清
Title: spark读取和处理zip、gzip、excel、等各种文件最全的技巧总结

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/562789/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球