sep:default ,
encoding:default UTF-8 decodes the CSV files by the given encoding type
quote:default " sets a single character used for escaping quoted values where the separator can be part of the value. If you would like to turn off quotations, you need to set not null but an empty string. This behaviour is different from com.databricks.spark.csv
escape:default \ sets a single character used for escaping quotes inside an already quoted value.
charToEscapeQuoteEscaping:default escape or \0
comment:default empty string
header:default false
enforceSchema:default true
inferSchema:(default false)
samplingRatio:default is 1.0
ignoreLeadingWhiteSpace:default false
ignoreTrailingWhiteSpace:default false
nullValue:default empty string
emptyValue:default empty string
nanValue:default NaN
positiveInf:default Inf
negativeInf:default -Inf
dateFormat:default yyyy-MM-dd
timestampFormat:default yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]
maxColumns:default 20480
maxCharsPerColumn:default -1
unescapedQuoteHandling:default STOP_AT_DELIMITER
mode:default PERMISSIVE
columnNameOfCorruptRecord:default is the value specified in spark.sql.columnNameOfCorruptRecord
multiLine:default false
locale:default is en-US
lineSep:default covers all \r, \r\n and \n
pathGlobFilter:an optional glob pattern to only include files with paths matching the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.
modifiedBefore(batch only): an optional timestamp to only include files with modification times occurring before the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter(batch only):an optional timestamp to only include files with modification times occurring after the specified Time. The provided timestamp must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
recursiveFileLookup: recursively scan a directory for files. Using this option disables partition discovery




Spark 2.0+:

For convenience, there is an implicit that wraps the DataFrameReader returned by and provides a .excel method which accepts all possible options and provides default values:

If the sheet name is unavailable, it is possible to pass in an index:

or to read in the names dynamically:

As you can see in the examples above, the location of data to read or write can be specified with the dataAddress option. Currently the following address styles are supported:

  • B3: Start cell of the data. Reading will return all rows below and all columns to the right. Writing will start here and use as many columns and rows as required.

  • B3:F35: Cell range of data. Reading will return only rows and columns in the specified range. Writing will start in the first cell (B3 in this example) and use only the specified columns and rows. If there are more rows or columns in the DataFrame to write, they will be truncated. Make sure this is what you want.

  • 'My Sheet'!B3:F35: Same as above, but with a specific sheet.

  • MyTable[#All]: Table of data. Reading will return all rows and columns in this table. Writing will only write within the current range of the table. No growing of the table will be performed. PRs to change this are welcome.


Author: 张永清
Title: spark读取和处理zip、gzip、excel、等各种文件最全的技巧总结





