tf.data.experimental.make_csv_dataset参数解释

tf.data.experimental.make_csv_dataset(
    file_pattern, batch_size, column_names=None, column_defaults=None,
    label_name=None, select_columns=None, field_delim=',',
    use_quote_delim=True, na_value='', header=True, num_epochs=None,
    shuffle=True, shuffle_buffer_size=10000, shuffle_seed=None,
    prefetch_buffer_size=None, num_parallel_reads=None, sloppy=False,
    num_rows_for_inference=100, compression_type=None, ignore_errors=False
)

Reads CSV files into a dataset, where each element of the dataset is a (features, labels) tuple that corresponds to a batch of CSV rows. The features dictionary maps feature column names to Tensors containing the corresponding feature data, and labels is a Tensor containing the batch’s label data.

By default, the first rows of the CSV files are expected to be headers listing the column names. If the first rows are not headers, set header=False and provide the column names with the column_names argument.

By default, the dataset is repeated indefinitely, reshuffling the order each time. This behavior can be modified by setting the num_epochs and shuffle arguments.

将 CSV 文件读入数据集,其中数据集的每个元素都是一个(特征、标签)元组,对应于一批 CSV 行。 特征字典将特征列名称映射到包含相应特征数据的张量,标签是包含批次标签数据的张量。
默认情况下,CSV 文件的第一行应该是列出列名的标题。 如果第一行不是标题,则设置 header=False 并使用 column_names 参数提供列名。
默认情况下,数据集会无限重复,每次都会重新排列顺序。 可以通过设置 num_epochs 和 shuffle 参数来修改此行为。
这一点非常重要,如果你发现模型已经在第一轮中安装好了,考虑一下情况是否如此。

[En]

This is very important, and if you find that the model has been fitted in the first round, consider whether this is the case.

Args

List of files or patterns of file paths containing CSV records. See

for pattern rules.

包含CSV记录的文件列表或文件路径模式

An int representing the number of records to combine in a single batch.

表示要合并在单个批处理中的记录数量,int类型。

An optional list of strings that corresponds to the CSV columns, in order. One per column of the input record. If this is not provided, infers the column names from the first row of the records. These names will be the keys of the features dict of each dataset element.

对应CSV列的可选字符串列表,顺序。输入记录的每列一个。如果没有提供,则从记录的第一行推断列名。这些名称将是每个数据集元素的特征字典的键。

A optional list of default values for the CSV fields. One item per selected column of the input record. Each item in the list is either a valid CSV dtype (float32, float64, int32, int64, or string), or a

with one of the aforementioned types. The tensor can either be a scalar default value (if the column is optional), or an empty tensor (if the column is required). If a dtype is provided instead of a tensor, the column is also treated as required. If this list is not provided, tries to infer types based on reading the first num_rows_for_inference rows of files specified, and assumes all columns are optional, defaulting to

for numeric values and

for string values. If both this and

are specified, these must have the same lengths, and

is assumed to be sorted in order of increasing column index.

CSV 字段的默认值的可选列表。 输入记录的每个选定列一个项目。 列表中的每个项目要么是有效的 CSV dtype(float32、float64、int32、int64 或字符串),要么是具有上述类型之一的”张量”。 张量可以是标量默认值(如果列是可选的),也可以是空张量(如果需要列)。 如果提供了 dtype 而不是张量,则该列也将根据需要进行处理。 如果未提供此列表,则尝试根据读取指定文件的前 num_rows_for_inference 行来推断类型,并假定所有列都是可选的,对于数值默认为

,对于字符串值默认为

。 如果 this 和

,则它们必须具有相同的长度,并且

[En]

Are specified, they must have the same length, and

假设您按照列索引增加的顺序进行排序。

[En]

Suppose you sort in the order in which the column index increases.

A optional string corresponding to the label column. If provided, the data for this column is returned as a separate

from the features dictionary, so that the dataset complies with the format expected by a

or

input function.

与标签列对应的可选字符串。 如果提供,则此列的数据作为特征字典中的单独”张量”返回,以便数据集符合”tf.Estimator.train”或”tf.Estimator.evaluate”输入函数预期的格式。

An optional list of integer indices or string column names, that specifies a subset of columns of CSV data to select. If column names are provided, these must correspond to names provided in

or inferred from the file header lines. When this argument is specified, only a subset of CSV columns will be parsed and returned, corresponding to the columns specified. Using this results in faster parsing and lower memory usage. If both this and

are specified, these must have the same lengths, and

is assumed to be sorted in order of increasing column index.

整数索引或字符串列名的可选列表,指定要选择的 CSV 数据列的子集。 如果提供了列名,则这些必须对应于

中提供的名称或从文件标题行推断的名称。 指定此参数时,将仅解析和返回与指定列相对应的 CSV 列的子集。 使用它会导致更快的解析和更低的内存使用。 如果

,则它们必须具有相同的长度,并且

[En]

Are specified, they must have the same length, and

假设您按照列索引增加的顺序进行排序。

[En]

Suppose you sort in the order in which the column index increases.

An optional

. Defaults to

. Char delimiter to separate fields in a record.

一个可选的

。 默认为

。 用于分隔记录中的字段的字符分隔符。

An optional bool. Defaults to

. If false, treats double quotation marks as regular characters inside of the string fields.

一个可选的布尔值。 默认为”真”。 如果为 false,则将双引号视为字符串字段内的常规字符。

Additional string to recognize as NA/NaN.

识别为 NA/NaN 的附加字符串。

A bool that indicates whether the first rows of provided CSV files correspond to header lines with column names, and should not be included in the data.

一个布尔值,指示提供的 CSV 文件的第一行是否对应于具有列名的标题行,并且不应包含在数据中。

An int specifying the number of times this dataset is repeated. If None, cycles through the dataset forever.}

指定此数据集重复次数的 int。 如果没有,则永远循环遍历数据集。

A bool that indicates whether the input should be shuffled.

一个布尔值,指示是否应中断输入。

[En]

A Boolean value indicating whether the input should be disrupted.

Buffer size to use for shuffling. A large buffer size ensures better shuffling, but increases memory usage and startup time.

用于shuffle的缓冲区大小。 较大的缓冲区大小可确保更好的shuffle,但会增加内存使用和启动时间。

Randomization seed to use for shuffling.

用于shuffle的随机化种子。

An int specifying the number of feature batches to prefetch for performance improvement. Recommended value is the number of batches consumed per training step. Defaults to auto-tune.

一个 int 指定要预取以提高性能的特征批次的数量。 推荐值是每个训练步骤消耗的批次数。 默认为自动调谐。

Number of threads used to read CSV records from files. If >1, the results will be interleaved. Defaults to

用于从文件中读取 CSV 记录的线程数。 如果>1,结果将被交错。 默认为”1″。

If

, reading performance will be improved at the cost of non-deterministic ordering. If

, the order of elements produced is deterministic prior to shuffling (elements are still randomized if

. Note that if the seed is set, then order of elements after shuffling is deterministic). Defaults to

如果为”True”,则以非确定性排序为代价提高读取性能。 如果为”False”,则生成的元素的顺序在shuffle之前是确定的(如果”shuffle=True”,元素仍然是随机的。请注意,如果设置了种子,那么shuffle后的元素顺序是确定的)。 默认为”假”。

Number of rows of a file to use for type inference if record_defaults is not provided. If None, reads all the rows of all the files. Defaults to 100.

如果未提供 record_defaults,则用于类型推断的文件行数。 如果为 None,则读取所有文件的所有行。 默认为 100。

(Optional.) A

scalar evaluating to one of

(no compression),

, or

. Defaults to no compression.

(可选。)一个

标量评估为

(无压缩)、

之一 ,或”GZIP”。 默认为无压缩。

(Optional.) If

, ignores errors with CSV file parsing, such as malformed data or empty lines, and moves on to the next valid CSV record. Otherwise, the dataset raises an error and stops processing when encountering any invalid records. Defaults to

(可选。)如果为

,忽略 CSV 文件解析错误,例如格式错误的数据或空行,并继续下一个有效的 CSV 记录。 否则,数据集会在遇到任何无效记录时引发错误并停止处理。 默认为”假”。

Original: https://blog.csdn.net/Geek_/article/details/123493251
Author: BugII_
Title: tf.data.experimental.make_csv_dataset参数解释

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/497215/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球