tf.data.experimental.make_csv_dataset参数解释

2023年5月23日下午6:24 • 人工智能 • 阅读 93

tf.data.experimental.make_csv_dataset(
    file_pattern, batch_size, column_names=None, column_defaults=None,
    label_name=None, select_columns=None, field_delim=',',
    use_quote_delim=True, na_value='', header=True, num_epochs=None,
    shuffle=True, shuffle_buffer_size=10000, shuffle_seed=None,
    prefetch_buffer_size=None, num_parallel_reads=None, sloppy=False,
    num_rows_for_inference=100, compression_type=None, ignore_errors=False
)

Reads CSV files into a dataset, where each element of the dataset is a (features, labels) tuple that corresponds to a batch of CSV rows. The features dictionary maps feature column names to Tensors containing the corresponding feature data, and labels is a Tensor containing the batch’s label data.

By default, the first rows of the CSV files are expected to be headers listing the column names. If the first rows are not headers, set header=False and provide the column names with the column_names argument.

By default, the dataset is repeated indefinitely, reshuffling the order each time. This behavior can be modified by setting the num_epochs and shuffle arguments.

将 CSV 文件读入数据集，其中数据集的每个元素都是一个（特征、标签）元组，对应于一批 CSV 行。特征字典将特征列名称映射到包含相应特征数据的张量，标签是包含批次标签数据的张量。
默认情况下，CSV 文件的第一行应该是列出列名的标题。如果第一行不是标题，则设置 header=False 并使用 column_names 参数提供列名。
默认情况下，数据集会无限重复，每次都会重新排列顺序。可以通过设置 num_epochs 和 shuffle 参数来修改此行为。
这一点非常重要，如果你发现模型已经在第一轮中安装好了，考虑一下情况是否如此。

[En]

This is very important, and if you find that the model has been fitted in the first round, consider whether this is the case.

Args

List of files or patterns of file paths containing CSV records. See

for pattern rules.

包含CSV记录的文件列表或文件路径模式

An int representing the number of records to combine in a single batch.

表示要合并在单个批处理中的记录数量，int类型。

An optional list of strings that corresponds to the CSV columns, in order. One per column of the input record. If this is not provided, infers the column names from the first row of the records. These names will be the keys of the features dict of each dataset element.

对应CSV列的可选字符串列表，顺序。输入记录的每列一个。如果没有提供，则从记录的第一行推断列名。这些名称将是每个数据集元素的特征字典的键。

A optional list of default values for the CSV fields. One item per selected column of the input record. Each item in the list is either a valid CSV dtype (float32, float64, int32, int64, or string), or a

with one of the aforementioned types. The tensor can either be a scalar default value (if the column is optional), or an empty tensor (if the column is required). If a dtype is provided instead of a tensor, the column is also treated as required. If this list is not provided, tries to infer types based on reading the first num_rows_for_inference rows of files specified, and assumes all columns are optional, defaulting to

for numeric values and

for string values. If both this and

are specified, these must have the same lengths, and

is assumed to be sorted in order of increasing column index.

CSV 字段的默认值的可选列表。输入记录的每个选定列一个项目。列表中的每个项目要么是有效的 CSV dtype（float32、float64、int32、int64 或字符串），要么是具有上述类型之一的”张量”。张量可以是标量默认值（如果列是可选的），也可以是空张量（如果需要列）。如果提供了 dtype 而不是张量，则该列也将根据需要进行处理。如果未提供此列表，则尝试根据读取指定文件的前 num_rows_for_inference 行来推断类型，并假定所有列都是可选的，对于数值默认为

，对于字符串值默认为

。如果 this 和

，则它们必须具有相同的长度，并且

[En]

Are specified, they must have the same length, and

假设您按照列索引增加的顺序进行排序。

[En]

Suppose you sort in the order in which the column index increases.

A optional string corresponding to the label column. If provided, the data for this column is returned as a separate

from the features dictionary, so that the dataset complies with the format expected by a

input function.

与标签列对应的可选字符串。如果提供，则此列的数据作为特征字典中的单独”张量”返回，以便数据集符合”tf.Estimator.train”或”tf.Estimator.evaluate”输入函数预期的格式。

An optional list of integer indices or string column names, that specifies a subset of columns of CSV data to select. If column names are provided, these must correspond to names provided in

or inferred from the file header lines. When this argument is specified, only a subset of CSV columns will be parsed and returned, corresponding to the columns specified. Using this results in faster parsing and lower memory usage. If both this and

are specified, these must have the same lengths, and

is assumed to be sorted in order of increasing column index.

整数索引或字符串列名的可选列表，指定要选择的 CSV 数据列的子集。如果提供了列名，则这些必须对应于

中提供的名称或从文件标题行推断的名称。指定此参数时，将仅解析和返回与指定列相对应的 CSV 列的子集。使用它会导致更快的解析和更低的内存使用。如果

，则它们必须具有相同的长度，并且

[En]

Are specified, they must have the same length, and

假设您按照列索引增加的顺序进行排序。

[En]

Suppose you sort in the order in which the column index increases.

An optional

. Defaults to

. Char delimiter to separate fields in a record.

一个可选的

。默认为

。用于分隔记录中的字段的字符分隔符。

An optional bool. Defaults to

. If false, treats double quotation marks as regular characters inside of the string fields.

一个可选的布尔值。默认为”真”。如果为 false，则将双引号视为字符串字段内的常规字符。

Additional string to recognize as NA/NaN.

识别为 NA/NaN 的附加字符串。

A bool that indicates whether the first rows of provided CSV files correspond to header lines with column names, and should not be included in the data.

一个布尔值，指示提供的 CSV 文件的第一行是否对应于具有列名的标题行，并且不应包含在数据中。

An int specifying the number of times this dataset is repeated. If None, cycles through the dataset forever.}

指定此数据集重复次数的 int。如果没有，则永远循环遍历数据集。

A bool that indicates whether the input should be shuffled.

一个布尔值，指示是否应中断输入。

[En]

A Boolean value indicating whether the input should be disrupted.

Buffer size to use for shuffling. A large buffer size ensures better shuffling, but increases memory usage and startup time.

用于shuffle的缓冲区大小。较大的缓冲区大小可确保更好的shuffle，但会增加内存使用和启动时间。

Randomization seed to use for shuffling.

用于shuffle的随机化种子。

An int specifying the number of feature batches to prefetch for performance improvement. Recommended value is the number of batches consumed per training step. Defaults to auto-tune.

一个 int 指定要预取以提高性能的特征批次的数量。推荐值是每个训练步骤消耗的批次数。默认为自动调谐。

Number of threads used to read CSV records from files. If >1, the results will be interleaved. Defaults to

用于从文件中读取 CSV 记录的线程数。如果>1，结果将被交错。默认为”1″。

, reading performance will be improved at the cost of non-deterministic ordering. If

, the order of elements produced is deterministic prior to shuffling (elements are still randomized if

. Note that if the seed is set, then order of elements after shuffling is deterministic). Defaults to

如果为”True”，则以非确定性排序为代价提高读取性能。如果为”False”，则生成的元素的顺序在shuffle之前是确定的（如果”shuffle=True”，元素仍然是随机的。请注意，如果设置了种子，那么shuffle后的元素顺序是确定的）。默认为”假”。

Number of rows of a file to use for type inference if record_defaults is not provided. If None, reads all the rows of all the files. Defaults to 100.

如果未提供 record_defaults，则用于类型推断的文件行数。如果为 None，则读取所有文件的所有行。默认为 100。

(Optional.) A

scalar evaluating to one of

(no compression),

, or

. Defaults to no compression.

（可选。）一个

标量评估为

（无压缩）、

之一，或”GZIP”。默认为无压缩。

(Optional.) If

, ignores errors with CSV file parsing, such as malformed data or empty lines, and moves on to the next valid CSV record. Otherwise, the dataset raises an error and stops processing when encountering any invalid records. Defaults to

（可选。）如果为

，忽略 CSV 文件解析错误，例如格式错误的数据或空行，并继续下一个有效的 CSV 记录。否则，数据集会在遇到任何无效记录时引发错误并停止处理。默认为”假”。

Original: https://blog.csdn.net/Geek_/article/details/123493251
Author: BugII_
Title: tf.data.experimental.make_csv_dataset参数解释

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/497215/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Pytorch搭建CNN进行图像分类

PyTorch是一个开源的Python机器学习库，2017年1月，由Facebook人工智能研究院（FAIR）基于Torch推出。最近抽出时间来亲身实践一下用PyTorch搭建一个…

人工智能 2023年7月6日
0083
解决mmdetection训练过程loss为nan的问题

我之前遇到多次loss为nan，一次是由于数据标注出现问题，换不同的模型参数均出现此问题，因此需要仔细检查数据格式；有一次是换了个neck的结构，loss变为nan，后面通过将…

人工智能 2023年7月9日
0063
【Python】高级变量通关教程上篇（列表、元组、字典）

💁 个人主页：黄小黄的博客主页❤️ 支持我：👍 点赞 🌷 收藏 🤘关注🎏 格言：一步一个脚印才能承接所谓的幸运本文来自专栏： Python基础教程欢迎点击支持订阅专栏 ❤️ ; …

人工智能 2023年7月5日
0076
利用Python识别txt文本并根据其内容进行文件分类

事情是这样的，有一个图片数据集需要根据分成很多类以便于给其设置标签，但所有的图片都在一个文件里，另外又给了个.txt文件，其中每行都是对应图片的类别。例如第1行对应的第0001.j…

人工智能 2023年7月1日
00101
一篇快速搞懂python模块、包和库

个人主页：天寒雨落的博客_CSDN博客-初学者入门C语言,python,数据库领域博主💬 热门专栏：python_天寒雨落的博客-CSDN博客每日赠语：没有窘迫的失败，就不会有自…

人工智能 2023年7月29日
0054
PAconv环境配置：build.ninja缺少lib的解决+error LNK2019: 无法解析的外部符号

1.背景最近跑PAconv项目，电脑上有版本匹配的cl.exe，但是仍然报错 Error checking compiler version for cl: [WinError …

人工智能 2023年5月25日
00103
python中的pandas的两种基本使用_python基础3:Pandas中常用操作

借了一张图 Pandas：提供名为DataFrame的数据结构，比较契合统计分析中的表结构，做数据分析用的，主要是做表格数据呈现。每次在使用Pandas的时候，不是看记的笔记就是…

人工智能 2023年7月8日
0062
2.7 tf.data创建Dataset以及数据处理教程——tensorflow2

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年5月26日
00134
【PyTorch深度学习项目实战100例】—— 使用pytorch实现LSTM自动AI作诗（藏头诗和首句续写）| 第6例

; 前言大家好，我是阿光。本专栏整理了《PyTorch深度学习项目实战100例》，内包含了各种不同的深度学习项目，包含项目原理以及源码，每一个项目实例都附带有完整的代码+数据集…

人工智能 2023年7月28日
0095
读书笔记：深度学习入门-基于python的理论与实现（俗称鱼书）

文章目录前言三、神经网络 * 3.2 激活函数 – 3.2.1 阶跃函数 3.2.2 sigmoid函数 3.2.3 ReLU函数 3.3 多维数组 –…

人工智能 2023年6月16日
00116
关于Double的非空判断

今天在对elasticsearch聚合查询获取Double值时，忽然发现最大、最小、平均值分别为： Double.POSITIVE_INFINITY、Double.NEGATIVE…

人工智能 2023年7月15日
0055
python中df head_解决Python spyder显示不全df列和行的问题

python中有的df列比较长head的时候会出现省略号，现在数据分析常用的就是基于anaconda的notebook和sypder，在spyder下head的时候就会比较明显的遇…

人工智能 2023年7月9日
0076
【深度强化学习】Actor-Critic Method

Actor-Critic 1 解决了什么问题？在 Actor-Critic 算法提出之前，大多数的 RL 算法大致分为以下两类： Actor-only 方法：直接使用参数化的 p…

人工智能 2023年5月28日
0093
【5G NR】SA下N26接口、EPS Fallback语音服务

1、SA N26接口N26是4G核心网和5G核心网之间的接口（MME和AMF之间），用于4G和5G的互操作对于5G核心网AMF支持N26接口情况：AMF支持N26接口，NG-RA…

人工智能 2023年5月27日
0090
SwinUnet官方代码训练自己数据集（单通道灰度图像的分割）

码字不易，收藏之余，别忘了给我点个赞吧！ ———Start 关于swinUnet网络的测试部分请移步另一篇博文官方代码：https://git…

人工智能 2023年7月3日
00137
【Python数据科学快速入门系列 | 09】Matplotlib数据关系图表应用总结

这是机器未来的第57篇文章原文首发地址：https://robotsfutures.blog.csdn.net/article/details/127116651 ; 《Pyth…

人工智能 2023年6月16日
0074

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

tf.data.experimental.make_csv_dataset参数解释

大家都在看