【几种数据集采样方式】

2023年5月28日上午1:34 • 人工智能 • 阅读 62

在训练神经网络时，如果数据量太大，无法一次性将数据放入到网络中进行训练，所以需要进行分批处理数据读取。这一个问题涉及到如何从数据集中进行读取数据的问题，PyTorch 框架提供了 Sampler 基类与多个子类实现不同方式的数据采样。

基类 Sampler

class Sampler(object):
    r"""Base class for all Samplers.

    Every Sampler subclass has to provide an :meth:__iter__ method, providing a
    way to iterate over indices of dataset elements, and a :meth:__len__ method
    that returns the length of the returned iterators.

    .. note:: The :meth:__len__ method isn't strictly required by
              :class:~torch.utils.data.DataLoader, but is expected in any
              calculation involving the length of a :class:~torch.utils.data.DataLoader.

"""

    def __init__(self, data_source):
        pass

    def __iter__(self):
        raise NotImplementedError

顺序采样 Sequential Sampler

class SequentialSampler(Sampler):
    r"""Samples elements sequentially, always in the same order.

    Arguments:
        data_source (Dataset): dataset to sample from
"""

    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source)))

    def __len__(self):
        return len(self.data_source)

顺序采样类并没有定义过多的方法，其中初始化方法仅仅需要一个 Dataset 类作为参数。
对于 len() 只负责返回数据源包含的数据个数， iter() 方法返回可迭代对象，这个可迭代对象是一个由 range 方法产生的顺序数值序列，也就是说迭代是按照顺序进行的。
每个 Epoch 包含很多 iteration，每个 Epoch 执行一次 iter() 函数，每个 iteration 执行一次可迭代对象的 next() 函数。

//测试

data = list([1, 2, 3, 4, 5])
seq_sampler = sampler.SequentialSampler(data_source=data)

for index in seq_sampler:
    print("index: {}, data: {}".format(str(index), str(data[index])))
//输出
index: 0, data: 1
index: 1, data: 2
index: 2, data: 3
index: 3, data: 4
index: 4, data: 5

随机采样 Random Sampler

class RandomSampler(Sampler):
    r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.

    If with replacement, then user can specify :attr:num_samples to draw.

    Arguments:
        data_source (Dataset): dataset to sample from
        replacement (bool): samples are drawn with replacement if , default=
        num_samples (int): number of samples to draw, default=len(dataset). This argument
            is supposed to be specified only when replacement is .

        generator (Generator): Generator used in sampling.

"""

    def __init__(self, data_source, replacement=False, num_samples=None, generator=None):
        self.data_source = data_source

        self.replacement = replacement
        self._num_samples = num_samples
        self.generator = generator

        if not isinstance(self.replacement, bool):
            raise TypeError("replacement should be a boolean value, but got "
                            "replacement={}".format(self.replacement))

        if self._num_samples is not None and not replacement:
            raise ValueError("With replacement=False, num_samples should not be specified, "
                             "since a random permute will be performed.")

        if not isinstance(self.num_samples, int) or self.num_samples  0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(self.num_samples))

    @property
    def num_samples(self):

        if self._num_samples is None:
            return len(self.data_source)
        return self._num_samples

    def __iter__(self):
        n = len(self.data_source)
        if self.replacement:
            rand_tensor = torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64, generator=self.generator)
            return iter(rand_tensor.tolist())
        return iter(torch.randperm(n, generator=self.generator).tolist())

    def __len__(self):
        return self.num_samples

最重要的是 iter() 方法，定义了核心的索引生成行为。其中 if 判断处返回了2种随机值，根据是否在初始化参数中给出 replacement 决定是否重复采样。区别核心在于 randint() 函数生成的随机数序列是包含重复数值的，而 randperm() 函数生成的随机数序列是不包含重复数值的。
下面分别测试 replacement 为 False 和 True 两种情况的示例：

ran_sampler = sampler.RandomSampler(data_source=data)
for index in ran_sampler:
    print("index: {}, data: {}".format(str(index), str(data[index])))

index: 3, data: 4
index: 4, data: 5
index: 2, data: 3
index: 1, data: 2
index: 0, data: 1

ran_sampler = sampler.RandomSampler(data_source=data, replacement=True)
for index in ran_sampler:
    print("index: {}, data: {}".format(str(index), str(data[index])))

index: 1, data: 2
index: 2, data: 3
index: 4, data: 5
index: 3, data: 4
index: 1, data: 2

子集随机采样 Subset Random Sampler

class SubsetRandomSampler(Sampler):
    r"""Samples elements randomly from a given list of indices, without replacement.

    Arguments:
        indices (sequence): a sequence of indices
        generator (Generator): Generator used in sampling.

"""

    def __init__(self, indices, generator=None):

        self.indices = indices
        self.generator = generator

    def __iter__(self):

        return (self.indices[i] for i in torch.randperm(len(self.indices), generator=self.generator))

    def __len__(self):
        return len(self.indices)

上述代码中 len() 的作用是返回随机数序列作为 indice 的索引。需要注意的是采样仍然是不重复的，也是通过 randperm 函数实现的。下面这个例子把用于训练集，验证集和测试集的划分：

sub_sampler_train = sampler.SubsetRandomSampler(indices=data[0:2])
for index in sub_sampler_train:
    print("index: {}".format(str(index)))
print('------------')
sub_sampler_val = sampler.SubsetRandomSampler(indices=data[2:])
for index in sub_sampler_val:
    print("index: {}".format(str(index)))

index: 2
index: 1

index: 3
index: 4
index: 5

加权随机采样 WeightedRandomSampler

class WeightedRandomSampler(Sampler):
    r"""Samples elements from [0,..,len(weights)-1] with given probabilities (weights).

    Args:
        weights (sequence)   : a sequence of weights, not necessary summing up to one
        num_samples (int): number of samples to draw
        replacement (bool): if , samples are drawn with replacement.

            If not, they are drawn without replacement, which means that when a
            sample index is drawn for a row, it cannot be drawn again for that row.

        generator (Generator): Generator used in sampling.

    Example:
        >>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
        [4, 4, 1, 4, 5]
        >>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
        [0, 1, 4, 3, 2]
"""

    def __init__(self, weights, num_samples, replacement=True, generator=None):

        if not isinstance(num_samples, _int_classes) or isinstance(num_samples, bool) or \
                num_samples  0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(num_samples))
        if not isinstance(replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(replacement))

        self.weights = torch.as_tensor(weights, dtype=torch.double)
        self.num_samples = num_samples

        self.replacement = replacement
        self.generator = generator

    def __iter__(self):

        rand_tensor = torch.multinomial(self.weights, self.num_samples, self.replacement, generator=self.generator)
        return iter(rand_tensor.tolist())

    def __len__(self):
        return self.num_samples

replacement 参数依旧是控制采样有没有放回的。num_samples 用于控制生成的个数，weights 参数对应的是样本的权重而不是类别的权重。最重要的是 iter() 方法，返回随机数序列，只是这个随机数序列是按照 weights 指定的权重确定的。


data=[1,2,5,78,6,56]

weights=[0.1,0.2,0.3,0.4,0.8,0.3,5]
rsampler=sampler.WeightedRandomSampler(weights=weights,num_samples=10,replacement=True)

for index in rsampler:
    print("index: {}".format(str(index)))

index: 5
index: 4
index: 6
index: 6
index: 6

批采样 BatchSampler

class BatchSampler(Sampler):
    r"""Wraps another sampler to yield a mini-batch of indices.

    Args:
        sampler (Sampler or Iterable): Base sampler. Can be any iterable object
            with __len__ implemented.

        batch_size (int): Size of mini-batch.

        drop_last (bool): If , the sampler will drop the last batch if
            its size would be less than batch_size

    Example:
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
"""

    def __init__(self, sampler, batch_size, drop_last):

        if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
                batch_size  0:
            raise ValueError("batch_size should be a positive integer value, "
                             "but got batch_size={}".format(batch_size))
        if not isinstance(drop_last, bool):
            raise ValueError("drop_last should be a boolean value, but got "
                             "drop_last={}".format(drop_last))

        self.sampler = sampler
        self.batch_size = batch_size

        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)

            if len(batch) == self.batch_size:
                yield batch
                batch = []

        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):

        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size

在定义好各种采样器以后，需要进行批采样。当 drop_last 为 True 时，如果采样的到的数据小于 batch size，则抛弃这个 batch 的数据。下面的例子中 BatchSampler 使用的采样器为顺序采样器。

seq_sampler = sampler.SequentialSampler(data_source=data)
batch_sampler = sampler.BatchSampler(seq_sampler, 4, False)
print(list(batch_sampler))

[[0, 1, 2, 3], [4, 5]]

Original: https://blog.csdn.net/zhaogang12138/article/details/123237984
Author: @zhg12138
Title: 【几种数据集采样方式】

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/528552/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Pandas数据分析(上)｜一文读懂Series和DataFrame

前言本文主要介绍Pandas中两个重要的数据结构：Series 和 DataFrame。二者在pandas数据分析与处理中是使用最多的数据结构。因此,学习Pandas这两个重要…

人工智能 2023年7月6日
0050
skywalking监控

3.在这里插入图片描述 ; 3.1 Skywalking架构 SkyWalking 逻辑上分为四部分: 探针, 平台后端, 存储和用户界面探针:用来采集app的请求，及服务请求第…

人工智能 2023年6月29日
0071
目标检测中的BBox 回归损失函数-L2，smooth L1，IoU，GIoU，DIoU，CIoU，Focal-EIoU，Alpha-IoU，SIoU

目标检测的两个任务，分类和位置回归，本帖将经典的位置回归损失函数总结如下，按发表时间顺序。 L1、L2、smooth L1 loss 提出smooth L1 loss的论文： L1…

人工智能 2023年7月9日
0088
从零开始的数模学习（特别篇）：Pandas数据处理入门

1 Pandas 数据处理基础 1.1 介绍 Pandas是非常著名的开源数据处理库，我们可以通过它完成对数据集的快速读取、转换、过滤、分析等一系列操作。除此之外，Pandas拥…

人工智能 2023年7月17日
0057
人脸识别_人脸识别及其应用

人脸识别软件登录界面人脸识别的英文名称是 Human Face Recognition.人脸识别产品利用AVS03A图像处理器；可以对人脸明暗侦测,自动调整动态曝光补偿，人脸追踪…

人工智能 2023年5月27日
0065
重学深度学习系列– CNN猫狗分类(TensorFlow2)

重学深度学习系列– CNN猫狗分类(TensorFlow2) 文章目录重学深度学习系列– CNN猫狗分类(TensorFlow2) * 一、我的环境二、…

人工智能 2023年5月25日
0092
HTML小游戏4 —— 简易版英雄联盟（附完整源码）

💂 网站推荐:【神级源码资源网】【摸鱼小游戏】 🤟 风趣幽默的前端学习课程：👉28个案例趣学前端 💅 想寻找共同学习交流、摸鱼划水的小伙伴，请点击【摸鱼学习交流群】 *💬 免…

人工智能 2023年6月28日
0077
利用谷歌colab跑github代码AttnGAN详细步骤深度学习实验（colab+pytorch+jupyter+github+AttnGAN）

Google Colab，全名Colaboratory，是由谷歌提供的免费的云平台，可以使用pytorch、keras、tensorflow等框架进行深度学习。其GPU为Tesla…

人工智能 2023年7月22日
0072
述：函数依赖（functional dependency），闭包（closure），最小覆盖（minimal cover）和规范覆盖（canonical cover）

在本文中，函数依赖是针对用户需要处理的数据集说的。 1.函数依赖：设数据为A A A，共有n n n个属性（即维度），记为A = A 1 , ⋯ , A n A={A_1,⋯,A…

人工智能 2023年6月19日
0070
TensorFlow2安装教程

1.安装Anaconda3 清华镜像源： Index of /anaconda/archive/ | 清华大学开源软件镜像站 | Tsinghua Open Source Mirr…

人工智能 2023年7月27日
0090
高斯低通和高斯高通滤波器

1.高斯低通滤波器（GLPF）高斯低通滤波器的二维形式为：其中，D0是截止频率，D(u,v)是距频率矩形中心的距离。高斯滤波器的宽度由参数 D0 表征，决定了平滑程度，而且 D…

人工智能 2023年6月17日
0087
DiffuseVAE：完美结合VAE和Diffusion Models

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年7月14日
0078
Python PTA 实验三组合数据类新

7-1 字典的应用-找出出现次数最多的字符串（高教社，《Python编程基础及应用》习题7-6） (30 分)编写一个程序，从键盘读取未指定个数的字符串，一行一个，以字符串&#82…

人工智能 2023年7月5日
00104
深度学习入门系列23：项目：用爱丽丝梦游仙境生成文本

大家好，我技术人Howzit，这是深度学习入门系列第二十三篇，欢迎大家一起交流！循环神经网络也被用于生成模型，这意味着除了能够用于预测模型（做预测），它还能学习序列问题并生成问题…

人工智能 2023年5月26日
0074
opencv调用yolov7 yolov7 c++ yolov7转onnx opencv调用yolov7 onnx

一、YOLOV7主要贡献：主要是现有的一些trick的集合以及模块重参化和动态标签分配策略，最终在 5 FPS 到 160 FPS 范围内的速度和准确度都超过了所有已知的目标检测…

人工智能 2023年6月23日
00135
tableau绘制雷达图（4步法）

为什么要用tableau？对比 excel： tableau：那么，我们先说一下雷达图的使用场景参考的两篇写的好的：tableau绘制雷达图tableau雷达图进行步骤的改…

人工智能 2023年6月11日
00104

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31