论文笔记（一）一种基于GPU的高效并行安全机器学习框架An Efficient Parallel Secure Machine Learning Framework on GPUs

2023年7月13日上午7:02 • 人工智能 • 阅读 93

@article{DBLP:journals/tpds/ZhangCZZZD21,
author = {Feng Zhang and
Zheng Chen and
Chenyang Zhang and
Amelie Chi Zhou and
Jidong Zhai and
Xiaoyong Du},
title = {An Efficient Parallel Secure Machine Learning Framework on GPUs},
journal = {{IEEE} Trans. Parallel Distributed Syst.},
volume = {32},
number = {9},
pages = {2262–2276},
year = {2021},
url = {An Efficient Parallel Secure Machine Learning Framework on GPUs | IEEE Journals & Magazine | IEEE Xplore},
doi = {10.1109/TPDS.2021.3059108},
timestamp = {Thu, 14 Oct 2021 09:20:51 +0200},
biburl = {https://dblp.org/rec/journals/tpds/ZhangCZZZD21.bib},
bibsource = {dblp computer science bibliography,dblp: computer science bibliography}
}

I 、论文梗概

背景：隐私保护重要性–MPC 用于很多应用中，尤其在机器学习中，MPC 有特殊的优势（相对DP ）

we find that the low performance problem exists even with two-party computation, which is mainly due to the following reasons.

SecureML [10], proposed by Mohassel and others, is the state-of-the-art machine learning framework based on two-party computation.

GPUs have been widely used as a powerful accelerator to machine learning algorithms [15], [16]. However, none of existing studies has focused on the acceleration of secure machine learning algorithms using GPUs.

针对的问题：性能

Previous work on secure machine learning mostly focused on novel protocols or improving accuracy, while the performance metric has been ignored.

本文的解决策略：提出基于GPU 的框架 GPU-based framework ParSecureML

遇到的挑战：

complex computation patterns, 复杂计算模式
frequent intra-node data transmission between CPU and GPU, 节点内CPU和GPU间数据传输
complicated inter-node data dependence 复杂的节点间数据依赖

提出的结构思路：

profiling-guided adaptive GPU utilization,
fine-grained double pipeline for intra-node CPU-GPU cooperation,
compressed transmission for inter-node communication,
integrate architecture specifific optimizations, such as Tensor Cores, into ParSecureML

成果：

the first GPU-based secure machine learning framework.
Compared to the state-of-the-art framework, ParSecureML achieves an average of 33.8X speedup.
ParSecureML can also be applied to inferences, which achieves 31.7X speedup on average.

ParSecureML 创新点

针对三大挑战： Building a GPU-based secure machine learning framework requires handling three challenges.

the complex triplet multiplication based computation patterns
how to handle the PCIe transmission overhead caused by frequent intra-node data transmission between CPU and GPU.
the complicated inter-node data dependence

三项新技术

a profiling guided adaptive GPU engine分析过程找到计算最密集的部分
a double pipeline design, which can overlap not only the GPU computation and PCIe data transmission, but also potential steps among different NN layers
a novel compression-based transmission method

对CPU和GPU进行了深度优化：

对随机数设计了线程安全的随机生成设计（a thread-safe random number generation design）;
计算密集复杂部分置于GPU（使用cache优化来并行这些操作）
引入架构的特殊优化，将TensorCores 加入GPUs

对比ML算法（6种）：

convolutional neural network (CNN) [19], multilayer perceptron (MLP) [20], linear regression [21], logistic regression [22], recurrent neural network (RNN) [23], and Support Vector Machine (SVM) [24],

5个数据集：

MNIST [25], VGGFace2 [26], NIST [27], CIFAR-10 [28], and a synthetic dataset.

II 、ParSecureML协议

Overview ：

三个组件：

1）profiling-guided adaptive GPU utilization针对挑战一

Double pipeline execution for overlapping intra-node data transmission and computation （compute1和communicate作为CPU执行的reconstruct phase，compute2作为GPU部分，形成一条pipeline；另外ML中单层中多个步骤，在这条pipeline中层间操作可以重叠）
Compressed transmission for inter-node communication 针对挑战三

论文笔记（一）一种基于GPU的高效并行安全机器学习框架An Efficient Parallel Secure Machine Learning Framework on GPUs

多技术集成的困难：GPU任务需要与pipeline执行和压缩传输合作；

双pipeline设计更复杂（CPU-GPU传输、计算、压缩传输）

压缩传输的数据能在GPU中存储

workflow：ML tasks各层中有forward propagation and backward propagation，both with reconstruct and GPU operation phases

profiling-guided adaptive GPU utilization
offline：三元组中矩阵乘法可GPU加速
online：

Activation function design：

Equation (9) to simulate the original nonlinear functions in GPUs

Double pipeline execution for overlapping intra-node data transmission and computation

思路：

每层 forward：数据处理 backward：参数更新；每层都需要数据传输

————————需要fine-grained pipeline设计，不使用coarse-grained pipeline[43][44]

许多步骤贯穿多层

————————需要a second pipeline来overlap the possible steps in different layers

Pipeline Design：

First Pipeline： overlap GPU computation and PCIe data transmission in equation (8)

Second Pipeline：

各层Forward and backward都需要reconstruct步骤和GPU操作，

后续层处理基于当前层的forward propagation，因此前后层forward 中reconstruct无法重叠。而backward中reconstruct不需要等待下一层，可与下一层propagation重叠——————可以节省一个reconstruct时间

Compressed Transmission for Inter-Node Communication

分析：迭代后的矩阵通常为稀疏矩阵。激活函数后会有多个零；当层数上涨，初始几层损失函数的梯度很小。

优化

1)CPU 加速

随机数产生的加速：使用 a thread-safe random number generator, Mersenne Twister 19937 generator (MT19937) [48], from C++ 11 random library（1.06 X rand()运行时间）另种可能的提高方式：cuRAND on GPUs，不过只在大矩阵下有好的加速效果

矩阵加减法优化（5）（6）中加减法多，可以通过multi-threaded for-loop in parallel

2） GPU加速

nvprof分析GPU运行，发现有三部分：host-to-device内存复制，通用矩阵乘法操作（针对），device-to-device内存复制

Tensor Core Utilization.

Popular GPU machine learning frameworks, including TensorFlow [35], PyTorch [36], MXNet [51], and Caffe2 [52], all utilize Tensor Cores.

Original: https://blog.csdn.net/weixin_41839176/article/details/126611037
Author: BambooDoo
Title: 论文笔记（一）一种基于GPU的高效并行安全机器学习框架An Efficient Parallel Secure Machine Learning Framework on GPUs

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/689307/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

T-SNE可视化高维数据，亮瞎审稿人

文章目录 * – 经典案例-MNIST手写数字降维可视化 – 论文中使用 t-SNE 案例 – t-SNE 实战 – + MNIST…

人工智能 2023年7月26日
0094
用 Python 提取 PDF 文本的简单方法

你好，我是征哥，一般情况下，Ctrl+C 是最简单的方法，当无法 Ctrl+C 时，我们借助于 Python，以下是具体步骤：第一步，安装工具库 1、tika — 用于从各种文件…

人工智能 2023年6月19日
0082
r包安装固定版本r包安装某个版本r包安装特定版本的R包

第一章概述 1-1 简述计算机程序设计的发展历程。解：迄今为止计算机程序设计的发展经历了机器、汇编、高级等阶段，C++ 是一种面向对象的编程，也属于高级。 …

人工智能 2023年5月23日
0073
详解CNN实现中文文本分类过程

摘要：本文主要讲解CNN实现中文文本分类的过程，并与贝叶斯、决策树、逻辑回归、随机森林、KNN、SVM等分类算法进行对比。本文分享自华为云社区《[Python人工智能] 二十一….

人工智能 2023年7月1日
0081
yolov5-realsense深度信息目标检测（构建自己的数据集模型）

yolov5-realsense深度信息目标检测（构建自己的数据集模型）训练准备： 1.安装运行yolov5代码略 2.制作训练数据集目标训练数据集，应大于50张图片以上（…

人工智能 2023年7月22日
0059
自然语言处理——文本数据的读写及操作

回答1：批量是指一次性对多个，可以提高效率。在使用 Spark HBase 时，也可以使用批量来提高效率。具体实现方式如下： 1. 批量写入使用 HBase 的 Put…

人工智能 2023年7月9日
0056
什么是降低维度（Dimensionalit

问题：什么是降低维度（Dimensionality Reduction）？降低维度是指将高维数据转换为低维空间的过程，从而减少数据的特征维度。在实际应用中，高维数据可能存在许多冗…

人工智能 2024年1月1日
0028
R语言|plot和par函数绘图详解，绘图区域设置颜色设置绘图后修改及图像输出

plot()函数 plot()函数是R中最基本的绘图函数，其实最简单、最基础的函数，这也就意味着其具有更多的可操作性。 plot(x,y,…) 在plot函数中，只需指定最基本…

人工智能 2023年7月4日
0075
基于meshgrid进行knn模型分类边界可视化

meshgrid方法 xx,yy = numpy.meshgrid(x,y) meshgrid函数就是用两个坐标轴上的点在平面上画网格(当然这里传入的参数是两个的时候)，本质上是用…

人工智能 2023年7月1日
0077
创建DataFrame的两个途径

https://www.shiyanlou.com/courses/536/labs/1818/document 方法一由反射机制推断出模式： Step 1：引用必要的类。 St…

人工智能 2023年6月2日
0069
pointnet C++推理部署（一）

由于tensorflow编译C++的api比较麻烦，此次部署的pointnet代码的Python版本为Pytorch编写的。代码：Pointnet_Pointnet2_pytorc…

人工智能 2023年5月28日
00162
深度学习pipeline和baseline是什么意思？

1.pipeline 最近在看深度学习论文，和技术文章，Pipeline是很常见的单词，但理解一直很模糊，查询了资料，做一些笔记。 pipeline，中文意为管线，意义等同于流水线…

人工智能 2023年6月23日
0054
Magento_CentOS安装

Magento_CentOS安装背景信息：操作系统：CentOS 7 镜像：阿里云 IP：公网IP 示例步骤使用以下软件版本： Apache：2.4.6 Mysql：5.7 P…

人工智能 2023年6月30日
00102
为什么说数字图像的一阶微分为f(x+1)-f(x)?

在冈萨雷斯《数字图像处理》第三章有如下介绍：怎么来理解呢？老猿按照自己的理解来解析一下：在《人工智能数学基础–导数1：基础概念及运算》中介绍：”导数（Derivati…

人工智能 2023年6月22日
00107
pandas实现筛选功能方式

1 筛选出数据的指定几行数据 data=df.loc[2:5] #这里的[2:5]表&amp…

人工智能 2023年6月15日
0088
千言数据集：文本相似度——提取TFIDF以及统计特征，训练和预测

以下学习笔记来源于 Coggle 30 Days of ML（22年1&2月）链接：https://coggle.club/blog/30days-of-ml-202201…

人工智能 2023年5月31日
00113

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

论文笔记（一）一种基于GPU的高效并行安全机器学习框架An Efficient Parallel Secure Machine Learning Framework on GPUs

大家都在看