RDD、DataFrame、Dataset三者三者之间转换

2023年6月2日上午7:12 • 人工智能 • 阅读 88

转化：

RDD、DataFrame、Dataset三者有许多共性，有各自适用的场景常常需要在三者之间转换
DataFrame/Dataset转RDD：

这个转换很简单

val rdd1=testDF.rdd
val rdd2=testDS.rdd
RDD转DataFrame：

import spark.implicits._
val testDF = rdd.map {line=>
      (line._1,line._2)
    }.toDF("col1","col2")

一般用元组把一行的数据写在一起，然后在toDF中指定字段名
RDD转Dataset：

import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
val testDS = rdd.map {line=>
      Coltest(line._1,line._2)
    }.toDS

可以注意到，定义每一行的类型（case class）时，已经给出了字段名和类型，后面只要往case class里面添加值即可
Dataset转DataFrame：

这个也很简单，因为只是把case class封装成Row

import spark.implicits._
val testDF = testDS.toDF
DataFrame转Dataset：

import spark.implicits._
case class Coltest(col1:String,col2:Int)extends Serializable //定义字段名和类型
val testDS = testDF.as[Coltest]

这种方法就是在给出每一列的类型后，使用as方法，转成Dataset，这在数据类型是DataFrame又需要针对各个字段处理时极为方便
特别注意：

在使用一些特殊的操作时，一定要加上 import spark.implicits._ 不然toDF、toDS无法使用

package dataframe

import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}

//
// Explore interoperability between DataFrame and Dataset. Note that Dataset
// is covered in much greater detail in the ‘dataset’ directory.

//
object DatasetConversion {

case class Cust(id: Integer, name: String, sales: Double, discount: Double, state: String)

case class StateSales(state: String, sales: Double)

def main(args: Array[String]) {
val spark =
SparkSession.builder()
.appName(“DataFrame-DatasetConversion”)
.master(“local[4]”)
.getOrCreate()

import spark.implicits._

// create a sequence of case class objects
// (we defined the case class above)
val custs = Seq(
Cust(1, “Widget Co”, 120000.00, 0.00, “AZ”),
Cust(2, “Acme Widgets”, 410500.00, 500.00, “CA”),
Cust(3, “Widgetry”, 410500.00, 200.00, “CA”),
Cust(4, “Widgets R Us”, 410500.00, 0.0, “CA”),
Cust(5, “Ye Olde Widgete”, 500.00, 0.0, “MA”)
)

// Create the DataFrame without passing through an RDD
val customerDF : DataFrame = spark.createDataFrame(custs)
//
// println(“ DataFrame schema”)
//
// customerDF.printSchema()
//
// println(“ DataFrame contents”)
//
// customerDF.show()

// +—+—————+——–+——–+—–+
//| id| name| sales|discount|state|
//+—+—————+——–+——–+—–+
//| 1| Widget Co|120000.0| 0.0| AZ|
//| 2| Acme Widgets|410500.0| 500.0| CA|
//| 3| Widgetry|410500.0| 200.0| CA|
//| 4| Widgets R Us|410500.0| 0.0| CA|
//| 5|Ye Olde Widgete| 500.0| 0.0| MA|
//+—+—————+——–+——–+—–+

//
// println(“*** Select and filter the DataFrame”)
//
val smallerDF =
customerDF.select(“sales”, “state”).filter($”state”.equalTo(“CA”))
//
// smallerDF.show()

//
// +——–+—–+
//| sales|state|
//+——–+—–+
//|410500.0| CA|
//|410500.0| CA|
//|410500.0| CA|
//+——–+—–+

///////////////////////////////////////////////////////////////////////////////////

// Convert it to a Dataset by specifying the type of the rows — use a case
// class because we have one and it’s most convenient to work with. Notice
// you have to choose a case class that matches the remaining columns.

// BUT also notice that the columns keep their order from the DataFrame —
// later you’ll see a Dataset[StateSales] of the same type where the
// columns have the opposite order, because of the way it was created.

val customerDS : Dataset[StateSales] = smallerDF.as[StateSales]
//
// println(“ Dataset schema”)
//
// customerDS.printSchema()
//
// println(“ Dataset contents”)
//
// customerDS.show()

// Select and other operations can be performed directly on a Dataset too,
// but be careful to read the documentation for Dataset — there are
// “typed transformations”, which produce a Dataset, and
// “untyped transformations”, which produce a DataFrame. In particular,
// you need to project using a TypedColumn to gate a Dataset.

// val verySmallDS : Dataset[Double] = customerDS.select($”sales”.as[Double])
//
// println(“*** Dataset after projecting one column”)
//
// verySmallDS.show()

//
//+——–+
//| sales|
//+——–+
//|410500.0|
//|410500.0|
//|410500.0|
//+——–+

// If you select multiple columns on a Dataset you end up with a Dataset
// of tuple type, but the columns keep their names.

val tupleDS : Dataset[(String, Double)] =
customerDS.select($”state”.as[String], $”sales”.as[Double])
//
// println(“*** Dataset after projecting two columns — tuple version”)
//
// tupleDS.show()

//
//+—–+——–+
//|state| sales|
//+—–+——–+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+—–+——–+

// You can also cast back to a Dataset of a case class. Notice this time
// the columns have the opposite order than the last Dataset[StateSales]
// val betterDS: Dataset[StateSales] = tupleDS.as[StateSales]
//
// println(“*** Dataset after projecting two columns — case class version”)
//
// betterDS.show()

//
//+—–+——–+
//|state| sales|
//+—–+——–+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+—–+——–+

// Converting back to a DataFrame without making other changes is really easy
// val backToDataFrame : DataFrame = tupleDS.toDF()
//
// println(“*** This time as a DataFrame”)
//
// backToDataFrame.show()
//

//+—–+——–+
//|state| sales|
//+—–+——–+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+—–+——–+

//
// // While converting back to a DataFrame you can rename the columns
val renamedDataFrame : DataFrame = tupleDS.toDF(“MyState”, “MySales”)

println(“*** Again as a DataFrame but with renamed columns”)

renamedDataFrame.show()

// +——-+——–+
//|MyState| MySales|
//+——-+——–+
//| CA|410500.0|
//| CA|410500.0|
//| CA|410500.0|
//+——-+——–+

Original: https://www.cnblogs.com/alamps/p/8333959.html
Author: alamps
Title: RDD、DataFrame、Dataset三者三者之间转换

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/560184/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Ruoyi若依前后端分离框架【若依登录详细过程】

本文主要写RuoYi项目前端登录流程开发环境配置 server: # 服务器的HTTP端口，默认为8080 port: 8080 指定了服务端启动的端口8080。我们运行ruoy…

人工智能 2023年6月30日
0089
2021年语音合成年度总结

论文统计每月更新一次，主要跟踪语音合成和语音识别的发展状况(很多文章都是在会议后才发出，但不影响统计。统计过程难免存在疏漏，因此统计结果仅供参考。所有文章语音合成领域统计列表请访问…

人工智能 2023年5月25日
00100
双目立体视觉(一) 基本原理和步骤

目录一、双目立体视觉系统的四个基本步骤二、各步骤原理 1、相机标定 2、立体校正 3、立体匹配一、双目立体视觉系统的四个基本步骤相机标定主要包含两部分内容: 单相机的内参标…

人工智能 2023年6月1日
0099
pytorch 1.11.0 安装流程

文章目录前言一、CUDA 安装二、8.2.1 cudnn 三、安装 pytorch 测试前言我的是基于 pycharm + Anaconda 安装pytorch pyto…

人工智能 2023年7月21日
0073
色彩颜色对照表（一）(16进制、RGB、CMYK、HSV、中英文名)

色彩颜色对照表（一） (16进制、RGB、CMYK、HSV、中英文名) 编分类名称 16 进制 R G B 值ＣＭＹＫ值 HSV 号颜色英文 HEX R G B C M …

人工智能 2023年7月19日
0098
OpenCV中简单的滤波

均值滤波：均值滤波是典型的线性滤波算法，是指用当前像素点周围n*n个像素值的均值来代替当前像素值。使用该方法遍历处理图像内的每一个像素点，可完成整幅图像的均值滤波。不足之处：均…

人工智能 2023年6月18日
0072
使用onnx c++部署pytorch神经网络模型全流程

简介 Open Neural Network Exchange（ONNX，开放神经网络交换）格式，是一个用于表示深度学习模型的标准，可使模型在不同框架之间进行转移。如pytorch…

人工智能 2023年7月21日
0058
Python 疫情数据可视化（爬虫+数据可视化）（Jupyter环境）

目录 1 项目背景 2 项目目标 3 项目分析 3.1数据获取 3.1.1分析网站 3.1.2找到数据所在url 3.1.3获取数据 3.1.4解析数据 3.1.5保存数据 3.2…

人工智能 2023年7月16日
0096
超详细的OpenCV入门教程，12小时带你吃透OpenCV。

OpenCV简介： OpenCV是一个基于Apache2.0许可（开源）发行的跨平台计算机视觉和机器学习软件库，可以运行在linux、Windows、Android和MAC OS操…

人工智能 2023年6月17日
0073
实战1 – 空气质量数据的校准

1 题目简介题目来源于2019 高教社杯全国大学生数学建模竞赛D题——空气质量数据的校准。空气污染对生态环境和人类健康危害巨大，通过对”两尘四气”（PM2…

人工智能 2023年6月19日
0067
YOLOV5目标检测-后处理NMS(非极大值抑制)

一.非极大抑制整个计算原理二.yolov5中相应的代码三.同类别框相互包裹，去除低置信度，减少误检代码实现四.将所有的类别框当做一个类别进行nms 一.非极大抑制整个计算原理…

人工智能 2023年6月17日
0078
Pytorch：全连接神经网络-MLP回归

Pytorch: 全连接神经网络-解决 Boston 房价回归问题 Copyright: Jingmin Wei, Pattern Recognition and Intellig…

人工智能 2023年6月16日
0086
pytorch单机多卡的正确打开方式以及可能会遇到的问题和相应的解决方法

pytorch 使用单机多卡，大体上有两种方式简单方便的 torch.nn.DataParallel(很 low，但是真的很简单很友好) 使用 torch.distribute…

人工智能 2023年7月20日
0060
微信小程序设计总结

微信小程序是一种全新的连接用户与服务的方式，它可以在微信内被便捷地获取和传播，同时具有出色的使用体验。小程序提供了一个简单、高效的应用开发框架和丰富的组件及API，帮助开发者在微…

人工智能 2023年6月4日
0063
【pytorch】固定(freeze)住部分网络

最好、最高效、最简洁的，是 “方案一” 。代码模板： pre_state_dict = torch.load(model_path, map_locati…

人工智能 2023年7月22日
0057
暑期实践第十七天 2022-7-20

今日学习阶段： 1.DataFrame对象 DataFrame是Pandas库中的一种数据结构，它是由多种类型的列组成的二维表数据结构，类似于Excel、SQL或Series对象构…

人工智能 2023年7月9日
0099

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

RDD、DataFrame、Dataset三者三者之间转换

大家都在看