一.简介
Word2Vec
是一个 Estimator
表示文档的单词序列并用于训练一个 Word2VecModel
。该模型将每个单词映射到唯一的固定大小的向量。使用 Word2VecModel
文档中所有单词的平均值将转换为向量;然后,可以将此向量用作预测,文档相似度计算等功能。
二.例子
在下面的代码片段中,我们从一组文档开始,每个文档由一个单词序列表示。对于每个文档,我们将其转换为特征向量。然后可以将特征向量传递给学习算法。
[En]
In the following code snippet, we start with a set of documents, each represented by a sequence of words. For each document, we convert it into a feature vector. The feature vector can then be passed to the learning algorithm.
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
// Input data: Each row is a bag of words from a sentence or document.
val documentDF = spark.createDataFrame(Seq(
"Hi I heard about Spark".split(" "),
"I wish Java could use case classes".split(" "),
"Logistic regression models are neat".split(" ")
).map(Tuple1.apply)).toDF("text")
// Learn a mapping from words to Vectors.
val word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(3)
.setMinCount(0)
val model = word2Vec.fit(documentDF)
val result = model.transform(documentDF)
result.collect().foreach { case Row(text: Seq[_], features: Vector) =>
println(s"Text: [${text.mkString(", ")}] => \nVector: $features\n") }
Original: https://www.cnblogs.com/yszd/p/13748359.html
Author: 云山之巅
Title: Spark ML 机器学习之Word2Vec
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/7584/
转载文章受原作者版权保护。转载请注明原作者出处!