场景+指标+样本–“机器学习模型”[en]Scenario + indicator + sample-“Machine Learning Model


定义模型的使用场景,即业务逻辑。场景定义有很多种,包括二分类和多分类。大多数时候,我们在做一些银行业务,目标是增加利润。但这并不是机器学习业务所能理解的。假设这是一项营销业务,该系统将向用户发送推荐理财产品的短信。我们将场景定义为两类,即什么样的理财产品用户购买概率较高,我们可以对这款理财产品进行模型预测,即购买或不购买的概率。[en]Define the usage scenario of the model, that is, business logic. There are many kinds of scene definitions, including two-classification and multi-classification. Most of the time we are doing some banking business, the goal is to increase profits. But this is not something that the machine learning business can understand. Assuming that this is a marketing business, the system will send users text messages recommending wealth management products. We define the scenario into two categories, that is, what kind of financial product users have a higher probability of buying, and we can make a model prediction of this financial product, that is, the probability of buying or not buying.

数据清理。事实上,数据清理与传统的大数据处理没有什么不同,可能需要完成或移除一些功能。所使用的技术是MR(MapReduce)或Spark,以及知识图的领域知识。[en]Data cleaning. In fact, data cleaning is no different from traditional big data processing, and some features may need to be completed or removed. The technology used is MR (MapReduce) or Spark, as well as domain knowledge of Knowledge Graph.

特征提取。我们需要从数据中生成一些特征,这些特征实际上是数据的字段,但仅用于机器学习。例如,数据中可能有一个人的性别和年龄,但生成的特征可能是数十万个维度,甚至数百万个维度。例如,对于线性模型,我们不能直接将原始数据放入其中。如何进行特征提取?这与后面使用的模型框架有关,我们必须生成框架支持的格式。当我们真正做到这一点时,我们将定义一个用于特征提取的DSL,用户可以通过简单的描述来生成Spark任务。我们已经为DSL做了一个AST解析器,它可以支持像libsvm或TensorFlow这样的TFRecord格式。[en]Feature extraction. We need to generate some features from the data, which are actually the fields of the data, but only for machine learning. For example, there may be a person’s gender and age in the data, but the generated features may be hundreds of thousands of dimensions or even millions of dimensions. For example, for linear models, we cannot put the raw data directly into it. How to do feature extraction? This is related to the model framework used later, and we have to generate the formats that the framework supports. When we actually do it, we will define a DSL for feature extraction, and users can generate Spark tasks through a simple description. We have made an AST parser for DSL, which can support TFRecord formats like libsvm or TensorFlow.

模特儿训练。在训练过程中有很多选择。一些已经被业界证明的机器学习算法包括LR、GBDT、DNN、NB(朴素贝叶斯)等。还有我们自主开发的算法HETreeNet,它将离散值转换为连续值,因为树模型对连续值的支持更好。我们可以使用不同的框架。例如,TensorFlow就是一个很好的DNN框架。[en]Model training. There are many choices during training. Some machine learning algorithms that have been proved by the industry include LR, GBDT, DNN, NB (Naive Bayes) and so on. There is also our self-developed algorithm HETreeNet, which converts discrete values to continuous values, because the tree model supports continuous values better. We can use different frameworks. For example, TensorFlow is a good DNN framework.

模型上线。模型上线以后就是一个服务,我们可以部成一个微服务或者单机起的一个进程。我们目前用Thrift server。上线以后同样要解决例如负载均衡和高可用的问题,还有认证授权,我们使用AKSK的加密方法。

自学。与普通应用程序不同,大多数机器学习模型都是及时的。比如头条里的推荐,最近一个月大家都在关注娱乐性,所以娱乐性特征可能很重要,所以我们要用增量数据来继续训练模式。这里我们需要一些SDK功能和对不同数据源的支持。模特培训可能是离线的,我们可以只从数据库中取出数据,在自学过程中可能需要拿到卡夫卡或一些流媒体数据。我们的模型框架还支持在线学习,即在线更新模型权重。[en]Self-study. Unlike ordinary applications, most machine learning models are timely. For example, the recommendation in the headline, everyone has been paying attention to entertainment in the last month, so entertainment characteristics may be important, so we have to use incremental data to continue to train the model. Here we need some SDK functionality and support for different data sources. Model training may be offline, we can just take the data out of Database, and we may have to pick up Kafka or some Streaming data during self-study. The framework of our model also supports online learning, that is, updating model weights online.

Original: https://www.cnblogs.com/linn/p/12165171.html
Author: 凌度
Title: 机器学习





最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总