




Models that rely on machine learning and deep learning algorithms try to describe the “truth” behind the world through data statistics from the point of view of relevance and probability, but it is still a long way from the real “artificial intelligence”. The “knowledge graph ” is more like machine intelligence that can analyze and reason like human beings.



This article explains the knowledge related to the knowledge graph in an easy-to-understand way, and describes in detail how to build the knowledge graph from scratch through the financial risk control case . As well as the steps to be experienced in the process and the problems to be considered at each stage.*




知识图谱(Knowledge Graph)是由 Google 公司在 2012 年提出来的一个新的名词,也叫语义网络。从学术的角度,可以给知识图谱这样的定义:” 知识图谱本质上是语义网络的知识库“。但这有点抽象,换个角度来说,从实际应用的角度出发其实可以简单地把知识图谱理解成 多关系图

所以说知识图谱,本质上 是一种 揭示实体之间关系的语义网络。



In order to better understand the knowledge graph, we must first understand the difference between information and knowledge.

  • 信息是指外部的客观事实。举例:这里有一瓶水,它现在是10°。
  • 知识是对外部客观规律的归纳和总结。举例:水在零度的时候会结冰。






Then understand the concept of knowledge graph from different perspectives.

  • Web视角下,知识图谱如同简单文本之间的超链接一样, 通过建立数据之间的语义链接,支持语义搜索。
  • 自然语言处理视角下,知识图谱就是 从文本中抽取语义和结构化的数据
  • 知识表示视角下,知识图谱是 采用计算机符号表示和处理知识的方法
  • 人工智能视角下,知识图谱是 利用知识库来辅助理解人类语言的工具
  • 数据库视角下,知识图谱是 利用图的方式去存储知识的方法




And now a lot of data are connected to each other, if you want to analyze the value of these links, knowledge graph can be an effective tool. With the advent of the era of the Internet of everything, the information contained in the link is bound to play a greater value, which is the main reason why the knowledge graph has developed so rapidly in recent years.




At present, there is no unified definition of knowledge graph in academic circles. It is clearly described in the documents released by Google that knowledge graph is a technical method of using graph model to describe the relationship between knowledge and modeling everything in the world.



Specifically, knowledge graph is a general formal description framework of semantic knowledge, which uses nodes to represent semantic symbols and edges to represent the relationship between semantics.

在知识图谱里,我们通常用”实体”来表达图里的节点,用”关系”来表达图里的”边”。 实体指的是现实世界中的事物,比如人、地名、概念、药物、公司等, 关系则用来表达不同实体之间的某种联系,比如人”居住在”深圳、李明是王海的”老板”、张朵的”手机号码”是138X…。




Many scenes in the real world are very suitable to be expressed by knowledge graph. For example, in a social network graph, we can have both “human” and “corporate” entities. The relationship between people can be either “friend” or “colleague”. The relationship between a person and a company can be a “current” or “former” relationship. Similarly, a risk control knowledge graph can include “phone” and “company” entities, and the relationship between phone and phone can be “call” relationship.




Next, it explains in detail the three elements that make up the knowledge graph, including: * entity * , * relationship * and * attribute * .

  • 实体:又叫作本体,指客观存在并可相互区别的事物,可以是具体的人、事、物,也可以是抽象的概念或联系。实体是知识图谱中最基本的元素。
  • 关系:在知识图谱中,边表示知识图谱中的关系,用来表示不同实体间的某种联系。
  • 属性:知识图谱中的实体和关系都可以有各自的属性。

在现实世界中,实体和关系也会拥有各自的属性,比如人可以有”姓名”和”年龄”。 当一个知识图谱拥有属性时,我们可以用属性图(Property Graph)来表示




The image above is a simple attribute diagram.

  • 李明是李飞的父亲
  • 李明今年25岁,职位是总经理
  • 李明和张三是朋友关系
  • 李明拥有一个138开头的电话号码
  • 电话号开通时间是2018年


知识图谱应用的前提是已经构建好了知识图谱,也可以把它认为成一个知识库。这也是为什么它可以用来回答一些搜索相关问题的原因,比如在百度搜索引擎里输入” 周杰伦妻子是谁“,我们直接可以得到答案” 昆凌“。这是因为我们在系统层面上已经创建好了一个包含”周杰伦”和”昆凌”的实体以及他俩之间关系的知识库。所以,当我们执行搜索的时候,就可以通过关键词提取(”周杰伦”, “昆凌”, “妻子”)以及知识库上的匹配可以直接获得最终的答案。这种搜索方式跟传统的搜索引擎是不一样的。后者返回的是网页,而不是最终的答案,多了一层用户自己筛选并过滤信息的过程。






*The architecture of knowledge graph mainly includes its own logical structure and architecture. *

知识图谱在 逻辑结构上可分为 模式层数据层两个层次,数据层主要是由一系列的事实组成,而知识将以事实为单位进行存储。如果用(实体1,关系,实体2)、(实体、属性,属性值)这样的三元组来表达事实,可选择图数据库作为存储介质,例如开源的 Neo4j、Twitter 的 FlockDB、JanusGraph 等。模式层构建在数据层之上, 主要是通过本体库来规范数据层的一系列事实表达。本体是结构化知识库的概念模板,通过本体库而形成的知识库不仅层次结构较强,并且冗余程度较小。

知识图谱的 体系架构是指其构建模式的结构,如下图所示:




The construction of knowledge graph is the basis of subsequent application, and the premise of construction is to extract data from different data sources. For vertical knowledge graphs, * their data sources mainly come from two channels: one is the data of the business itself, which is usually contained in the company’s database tables and stored in a structured way; the other is the publicly crawled data on the network, which is usually unstructured data stored in the form of web pages. *





The difficulty of information extraction is to deal with unstructured data. In the process of constructing similar maps, natural language processing techniques are mainly involved in the following aspects:

  • 实体命名识别(Name Entity Recognition)
  • 关系抽取(Relation Extraction)
  • 实体统一(Entity Resolution)
  • 指代消解(Coreference Resolution)




The construction and application of large-scale knowledge base needs the support of a variety of natural language processing technologies. Through the technology of * knowledge extraction * , knowledge elements such as entities, relationships and attributes can be extracted from some open semi-structured and unstructured data. Through * knowledge fusion * , the ambiguity between the referential items such as entities, relations, attributes and factual objects can be eliminated, and a high-quality knowledge base can be formed. * knowledge reasoning * is to further mine the hidden knowledge on the basis of the existing knowledge base, so as to enrich and expand the knowledge base. The comprehensive vector formed by distributed knowledge representation is of great significance to the construction, reasoning, fusion and application of knowledge base.




Knowledge extraction is mainly oriented to open linked data, and the available knowledge units are extracted by automatic technology. The knowledge unit mainly includes three knowledge elements: * entity (extension of concept) * , * relation * and * attribute * . And on this basis, a series of high-quality fact expressions are formed, which lays the foundation for the construction of the upper pattern layer. There are three main tasks of knowledge extraction:

  • 实体提取:也称为命名实体识别,指的是从原始语料库中自动识别命名实体。由于实体是知识图中最基本的元素,实体抽取的完整性、准确性和召回率将直接影响知识库的质量。因此,实体抽取是知识抽取中最基本、最关键的步骤。

    entity extraction: also known as named entity recognition, which refers to the automatic recognition of named entities from the original corpus. Because entity is the most basic element in the knowledge graph, the integrity, accuracy and recall rate of its extraction will directly affect the quality of the knowledge base. Therefore, entity extraction is the most basic and critical step in knowledge extraction.*

  • 关系抽取:目标是解决实体之间的语义链接问题。早期的关系提取主要是通过人工构建语义规则和模板来识别实体关系。随后,实体间的关系模型逐渐取代了人工预定义的语法和规则。

    relationship extraction: the goal is to solve the problem of semantic links between entities. The early relationship extraction is mainly through manual construction of semantic rules and templates to identify entity relations. Subsequently, the relationship model between entities gradually replaced the artificial predefined syntax and rules.*

  • 属性提取:属性提取主要针对实体,通过属性可以形成实体的完整轮廓。由于实体的属性可以看作实体与属性值之间的名称关系,因此实体属性提取问题可以转化为关系提取问题。

    attribute extraction: attribute extraction is mainly for entities, and a complete outline of entities can be formed through attributes. Because the attribute of an entity can be regarded as a name relationship between the entity and the attribute value, the problem of entity attribute extraction can be transformed into a relation extraction problem.*




In recent years, the representation learning technology represented by deep learning has made important progress, which can express the semantic information of entities as dense low-dimensional real-valued vectors, and then efficiently calculate entities, relationships and their complex semantic associations in low-dimensional space, which is of great significance for the construction, reasoning, fusion and application of knowledge base.




Because there are a wide range of knowledge sources in the knowledge graph, there are some problems, such as the quality of knowledge is different, knowledge repetition from different data sources, knowledge association is not clear enough, so it is necessary to carry out knowledge fusion. Knowledge fusion is a high-level knowledge organization, which enables knowledge from different knowledge sources to carry out heterogeneous data integration, disambiguation, processing, reasoning verification, updating and other steps under the same framework. To achieve the integration of data, information, methods, experience and human ideas, to form a high-quality knowledge base.



Among them, knowledge updating is an important part. Human cognitive ability, knowledge reserve and business needs will all increase over time. Therefore, the content of knowledge graph also needs to keep pace with the times. * whether it is general knowledge graph or industry knowledge graph, they all need to be iterated and updated constantly to expand existing knowledge and add new knowledge.*



Third, the construction process and design of knowledge graph.



First of all, we still have to set a specific problem, so that the whole design has a clear purpose. In this chapter, we mainly take * financial risk control * as an example to describe the construction process of knowledge graph.



The construction of a complete knowledge graph includes the following steps:

  1. 定义具体的业务问题
  2. 数据的收集
  3. 预处理
  4. 知识图谱设计
  5. 存储知识图谱
  6. 应用知识图谱
  7. 系统评估




Let’s follow this process to talk about what needs to be done and what needs to be thought about in each step.





In this process, the core is actually the design of the map, because once the map is designed, it will become our follow-up “brain”, good or bad design will directly affect the future application. This is similar to building database tables, once the design is not reasonable, there will be a lot of problems. To a large extent, designing a knowledge graph depends on the understanding of the business and the prospect of the future business.



At present, the problem to be solved is how to determine the fraud risk of a person’s applicant by technical means.

如何判断一个人欺诈风险,传统方法可以根据 个人特征:年龄、单位、工资,但是仅仅关注一个点,而现在通过知识图谱,可以根据关系特征:周围朋友、电话号码等等,比如,朋友是否有失信记录或者不同人用相同手机号码登记,这就把我们查询的范围从一个点扩大到一个面。

什么时候需要知识图谱?在进入下一个话题的讨论之前, 要明确的一点是,对于自身的业务问题到底需不需要知识图谱系统的支持。因为在很多的实际场景,即使对关系的分析有一定的需求,实际上也可以利用传统数据库来完成分析的。所以为了避免使用知识图谱而选择知识图谱,以下给出了几点总结。

  • 有没有强烈 可视化需求
  • 有没有设计 深度搜素的场景
  • 对查询效率有无 实时性要求
  • 数据多样化、解决数据孤岛问题
  • 你是否有能力和成本构建知识图谱系统

    whether you have the ability and cost to build a knowledge graph system*

  • 是否有一定的 知识推理需求




The next step is to identify the data source and do the necessary data preprocessing. For data sources, we need to consider the following:

  1. 我们已经有哪些数据?
  2. 虽然现在没有,但有可能拿到哪些数据?
  3. 其中哪部分数据可以用来降低风险?
  4. 哪部分数据可以用来构建知识图谱?



What needs to be noted here is that not all data related to anti-fraud must enter the knowledge graph, and some of the decision-making principles of this part will be introduced in more detail in the following section.



For anti-fraud, there are several data sources that we can easily imagine, including the basic information of * users, behavior data, operator data, e-commerce data, blacklists, public information on the network, and so on. Assuming that we already have a list of data sources, the next step is to see which data needs further processing, for example, for unstructured data, we more or less need to use technologies related to natural language processing. The basic information filled in by users is basically stored in the business table, except for individual fields that need further processing, many fields can be directly used for modeling or added to the knowledge graph system. For behavioral data, we need to go through some simple processing and extract valid information such as “how long the user stays on a page” and so on. For the web page data open on the network, some technologies related to information extraction are needed.*



For example, for the basic information of the user, we probably need to do the following. On the one hand, user information such as name, age, education and other fields can be extracted and used directly from the structured database. But on the other hand, for the company name filled in, we may need to do further processing. For example, some users fill in “Beijing greedy Technology Co., Ltd.” and others fill in “Beijing Wangjing greedy Technology Co., Ltd.”. In fact, they all point to the same company. Therefore, at this time, we need to do the company name alignment, the technical details can refer to the entity alignment technology mentioned earlier.






In the design of knowledge graph, we will certainly face the following common problems:

  1. 需要哪些实体、关系和属性?
  2. 哪些属性可以做为实体,哪些实体可以作为属性?
  3. 哪些信息不需要放在知识图谱中?



Based on these common problems, we abstract a series of design principles from previous design experience. These design principles are similar to the paradigm in the traditional database design to guide the relevant personnel to design a more reasonable knowledge graph system while ensuring the efficiency of the system.




The above is summed up from previous experience and may not be completely accurate, but it can at least reflect the holes that we avoid when designing knowledge graphs. Next, I will explain to you around each point.







Using the above principles, we can design a graph in the field of financial anti-fraud. Finally, a simplified graph is shown below. Of course, the map in practical application is much more complicated than this!



知识图谱主要有两种存储方式:一种是 基于RDF的存储;另一种是 基于图数据库的存储。它们之间的区别如下图所示。RDF一个重要的设计原则是数据的易发布以及共享,图数据库则把重点放在了高效的图查询和搜索上。其次,RDF以三元组的方式来存储数据而且不包含属性信息,但图数据库一般以属性图为基本的表示形式,所以实体和关系可以包含属性,这就意味着更容易表达现实的业务场景。


根据最新的统计,图数据库仍然是增长最快的存储系统。相反,关系型数据库的增长基本保持在一个稳定的水平。同时,也列出了常用的图数据库系统以及他们最新使用情况的排名。 其中 Neo4j系统目前仍是使用率最高的图数据库,它拥有活跃的社区,而且系统本身的查询效率高,但唯一的不足就是不支持准分布式。相反,OrientDB 和 JanusGraph(原Titan)支持分布式,但这些系统相对较新,社区不如Neo4j活跃,这也就意味着使用过程当中不可避免地会遇到一些刺手的问题。如果选择使用RDF的存储系统,Jena或许一个比较不错的选择。





After building the knowledge graph, we should use it to solve specific problems. For the knowledge graph of financial risk control, the first task is to mine the hidden fraud risk in the relational network. From an algorithm point of view, there are * two different scenarios * : one is * rule-based * , and the other is * probability-based * . The former depends on the experience of experts, while the latter depends on data-driven.


5.1 基于规则的应用



First of all, let’s look at several rule-based applications, which are * inconsistency verification * , * rule-based feature extraction * , * pattern-based judgment * .




In order to judge the risks in the relationship network, a simple way is to do inconsistency verification, that is, to identify potential contradictions through some rules. These rules are artificially defined in advance, so some business knowledge is needed in the matter of designing rules. For example, in the following picture, both Li Ming and Li Fei indicate the same company phone number, but actually judge from the database that they actually work in different companies, which is a contradiction. As a matter of fact, there can be many similar rules, which are not listed here.





We can also extract some features from the knowledge graph based on rules, and these features are generally based on depth-based searches such as 2-degree, 3-degree or even higher dimensions. For example, we can ask the question: “what is the relationship between the borrower and the two?” From the picture, we can easily observe that the borrower is Li Fei’s father, and Li Ming is Li Fei’s friend. After these features are extracted, they can be used as the input of the risk model. I would like to make it clear here that if the features do not involve in-depth relationships, in fact, traditional relational databases are sufficient to meet the needs.



这种方法比较适用于找出团体欺诈,它的核心在于通过一些模式来找到有可能存在风险的团体或者子图(sub-graph),然后对这部分子图做进一步的分析。 这种模式有很多种,包括: 多点共享信息、三角关系、强连通图、团、弱连通图等等。在这里举几个简单的例子。 比如在下图中,三个实体共享了很多其他的信息,我们可以看做是一个团体,有欺诈嫌疑,并对其做进一步的分析。




For example, we can also find the strongly connected graph from the knowledge graph, mark it, and then do further risk analysis. Strongly connected graph means that each node can reach other points through a certain path, which shows that there is a strong relationship between these nodes.


5.2 基于概率统计的方法



In addition to rule-based methods, probability and statistical methods can also be used, such as * community mining * , * tag propagation * , * clustering * and so on.





Because community mining is based on probability methodology, the advantage is that there is no need to define rules artificially, especially for a large relational network, defining rules itself is a very complicated thing.





Compared with the regular methodology, the disadvantage of the probability-based approach is that it requires enough data. If the amount of data is small and the whole graph is sparse, the rule-based method can become our first choice. Especially for the financial sector, there are fewer data labels, which is the main reason why rule-based methodology is still more widely used in the financial field.

5.3 基于动态网络的分析



All of the above analyses are based on a static relationship graph. The so-called static relationship graph means that we do not consider the change of the graph structure itself with time, but focus on the current knowledge graph structure. However, we also know that the structure of the map changes over time, and these changes themselves can be associated with risk.






Knowledge graph is a challenging and interesting field. As long as there is a correct practical application scenario, the value of knowledge graph is worth looking forward to. In the near future, knowledge graph technology will be widely used in various fields.



And knowledge graph is a relatively new tool, its main function is to analyze the relationship, especially the depth relationship. So in business, first of all to ensure its necessity, in fact, many problems can be solved in a non-knowledge graph way.



One of the most important roles in the field of knowledge graph is knowledge reasoning. And the reasoning of knowledge is the only way to strong artificial intelligence.

最后,需要强调的是, 知识图谱工程本身还是以业务为重心,以数据为中心。


本文是从零开始学NLP 系列文章第十六篇,希望小伙伴们多多支持,互相交流。







Original: https://blog.csdn.net/kobepaul123/article/details/120819406
Author: Yunlord
Title: 强人工智能必经之路?知识图谱超详细总结,快速入门KG首选(万字长文,值得收藏)





亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球