知识图谱类产品-开题报告构想
整体技术路线是GNN,降低前期语料的爬取成本,化团队力量为自己用,开始日益发现团队力量的重要性,在一个团队中可以相互成就一个产品,形成能力的互补和简历的递进,but像去年那种技术组的过于分裂,导致技术壁垒在团队内部产生的现象一定要注意避免,做到技术的可互通性的同时,也要做好项目的完善工作。
1.信息源获取难度以及合法性
自然语言处理较为依赖于信息源的获取,本团队不希望将重心放在前期的语料爬取中,虽然这对于结果很重要,但是发展初期的主要方向还是把整个框架能够成功跑出来结果。
[En]
Natural language processing is more dependent on the acquisition of information sources, the team does not want to focus on the early corpus crawling, although this is very important for the results, but the main direction of the early development is to successfully run the whole framework out of the results.
信息源获取的难度决定了本团队在项目初期产品框架设定中的投入成本和精力大小,不希望在此处耗费过多人力物力。对于前期可以直接先从企查查等二次信息获取网站进行二次获取,以方便、快速为目的,哪怕爬取数据不完整有差错,先求有再求好。等到产品可以商用之后,需要扩大规模,签署商业协议的时候再改为从原新闻渠道进行获取。采用照猫画虎的方式,已有的企查查、天眼查等二次信用检索网站用什么数据源,我们就用什么爬取,他们不行的话,我们肯定也不行。
[En]
The difficulty of obtaining information sources determines the cost and effort of the team in setting the product framework at the beginning of the project, and we do not want to spend too much manpower and material resources here. For the early stage, we can directly obtain the second information from the enterprise search and other secondary information acquisition websites for the purpose of convenience and rapidness, even if there are errors in the incomplete crawling data. After the product is ready for commercial use, we need to expand the scale and sign a commercial agreement to obtain it from the original news channel. In the way of drawing a picture of a tiger according to a cat, we crawl what data sources are used in existing enterprise search, sky-eye search and other secondary credit retrieval websites, and if they can’t, we certainly can’t either.
1.1、信息源获取难度
详见链接(待补充详解)
1.2、信息源获取合法性


必须先做出来个东西,然后尽快申请知识产权保护起来,外包给王闻申请经费。跟风做数据,已经有的征信产品在做的,肯定是能够爬取,允许爬取的数据源!你不行他们肯定也不行,墙倒猢狲散
[En]
We must first make something, then apply for intellectual property protection as soon as possible, and outsource it to Wang Wen to apply for funds. Follow the trend to do data, already some credit information products are doing, must be able to crawl, allow crawling data source! If you can’t, neither can they. The wall falls apart.
天眼查、企查查、启信宝这类第三方企业信息查询公司有什么不同吗? – 知乎
天眼查,企查查,启信宝三类第三方企业信息查询公司数据来源
- 全国企业信息公示系统,中国裁判文书网,中国执业信息公开网,国家知识产权网,商标网,版权局网
做出来的产品预期,可以做出个推荐系统,推荐的是企业,并且检索出真正的可联系对象,明了其业务以及投资额,重点新闻,快速排查
[En]
The product is expected to make a recommendation system, recommend the enterprise, and retrieve the real contact object, understand its business and investment, key news, quick investigation

这类产品宣传的都是全国企业信息查询系统,但是如果你是做TO B业务的,通过这类产品 找大客户、找批量企业KP的联系方式,或者是作为老板、业务负责人想 为销售团队提供大批量的目标客户,那么两查一宝的信息就不太适用,即便是开了会员。
它们之间的区别在哪儿呢?
首先就是数据源:
依靠更先进的爬虫技术,市面上主流的获客系统攻克了全网1000+网站数据源,其中数据源包含:
政府公开数据
比如工商信息网、资质网、招标网、融资网、税务网、法律信息网等
[En]
For example, industrial and commercial information network, qualification network, bidding network, financing network, tax network, legal information network and so on.
商务型网络数据
比如企业官网、垂直网站、行业网站、招聘网站、门户网站、地图网站等其他和公司业务贴合更紧密的数据。
[En]
For example, corporate official websites, vertical websites, industry websites, recruitment websites, portals, map sites and other data that are more closely related to the company’s business.
新闻媒体网站数据
比如新浪、微博、头条、凤凰、贴吧、小程序、app、脉脉等
至少3倍以上的主流数据源,其实就完爆了上述三类产品。
更重要的是丰富的数据来源,其实就决定了数据的精准程度、完整程度。尤其是对于TO B客户的联系方式。
相比两查一宝仅收录了年报信息。此类系统收录了至少8个主要渠道数千个来源:1.官网 2.年报 3.B2B网站 4.地图 5.企业信息 6.行业网站 7.招聘网站 8.招标信息等其他平台。

空号检测、智能推荐功能帮助用户全方位触达企业相关负责人。并非只是挂在公司旗下的法人。解决的痛点和问题——直接检索也不确定到底怎么样,点击进去之后可以更加深入的了解这个工具如何去实现
[En]
The functions of empty number detection and intelligent recommendation help users reach the relevant persons in charge of the enterprise in all directions. It is not just a legal person hanging under the company. The pain points and problems to be solved-direct retrieval is not sure what it is like. After clicking in, you can learn more about how this tool can be implemented.
目前to B的线索平台主要有以下几种:
1、企查查、天眼查、启信宝等等。这类平台一般是会员制,价格比较亲民,可以说职场人人手一套。其定位主要是查企业信息,信息里有联系人这一栏,来源比较单一,主要是公商年检信息,无效联系方式也比较多。这个平台适合产品面向众多大众的业务,可结合群呼或者机器人批量外呼,做客户筛选。
2、探迹、搜客宝、销氪等,这类平台数据整合的比较好。相比较企查查这种平台,主要有以下几点不同:(1).数据的来源维度更丰富,它包函了这些企业的官网、百度地图、招聘信息、各类平台发布的信息(B2B网站、本地服务类网站、分类门户信息网站等等)、工商年检信息等等,且筛选条件精细,出来的数据相对更精准。(2).每日更新,可以第一时间获取最新的号码资源。其中,探迹的筛选条件更精细和丰富,价格也比较高。搜客宝最近功能迭代迅速,在ToB数据层面以及筛选维度和探迹基本类似。搜客宝的价格相对低,性价比更高。(3)号码清洗和代理记账号码标记功能,可以过滤掉无效号码。
以上的数据平台都可以给到明号,除了探迹以外都可支持数据导出。
[En]
All of the above data platforms can be given Ming number, and all of them can support data export except trace.
3、联通等电信通讯体系运营商大数据平台,这类可抓取指定的app或者网站的访问用户,一般按照每条数据/元来收费,综合成本较高,数据也相对精准。不过,一般是给个平台,通过平台外呼,号码是脱敏的。
2、产品的技术路线方向
结果的产出以产品为导向,并入其他赛道共同参赛,最直接的成果产出就是软件著作权、论文(等级低的不投)、比赛(非金融类赛道不投),为日后的履历打下基础。
[En]
The output of the result is product-oriented and incorporated into other tracks to participate in the competition, and the most direct results are software copyright, papers (low-level) and competitions (non-financial tracks), laying the foundation for future resumes.
从履历方向去考虑,加强自身在机器学习、深度学习、神经网络方向的应用以及经历的学习,以自己获取的目的为导向,进行反向倒推出自己所需要的项目经历,并且几个经历最好有关联,作为自身的特征,最好跟自己申请的项目能够关联上,目前这些经历还是太数据(文本数据),没有太涉及到商业数据分析(数据科学的竞赛得多参加参加)或者金融数据策略构建,下一步方向需要考虑转变一下,如何引导项目的方向更加的商业化(用户流的形成,多看看别人的实习岗位和项目经历具体是做什么的)
[En]
From the perspective of curriculum vitae, we should strengthen our own application in machine learning, deep learning, neural network direction and learning experience, and reverse the project experience we need according to the purpose of our own acquisition. and several experiences had better be related, as their own characteristics, it is best to be associated with the project they applied for, and at present these experiences are still too data (text data). It doesn’t involve too much business data analysis (more participants in data science competitions) or financial data strategy construction. The next step is to consider changing the direction of the project and how to guide the project to be more commercialized (the formation of user flow). Look more at what other people’s internships and project experiences do)
单纯的知识图谱和神经网络是不同领域,必须考虑好如何将图神经网络的力量赋能于知识图谱,检索方式:1、自己有知识图谱的相关报告,但是应用和解说的技术路线较为局限 2、检索微信、知乎中图神经网络的应用以及大致思路(开题阶段大致的文献综述需要了解完毕)
[En]
Simple knowledge graph and neural network are different fields, so we must consider how to empower graph neural network to knowledge graph. Retrieval methods: 1. We have relevant reports on knowledge graph, but the technical route of application and interpretation is relatively limited. 2, search Wechat, the application of Zhihu Zhongtu neural network and general ideas (the general literature review in the opening stage needs to be completed)



注意:考虑到项目的工程量大小的影响,最好是选取技术路线已经非常成熟,网络中有较多现成代码和前人解说,只是更换了不同的应用场景、对象或者领域而造成的不同,比如将图神经网络在社交网络的应用转化为研究图神经网络在企业之间的”社交”关系网络,形成一个推荐系统
[En]
Note: considering the influence of the project quantity, it is best to select the technical route is very mature, there are more ready-made codes and previous explanations in the network, but the differences caused by changing different application scenarios, objects or fields, such as transforming the application of graph neural network in social network into studying the “social” relationship network of graph neural network between enterprises to form a recommendation system.
- 产品的商业价值
应用领域1没有做出知识图谱本身在实体之间的相关度,跟词云没有本质差别,希望能够做出一个有市场规模的产品(外包售卖,转向自身产出,打造实验室对于老一辈成员的正向输入,还没有太看懂股权的效果与作用,注意做好对本年级知识产权的保护措施和手段,你也得考虑一下实习需要你再加些什么方面的经历)
1.1、 项目的可实现性
关系到该项目能否在你申请之前产出发挥其最大化的效用,考虑到项目所需要的人员规模(比如前端页面的搭建,前后端的链接,爬虫的获取和清洗,内部模型的搭建),需要的时间(预期、安排好规划,半年内产出)
[En]
It is related to whether the output of the project can maximize its effectiveness before you apply, taking into account the size of the personnel required by the project (such as the construction of front-end pages, front-end links, access and cleaning of crawlers, construction of internal models). Time required (expected, planned, output within half a year)
1.2、 项目的本身价值
借鉴于社交媒体中的推荐系统,利用了社交网络(根据用户的点击、购买、访问等信息数据,进行语义网络的构建)形成智能推荐以及关系检索。我们进行对象的替换,采取研究不同的企业之间的投资、供应、竞争、合作等关系,点击进入后可以转化到情感分析的界面得出该企业前10条重点新闻
[En]
Using the recommendation system in social media for reference, it uses social network (constructing semantic network according to users’ click, purchase, access and other information data) to form intelligent recommendation and relationship retrieval. We replace the objects and study the investment, supply, competition, cooperation and other relationships between different enterprises. after clicking to enter, we can be transformed into the interface of emotional analysis to get the top 10 key news of the enterprise.
1.3、项目的依附与延续价值
创立实验室开牌,成立正式组织对外开拓赛道,明确反馈机制,纳入采用短期和长期两种,短期主要以比赛作为媒介和载体。
[En]
The establishment of laboratory opening, the establishment of a formal organization to open up the track, a clear feedback mechanism, including the use of both short-term and long-term, short-term competition as the medium and carrier.
2、参考实际应用方向
基于GNN图神经网络的企业信用风险传导研究
城门失火,殃及池鱼。关联方信用风险的恶化可能会传导至本来资质尚可的主体。因此,对发债主体关联方信用风险传导的研究,变得重要且必要。
[En]
The fire at the city gate affected the fish in the pond. The deterioration of the credit risk of related parties may be transmitted to the subjects with reasonable qualifications. Therefore, the research on the credit risk transmission of the related parties of the bond issuer becomes important and necessary.
- 关联方、尤其是 隐蔽关联方的发现及确认
关联方的发现及确认依赖于企业的关系信息。哪些关系信息能够揭示关联方,公开的关系信息是否能确指真正的关联方,以及实操方面通过数据库查询是否能高效地挖掘深层关系链路等,是数据分析、模型构建、产品开发各方都要考虑的问题。
[En]
The discovery and confirmation of related parties depends on the relationship information of the enterprise. Which relational information can reveal the related parties, whether the disclosed relational information can refer to the real related parties, and whether the deep relational links can be mined efficiently through database query, it is a problem that should be considered by all parties of data analysis, model construction and product development.
- 大关系网、小样本群下关联方风险传导的量化
企业间关联关系种类众多,但并非所有关联关系都会导致企业信用风险发生传导效应。同类关系存在强弱之分、企业自身的风险抵抗能力等因素,均需纳入考虑,否则会导致大范围的正常企业被误传导。另外,传导案例数据较为匮乏,加之风险传导的抽象程度较高,给模型训练带来了巨大挑战。
[En]
There are many kinds of relationships among enterprises, but not all of them will lead to the transmission effect of enterprise credit risk. Factors such as the strength of the same kind of relationship and the risk resistance of the enterprise itself need to be taken into consideration, otherwise it will lead to the mistransmission of a large range of normal enterprises. In addition, the lack of transmission case data, coupled with the high degree of abstraction of risk transmission, brings great challenges to model training.
借助企业知识图谱,理清企业间关系
知识图谱是一种基于图的数据结构,是由节点(实体)和边(关系)组成的语义网络,常用于分析复杂关系问题。因此,我们尝试构建基于企业的知识图谱(以下简称企业图谱),即基于图数据库以图的形式展示企业关系信息,以挖掘企业关联方。
[En]
Knowledge graph is a kind of graph-based data structure, which is a semantic network composed of nodes (entities) and edges (relationships). It is often used to analyze complex relationship problems. Therefore, we try to build an enterprise-based knowledge graph (hereinafter referred to as enterprise graph), that is, based on the graph database to display enterprise relationship information in the form of graph, in order to mine enterprise related parties.
在隐蔽关联方挖掘层面,由于关联方的隐蔽手段越发高明,挖掘难度越来越大。我们通过大量研究分析发现,问题企业为了利益输送掩饰或财务报表美化等目的,一般会与隐蔽关联方之间存在业务或资金往来。对此,我们重点从目标企业的主要供应链及资金往来方入手,基于企业图谱通过考虑关联模式、关联实体各类信息相关性聚类距离等,挖掘可能存在隐蔽关联关系的图谱关系路径。
[En]
At the level of covert related party mining, because the covert means of related parties are becoming more and more clever, mining is becoming more and more difficult. Through a large number of research and analysis, we find that problem enterprises generally have business or capital exchanges with hidden related parties for the purpose of benefit transportation and concealment or financial statement beautification. In this regard, we focus on the main supply chain and capital counterpart of the target enterprise, based on the enterprise graph, by considering the association pattern, the correlation clustering distance of all kinds of information of the associated entity, and so on, mining the graph relationship path that may have hidden association relationship.
然而并非所有关系数据都有助于关联方的挖掘。如果基于上述企业图谱进行简单计算,后续建模势必会受到无用关系信息的干扰。我们的做法是借鉴专家经验,并根据回测的历史数据进行双重验证,筛选出有助于风险传导研究的重要关系类型。
[En]
However, not all relational data are helpful to the mining of related parties. If the simple calculation is carried out based on the above enterprise graph, the subsequent modeling is bound to be disturbed by useless relational information. Our approach is to draw lessons from the experience of experts and conduct double verification according to the historical data of the back test to screen out the important types of relationships that are helpful to the study of risk transmission.
结合业务经验和人工智能,探索量化风险传导规则
专家对于风险传导的判断,仁者见仁,智者见智,仅沉淀专家经验判断主观性偏强;而众多基于图嵌入等纯AI算法的实践效果也多不理想。那么,采用业务经验结合人工智能的方式,将投资先验知识沉淀到AI模型,机器通过学习对专家经验进行修正,以更加符合业务逻辑的科学手段,达到准确量化预测传导程度和可能出现的风险情况。样本案例方面,在平安资管风控专家提供的传导成功和失败案例的基础上,生成正负案例数据集。
经验沉淀方面,建立起不同关系的重要程度排序、不同关系的不同属性阈值、不同关系的组合增强效应等BI规则,用于模型初期训练。对于专家经验之间的差异性问题,采用集成策略,通过路径回测结果的表现优异程度决定每种专家经验权重。
模型设计方面,如图1,采用概率图结构设计出AI概率模型。每个企业被抽象为一个节点,包含企业的主要特征信息,箭头指向对应着信用风险传导方向,根据节点的特征信息和节点之间的关系类型及强弱,由模型计算出对应的条件概率。这样不仅有效继承了专家先验知识,同时,在案例学习中,不断优化模型参数,对专家经验的先验偏差进行纠正。

高阶创新:引入GNN神经网络图谱,提升传导准确率及覆盖度
在实际模型训练中,我们发现,概率图模型虽然可以在参数上进行自我学习修正,但对于传导路径的偏差却难以调整。加上专家关注视野天然具有局限性,不能覆盖全部的风险传导规律,因此我们创新性地引入更加灵活的图神经网络算法(GNN),对风险传导量化建模。
图神经网络是一种直接在图结构上运行的神经网络,非常适合图结构数据的学习。虽然神经网络模型在图像及语音领域已取得突破性进展,但对于抽象程度较高的非欧空间图结构数据,依然存在探索空间。GNN模型中,以图结构将众多关联企业进行集成展示,然后在图上设计神经网络,对企业的状态特征及关系特征一起进行学习。与概率图模型相比,GNN模型的传导路径和模型参数完全来自于训练样本的学习,突破了专家经验的局限性限制。在具体模型训练方面,针对学习案例稀疏问题,借鉴强化学习的思想,结合概率图模型进行协同训练。即经过初级训练后,将概率图模型的成果沉淀到GNN模型中;在后续真实案例学习中,GNN模型不断学习新的风险传导规律,发挥GNN对全局信息进行同时关注的优势。

资料来源:A Comprehensive Survey on Graph Neural Networks. JOURNAL OF LATEX CLASS FILES, VOL. X, NO. X, DECEMBER 2018
案例分析
2020年某月,某大型房企控股有限公司发行的n年期债券,不能按期足额兑付本息,构成实质性违约。对于违约原因,该主体在公告中指出,新型冠状病毒肺炎疫情等不可抗力因素,导致业务遭受重创,且其受到”降杠杆、民营企业融资难发债难”的影响,现金持续流出,流动资金极为紧张。对于该房企的解释,资本市场持观望态度。模型在2018年识别出该主体受对外担保关联方风险传导,提前2年预警其信用风险严重恶化,违约风险极高。
KYZ风险传导模型提示:
2018年6月,某文旅公司因资产规模不佳,盈利能力较差,被发出高危预警;2018年10月底,被评为濒临违约的最高风险状态。
2018年10月底对该房企发出高危预警。具体归因是其对某文旅公司控股比例高达60%,存在超过20亿的大额直接担保关系。
虽然该房企与其文旅公司的资产和运行都是独立和市场化的,但事实上,控股股东往往与上市公司有着如业务交易、融资的信用背书等千丝万缕的联系。一方出现问题,很可能会对另一方造成影响。且现实中也有观点认为,投入大、回报长的文旅地产,对于母控股公司整体的现金流构成了极大的考验。
[En]
Although the assets and operation of the real estate company and its travel company are independent and market-oriented, in fact, the controlling shareholders are often inextricably linked with listed companies, such as business transactions, financing credit endorsement and so on. Problems on one side are likely to affect the other. And in reality, there is also a view that the cultural travel real estate with large investment and long return constitutes a great test for the overall cash flow of the parent holding company.

KYZ观点 | 闻舆知危——如何从舆情中预见金融风险
https://mp.weixin.qq.com/s?__biz=MzAwODMwMDUxNQ==&mid=2649490668&idx=1&sn=6b294a6ed02db87f7faf42811d973b5f&chksm=83685660b41fdf76657dbfbe8323fe12fb13eda77edd1f0e5dc87ac3ce359b0fb0285052b044&mpshare=1&scene=23&srcid=1126PqtwI1aLS3xj1q6jJ7Ao&sharer_sharetime=1637900783245&sharer_shareid=6e98a966c082f84bfcbeab38fd2c28af#rd
图神经网络GNN最新理论进展和应用探索(附报告下载)
https://mp.weixin.qq.com/s?__biz=MzUxNjcxMjQxNg==&mid=2247491324&idx=4&sn=a93333fe228061ce3551403c3696419e&chksm=f9a26c73ced5e5654849ae0bd5569b0ba226758497eee2def3f26a0e4df640991b209ce22c42&mpshare=1&scene=23&srcid=1126EDpMzqHmkJqFuVSlsqmT&sharer_sharetime=1637901112190&sharer_shareid=6e98a966c082f84bfcbeab38fd2c28af#rd





企业信息和图神经网络——里面提到爬取数据源很简单?

图神经网络(GCN、GraphSage、GAT)等在公司实际推荐系统中有应用么? – 知乎

Original: https://blog.csdn.net/weixin_51117061/article/details/121569855
Author: HIT_SunJiankun
Title: 知识图谱类产品-开题报告构想(一)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/79381/
转载文章受原作者版权保护。转载请注明原作者出处!