【知识图谱】02数据准备

2023年6月10日上午4:06 • 人工智能 • 阅读 75

准备工作

MySQL安装和配置方法，链接：

https://blog.csdn.net/shankezh/article/details/115011889

可以直接使用sql文件导入，链接在：

https://download.csdn.net/download/shankezh/15939127

数据导入

使用命令行导入执行：

mysql -u root -p < kg_movie.sql

数据库中共五张表

actor：

actor_to_movie:

genre:

movie:

movie_to_genre:

数据修正

为什么要修正

这个数据如果直接使用的话，如果你按照网上其他教程应该到SPARQL查询会出现问题，大体出现的问题就是查询基本上没有变化或者是失败的，原因如下：

1、数据多表关联没做，例如电影到类型，演员到电影这两张表并没有真正关联引用到movie,actor,genre这三个表，当然，你可以自己做逻辑来弥补问题，但肯定无法按照网上的其他博主的教程获得其结果，目前我还不知道他们到底是怎么会出现正确结果；

2、对于movie_to_genre, actor_to_movie这两表数据也不全。

如何做

因此我们需要做如下工作：

1、关联关系表，即将actor_to_movie关联到 actor 和 movie两张表， movie_to_genre关联到 movie和genre两张表；

2、重新生成两张关联表的数据。

操作步骤

关联关系表

进入控制台，登录mysql，开始如下流程：

1、删除无用行：

在actor_to_movie中删除actor_movie_id，如下指令：

alter table actor_to_movie drop actor_movie_id;

在movie_to_genre中删除movie_genre_id，如下指令：

alter table movie_to_genre drop movie_genre_id;

变更后如下：

2、添加外键约束：

alter table actor_to_movie add constraint actor_movie_id foreign key (actor_id) references actor (actor_id) on delete no action on update no action;

上面是关联actor表，下面是关联movie表；

alter table actor_to_movie add constraint movie_actor_id foreign key (movie_id) references movie (movie_id) on delete no action on update no action;

on delete no action on update no action是添加条件，指主表如果要删除和更新数据时，如果引用了外键，则不允许被删除和更新；

添加完成后如下，查看指令在下面：

show create table actor_to_movie;

同理，对movie_to_genre表进行关联：

alter table movie_to_genre add constraint movie_genre_id foreign key (movie_id) references movie (movie_id) on delete no action on update no action;

alter table movie_to_genre add constraint genre_movie_id foreign key (genre_id) references genre (genre_id) on delete no action on update no action;

查看结果：

show create table movie_to_genre;

也可以使用Navcat来看ER图看关联关系：

更新关联表数据

先删除actor_to_movie和movie_to_genre两个表里的数据：

delete from actor_to_movie;

delete from movie_to_genre;

然后关联表单：

关联更新的思路无非就是查询actor在movie表中的参演列，movie中类型和genre类型匹配关系，由于sql语句查询不专业，所以直接用python执行逻辑插入，代码如下：


import MySQLdb
from pprint import pprint

class SQL_OP():
    def __init__(self,ip,user,psd,db,charset):
        self.ip = ip
        self.user = user
        self.password = psd
        self.db = db
        self.charset = charset

    def open(self):
        # 打开数据库连接
        self.db = MySQLdb.connect(self.ip, self.user, self.password, self.db, charset = self.charset)
        # 使用cursor()方法获取操作游标
        self.cursor = self.db.cursor()
        print("数据库打开")

    def get_execute(self, sentence):
        self.cursor.execute(sentence)
        print("数据获取完成")
        return self.cursor.fetchall()

    def set_execute(self, sentence):
        try:
            self.cursor.execute(sentence)
            self.db.commit()
        except:
            self.db.rollback()
            print("指令执行异常")
        print("指令执行完成")

    def close(self):
        self.db.close()
        print("数据库关闭")

def sql_generate_relation():
    data_movie = None  # 电影数据
    data_actor = None  # 演员数据
    data_genre = None  # 类型数据
    data_actor_to_movie = None  # 关联 m 与 a
    data_movie_to_genre = None  # 关联 m 与 g

    data_actor_to_movie = list()
    data_movie_to_genre = list()

    sql_op = SQL_OP(ip="localhost", user="root", psd=".root", db="kg_movie", charset='utf8')
    sql_op.open()
    data_movie = sql_op.get_execute("select * from movie")
    data_actor = sql_op.get_execute("select * from actor")
    data_genre = sql_op.get_execute("select * from genre")

    for actor in data_actor:
        for movie in data_movie:
            if actor[2] in movie[9]:
                data_actor_to_movie.append((actor[0], movie[0]))

    for genre in data_genre:
        for movie in data_movie:
            if genre[1] in movie[8]:
                data_movie_to_genre.append((movie[0], genre[0]))

    # pprint(data_actor_to_movie)
    # print(len(data_actor_to_movie)) # 9711

    # pprint(data_movie_to_genre)
    # print(len(data_movie_to_genre)) # 14370

    count = 0
    # insert data to movie
    sql = r'insert into actor_to_movie(actor_id,movie_id) values'
    for actor_movie in data_actor_to_movie:
        sql_actor_movie = sql + str(actor_movie)
        sql_op.set_execute(sql_actor_movie)
        count += 1
        print("actor_movie progress :" ,str(count) , ' / ' , str(len(data_actor_to_movie)))
    print("actor_movie 执行完成")

    count = 0
    sql = r'insert into movie_to_genre(movie_id,genre_id) values'
    for movie_genre in data_movie_to_genre:
        sql_movie_genre = sql + str(movie_genre)
        sql_op.set_execute(sql_movie_genre)
        count += 1
        print("movie_genre progress :", str(count), ' / ', str(len(data_actor_to_movie)))
    print("movie_genre 执行完成")
    print("执行完成")
    sql_op.close()

def sql_data_clean():
    sql_op = SQL_OP(ip="localhost", user="root", psd=".root", db="kg_movie", charset='utf8')
    sql_op.open()
    data_movie = sql_op.get_execute("select * from movie")
    # movie_bio 清洗掉 "
    count = 0
    for movie in data_movie:
        if r'"' in movie[1]:
            count += 1
            print(r'find id : ', str(movie[0]) ,"... progress : " , str(count), " / ", len(data_movie))
            movie_bio = movie[1].replace(r'"',r'')
            sql = r"update movie set movie_bio ='%s' where movie_id = '%d' "%(movie_bio,movie[0])
            sql_op.set_execute(sql)
    print("movie_bio符号替换完成")
    sql_op.close()

if __name__ == '__main__':
    sql_generate_relation()
    # sql_data_clean()

代码不解释了，很基础，不要多次执行。

数据清洗

数据中存在一些特殊符号，这一步如果不清晰，会在以后的导入neo4j可视化时出现问题，如果你不需要可视化，可以忽略这一步骤。

我们这里使用代码进行清洗，目的是替换掉其中的引号” .

我们使用python进行快速清洗


import MySQLdb
from pprint import pprint

class SQL_OP():
    def __init__(self,ip,user,psd,db,charset):
        self.ip = ip
        self.user = user
        self.password = psd
        self.db = db
        self.charset = charset

    def open(self):
        # &#x6253;&#x5F00;&#x6570;&#x636E;&#x5E93;&#x8FDE;&#x63A5;
        self.db = MySQLdb.connect(self.ip, self.user, self.password, self.db, charset = self.charset)
        # &#x4F7F;&#x7528;cursor()&#x65B9;&#x6CD5;&#x83B7;&#x53D6;&#x64CD;&#x4F5C;&#x6E38;&#x6807;
        self.cursor = self.db.cursor()
        print("&#x6570;&#x636E;&#x5E93;&#x6253;&#x5F00;")

    def get_execute(self, sentence):
        self.cursor.execute(sentence)
        print("&#x6570;&#x636E;&#x83B7;&#x53D6;&#x5B8C;&#x6210;")
        return self.cursor.fetchall()

    def set_execute(self, sentence):
        try:
            self.cursor.execute(sentence)
            self.db.commit()
        except:
            self.db.rollback()
            print("&#x6307;&#x4EE4;&#x6267;&#x884C;&#x5F02;&#x5E38;")
        print("&#x6307;&#x4EE4;&#x6267;&#x884C;&#x5B8C;&#x6210;")

    def close(self):
        self.db.close()
        print("&#x6570;&#x636E;&#x5E93;&#x5173;&#x95ED;")

def sql_generate_relation():
    data_movie = None  # &#x7535;&#x5F71;&#x6570;&#x636E;
    data_actor = None  # &#x6F14;&#x5458;&#x6570;&#x636E;
    data_genre = None  # &#x7C7B;&#x578B;&#x6570;&#x636E;
    data_actor_to_movie = None  # &#x5173;&#x8054; m &#x4E0E; a
    data_movie_to_genre = None  # &#x5173;&#x8054; m &#x4E0E; g

    data_actor_to_movie = list()
    data_movie_to_genre = list()

    sql_op = SQL_OP(ip="localhost", user="root", psd=".root", db="kg_movie", charset='utf8')
    sql_op.open()
    data_movie = sql_op.get_execute("select * from movie")
    data_actor = sql_op.get_execute("select * from actor")
    data_genre = sql_op.get_execute("select * from genre")

    for actor in data_actor:
        for movie in data_movie:
            if actor[2] in movie[9]:
                data_actor_to_movie.append((actor[0], movie[0]))

    for genre in data_genre:
        for movie in data_movie:
            if genre[1] in movie[8]:
                data_movie_to_genre.append((movie[0], genre[0]))

    # pprint(data_actor_to_movie)
    # print(len(data_actor_to_movie)) # 9711

    # pprint(data_movie_to_genre)
    # print(len(data_movie_to_genre)) # 14370

    count = 0
    # insert data to movie
    sql = r'insert into actor_to_movie(actor_id,movie_id) values'
    for actor_movie in data_actor_to_movie:
        sql_actor_movie = sql + str(actor_movie)
        sql_op.set_execute(sql_actor_movie)
        count += 1
        print("actor_movie progress :" ,str(count) , ' / ' , str(len(data_actor_to_movie)))
    print("actor_movie &#x6267;&#x884C;&#x5B8C;&#x6210;")

    count = 0
    sql = r'insert into movie_to_genre(movie_id,genre_id) values'
    for movie_genre in data_movie_to_genre:
        sql_movie_genre = sql + str(movie_genre)
        sql_op.set_execute(sql_movie_genre)
        count += 1
        print("movie_genre progress :", str(count), ' / ', str(len(data_actor_to_movie)))
    print("movie_genre &#x6267;&#x884C;&#x5B8C;&#x6210;")
    print("&#x6267;&#x884C;&#x5B8C;&#x6210;")
    sql_op.close()

def sql_data_clean():
    sql_op = SQL_OP(ip="localhost", user="root", psd=".root", db="kg_movie", charset='utf8')
    sql_op.open()
    data_movie = sql_op.get_execute("select * from movie")
    # movie_bio &#x6E05;&#x6D17;&#x6389; "
    count = 0
    for movie in data_movie:
        if r'"' in movie[1]:
            count += 1
            print(r'find id : ', str(movie[0]) ,"... progress : " , str(count), " / ", len(data_movie))
            movie_bio = movie[1].replace(r'"',r'')
            sql = r"update movie set movie_bio ='%s' where movie_id = '%d' "%(movie_bio,movie[0])
            sql_op.set_execute(sql)
    print("movie_bio&#x7B26;&#x53F7;&#x66FF;&#x6362;&#x5B8C;&#x6210;")
    sql_op.close()

if __name__ == '__main__':
    #sql_generate_relation()
    sql_data_clean()

完成。

Original: https://blog.csdn.net/shankezh/article/details/115014660
Author: 飘散风中
Title: 【知识图谱】02数据准备

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/595234/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

pytorch模型定义常用函数以及resnet模型修改案例

import torch from torch import nn class MyLinear(nn.Module): def __init__(self, in_feature…

人工智能 2023年6月4日
0071
典型卷积神经网络算法（AlexNet、VGG、GoogLeNet、ResNet）

好好学习，天天向上！活动地址：CSDN21天学习挑战赛一、 AlexNet 1.结构 ●AlexNet有八个带权层，前五个是卷积层，剩下三层是全连接层。第一个卷积层利用96…

人工智能 2023年7月12日
0070
“我永远都无法理解人类！” OpenAI “杀”死了那个成功模拟已故未婚妻的 GPT-3 机器人

“逝者已矣,生者如斯”，意为死去的人已离我们而去，活着的人要好好生活。可人非圣贤，明知不可拘泥于过去，却总会在深夜不禁回想起过往的美好，并在心里说一句：&#…

人工智能 2023年5月30日
0069
语音识别——语言模型

这篇博客的主要目的是摘录洪庆阳教授《语言识别–原理与应用》的笔记。请原谅我的不足之处。 [En] The main purpose of this blog is to…

人工智能 2023年5月25日
0063
R语言中lm函数构建线性和非线性回归模型

目录一、lm函数建立线性回归模型（1）一元线性回归（2）多元线性回归二、lm函数建立非线性回归模型三、回归诊断一、lm函数建立线性回归模型（1）一元线性回归 1.首先…

人工智能 2023年6月16日
0070
OWL本体基础知识

备注：OWL本体中对象属性和数据属性都可以有进一步的注释属性，被称之为公理 <owl:NamedIndividual rdf:about="http://www.s…

人工智能 2023年6月1日
0097
Evolutionary algorithm （遗传算法）介绍

Evolutionary algorithm （遗传算法）介绍 Evolutionary algorithm 遗传算法，实际上也是机器学习里面一个很重要的分支。为什么呢，因为他在…

人工智能 2023年7月17日
0060
R语言使用order函数对dataframe数据进行排序、基于单个字段（变量）进行排序（升序、降序）、基于多个字段（变量）进行排序（升序、降序）

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年7月16日
0061
使用python和pyqt5轻松上手人脸识别系统（含代码）

使用python和pyqt5轻松上手人脸识别系统（含代码）一、环境配置 * 1.1 python环境配置 – 1.1.1 安装 anaconda 1.1.2 安装p…

人工智能 2023年5月28日
0067
CoCo数据集下载

文章目录 1.介绍 2.下载 * 2.1 官网 2.2 百度网盘 2.3 下载到linux服务器 1.介绍 MS COCO的全称是Microsoft Common Objects …

人工智能 2023年6月16日
00124
MVSNet depthfusion配置流程

MVSNet depthfusion配置流程原文内容 R/MVSNet itself only produces per-view depth maps. To generate…

人工智能 2023年7月19日
0062
Ubuntu18.04安装Nvidia驱动【全网不坑，超全步骤】（亲测～）

Ubuntu18.04安装Nvidia驱动【全网不坑，超全步骤】亲测～为了方便以后回忆以及给像我一样的菜鸡提供思路，给出具体的步骤： * No.1 查看自己的电脑显卡型号（已知麻…

人工智能 2023年6月16日
00137
关于TensorFlow和PyTorch共同安装的兼容版本尝试的记录 – env_name: tftorch

所用命令简述 安装 TensorFlow 和 Pytorch conda create –name tftorc…

人工智能 2023年7月23日
0065
数据分析之预备知识学习笔记

文章导航 1,前言 2,anaconda * 2.1,模块、包和库是什么 – 模块（module）包（package）库（library） 2.1.1总结 2.2,…

人工智能 2023年6月11日
0095
一种能让大型数据聚类快2000倍的方法，真不戳

一、问题描述国家天文台有个聚类任务：共11份数据，每份数据是从一张照片中提取出来的，包含500多万条记录，每条记录是一个天体的坐标及属性。11张”照片”中…

人工智能 2023年6月19日
0063
Prompt Learning详解

现阶段NLP最火的两个idea 一个是对比学习（contrastive learning）另一个就是 prompt prompt 说简单也很简单看了几篇论文之后发现其实就是构建…

人工智能 2023年7月27日
0077

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

【知识图谱】02数据准备

为什么要修正

如何做

操作步骤

关联关系表

更新关联表数据

数据清洗

大家都在看