图数据库hugegraph如何快速导入10亿+数据

2023年6月1日上午11:13 • 人工智能 • 阅读 92

随着社交、电商、金融、零售、物联网等行业的快速发展，现实社会织起了了一张庞大而复杂的关系网，亟需一种支持海量复杂数据关系运算的数据库即图数据库。本系列文章是学习知识图谱以及图数据库相关的知识梳理与总结

本文会包含如下内容：

如何快速导入10亿+数据到图数据库

本篇文章适合人群：架构师、技术专家、对知识图谱与图数据库感兴趣的高级工程师

1. hugegraph环境

hugegraph server版本 0.11.2，后端存储使用的是 RocksDB， 单机版本。服务器版本是： CPU: 2 * Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz，内存：256GB，硬盘：SAS盘

loader版本 0.11.1

测试时loader在Server所在的服务器上执行。如果下载、安装部署Server,loader请参考：https://hugegraph.github.io/hugegraph-doc/quickstart/hugegraph-server.html

数据准备

2.1 原始数据说明

测试导入的数据是Stanford的公开数据集 Friendster ;，该数据约31G左右。

Friendster是一个在线游戏网络。在作为游戏网站重新推出之前，Friendster是一个用户可以建立友谊的社交网站，该数据就是用户之间的关系数据。该数据由Web Archive项目提供，其中有完整的图表。详见：http://snap.stanford.edu/data/com-Friendster.html。

此数据集的统计信息如下：共有65608366个顶点，1806067135条边，约18亿+的数据量。

请自行下载com-friendster.ungraph.txt.gz 压缩包，大小约8.7GB，经过gunzip解压后，大小约31GB

解压后读取前10行数据，如下图，文件结构很简单，每一行代表一条边，其包含两列，第一列是源顶点Id，第二列是目标顶点Id，两列之间以\t分隔。另外，文件最上面几行是一些概要信息，它说明了文件共有65608366个顶点，1806067135条边（行）.

因为是记录用户之间的关系数据，所以文件中的顶点数据有很多是重复的。为了能正确的导入顶点数据，需要针对顶点数据进行去重处理。

2.2 数据处理

因为顶点数据量很大，需要采用在处理海量数据领域颇为常见的处理手段：分而治之，具体步骤如下：

将原始文件中的全部顶点Id利用hash算法，分成较均匀的若干份，保证在每份之间没有重复的，在每份内部允许有重复的；
对每一份文件，定义一个内存的set容器用于判断某个Id是否存在，针对处理的每个顶点Id，如果不包含，则加入到容器，并写入到新文件中，否则跳过不处理
脚本运行命令如下： python friendster-process.py com-friendster.ungraph.txt com-friendster.vertex.txt，注意：在原始文件的所在的目录执行，第一个参数是原始文件名称，第二个参数是处理后的文件名，默认使用取余算法，分成10份，并将结果写到当前路径。

脚本运行完成后，可以看到merge file com-friendster.vertex.txt contain 65608366 lines，与文件描述相符的。

具体源代码如下：

#!/usr/bin/python
coding=utf-8
import os
import sys

class BigFileProcess:
    '''
    大文件处理类，处理策略是：将大文件按hash策略拆分为小文件，然后再将小文件合并
    '''
    shard_file_dict = {}
    shard_prefix = 'tmp_shard_'

    def __init__(self, raw_file_path, out_file_name, out_dir='./', shard_num=10):
        self.raw_file_path = raw_file_path
        self.out_dir = out_dir
        self.shard_num = shard_num
        self.out_file_name = out_file_name

    def create_shard_files(self):
        for i in range(self.shard_num):
            shard_file = open(os.path.join(self.out_dir, self.shard_prefix + str(i)), "w")
            self.shard_file_dict[i] = shard_file

    def close_shard_files(self):
        for file in self.shard_file_dict.values():
            file.close()

    def del_shard_files(self):
        for key in self.shard_file_dict.keys():
            shard_file = os.path.join(self.out_dir, self.shard_prefix + str(key))
            if os.path.exists(shard_file):
                os.remove(shard_file)
        print("删除{}个shard files完成".format(self.shard_num))

    def split_shard_file(self):
        print("创建并打开{}个shard files".format(self.shard_num))
        self.create_shard_files()
        line_count = 0
        with open(self.raw_file_path, "r") as lines:
            for line in lines:
                line_count += 1
                # Skip comment line
                if line.startswith('#'):
                    continue
                # 不设置分隔符means split according to any whitespace,包括空格、换行(\n)、制表符(\t)等
                parts = line.strip().split(maxsplit=1)
                if not len(parts) == 2:
                    print("line {} format error".format(parts))

                # Python3 整型是没有限制大小的，可以当作 Long 类型使用
                # 将源顶点、目标顶点写到对应的shard file中
                for i in range(2):
                    node_id = int(parts[i])
                    index = node_id % self.shard_num
                    self.shard_file_dict[index].write(str(node_id) + '\n')

                if line_count % 1000000:
                    print("已经处理{}行".format(line_count))
        print("共处理{}行".format(line_count))
        print("Split original file:{} info {} shard files".format(self.raw_file_path, self.shard_num))
        self.close_shard_files()
        print("关闭{}个shard files完成".format(self.shard_num))

    def merge_shard_file(self):
        output_file_path = os.path.join(self.out_dir, self.out_file_name)
        print("Prepare duplicate and merge {} shard files into %s".format(self.shard_num, output_file_path))
        merge_file = open(output_file_path, "w")
        line_count = 0

        # 基于单个文件去重，并合并到单个文件中
        for key in self.shard_file_dict.keys():
            shard_file = os.path.join(self.out_dir, self.shard_prefix + str(key))
            with open(shard_file, "r+") as lines:
                elems = {}
                # 处理每一行
                for line in lines:
                    # Filter duplicate id
                    if not (line in elems):
                        elems[line] = ""
                        merge_file.write(line)
                        line_count += 1
            print("Processed shard file {} completed".format(shard_file))
        merge_file.close()
        print("all {} shard files and merge into {} completed".format(self.shard_num, output_file_path));
        print("merge file {} contain {} lines".format(output_file_path, line_count))

if __name__ == '__main__':
    #file_process = BigFileProcess("com-friendster.ungraph.txt", "com-friendster.vertex.txt")
    print('参数个数为:{}个参数'.format(len(sys.argv)))
    print('参数列表:{}'.format(str(sys.argv)))
    file_process = BigFileProcess(sys.argv[1], sys.argv[2])

    file_process.split_shard_file()
    file_process.merge_shard_file()
    file_process.del_shard_files()

3. 导入准备

在安装loader程序的example目录下创建com-friendster目录，此目录中的文件说明如下：

com-friendster.ungraph.txt 是下载的friendster数据压缩包的解压后的文件

com-friendster.vertex.txt 是处理后只包含顶点的文件

schema.groovy 是图模型描述文件

struct.json 是数据源映射文件

3.1 编写图模型

这一步是建模的过程，用户需要对自己已有的数据和想要创建的图模型有一个清晰的构想，然后编写 schema 建立图模型

针对friendster数据，顶点和边除了Id外，都没有其他的属性，所以图的schema其实很简单

schema.propertyKey("id").asInt().ifNotExist().create();
schema.vertexLabel("person").useCustomizeNumberId().properties("id").ifNotExist().create();
schema.indexLabel("personByAge").onV("person").by("id").range().ifNotExist().create();
schema.indexLabel("personById").onV("person").by("id").unique().ifNotExist().create();
schema.edgeLabel("friend").sourceLabel("person").targetLabel("person").ifNotExist().create();

3.1 编写数据源映射文件

输入源的映射文件用于描述如何将输入源数据与图的顶点类型/边类型建立映射关系，以 JSON格式组织，由多个映射块组成，其中每一个映射块都负责将一个输入源映射为顶点和边。

针对friendster数据，只有一个顶点文件和边文件，且文件的分隔符都是\t，所以将input.format指定为TEXT，input.delimiter使用默认即可。

顶点有一个属性id，而顶点文件头没有指明列名，所以需要显式地指定input.header为[“id”]，input.header的作用是告诉loader文件的每一列的列名是什么，但要注意：列名并不一定就是顶点或边的属性名，描述文件中有一个field_mapping域用来将列名映射为属性名。

边没有任何属性，边文件中只有源顶点和目标顶点的Id，需要先将input.header指定为[“source_id”, “target_id”]，这样就给两个列取了不同的名字，然后通过field_mapping映射为源顶点和目标顶点的主键列即id列

{
  "vertices": [
    {
      "label": "person",
      "input": {
        "type": "file",
        "path": "example/com-friendster/com-friendster.vertex.txt",
        "format": "TEXT",
        "header": ["id"],
        "charset": "UTF-8",
        "skipped_line": {
          "regex": "(^#|^//).*"
        }
      },
      "null_values": ["NULL", "null", ""],
      "id": "id"
    }
  ],
  "edges": [
    {
      "label": "friend",
      "source": ["source_id"],
      "target": ["target_id"],
      "input": {
        "type": "file",
        "path": "example/com-friendster/com-friendster.ungraph.txt",
        "format": "TEXT",
        "header": ["source_id", "target_id"],
        "skipped_line": {
          "regex": "(^#|^//).*"
        }
      },
      "field_mapping": {
        "source_id": "id",
        "target_id": "id"
      }
    }
  ]
}

3.3 执行导入

执行命令后，loader就会开始导入数据，并会打印进度到控制台上，等所有顶点和边导入完成后，会看到以下统计信息。

导入数据约花费50分钟，其中顶点的导入速率是2.2万/秒，边的导入速率是：60.8万/秒，且导入的点、边数据量与friendster的描述一致

hugegraph的图数据占用磁盘空间约80GB，相较于原始数据大小，扩大了约2倍

并且hugegraph-hubble无法使用，因为图 testgraph 已占用磁盘 585,728MB 超过上限 102,400MB 【此空间比实际磁盘占用空间要大很多，不知道原因】，提示不能在hubble上面使用。

nohup bin/hugegraph-loader.sh -h 12.25.12.19 -p 8080 -g testgraph -f example/com-friendster/struct.json -s example/com-friendster/schema.groovy &

3. 简单查询

在hugegraph-studio中执行

3.1 统计点边的数量

g.V().count() 可以正确统计出数量为： 65608366

g.E().count() 可能因为边的数量太大，导致Script evaluation exceeded the configured ‘scriptEvaluationTimeout’ threshold of 30000 ms 执行超时而报错

3.2 查询点边数据

g.V().has(“id”,101) 或 g.V(101) 查询顶点ID是101的顶点

g.V(101).outE() 查询ID是101的顶点的OUT方向的所有邻接边

Original: https://blog.csdn.net/penriver/article/details/115124350
Author: java编程艺术
Title: 图数据库hugegraph如何快速导入10亿+数据

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/556541/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

pandas 实现无关联key数据交叉连接（cross join）

有两个数据帧，分别有一列col1，col2，他们没有相同的key： left = pd.DataFrame({‘col1’ : [‘A’, ‘B’, ‘C’]}) right = …

人工智能 2023年7月8日
0060
CloudCompare点云配准基本操作

CloudCompare基本介绍官方网站https://cloudcompare.org/官方文档https://cloudcompare.org/doc/qCC/CloudCo…

人工智能 2023年7月27日
0060
入门cv必读的10篇baseline论文

1.NIPS-2012，Alexnet：深度学习CV领域划时代论文，具有里程碑意义《ImageNet Classification with Deep Convolutional…

人工智能 2023年7月13日
0046
opencv深度学习入门笔记整理

一、图像的基本操作（1）读取图像 Img = cv2.imread（"xx.jpg"） img的数据类型为ndarray的格式（2）图像显示可以多次调用，…

人工智能 2023年7月20日
0060
neo4j的Cypher语法进阶

删除操作需要先删除关系，才能删除关系关联的节点 删除关系 match ()-[r:nati…

人工智能 2023年6月1日
0077
ROS+ubuntu20.04+opencv4.5.5

环境:ubuntu20.04opencv4.5.5ros: noetic 参考链接:在ROS中使用OpenCV进行简单的图像处理ROS中使用opencv第一个链接注释比较多，第二个…

人工智能 2023年6月10日
0074
YOLOv5 Windows环境下的C++部署（GPU）

YOLOv5 Windows环境下的C++部署（GPU）文章目录 YOLOv5 Windows环境下的C++部署（GPU）前言 1、环境介绍 2、环境配置 3、.torchsc…

人工智能 2023年5月26日
0077
人工智能在医疗领域的应用

1.背景分析人工智能是研究开发用于模拟和延伸人的智能的理论，方法，技术和应用系统的一项新技术科学，它的结构类似金字塔结构：上层是算法，中间是芯片，第三层是各种软硬件平台，最下面是…

人工智能 2023年6月24日
0056
矩阵求导（本质、原理与推导）详解

矩阵求导是机器学习与深度学习的基础，它是高等数学、线性代数知识的综合，并推动了概率论与数理统计向多元统计的发展。在一般的线性代数的课程中，很少会提到矩阵导数的概念；而且在网上寻找矩…

人工智能 2023年6月15日
0064
minimum error threshold 原理浅谈及c++实现

参考论文来自1986kittler，原理论证可以自行参考其论文，这里只关心其具体实现步骤。看看如何从一副图像得到最小误差阈值法下的参数threshold。首先对一幅图像，通过cal…

人工智能 2023年7月20日
0057
【NeRF】在yenchenlin/nerf-pytorch上运行新的数据集

现在有的东西数据集：和yen给出测试数据集进行对比圈出来的文件是有的，不确定其他没有的文件影不影响运行先试一下再说。 ; 在yen上运行自己的数据集 yen 是这么说的也就是说，y…

人工智能 2023年7月21日
0090
新手入门深度学习 | 2-2：结构化数据建模流程示例

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年5月26日
0072
Opencv-霍夫圆检测-代码解析

目录目录前言一、霍夫圆检测代码二、函数解析 1.cv2.HoughCircles函数 2.双边滤波：bilateralFilter() 函数 3.形态学操作-开运算 4.c…

人工智能 2023年6月19日
0090
回归的误差服从正态分布吗_【判断题】多线线性回归的误差项服从正态分布

【判断题】多线线性回归的误差项服从正态分布更多相关问题【判断题】抽样、量化和编码的逆过程分别是利用低通滤波器、解量化和解码实现的。 A. 正确 B. 错误【单选题】关于周邦彦…

人工智能 2023年6月18日
0063
shape_based_matching代码解读0422

写作本系列文章旨在就个人学习该论文及其开源项目做一个学习分享和交流。原论文篇名：Gradient Response Maps for Real-TimeDetection of …

人工智能 2023年6月18日
0079
『Transformer/BERT』Transformer和BERT的位置编码

为什么要对位置进行编码？ Attention提取特征的时候，可以获取全局每个词对之间的关系，但是并没有显式保留时序信息，或者说位置信息。就算打乱序列中token的顺序，最后所得到的…

人工智能 2023年5月27日
0047

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30