[机器翻译-数据集] 批量获取所有WMT数据（初步解决方案）

2023年5月28日下午12:58 • 大数据 • 阅读 84

文章目录

前言
具体实现
不足之处
补充
*
补充一：记录暂时只能手动下载的数据集
补充二：添加wmt19的测试集
下载结果记录
*
平行语料数量统计
–
- wmt14
- wmt15
- wmt16
- wmt17
- wmt18
- wmt19
需要手动下载的语料统计
参考

前言

WMT是机器翻译和机器翻译研究的主要活动。该会议每年与自然语言处理方面的大型会议联合举行。2006年，第一届机器翻译研讨会在计算语言学协会北美分会年会上举行。2016年，随着神经机器翻译的兴起，WMT成为了一个自己的会议。机器翻译会议仍然主要被称为WMT [1]。
有些机器翻译工作会使用历年WMT公开的数据集作为他们的数据集[2]，如下图所示：

当笔者想要复现工作结果时，首先需要收集得到这样的数据集。而以WMT13 [3]为例。如下图所示，笔者需要手动点击下载上面公开的每一个子数据集，然后汇总得到整个WMT13的训练、验证和测试集。而由于每一个子数据集的形式也不同，且数量较多…总的来说还是很麻烦的。

而笔者发现，huggingface [4]上面已经收集了部分年份的WMT数据，并提供了下载接口。以wmt14的所有hi-en数据为例，最终的下载结果如下图所示：

（笔者后知后觉意识到，只要想办法打开.arrow文件就可以得到对应数据了…艹）
本文旨在总结批量获取所有WMT数据的初步解决方案，通过修改huggingface datasets库的源码实现。

; 具体实现

第一步，pip install datasets安装datasets库。
第二步，通过git clone https://github.com/huggingface/datasets克隆datasets库，datasets/datasets路径下面包含了该库提供的所有数据集的相关代码：

第三步，创建主程序文件（run.py），代码如下，其中， py_file_path为上面说的datasets/datasets路径， save_dir为保存到本地的路径：

from datasets import load_dataset
import os
wmt_dict = {
    "wmt14": [(lang, "en") for lang in ["cs", "de", "fr", "hi", "ru"]],
    "wmt15": [(lang, "en") for lang in ["cs", "de", "fi", "fr", "ru"]],
    "wmt16": [(lang, "en") for lang in ["cs", "de", "fi", "ro", "ru", "tr"]],
    "wmt17": [(lang, "en") for lang in ["cs", "de", "fi", "lv", "ru", "tr", "zh"]],
    "wmt18": [(lang, "en") for lang in ["cs", "de", "et", "fi", "kk", "ru", "tr", "zh"]],
    "wmt19": [(lang, "en") for lang in ["cs", "de", "fi", "gu", "kk", "lt", "ru", "zh"]] + [("fr", "de")],
}

py_file_path = r"C:\Users\13359\PycharmProjects\for_fun\other\wmt_datasets\datasets\datasets"
save_dir = r"D:\dataset\mt"
for wmt in wmt_dict:
    for lang_tuple in wmt_dict[wmt]:
        lang_pair = "-".join(lang_tuple)
        print(f"wmt: {wmt} | lang_pair: {lang_pair}")
        load_dataset(os.path.join(py_file_path, wmt), name = lang_pair, cache_dir = save_dir)

第四步，在上述datasets/datasets路径下面随便选择一个wmt文件夹，比如wmt14，将里面的wmt_utils.py复制到run.py的同级目录下。（暂时不知道为何，尝试下来这样没有错），也就是文件目录结构如下：

第五步，如果此时运行run.py，则会像前言中的那样，得到所有wmt的所有语言对的数据，但数据格式是arrow的。笔者脑抽，没有直接去想如何把.arrow文件转成更好理解的格式，而是通过修改pip install下来的datasets源码，来直接修改保存数据的过程。具体来说，通过ctrl+B追溯run.py中load_dataset的执行顺序，最终找到了保存数据的源码位置：load_dataset(run.py)->builder_instance.download_and_prepare(load.py,1738行)->self._download_and_prepare(builder.py, 638行)->self._prepare_split(builder.py, 723行)，由于此时的self是一个wmtxx object，从具体的wmtxx.py（如wmt14.py，位于datasets/datasets/wmt14/wmt14.py）可知，wmtxx类的继承顺序是：wmtxx->Wmt(wmt_utils.py)->GeneratorBasedBuilder(builder.py)->DatasetBuilder(builder.py)，所以self._prepare_split最终方法实现是GeneratorBasedBuilder类的_prepare_split方法。
该方法中完成了.arrow数据的创建，具体代码如下所示：

with ArrowWriter(
    features=self.info.features,
    path=fpath,
    writer_batch_size=self._writer_batch_size,
    hash_salt=split_info.name,
    check_duplicates=check_duplicate_keys,
) as writer:
    try:
        for key, record in logging.tqdm(
            generator,
            unit=" examples",
            total=split_info.num_examples,
            leave=False,
            disable=not logging.is_progress_bar_enabled(),
            desc=f"Generating {split_info.name} split",
        ):
            example = self.info.features.encode_example(record)
            writer.write(example, key)
    finally:
        num_examples, num_bytes = writer.finalize()

split_generator.split_info.num_examples = num_examples
split_generator.split_info.num_bytes = num_bytes

其中， generator就是包含了所有数据的生成器。于是，笔者修改了上面的代码，完成了对数据保存的修改：


generator = self._generate_examples(**split_generator.gen_kwargs)

user_name = "xushaoyang"
str_lst = fpath.split("\\")
index = str_lst.index(self.name)
lang_pair = str_lst[index + 1]
source, target = lang_pair.split("-")
path_lst = str_lst[:index + 2]
path_lst[index] += f"_{user_name}"
dir_path = os.path.join(*path_lst)
os.makedirs(dir_path, exist_ok=True)
source_file_name = f"{self.name}.{lang_pair}-{split_generator.name}.{source}"
source_path = os.path.join(dir_path, source_file_name)
target_file_name = f"{self.name}.{lang_pair}-{split_generator.name}.{target}"
target_path = os.path.join(dir_path, target_file_name)

with open(source_path, mode="w", encoding="utf-8") as source_f:
    with open(target_path, mode="w", encoding="utf-8") as target_f:
        with ArrowWriter(
            features=self.info.features,
            path=fpath,
            writer_batch_size=self._writer_batch_size,
            hash_salt=split_info.name,
            check_duplicates=check_duplicate_keys,
        ) as writer:
            try:
                for key, record in logging.tqdm(
                    generator,
                    unit=" examples",
                    total=split_info.num_examples,
                    leave=False,
                    disable=not logging.is_progress_bar_enabled(),
                    desc=f"Generating {split_info.name} split",
                ):
                    assert list(record.keys()) == ["translation"]
                    lang_keys = list(record["translation"].keys())
                    if source not in lang_keys:
                        target_idx = lang_keys.index(target)
                        source_idx = (target_idx + 1) % 2
                        error_source_key = lang_keys[source_idx]
                        new_record = {"translation": {
                            source: record["translation"][error_source_key],
                            target: record["translation"][target]
                        }}
                        record = new_record
                        del new_record

                    example = self.info.features.encode_example(record)
                    writer.write(example, key)

                    source_sentence = record['translation'][source]
                    target_sentence = record['translation'][target]
                    source_sentence = source_sentence.replace("\r", "").replace("\n", "")
                    target_sentence = target_sentence.replace("\r", "").replace("\n", "")
                    source_f.write(source_sentence + "\n")
                    target_f.write(target_sentence + "\n")
            finally:
                num_examples, num_bytes = writer.finalize()
split_generator.split_info.num_examples = num_examples
split_generator.split_info.num_bytes = num_bytes

完成修改后执行run.py，以wmt14的hi-en数据为例，得到的数据如下图所示，文件的命名仿照了OPUS100 [5]：

; 不足之处

总体来说，本篇博客价值不大。因为如果直接读取.arrow应该也行，比如参考：https://blog.csdn.net/wowotuo/article/details/110497489。如果这样的话，那完成前四步，直接运行run.py即可。
huggingface datasets库只提供了WMT14-19的下载，其它年份的WMT还是需要自行下载，当然也可以继续修改源码，把其它年份的加进去（TODO1）。
下载的数据不是很全。wmt_utils.py中以SubDataset类的形式定义了所有WMT可能需要下载到的所有数据集，而以czeng_10为例，定义了manual_dl_files的数据集就是需要手动下载的。笔者目前的解决方案是，在下载过程中记录哪些数据集没有自动下载，所有下载完成之后再去补上（包括手动补上第2条中缺少的WMT数据），记录的代码见”补充”。
另外，笔者在完整跑完run.py之后发现，wmt19的测试集是缺失的。

补充

补充一：记录暂时只能手动下载的数据集

如”不足之处”的第3点所述，有一些文件没有提供自动下载的url，笔者的解决方案是在下载过程中记录哪些数据集没有自动下载，所有下载完成之后再去手动补上。具体来说，笔者在wmt_utils.py中的 _split_generators函数中的 if dataset.get_manual_dl_files(source):语句下，加入了如下语句：

with open(f"{self.name}_error_log", mode="a", encoding="utf-8") as file:
    file.write(f"lang: {'-'.join(self.config.language_pair)} | data_name: {dataset.name} | url: {str(dataset.get_manual_dl_files(source))}" + "\n")

在运行run.py的过程中，出现数据集缺失的情况，这样的记录就会被保存在wmtxxx_error_log日志文件中，如下图所示：

补充二：添加wmt19的测试集

打开datasets/datasets中的wmt19文件夹，修改里面的wmt19.py和wmt_utils.py。


datasets.Split.TEST: ["newstest2019", "newstest2019_csen", "newstest2019_frde"],

SubDataset(
    name="newstest2019",
    target="en",
    sources={"de", "fi", "gu", "kk", "lt", "ru", "zh"},
    url="http://data.statmt.org/wmt19/translation-task/test.tgz",
    path=("sgm/newstest2019-{src}en-src.{src}.sgm", "sgm/newstest2019-{src}en-ref.en.sgm"),
),

SubDataset(
    name="newstest2019_csen",
    target="en",
    sources={"cs"},
    url="http://data.statmt.org/wmt19/translation-task/test.tgz",
    path=("sgm/newstest2019-en{src}-src.en.sgm", "sgm/newstest2019-en{src}-ref.{src}.sgm"),
),

SubDataset(
    name="newstest2019_frde",
    target="de",
    sources={"fr"},
    url="http://data.statmt.org/wmt19/translation-task/test.tgz",
    path=("sgm/newstest2019-frde-ref.de.sgm", "sgm/newstest2019-frde-src.fr.sgm"),
),

另外，直接这样运行run.py会报错，因为dataset_infos.json中的内容还没有修改。而程序会读取dataset_info.json中一些预设的信息，然后和下载下来的结果进行一些校验：verify_checksums、verify_splits，但是直接修改dataset_infos.json较为麻烦，所以笔者选择取消校验：

load_dataset(os.path.join(py_file_path, wmt), name = lang_pair, cache_dir = save_dir, save_infos = True)

另外，还是要对dataset_infos.json做一个细微的修改，即在splits中增加”test”：

      "test": {
        "name": "test",
        "num_bytes": 3000,
        "num_examples": 3000,
        "dataset_name": "wmt19"
      }

下载结果记录

平行语料数量统计

wmt14

cs-en:
    train:
        source:953621
        target:953621
    validation:
        source:3000
        target:3000
    test:
        source:3003
        target:3003
de-en:
    train:
        source:4508785
        target:4508785
    validation:
        source:3000
        target:3000
    test:
        source:3003
        target:3003
fr-en:
    train:
        source:40836715
        target:40836715
    validation:
        source:3000
        target:3000
    test:
        source:3003
        target:3003
hi-en:
    train:
        source:32863
        target:32863
    validation:
        source:520
        target:520
    test:
        source:2507
        target:2507
ru-en:
    train:
        source:1486965
        target:1486965
    validation:
        source:3000
        target:3000
    test:
        source:3003
        target:3003

wmt15

cs-en:
    train:
        source:959768
        target:959768
    validation:
        source:3003
        target:3003
    test:
        source:2656
        target:2656
de-en:
    train:
        source:4522998
        target:4522998
    validation:
        source:3003
        target:3003
    test:
        source:2169
        target:2169
fi-en:
    train:
        source:2073394
        target:2073394
    validation:
        source:1500
        target:1500
    test:
        source:1370
        target:1370
fr-en:
    train:
        source:40853137
        target:40853137
    validation:
        source:4503
        target:4503
    test:
        source:1500
        target:1500
ru-en:
    train:
        source:1495081
        target:1495081
    validation:
        source:3003
        target:3003
    test:
        source:2818
        target:2818

wmt16

cs-en:
    train:
        source:997240
        target:997240
    validation:
        source:2656
        target:2656
    test:
        source:2999
        target:2999
de-en:
    train:
        source:4548885
        target:4548885
    validation:
        source:2169
        target:2169
    test:
        source:2999
        target:2999
fi-en:
    train:
        source:2073394
        target:2073394
    validation:
        source:1370
        target:1370
    test:
        source:6000
        target:6000
ro-en:
    train:
        source:610320
        target:610320
    validation:
        source:1999
        target:1999
    test:
        source:1999
        target:1999
ru-en:
    train:
        source:1516162
        target:1516162
    validation:
        source:2818
        target:2818
    test:
        source:2998
        target:2998
tr-en:
    train:
        source:205756
        target:205756
    validation:
        source:1001
        target:1001
    test:
        source:3000
        target:3000

wmt17

cs-en:
    train:
        source:1018291
        target:1018291
    validation:
        source:2999
        target:2999
    test:
        source:3005
        target:3005
de-en:
    train:
        source:5906184
        target:5906184
    validation:
        source:2999
        target:2999
    test:
        source:3004
        target:3004
fi-en:
    train:
        source:2656542
        target:2656542
    validation:
        source:6000
        target:6000
    test:
        source:6004
        target:6004
lv-en:
    train:
        source:3567528
        target:3567528
    validation:
        source:2003
        target:2003
    test:
        source:2001
        target:2001
ru-en:
    train:
        source:24782720
        target:24782720
    validation:
        source:2998
        target:2998
    test:
        source:3001
        target:3001
tr-en:
    train:
        source:205756
        target:205756
    validation:
        source:3000
        target:3000
    test:
        source:3007
        target:3007
zh-en:
    train:
        source:25134743
        target:25134743
    validation:
        source:2002
        target:2002
    test:
        source:2001
        target:2001

wmt18

cs-en:
    train:
        source:11046024
        target:11046024
    validation:
        source:3005
        target:3005
    test:
        source:2983
        target:2983
de-en:
    train:
        source:42271874
        target:42271874
    validation:
        source:3004
        target:3004
    test:
        source:2998
        target:2998
et-en:
    train:
        source:2175873
        target:2175873
    validation:
        source:2000
        target:2000
    test:
        source:2000
        target:2000
fi-en:
    train:
        source:3280600
        target:3280600
    validation:
        source:6004
        target:6004
    test:
        source:3000
        target:3000
kk-en:
    train:
        source:0
        target:0
    validation:
        source:0
        target:0
    test:
        source:0
        target:0
ru-en:
    train:
        source:36858512
        target:36858512
    validation:
        source:3001
        target:3001
    test:
        source:3000
        target:3000
tr-en:
    train:
        source:205756
        target:205756
    validation:
        source:3007
        target:3007
    test:
        source:3000
        target:3000
zh-en:
    train:
        source:25160346
        target:25160346
    validation:
        source:2001
        target:2001
    test:
        source:3981
        target:3981

wmt19

cs-en:
    train:
        source:7270695
        target:7270695
    validation:
        source:2983
        target:2983
    test:
        source:1997
        target:1997
de-en:
    train:
        source:38690334
        target:38690334
    validation:
        source:2998
        target:2998
    test:
        source:2000
        target:2000
fi-en:
    train:
        source:6587448
        target:6587448
    validation:
        source:3000
        target:3000
    test:
        source:1996
        target:1996
gu-en:
    train:
        source:11670
        target:11670
    validation:
        source:1998
        target:1998
    test:
        source:1016
        target:1016
kk-en:
    train:
        source:126583
        target:126583
    validation:
        source:2066
        target:2066
    test:
        source:1000
        target:1000
lt-en:
    train:
        source:2344893
        target:2344893
    validation:
        source:2000
        target:2000
    test:
        source:1000
        target:1000
ru-en:
    train:
        source:37492126
        target:37492126
    validation:
        source:3000
        target:3000
    test:
        source:2000
        target:2000
zh-en:
    train:
        source:25984574
        target:25984574
    validation:
        source:3981
        target:3981
    test:
        source:2000
        target:2000
fr-de:
    train:
        source:9824476
        target:9824476
    validation:
        source:1512
        target:1512
    test:
        source:1707
        target:1707

需要手动下载的语料统计

wmt14:
lang: cs-en | data_name: czeng_10 | url: [‘data-plaintext-format.0.tar’, ‘data-plaintext-format.1.tar’, ‘data-plaintext-format.2.tar’, ‘data-plaintext-format.3.tar’, ‘data-plaintext-format.4.tar’, ‘data-plaintext-format.5.tar’, ‘data-plaintext-format.6.tar’, ‘data-plaintext-format.7.tar’, ‘data-plaintext-format.8.tar’, ‘data-plaintext-format.9.tar’]
lang: ru-en | data_name: yandexcorpus | url: [‘1mcorpus.zip’]
lang: hi-en | data_name: hindencorp_01 | url: [‘hindencorp0.1.gz’]
wmt15：
lang: cs-en | data_name: czeng_10 | url: [‘data-plaintext-format.0.tar’, ‘data-plaintext-format.1.tar’, ‘data-plaintext-format.2.tar’, ‘data-plaintext-format.3.tar’, ‘data-plaintext-format.4.tar’, ‘data-plaintext-format.5.tar’, ‘data-plaintext-format.6.tar’, ‘data-plaintext-format.7.tar’, ‘data-plaintext-format.8.tar’, ‘data-plaintext-format.9.tar’]
lang: ru-en | data_name: yandexcorpus | url: [‘1mcorpus.zip’]
wmt16：
lang: cs-en | data_name: czeng_16pre | url: [‘czeng16pre.deduped-ignoring-sections.txt.gz’]
lang: ru-en | data_name: yandexcorpus | url: [‘1mcorpus.zip’]
wmt17：
lang: cs-en | data_name: czeng_16 | url: [‘data-plaintext-format.0.tar’, ‘data-plaintext-format.1.tar’, ‘data-plaintext-format.2.tar’, ‘data-plaintext-format.3.tar’, ‘data-plaintext-format.4.tar’, ‘data-plaintext-format.5.tar’, ‘data-plaintext-format.6.tar’, ‘data-plaintext-format.7.tar’, ‘data-plaintext-format.8.tar’, ‘data-plaintext-format.9.tar’]
lang: ru-en | data_name: yandexcorpus | url: [‘1mcorpus.zip’]
wmt18：
lang: cs-en | data_name: czeng_17 | url: [‘data-plaintext-format.0.tar’, ‘data-plaintext-format.1.tar’, ‘data-plaintext-format.2.tar’, ‘data-plaintext-format.3.tar’, ‘data-plaintext-format.4.tar’, ‘data-plaintext-format.5.tar’, ‘data-plaintext-format.6.tar’, ‘data-plaintext-format.7.tar’, ‘data-plaintext-format.8.tar’, ‘data-plaintext-format.9.tar’]
lang: ru-en | data_name: yandexcorpus | url: [‘1mcorpus.zip’]
wmt19：
lang: cs-en | data_name: czeng_17 | url: [‘data-plaintext-format.0.tar’, ‘data-plaintext-format.1.tar’, ‘data-plaintext-format.2.tar’, ‘data-plaintext-format.3.tar’, ‘data-plaintext-format.4.tar’, ‘data-plaintext-format.5.tar’, ‘data-plaintext-format.6.tar’, ‘data-plaintext-format.7.tar’, ‘data-plaintext-format.8.tar’, ‘data-plaintext-format.9.tar’]
lang: ru-en | data_name: yandexcorpus | url: [‘1mcorpus.zip’]

可以看到存在大量的重复。以上语料的下载链接统计如下：

czeng_10：https://ufal.mff.cuni.cz/legacy/czeng/czeng10/
yandexcorpus：https://translate.yandex.ru/corpus?lang=en
hindencorp_01：http://ufallab.ms.mff.cuni.cz/~bojar/hindencorp/
czeng_16pre：https://ufal.mff.cuni.cz/czeng/czeng16pre
czeng_16：https://ufal.mff.cuni.cz/czeng/czeng16
czeng_17：https://ufal.mff.cuni.cz/czeng/czeng17

基本都需要提交申请

参考

[1]https://machinetranslate.org/wmt
[2]https://arxiv.org/pdf/2105.09259v1.pdf
[3]https://www.statmt.org/wmt14/translation-task.html
[4]https://github.com/huggingface/datasets
[5]https://github.com/EdinburghNLP/opus-100-corpus

Original: https://blog.csdn.net/jokerxsy/article/details/124958155
Author: Muasci
Title: [机器翻译-数据集] 批量获取所有WMT数据（初步解决方案）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/531662/

转载文章受原作者版权保护。转载请注明原作者出处！

大数据

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

winform SQLite vs2019 EF ADO.net实体数据模型

VisualStudio2019 C# EntityFramework6 SQLite的环境配置我最近的开发都使用微软的EF6，这种把数据库关系到对象上的方式操作数据非常简单，不…

大数据 2023年11月12日
0056
BERTopic：NLP主题模型的未来！

文| ZenMoore编| 小轶以前我一直以为，主题建模(提取文档的主题词)这种机器学习时代就开始研究的基础工具，现在肯定已经到头了，虽然…有时效果可能不是那么让人满…

大数据 2023年5月28日
00133
完全分布式Hadoop3.X的搭建

准备工作以及安装Hadoop之前的操作和Hadoop2.X的安装相同，在我上一篇博客中，这里不做过多介绍 https://www.cnblogs.com/lmandcc/p/153…

大数据 2023年5月26日
0056
linux上搭建svn服务器

linux上搭建svn服务器 1.检查当前版本，没有的话用yum安装rpm -qa subversion 2.安装yum install subversion -y 2.建库mkd…

大数据 2023年6月3日
0067
「敏捷架构」大型敏捷框架SAFe：企业架构

所有人都能看到我征服的这些战术，但没有人能看到从中演化出胜利的战略。（人皆知我所以胜之形，而莫知吾所以制胜之形）【All men can see these tactics whe…

大数据 2023年5月24日
0088
【工具篇】SQLite本地数据库在Unity3D的应用

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

大数据 2023年11月10日
0065
AI+金融驱动金融信创“芯”生态

“十四五”规划纲要提出，要健全具有高度适应性、竞争力、普惠性的现代金融体系，有序推进金融创新，稳妥发展金融科技，加快金融机构数字化转型，这标志着金融领域的科…

大数据 2023年5月24日
0078
在线就能用的 SQL 练习平台（附SQL学习文档）

[TencentCloudSDKException] code:FailedOperation.ServiceIsolate message:service is stopped …

大数据 2023年6月2日
0073
百度地图API的使用教程以及案例

百度地图API的使用教程以及案例原创 wx62bdb159cc1872022-07-04 08:59:56博主文章分类：【百度地图】 ©著作权文章标签 javascript 街…

大数据 2023年5月25日
0077
03_Linux基础-文件类型-主辅提示符-第1提示符-Linux命令-内外部命令-快捷键-改为英文编码-3个时间-stat-其他基础命令

03_Linux基础-文件类型-主辅提示符-第1提示符-Linux命令-内外部命令-快捷键-改为英文编码-3个时间-stat-{1..100}-du-cd-cp-file-mv-e…

大数据 2023年5月26日
0094
java基于ssm的高校学生会日常事务管理信息系统

该学生会日常事务信息管理系统，采用了BS架构模式开发，后台数据库采用了mysql，是一款典型的后台管理系统，实现了学生会的日常事务的管理。该学生会日常事务管理系统，采用了BS架构…

大数据 2023年5月25日
0070
Hive 报错小结

No Route to Host IOException当网络上的一台机器不知道如何将 TCP 数据包发送到指定的机器时，您会收到 TCP No Route To Host 错误 …

大数据 2023年11月12日
0037
如何让智能客服成为企业的生产力工具？

一、阿里云智能客服最新进展 1 、发展轨迹及应用效果：智能客服作为生产力工具经历了从算法的具体效果到实战中降本增效的层面转换；2015-2016年左右，智能客服的特征主要体现在…

大数据 2023年5月28日
0080
Linux学习笔记（一）初识Linux

初始Linux Linux可划分为以下四部分： Linux内核 GNU工具图形化桌面环境应用软件每一部分在Linux系统中各司其职，下图是各部分对应关系： 1、Linux内核…

大数据 2023年5月27日
0058
hive之left semi join（左半连接）使用方法

目录一、建表数据准备二、语法三、left semi join例子四、left semi join、join、left join的区别 1、left semi join 2、…

大数据 2023年11月11日
0094

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31