【实测】Python 和 C++ 下字符串查找的速度对比

2023年6月13日上午4:51 • Linux • 阅读 117

最近在备战一场算法竞赛，语言误选了 Python ，无奈只能着手对常见场景进行语言迁移。而字符串查找的场景在算法竞赛中时有出现。本文即对此场景在 Python 和竞赛常用语言 C++ 下的速度进行对比，并提供相关参数和运行结果供他人参考。

硬件和操作系统

                   -                    root@
                  .o+                   ------------
                 ooo/                   OS: Arch Linux ARM aarch64
                +oooo:                  Host: Raspberry Pi 4 Model B
               +oooooo:                 Kernel: 5.16.12-1-aarch64-ARCH
               -+oooooo+:                Uptime: 3 hours, 32 mins
             /:-:++oooo+:               Packages: 378 (pacman)
            /++++/+++++++:              Shell: zsh 5.8.1
           /++++++++++++++:             Terminal: /dev/pts/0
          /+++ooooooooooooo/           CPU: (4) @ 1.500GHz
         ./ooosssso++osssssso+          Memory: 102MiB / 7797MiB
        .oossssso-/ossssss+
       -osssssso.      :ssssssso.

      :osssssss/        osssso+++.

     /ossssssss/        +ssssooo/-
   /ossssso+/:-        -:/+osssso+-
  +sso+:-                 .-/+oso:
 ++:.                           -/+/
 .                                 /

编译环境和解释环境

Python
解释器：Python 3.10.2 (main, Jan 23 2022, 21:20:14) [GCC 10.2.0] on linux
交互环境：IPython 8.0.1
C++
编译器：g++ (GCC) 11.2.0
编译命令： g++ test.cpp -Wall -O2 -g -std=c++11 -o test

本次实测设置两个场景：场景 1 的源串字符分布使用伪随机数生成器生成，表示字符串查找的平均情况；场景 2 的源串可连续分割成 20,000 个长度为 50 的字符片段，其中第 15,001 个即为模式串，形如”ab…b”（1 个”a”，49 个 “b”），其余的字符片段形如”ab…c”（1 个”a”，48 个”b”，1 个”c”）。

项目场景 1：平均情况场景 2：较坏情况字符集小写字母

字符分布

有较强规律性源串长度 1,000,000 1,000,000 模式串长度 1,000 50 模式串出现位置 250,000、500,000、750,000 750,000 模式串出现次数 1 1

测试方法

本次实测中，Python 语言使用内置类型 str 的 .find() 成员函数，C++ 语言分别使用 string 类的 .find() 成员函数、 strstr 标准库函数和用户实现的 KMP 算法。

测试对象核心代码 Python

C++ –

生成源串和模式串

import random

场景 1：
源串
s = "".join(chr(random.choice(range(ord("a"), ord("z") + 1))) for _ in range(1000000))
模式串列表，三个元素各对应一个模式串
p = [s[250000:251000], s[500000:501000], s[750000:751000]]

场景 2：
模式串
p = 'a' + 'b' * 49
其他字符片段
_s = "a" + "b" * 48 + "c"
源串
s = _s * 15000 + p + _s * 4999

存储到文件，便于 C++ 程序获取
with open('source.in', 'w') as f:
    f.write(s)
with open('pattern.in', 'w') as f:
    f.write(p[0])

测试代码

In []: %timeit s.find(p[0])

#include
#include
#include
#include
#define LOOP_COUNT (1000)
using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;

double test(string s, string p, size_t* pos_ptr) {
    auto t1 = high_resolution_clock::now();
    *pos_ptr = s.find(p);
    auto t2 = high_resolution_clock::now();
    duration ms_double = t2 - t1;
    return ms_double.count();
}

int main() {
    string s, p;
    size_t pos;
    ifstream srcfile("source.in");
    ifstream patfile("pattern.in");
    srcfile >> s;
    patfile >> p;

    double tot_time = 0;
    for (int i = 0; i < LOOP_COUNT; ++i) {
        tot_time += test(s, p, &pos);
    }

    cout << "Loop count:            " << LOOP_COUNT << endl;
    cout << "Source string length:  " << s.length() << endl;
    cout << "Pattern string length: " << p.length() << endl;
    cout << "Search result:         " << pos << endl;
    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;
}

#include
#include
#include
#include
#define LOOP_COUNT (1000)
using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
char s[1000005], p[1005], *pos=NULL;

double test(char* s, char* p, char** pos_ptr) {
    auto t1 = high_resolution_clock::now();
    *pos_ptr = strstr(s, p);
    auto t2 = high_resolution_clock::now();
    duration ms_double = t2 - t1;
    return ms_double.count();
}

int main() {
    ifstream srcfile("source.in");
    ifstream patfile("pattern.in");
    srcfile >> s;
    patfile >> p;

    double tot_time = 0;
    for (int i = 0; i < LOOP_COUNT; ++i) {
        tot_time += test(s, p, &pos);
    }

    cout << "Loop count:            " << LOOP_COUNT << endl;
    cout << "Source string length:  " << strlen(s) << endl;
    cout << "Pattern string length: " << strlen(p) << endl;
    cout << "Search result:         " << pos - s << endl;
    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;
}

#include
#include
#include
#include
#include
#define LOOP_COUNT (1000)
using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
int dp[1005];

int KMP(string s, string p) {
    int m = s.length(), n = p.length();
    if (n == 0) return 0;
    if (m < n) return -1;
    memset(dp, 0, sizeof(int) * (n+1));
    for (int i = 1; i < n; ++i) {
        int j = dp[i+1];
        while (j > 0 && p[j] != p[i]) j = dp[j];
        if (j > 0 || p[j] == p[i]) dp[i+1] = j + 1;
    }
    for (int i = 0, j = 0; i < m; ++i)
        if (s[i] == p[j]) { if (++j == n) return i - j + 1; }
        else if (j > 0) {
            j = dp[j];
            --i;
        }
    return -1;
}

double test(string s, string p, int* pos_ptr) {
    auto t1 = high_resolution_clock::now();
    *pos_ptr = KMP(s, p);
    auto t2 = high_resolution_clock::now();
    duration ms_double = t2 - t1;
    return ms_double.count();
}

int main() {
    string s, p;
    int pos;
    ifstream srcfile("source.in");
    ifstream patfile("pattern.in");
    srcfile >> s;
    patfile >> p;

    double tot_time = 0;
    for (int i = 0; i < LOOP_COUNT; ++i) {
        tot_time += test(s, p, &pos);
    }

    cout << "Loop count:            " << LOOP_COUNT << endl;
    cout << "Source string length:  " << s.length() << endl;
    cout << "Pattern string length: " << p.length() << endl;
    cout << "Search result:         " << pos << endl;
    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;
}

IPython 的 %timeit 魔法命令可以输出代码多次执行的平均时间和标准差，在此取平均时间。C++ 的代码对每个模式串固定运行 1,000 次后取平均时间。

以下时间若无特别说明，均以微秒为单位，保留到整数位。

场景模式串出现位置 Python C++ –

C++ –

场景 1 250,000 105 523 155 2564 场景 1 500,000 183 1053 274 3711 场景 1 750,000 291 1589 447 4900 场景 2 750,000 2630* 618 353 3565

原输出为”2.63 ms”。IPython 的 %timeit 输出的均值保留 3 位有效数字，由于此时间已超过 1 毫秒，微秒位被舍弃。此处仍以微秒作单位，数值记为”2630″。

本次实测时使用的设备硬件上劣于算法竞赛中的标准配置机器，实测结果中的”绝对数值”参考性较低。

根据上表中的结果，在给定环境和相关参数条件下，场景 1 中 Python 的运行时间大约为 C++ 中 string::find 的五分之一，与 std:strstr 接近；而在场景 2 中 Python 的运行时间明显增长，但 C++ 的前两种测试方法的运行时间与先前接近甚至更短。四次测试中，C++ 的用户实现的 KMP 算法运行时间均较长，长于同条件下 Python 的情况。

值得关注的是：C++ 中自行实现的 KMP 算法的运行时间竟然远长于 C++ 标准库甚至 Python 中的算法。这也类似于常说的”自己设计汇编代码运行效率低于编译器”的情况。Stack Overflow 的一个问题 strstr faster than algorithms? 下有人回答如下：

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it’s quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don’t stand a chance of out-performing it in C for such small benchmarks.

KMP 算法并非是所有线性复杂度算法中最快的。在不同的环境（软硬件、测试数据等）下，KMP 与其变种乃至其他线性复杂度算法，孰优孰劣都无法判断。编译器在设计时考虑到诸多可能的因素，尽可能使不同环境下都能有相对较优的策略来得到结果。因而，在保证结果正确的情况下，与其根据算法原理自行编写，不如直接使用标准库中提供的函数。

同时本次实测也在运行时间角度再次印证 Python 并不适合在算法竞赛中取得高成绩的说法。

Original: https://www.cnblogs.com/littleye233/p/15977014.html
Author: 小叶Little_Ye
Title: 【实测】Python 和 C++ 下字符串查找的速度对比

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/606374/

转载文章受原作者版权保护。转载请注明原作者出处！

Linux

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

高速USB转4串口产品设计-TTL串口

基于480Mbps 高速USB转8路串口芯片CH344Q，可以为各类主机扩展出4个独立的串口。CH344芯片支持使用操作系统内置的CDC串口驱动，也支持使用厂商提供的VCP串口驱动…

Linux 2023年6月7日
00107
ThinkPHP5 远程命令执行漏洞

一、ThinkPHP介绍轻量级框架，内部OOP和面向过程代码都存在，是国人自己开发的框架。ThinkPHP是一个快速、兼容而且简单的轻量级国产PHP开发框架，诞生于2006年初，…

Linux 2023年6月14日
0078
python之pyautogui实现图片识别-办公自动化

环境 python 3.8everedit编辑器代码 from selenium import webdriver from selenium.webdriver.chrome….

Linux 2023年6月7日
00115
位图实现

位图就是用每个字节中的bit位代表一组资源的映射。例如：一个字节有8位，在操作系统中可以用一个bit位代表一个4K的页，那一个字节就可以代表8页32K内存。可以利用位图进行资源…

Linux 2023年6月7日
0087
Ubuntu 16.04 更改系统语言为简体中文 #####避坑指南

按照我的步骤一步一步走，就不会有问题了。 [En] Follow my steps step by step, and there will be no problem. 这里我想…

Linux 2023年5月27日
00112
linux 普通分区与lvm分区

安装linux系统时有时候会提示lvm分区与标准分区首先普及一下lvm分区：lvm是 logical volume manager (逻辑卷管理)，linux环境下对磁盘分区的…

Linux 2023年5月27日
00105
排序算法

内部排序这里先介绍一个概念，算法稳定性算法稳定性 — 假设在数列中存在a[i]=a[j]，若在排序之前，a[i]在a[j]前面；并且排序之后，a[i]仍然在a[j]…

Linux 2023年6月6日
00132
每天一个 HTTP 状态码 203

203 ‘Non-Authoritative Informative’ 直译过来是「非权威信息」的意思… 203 Non-Authoritati…

Linux 2023年6月7日
00102
redis 突然大量逐出导致读写请求block

redis作为缓存场景使用，内存耗尽时，突然出现大量的逐出，在这个逐出的过程中阻塞正常的读写请求，导致 redis 短时间不可用； redis 中的LRU是如何实现的？逐出qps…

Linux 2023年5月28日
0089
模糊测试基本概念FuzzTest

1. what is FUZZ TESTing? Fuzz Testing is an automated software testing technology, origina…

Linux 2023年6月7日
0076
笔记：Java集合框架(一)

Java集合框架(一) Collection接口继承结构 Iterator接口 Iterator接口定义了迭代器的基本方法： java;gutter:true; hasNext(…

Linux 2023年6月14日
0078
调度器简介

内核中用来安排进程执行的模块称为调度器（scheduler），它可以切换进程状态（process state）。例如执行、可中断睡眠、不可中断睡眠、退出、暂停等。调度器是CPU中…

Linux 2023年6月7日
0078
Linux Centos 打开和关闭防火墙

systemctl status firewalld.service # 查看防火墙状态 systemctl start firewalld.service # 开启防火墙 sys…

Linux 2023年6月13日
00113
我叫Mongo，干了「查询终结篇」，值得您拥有

这是mongo第三篇”查终结篇”，后续会连续更新5篇 mongodb的文章总结上会有一系列的文章，顺序是先学会怎么用，在学会怎么用好，戒急戒躁，循序渐进，跟…

Linux 2023年6月14日
00129
tcpreplay重放报文，tcpdump能抓到包，应用程序收不到包

现象：生产环境中有两台服务器A、B，A服务器实时有报文发往B服务器。为了在测试环境测试新功能，故在现网A服务器上tcpdump抓取发往B服务器的报文，然后在测试环境tcprewr…

Linux 2023年5月27日
00156
Linux——基础命令用法（上）

一、Linux基础命令 1、Linux命令的语法一条完整的Linux命令的组成部分：命令选项参数命令：是某个具体的功能选项：是对函数的修改(通常以-开头，-表示选项的短…

Linux 2023年5月27日
0086

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31