Python3 反向按行读取大文件、日志read_reverse_bigfile

2023年5月24日上午2:07 • Python • 阅读 68

适用场景：

反向读取大文件的需求，多半出于想要获取超大的日志文件最后几百行内容，小文件直接通过 lins = file.readlines()[-200:] 就可以直接得出了，

大文件readlines() 会将文件内容全部读取到内存，在GB大小情况下，内存不一定够，列表切片性能差。生成器反向读取方式更合适。

反向读取大文件，主要基于：<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Read large files in reverse, mainly based on:</font>*</details>

file对象的seek()函数，偏移文件指针
re 模块finditer 搜寻文件行尾符，Match对象的end获取行尾符索引位置
yield 生成器节省内存

性能验证：

在windows10， RAM 16G，处理器 11th Gen Intel(R) Core(TM) i5-11400 @ 2.60GHz 2.59 GHz ， python 3.7.11 下简单验证，

import re

def read_reverse_bigfile(filepath, encoding='utf-8', separator=b'\n', single_size=1024 * 1024):    """    :param filepath: 文件路径    :param encoding: 字符编码，默认utf-8    :param separator: 行尾分隔符，默认 '\n'    :param single_size: 单次读取 字符量，默认 1024*1024    :return: generator     """    with open(filepath, 'rb') as f:        try:            f.seek(0, 2)            position = f.tell()            if position > single_size:                f.seek(-single_size, 2)            else:                f.seek(0, 0)        except OSError as e:            return 'Blank file'        line = b''        while 1:            chunk = f.read(single_size)            index_list = [match.end() for match in re.finditer(separator, chunk)]            index = None            while index_list:                target = index_list.pop()                if index is None:                    line = chunk[target:] + line                else:                    line = chunk[target:index] + line                if line:                    yield line.decode(encoding=encoding)                line = b''                index = target            else:                line = chunk[:index] + line            position = f.tell()            if position > 2 * single_size and single_size > 0:                f.seek(-2 * single_size, 1)            else:                f.seek(0, 0)                single_size = position - single_size                if single_size                     yield line.decode(encoding=encoding)                    return 'End'

undefined

if __name__ == '__main__':    import time    import os    import psutil    pid = os.getpid()    p = psutil.Process(pid)

start_time = time.time()    fp = r'./test30.txt'    rrb = list(read_reverse_bigfile(fp))    print(len(rrb))    print(f'耗时0：{time.time() - start_time}')    with open(fp, encoding='utf-8') as f:        orl = f.readlines()        print(len(orl))    print(f'耗时1：{time.time() - start_time}')    info = p.memory_full_info().uss/(1024*1024)    print(f'内存信息：{info}MB')

undefined

Original: https://www.cnblogs.com/yougnen/p/16081877.html
Author: 阿伦来啦
Title: Python3 反向按行读取大文件、日志read_reverse_bigfile

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/500156/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

pygame 实现打字单词的统计_Java实现单词统计

单词统计的是统计一个文件中单词出现的次数，比如下面的数据源其中,最终出现的次数结果应该是下面的显示那么在MapReduce中该如何编写代码并出现最终结果？首先我们把文件上传到…

Python 2023年9月25日
0026
华为麒麟团队力造的Python，整整26G

华为团队力造的Python，下面链接自取 Original: https://www.cnblogs.com/sanhe/p/16133983.htmlAuthor: 三河Titl…

Python 2023年6月10日
0066
10 分钟 pandas-大致了解pandas能做的工作

pandas 中有两种对象， Series 和 DataFrame s = pd.Series([1, 3, 5, np.nan, 6, 8]) dates = pd.date_r…

Python 2023年8月8日
0059
Backbone 网络-DenseNet 论文解读

目录摘要网络结构优点代码问题参考资料摘要 ResNet 的工作表面，只要建立前面层和后面层之间的”短路连接”（shortcut），就能有助于训…

Python 2023年10月24日
0033
MICCAI 论文投稿须知翻译

本文件包含了一些要点，我们希望这些要点将有助于作者准备提交给2021 MICCAI的文件，并应阅读MICCIAI审查过程内容： 1.论文征集 2.提交截止日期 3.手稿格式 4.审…

Python 2023年10月24日
00109
flask程序使用celery

写在开头 celery是一个分布式并行框架，适合一些并行任务。跟web程序结合可以实现web程序和业务代码的解耦合，celery是采用多进程方式进行的，所以能够有效利用多核CPU。…

Python 2023年8月10日
0035
防火墙NAT综合实验——nat控制，豁免，远程，DMZ区域（带命令）

在ENSP中进行防火墙 NAT 实验，你可以按照以下步骤进行配置： 1. 首先，根据需要划分防火墙_的安全 _区域。可以使用命令firewall zone trust add…

Python 2023年10月11日
0045
Python多进程处理（读、写）numpy矩阵

前言由于需要使用python处理一个380*380的numpy矩阵，经过计算后对其中的每个元素进行赋值，单进程处理大约需要4小时，要处理几百个矩阵，时间上有些耗不起，研究了一下p…

Python 2023年8月27日
0067
Pandas使用操作(二)

获取列名 1、df.columns df.columns[0] 2、list(df) list(df)[0] 3、df.keys() df.keys()[0] 将空字符串替换为na…

Python 2023年8月7日
0026
标准化与归一化

标准化（Standardization ）和归一化（Nomalisation）网上对他们的描述纷繁复杂，有些人认为他们是等价的，有些人认为他们是完全不同的，我这里更倾向于认为他们…

Python 2023年10月24日
0029
Scrapy 爬虫框架初体验一 —— 网络爬虫及其框架介绍

一、框架概述在介绍框架之前，简单介绍一下网络爬虫（Web Crawler）。当我们上网时，浏览的网页上有很多形形色色的信息，我们可以手动收集（复制粘贴or下载）我们需要的信息。…

Python 2023年10月6日
0045
Python工具箱系列(二十一)

为了方便准备试验用的数据，建议使用Faker这个库来模拟。Faker是一个Python软件包，可生成伪造数据。无论是需要引导数据库，创建美观的XML文档，填充持久性以进行压力测试，…

Python 2023年10月30日
0022
python中线条颜色_Python中matplotlib的颜色及线条等设置

1.颜色 plt.scatter(x,y,c=’r’,marker=’x’,label=’cluster_1&#8242…

Python 2023年9月4日
0046
python打包exe之pyinstaller用法

pyinstaller可以将python写好的脚本打包成exe文件，方便windows用户在没有python环境下运行。这个程序完全跨平台，包括Windows、Linux、Mac …

Python 2023年6月11日
0075
python colorbar设置label标签位置

colorbar简单设置方法关于python中使用colorbar的方法已经有比较好的博文介绍，简单列举个人感觉比较好两个colorbar方向 https://blog.csdn…

Python 2023年8月1日
0038
Python3+flask+sqlalchemy分页查询

Flask是Python3开发平台的小型框架，是DJango框架的轻量版，所谓的轻量，并不是说Flask功能没有DJango强大，而是为了提供用户开发过程中更大的灵活空间，缩减了很…

Python 2023年8月13日
0043

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Python3 反向按行读取大文件、日志read_reverse_bigfile

大家都在看