python 如何读取超大的文件--php,mysql,java,redis,php redis,ajax,jsp,struts,linux,javascript,html,staruml,window,lua,cocos2dx,python,ansible,shell,ansible,sed,awk,go,docker,vue,js,架构,python,open,pands,mmap

python 如何读取超大的文件

2025-1-13 杜世伟 Python

在 Python 中读取超大的文件（例如，文件的大小大于系统内存）时，通常需要逐行或分块读取，以避免将整个文件加载到内存中，从而导致内存不足的问题。以下是几种常见的方法：

1. 使用 open 和迭代器逐行读取
这是最常见的方法。通过文件对象的迭代器，逐行读取文件。

with open('large_file.txt', 'r', encoding='utf-8') as file:
for line in file:
process_line(line) # 对每行进行处理

这种方式非常高效，因为它不会一次性将整个文件加载到内存，而是逐行读取。

2. 分块读取文件内容
如果需要以更大的块为单位读取，可以使用 read 方法指定块的大小。
with open('large_file.txt', 'r', encoding='utf-8') as file:
while True:
chunk = file.read(1024 * 1024) # 每次读取 1 MB
if not chunk:
break
process_chunk(chunk) # 对读取的块进行处理

这种方法适用于需要逐块处理文件内容的场景。

3. 使用 readlines 的生成器版本
readlines 会一次性加载整个文件，但我们可以通过生成器的方式自定义逐行读取：
def read_large_file(file_path, chunk_size=1024):
with open(file_path, 'r', encoding='utf-8') as file:
while True:
lines = file.readlines(chunk_size) # 读取指定字节数
if not lines:
break
for line in lines:
yield line

for line in read_large_file('large_file.txt'):
process_line(line)

4. 使用 mmap 模块
mmap 是一种内存映射的方法，可以将文件映射到内存中，但不实际加载整个文件，可以像字符串一样操作文件内容。
import mmap
with open('large_file.txt', 'r+b') as file:
with mmap.mmap(file.fileno(), length=0, access=mmap.ACCESS_READ) as m:
for line in iter(m.readline, b""):
process_line(line.decode('utf-8')) # 解码为字符串

5. 使用 pandas 处理超大文件
如果文件是结构化数据（如 CSV 文件），可以使用 pandas 提供的分块加载方法。
import pandas as pd
chunk_size = 100000 # 每次读取 100,000 行
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
process_chunk(chunk) # 对每块数据进行处理

6. 使用多线程或多进程
在某些情况下，可以结合多线程或多进程并行处理大文件的不同部分，从而提高处理效率。

from concurrent.futures import ThreadPoolExecutor
def process_chunk(chunk):
# 自定义处理逻辑
pass

with open('large_file.txt', 'r', encoding='utf-8') as file:
with ThreadPoolExecutor() as executor:
futures = []
while True:
chunk = file.read(1024 * 1024) # 每次读取 1 MB
if not chunk:
break
futures.append(executor.submit(process_chunk, chunk))

注意事项
文件编码：确保文件的编码（如 utf-8 或 latin-1）与程序中的解码方式一致，避免解码错误。
错误处理：在处理大文件时添加错误处理逻辑，以应对可能的 I/O 错误或内存不足问题。
优化处理逻辑：无论是逐行还是分块读取，尽量优化 process_line 或 process_chunk 的逻辑，以减少处理时间。

标签: python open pands mmap

« 从技术专家到战略领袖：成就技术总监的路径与思维 | python requests 模块»

孤独求学人

记录自己技术路上的点点滴滴~

python 如何读取超大的文件

链接

搜索

热门日志

分类

最新日志

随机日志