在 python 中有效地处理一个大的 .txt 文件

时间:2022-11-26
本文介绍了在 python 中有效地处理一个大的 .txt 文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我对 python 和一般编程很陌生,但我正在尝试对包含大约 700 万行 python 的制表符分隔的 .txt 文件运行滑动窗口"计算.我所说的滑动窗口的意思是,它将运行一个计算,比如 50,000 行,报告数字,然后向上移动说 10,000 行,并在另外 50,000 行上执行相同的计算.我的计算和滑动窗口"工作正常,如果我在一小部分数据上测试它,它运行良好.但是,如果我尝试在我的整个数据集上运行该程序,它会非常慢(我现在已经运行了大约 40 个小时).数学很简单,所以我认为它不应该花这么长时间.

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

我现在阅读 .txt 文件的方式是使用 csv.DictReader 模块.我的代码如下:

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('','') for line in newfile), delimiter="	")

我相信这是一次从所有 700 万行中制作一本字典,我认为这可能是它对于较大文件的速度如此之慢的原因.

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

由于我只对一次对块"或窗口"数据运行计算感兴趣,有没有更有效的方法来一次只读取指定的行,执行计算,然后重复指定行的新指定块"或窗口"?

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

推荐答案

collections.deque 是一个有序的项目集合,可以采用最大大小.当您将一个项目添加到一端时,一个项目会从另一端落下.这意味着要遍历 csv 上的窗口",您只需要继续向 deque 添加行,它就会处理丢弃完整的行.

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
    reader = csv.DictReader((line.replace("", "") for line in csv_file), delimiter="	")

    # initial fill
    for _ in range(50000):
        dq.append(reader.next())

    # repeated compute
    try:
        while 1:
            compute(dq)
            for _ in range(10000):
                dq.append(reader.next())
    except StopIteration:
            compute(dq)

这篇关于在 python 中有效地处理一个大的 .txt 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

上一篇:谷歌浏览器中的 Pyqt 标签页 下一篇:Kivy - 为标签按钮添加图标

相关文章

最新文章