使用多处理读取多个文件

时间：2023-03-14

本文介绍了使用多处理读取多个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着跟版网的小编来一起学习吧！

问题描述

我需要阅读一些非常大的文本文件(100+ Mb)，用正则表达式处理每一行并将数据存储到一个结构中.我的结构继承自 defaultdict，它有一个读取 self.file_name 文件的 read(self) 方法.

I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read(self) method that read self.file_name file.

看这个非常简单(但不是真实的)示例，我没有使用正则表达式，但我正在拆分行:

Look at this very simple (but not real) example, I'm not using regex, but I'm splitting lines:


import multiprocessing
from collections import defaultdict

def SingleContainer():
    return list()

class Container(defaultdict):
    """
    this class store odd line in self["odd"] and even line in self["even"].
    It is stupid, but it's only an example. In the real case the class
    has additional methods that do computation on readen data.
    """
    def __init__(self,file_name):
        if type(file_name) != str:
            raise AttributeError, "%s is not a string" % file_name
        defaultdict.__init__(self,SingleContainer)
        self.file_name = file_name
        self.readen_lines = 0
    def read(self):
        f = open(self.file_name)
        print "start reading file %s" % self.file_name
        for line in f:
            self.readen_lines += 1
            values = line.split()
            key = {0: "even", 1: "odd"}[self.readen_lines %2]
            self[key].append(values)
        print "readen %d lines from file %s" % (self.readen_lines, self.file_name)

def do(file_name):
    container = Container(file_name)
    container.read()
    return container.items()

if __name__ == "__main__":
    file_names = ["r1_200909.log", "r1_200910.log"]
    pool = multiprocessing.Pool(len(file_names))
    result = pool.map(do,file_names)
    pool.close()
    pool.join()
    print "Finish"

最后，我需要将每个结果加入一个容器中.保持行的顺序很重要.返回值时我的方法太慢了.更好的解决方案?我在 Linux 上使用 python 2.6

At the end I need to join every results in a single Container. It is important that the order of the lines is preserved. My approach is too slow when returning values. Better solution? I'm using python 2.6 on Linux

使用多处理读取多个文件

问题描述

推荐答案

相关文章