我正在解析两个大文件(Gb 大小顺序),每个文件都包含 keys
和相应的 values
.一些 keys
在两个文件之间共享,但对应的 values
不同.对于每个文件,我想将 keys*
和相应的 values
写入一个新文件,其中 keys*
表示两者中都存在的键文件 1 和文件 2.我不在乎输出中的 key
顺序,但两个文件中的顺序绝对应该相同.
I am parsing two big files (Gb size order), that each contains keys
and corresponding values
. Some keys
are shared between the two files, but with differing corresponding values
.
For each of the files, I want to write to a new file the keys*
and corresponding values
, with keys*
representing keys present both in file1 and file2. I don't care on the key
order in the output, but the should absolutely be in the same order in the two files.
文件 1:
key1
value1-1
key2
value1-2
key3
value1-3
文件2:
key1
value2-1
key5
value2-5
key2
value2-2
一个有效的输出是:
解析文件 1:
key1
value1-1
key2
value1-2
解析文件 2:
key1
value2-1
key2
value2-2
另一个有效的输出:
解析文件 1:
key2
value1-2
key1
value1-1
解析文件 2:
key2
value2-2
key1
value2-1
无效输出(文件 1 和文件 2 中的键顺序不同):
An invalid output (keys in differing order in file 1 and file 2):
解析文件 1:
key2
value1-2
key1
value1-1
解析文件 2:
key1
value2-1
key2
value2-2
最后一个精度是值大小远远大于键大小.
A last precision is that value sizes are by far bigger than key sizes.
我想做的是:
对于每个输入文件,解析并返回一个dict
(我们称之为file_index
),其中key对应于文件中的key,value对应于在输入文件中找到密钥的偏移量.
For each input file, parse and return a dict
(let's call it file_index
) with keys corresponding to the keys in the file, and values corresponding to the offset where the key was found in the input file.
计算交集
good_keys = file1_index.viewkeys() & file2_index.viewkeys()
做一些类似(伪代码):
do something like (pseudo-code) :
for each file:
for good_key in good_keys:
offset = file_index[good_key]
go to offset in input_file
get corresponding value
write (key, value) to output file
迭代同一个集合是否保证我有完全相同的顺序(假设它是相同的集合:我不会在两次迭代之间修改它),或者我应该转换先设置一个列表,然后遍历列表?
Does iterating over the same set guarantee me to have the exact same order (providing that it is the same set: I won't modify it between the two iterations), or should I convert the set to a list first, and iterate over the list?
Python 的 dicts 和 set 是稳定的,也就是说,如果你迭代它们而不改变它们,它们保证给你相同的顺序.这来自 dicts 文档:
Python's dicts and sets are stable, that is, if you iterate over them without changing them they are guaranteed to give you the same order. This is from the documentation on dicts:
键和值以非随机的任意顺序迭代,随 Python 实现而变化,并且取决于字典的插入和删除历史.如果键、值和项目视图被迭代而没有对字典进行干预修改,项目的顺序将直接对应.
Keys and values are iterated over in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions. If keys, values and items views are iterated over with no intervening modifications to the dictionary, the order of items will directly correspond.
这篇关于集合上的 Python 迭代顺序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!