• <bdo id='srqlR'></bdo><ul id='srqlR'></ul>
  • <small id='srqlR'></small><noframes id='srqlR'>

  • <tfoot id='srqlR'></tfoot>

      <i id='srqlR'><tr id='srqlR'><dt id='srqlR'><q id='srqlR'><span id='srqlR'><b id='srqlR'><form id='srqlR'><ins id='srqlR'></ins><ul id='srqlR'></ul><sub id='srqlR'></sub></form><legend id='srqlR'></legend><bdo id='srqlR'><pre id='srqlR'><center id='srqlR'></center></pre></bdo></b><th id='srqlR'></th></span></q></dt></tr></i><div id='srqlR'><tfoot id='srqlR'></tfoot><dl id='srqlR'><fieldset id='srqlR'></fieldset></dl></div>
      <legend id='srqlR'><style id='srqlR'><dir id='srqlR'><q id='srqlR'></q></dir></style></legend>

        在 python 的元组列表中有效且更快地迭代超过 3600 万个项目

        时间:2023-10-19

        1. <tfoot id='r8WoQ'></tfoot>
        2. <legend id='r8WoQ'><style id='r8WoQ'><dir id='r8WoQ'><q id='r8WoQ'></q></dir></style></legend>

            <small id='r8WoQ'></small><noframes id='r8WoQ'>

              <tbody id='r8WoQ'></tbody>
              <i id='r8WoQ'><tr id='r8WoQ'><dt id='r8WoQ'><q id='r8WoQ'><span id='r8WoQ'><b id='r8WoQ'><form id='r8WoQ'><ins id='r8WoQ'></ins><ul id='r8WoQ'></ul><sub id='r8WoQ'></sub></form><legend id='r8WoQ'></legend><bdo id='r8WoQ'><pre id='r8WoQ'><center id='r8WoQ'></center></pre></bdo></b><th id='r8WoQ'></th></span></q></dt></tr></i><div id='r8WoQ'><tfoot id='r8WoQ'></tfoot><dl id='r8WoQ'><fieldset id='r8WoQ'></fieldset></dl></div>
              • <bdo id='r8WoQ'></bdo><ul id='r8WoQ'></ul>

                  本文介绍了在 python 的元组列表中有效且更快地迭代超过 3600 万个项目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

                  问题描述

                  首先,在有人将其标记为重复之前,请阅读以下内容.我不确定迭代中的延迟是由于庞大的规模还是我的逻辑.我有一个用例,我必须在元组列表中迭代 3600 万个项目.我的主要要求是速度和效率.样品清单:

                  Firstly, before anyone marks it as a duplicate, please read below. I am unsure if the delay in the iteration is due to the huge size or my logic. I have a use case where I have to iterate over 36 million items in a list of tuples. My main requirement is speed and efficiency. Sample list:

                  [
                      ('how are you', 'I am fine'),
                      ('how are you', 'I am not fine'),
                      ...36 million items...
                  ]
                  

                  到目前为止我做了什么:

                  What I have done so far:

                  for query_question in combined:
                      query = "{}".format(word_tokenize(query_question[0]))
                      question = "{}".format(word_tokenize(query_question[1]))
                  
                      # the function uses a naive doc2vec extension of GLOVE word vectors
                      vec1 = np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(query)
                          if word in word_vector_dict
                      ], axis=0)
                  
                      vec2 = np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(question)
                          if word in word_vector_dict
                      ], axis=0)
                  
                      similarity_score = 1 - distance.cosine(vec1, vec2)
                      store_question_score = store_question_score.append(
                          (query_question[1], similarity_score)
                      ) 
                      count += 1
                  
                      if(count == len(data_list)):
                          store_question_score_descending = store_question_score.sort(
                              key=itemgetter(1), reverse=True
                          )
                          result_dict[query_question[0]] = store_question_score_descending[:5]
                          store_question_score =[]
                          count = 1
                  

                  上述逻辑旨在计算问题之间的相似度分数并执行文本相似度算法.我怀疑迭代中的延迟可能是 vec1 和 vec2 的计算. 如果是这样,我怎样才能做得更好?我正在寻找如何加快这个过程.

                  The above logic aims to calculate the similarity scores between questions and perform a text similarity algorithm. I'm suspecting the delay in the iteration could be the calculation of vec1 and vec2. If so, how can I do this better? I am looking for how to speed up the process.

                  还有很多其他问题类似于迭代巨大列表,但我找不到任何可以解决我的问题的问题.

                  There are plenty of other questions similar to iterative over huge lists, but I could not find any that solved my problem.

                  非常感谢您提供的任何帮助.

                  I really appreciate any help you can provide.

                  推荐答案

                  尝试缓存:

                  from functools import lru_cache
                  
                  @lru_cache(maxsize=None)
                  def compute_vector(s):
                      return np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(s)
                          if word in word_vector_dict
                      ], axis=0)
                  

                  然后改用这个:

                  vec1 = compute_vector(query)
                  vec2 = compute_vector(question)
                  


                  如果向量的大小是固定的,您可以通过缓存到形状为 (num_unique_keys, len(vec1)) 的 numpy 数组做得更好,在您的情况下 num_unique_keys =370000 + 100:


                  If the size of the vectors is fixed, you can do even better by caching to a numpy array of shape (num_unique_keys, len(vec1)), where in your case num_unique_keys = 370000 + 100:

                  class VectorCache:
                      def __init__(self, func, num_keys, item_size):
                          self.func = func
                          self.cache = np.empty((num_keys, item_size), dtype=float)
                          self.keys = {}
                  
                      def __getitem__(self, key):
                          if key in self.keys
                              return self.cache[self.keys[key]]
                          self.keys[key] = len(self.keys)
                          item = self.func(key)
                          self.cache[self.keys[key]] = item
                          return item
                  
                  
                  def compute_vector(s):
                      return np.mean([
                          word_vector_dict[word]
                          for word in literal_eval(s)
                          if word in word_vector_dict
                      ], axis=0)
                  
                  
                  vector_cache = VectorCache(compute_vector, num_keys, item_size)
                  

                  然后:

                  vec1 = vector_cache[query]
                  vec2 = vector_cache[question]
                  


                  使用类似的技术,您还可以缓存余弦距离:


                  Using a similar technique, you can also cache the cosine distances:

                  @lru_cache(maxsize=None)
                  def cosine_distance(query, question):
                      return distance.cosine(vector_cache[query], vector_cache[question])
                  

                  这篇关于在 python 的元组列表中有效且更快地迭代超过 3600 万个项目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!

                  上一篇:在 Python 中与 finditer() 重叠匹配 下一篇:为什么这个循环的迭代没有在openpyxl中添加单元格?

                  相关文章

                  <i id='6A5Fn'><tr id='6A5Fn'><dt id='6A5Fn'><q id='6A5Fn'><span id='6A5Fn'><b id='6A5Fn'><form id='6A5Fn'><ins id='6A5Fn'></ins><ul id='6A5Fn'></ul><sub id='6A5Fn'></sub></form><legend id='6A5Fn'></legend><bdo id='6A5Fn'><pre id='6A5Fn'><center id='6A5Fn'></center></pre></bdo></b><th id='6A5Fn'></th></span></q></dt></tr></i><div id='6A5Fn'><tfoot id='6A5Fn'></tfoot><dl id='6A5Fn'><fieldset id='6A5Fn'></fieldset></dl></div>
                    <bdo id='6A5Fn'></bdo><ul id='6A5Fn'></ul>

                      <small id='6A5Fn'></small><noframes id='6A5Fn'>

                    1. <legend id='6A5Fn'><style id='6A5Fn'><dir id='6A5Fn'><q id='6A5Fn'></q></dir></style></legend><tfoot id='6A5Fn'></tfoot>