创建具有多个输入的 TimeseriesGenerator

时间:2023-01-23
本文介绍了创建具有多个输入的 TimeseriesGenerator的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着跟版网的小编来一起学习吧!

问题描述

我正在尝试根据约 4000 只股票的每日基本面和价格数据训练 LSTM 模型,由于内存限制,在转换为模型的序列后,我无法将所有内容都保存在内存中.

这导致我改用生成器,例如

相反,我想要的是类似于此:

稍微类似的问题:合并或附加多个 Keras TimeseriesGenerator 对象合二为一

我探索了像这样组合生成器的选项 SO 建议:我如何组合两个 keras 生成器函数,但是在大约 4000 个生成器的情况下这不是主意.

我希望我的问题有意义.

解决方案

所以我最终做的是手动进行所有预处理并为每个包含预处理序列的股票保存一个 .npy 文件,然后使用手动创建的生成器我像这样批量制作:

类 seq_generator():def __init__(self, list_of_filepaths):self.usedDict = dict()对于 list_of_filepaths 中的路径:self.usedDict[路径] = []定义生成(自我):而真:路径 = np.random.choice(list(self.usedDict.keys()))stock_array = np.load(路径)random_sequence = np.random.randint(stock_array.shape[0])如果 random_sequence 不在 self.usedDict[path] 中:self.usedDict[path].append(random_sequence)产量 stock_array[random_sequence, :, :]train_generator = seq_generator(list_of_filepaths)train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),output_types=(tf.float32, tf.float32),output_shapes=(n_timesteps, n_features))train_dataset = train_dataset.batch(batch_size)

其中 list_of_filepaths 只是预处理 .npy 数据的路径列表.

<小时>

这将:

  • 加载随机股票的预处理 .npy 数据
  • 随机选择一个序列
  • 检查序列的索引是否已经在usedDict
  • 中使用过
  • 如果不是:
    • 将该序列的索引附加到 usedDict 以跟踪不向模型提供两次相同的数据
    • 产生序列

这意味着生成器将在每次调用"时从随机股票中提供一个唯一序列,使我能够使用 .from_generator().batch() 方法.

I'm trying to train an LSTM model on daily fundamental and price data from ~4000 stocks, due to memory limits I cannot hold everything in memory after converting to sequences for the model.

This leads me to using a generator instead like the TimeseriesGenerator from Keras / Tensorflow. Problem is that if I try using the generator on all of my data stacked it would create sequences of mixed stocks, see the example below with a sequence of 5, here Sequence 3 would include the last 4 observations of "stock 1" and the first observation of "stock 2"

Instead what I would want is similar to this:

Slightly similar question: Merge or append multiple Keras TimeseriesGenerator objects into one

I explored the option of combining the generators like this SO suggests: How do I combine two keras generator functions, however this is not idea in the case of ~4000 generators.

I hope my question makes sense.

解决方案

So what I've ended up doing is to do all the preprocessing manually and save an .npy file for each stock containing the preprocessed sequences, then using a manually created generator I make batches like this:

class seq_generator():

  def __init__(self, list_of_filepaths):
    self.usedDict = dict()
    for path in list_of_filepaths:
      self.usedDict[path] = []

  def generate(self):
    while True: 
      path = np.random.choice(list(self.usedDict.keys()))
      stock_array = np.load(path) 
      random_sequence = np.random.randint(stock_array.shape[0])
      if random_sequence not in self.usedDict[path]:
        self.usedDict[path].append(random_sequence)
        yield stock_array[random_sequence, :, :]

train_generator = seq_generator(list_of_filepaths)

train_dataset = tf.data.Dataset.from_generator(seq_generator.generate(),
                                               output_types=(tf.float32, tf.float32), 
                                               output_shapes=(n_timesteps, n_features)) 

train_dataset = train_dataset.batch(batch_size)

Where list_of_filepaths is simply a list of paths to preprocessed .npy data.


This will:

  • Load a random stock's preprocessed .npy data
  • Pick a sequence at random
  • Check if the index of the sequence has already been used in usedDict
  • If not:
    • Append the index of that sequence to usedDict to keep track as to not feed the same data twice to the model
    • Yield the sequence

This means that the generator will feed a single unique sequence from a random stock at each "call", enabling me to use the .from_generator() and .batch() methods from Tensorflows Dataset type.

这篇关于创建具有多个输入的 TimeseriesGenerator的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持跟版网!