We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug描述
# 本函数已保存在d2lzh_pytorch包中方便以后使用 def data_iter_random(corpus_indices, batch_size, num_steps, device=None): # 减1是因为输出的索引x是相应输入的索引y加1 num_examples = (len(corpus_indices) - 1) // num_steps epoch_size = num_examples // batch_size example_indices = list(range(num_examples)) random.shuffle(example_indices) # 返回从pos开始的长为num_steps的序列 def _data(pos): return corpus_indices[pos: pos + num_steps] if device is None: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') for i in range(epoch_size): # 每次读取batch_size个随机样本 i = i * batch_size batch_indices = example_indices[i: i + batch_size] X = [_data(j * num_steps) for j in batch_indices] Y = [_data(j * num_steps + 1) for j in batch_indices] yield torch.tensor(X, dtype=torch.float32, device=device), torch.tensor(Y, dtype=torch.float32, device=device)
以上是随机采样的写法,但是觉得有两个问题。首先,因为for i in range(epoch_size)的关系,所以实际上每一次都是从下标为0的开始采样。对于下面所给的测试,实际上x只能在0-23产生batch,也就是x的batch一直都不包括24,25,26,27,28。
for i in range(epoch_size)
x
# 测试 my_seq = list(range(30)) for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6): print('X: ', X, '\nY:', Y, '\n') # 所给的结果 X: tensor([[18., 19., 20., 21., 22., 23.], [12., 13., 14., 15., 16., 17.]]) Y: tensor([[19., 20., 21., 22., 23., 24.], [13., 14., 15., 16., 17., 18.]]) X: tensor([[ 0., 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10., 11.]]) Y: tensor([[ 1., 2., 3., 4., 5., 6.], [ 7., 8., 9., 10., 11., 12.]])
Q1: 那其实在实现随机采样的时候,是不是应该保证有一部分epoch包含的batch有24,25,26,27,28(不知道我有没有理解错)。同理,在相邻采样中也有同样的情况。 Q2: 此外,上面的写法生成一定是batch_size=2的数据,当有数据剩余且数据量小于batch_size=2的数据量时就不会生成。但是在全连接和CNN中,我们读取的小批量数据在最后一个batch中数据量往往小于batch_size。因此在这里,假设上面的测试剩余了大于batch_size=1的数据(如设置my_seq = list(range(32)),此时有8个数据未被抽取),是否继续采样一个batch_size=1的数据,望解惑!
batch_size=2
batch_size=1
my_seq = list(range(32))
# Q2的情况如下: # 测试 my_seq = list(range(32)) for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6): print('X: ', X, '\nY:', Y, '\n') # 结果(这里包含了0<batch_size<=2的情况) X: tensor([[18., 19., 20., 21., 22., 23.], [ 0., 1., 2., 3., 4., 5.]]) Y: tensor([[19., 20., 21., 22., 23., 24.], [ 1., 2., 3., 4., 5., 6.]]) X: tensor([[12., 13., 14., 15., 16., 17.], [24., 25., 26., 27., 28., 29.]]) Y: tensor([[13., 14., 15., 16., 17., 18.], [25., 26., 27., 28., 29., 30.]]) X: tensor([[ 6., 7., 8., 9., 10., 11.]]) Y: tensor([[ 7., 8., 9., 10., 11., 12.]])
以下是我另外写随机采样的,保证了我上述说的情况
def data_iter_random(corpus_indices, batch_size, num_steps, device=None): # 减1是因为输出的索引x是相应输入的索引y加1 num_examples = (len(corpus_indices) - 1) // num_steps # 随机抽样的起始位置 sample_start = np.random.randint((len(corpus_indices) - 1) % num_steps + 1) example_indices = np.arange(sample_start, len(corpus_indices), num_steps)[:num_examples] np.random.shuffle(example_indices) # 转gpu if device is None: device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # 每个读取batch_size个随机样本 for idx in np.arange(0, len(example_indices), batch_size): batch_example = example_indices[idx:(idx+batch_size)] x = [corpus_indices[pos:(pos+num_steps)] for pos in batch_example] y = [corpus_indices[(pos+1):(pos+1+num_steps)] for pos in batch_example] yield torch.tensor(x, dtype=torch.float32, device=device), torch.tensor(y, dtype=torch.float32, device=device)
测试结果
my_seq = list(range(30)) for X, Y in data_iter_random(my_seq, batch_size=2, num_steps=6): print('X: ', X, '\nY:', Y, '\n') # 结果: X: tensor([[14., 15., 16., 17., 18., 19.], [ 8., 9., 10., 11., 12., 13.]], device='cuda:0') Y: tensor([[15., 16., 17., 18., 19., 20.], [ 9., 10., 11., 12., 13., 14.]], device='cuda:0') X: tensor([[ 2., 3., 4., 5., 6., 7.], [20., 21., 22., 23., 24., 25.]], device='cuda:0') Y: tensor([[ 3., 4., 5., 6., 7., 8.], [21., 22., 23., 24., 25., 26.]], device='cuda:0')
版本信息 pytorch: 1.6.0 torchvision: 0.7.0 torchtext: ...
The text was updated successfully, but these errors were encountered:
No branches or pull requests
bug描述
以上是随机采样的写法,但是觉得有两个问题。首先,因为
for i in range(epoch_size)
的关系,所以实际上每一次都是从下标为0的开始采样。对于下面所给的测试,实际上x
只能在0-23产生batch,也就是x的batch一直都不包括24,25,26,27,28。Q1:
那其实在实现随机采样的时候,是不是应该保证有一部分epoch包含的batch有24,25,26,27,28(不知道我有没有理解错)。同理,在相邻采样中也有同样的情况。
Q2:
此外,上面的写法生成一定是batch_size=2的数据,当有数据剩余且数据量小于
batch_size=2
的数据量时就不会生成。但是在全连接和CNN中,我们读取的小批量数据在最后一个batch中数据量往往小于batch_size。因此在这里,假设上面的测试剩余了大于batch_size=1
的数据(如设置my_seq = list(range(32))
,此时有8个数据未被抽取),是否继续采样一个batch_size=1的数据,望解惑!以下是我另外写随机采样的,保证了我上述说的情况
测试结果
版本信息
pytorch: 1.6.0
torchvision: 0.7.0
torchtext:
...
The text was updated successfully, but these errors were encountered: