-
Notifications
You must be signed in to change notification settings - Fork 128
Description
got OOM when doing training:
Environment:
tf: 1.4
GPU: Titan X
python 2.7
Ubuntu 16.04
Error:
2018-01-07 22:12:42.933166: W tensorflow/core/framework/op_kernel.cc:1192] Resource exhausted: OOM when allocating tensor with shape[34560,1]
Traceback (most recent call last):
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 57, in main
model.train(config, train_set, val_set, lr_schedule)
File "/home/hope/im2latex-1/model/base.py", line 160, in train
lr_schedule)
File "/home/hope/im2latex-1/model/img2seq.py", line 173, in _run_epoch
feed_dict=fd)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
Caused by op u'attn_cell/rnn/while/rnn/att_mechanism/MatMul', defined at:
File "train.py", line 61, in
main()
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python2.7/dist-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "train.py", line 56, in main
model.build_train(config)
File "/home/hope/im2latex-1/model/img2seq.py", line 41, in build_train
self._add_pred_op()
File "/home/hope/im2latex-1/model/img2seq.py", line 119, in _add_pred_op
self.dropout)
File "/home/hope/im2latex-1/model/decoder.py", line 60, in call
initial_state=attn_cell.initial_state())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 614, in dynamic_rnn
dtype=dtype)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 777, in _dynamic_rnn_loop
swap_memory=swap_memory)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2816, in while_loop
result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2640, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2590, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 762, in _time_step
(output, new_state) = call_cell()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/rnn.py", line 748, in
call_cell = lambda: cell(input_t, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 109, in call
new_output, new_state = self.step(inputs, state)
File "/home/hope/im2latex-1/model/components/attention_cell.py", line 79, in step
c = self._attention_mechanism.context(new_h)
File "/home/hope/im2latex-1/model/components/attention_mechanism.py", line 83, in context
e = tf.matmul(att_flat, att_beta)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1898, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 2437, in _mat_mul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2960, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1473, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[34560,1]
[[Node: attn_cell/rnn/while/rnn/att_mechanism/MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](attn_cell/rnn/while/rnn/att_mechanism/Reshape, attn_cell/rnn/while/rnn/att_mechanism/MatMul/Enter)]]
[[Node: Mean/_85 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2674_Mean", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]