用户
 找回密码
 立即注册
发表于 2021-11-25 22:12:14
16182
附件pdf为运行结果




使用道具 举报 回复
发表于 2021-11-25 22:18:29
pdf传不了节选部分内容上传
2021-11-25 10:12:25,795 [INFO] __main__: Number of images in the training dataset:          9614
2021-11-25 10:12:25,795 [INFO] __main__: Number of images in the validation dataset:          1068
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

2021-11-25 10:12:26,985 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:986: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2021-11-25 10:12:27,229 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:973: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

Epoch 1/80
   2/1202 [..............................] - ETA: 2:01:43 - loss: 7.1783/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (2.221805). Check your callbacks.
  % delta_t_median)
1202/1202 [==============================] - 120s 100ms/step - loss: 5.2242
03a3f4eedd9d:48:76 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.5<0>
03a3f4eedd9d:48:76 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
03a3f4eedd9d:48:76 [0] NCCL INFO NET/IB : No device found.
03a3f4eedd9d:48:76 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.5<0>
03a3f4eedd9d:48:76 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 00/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 01/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 02/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 03/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 04/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 05/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 06/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 07/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 08/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 09/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 10/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 11/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 12/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 13/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 14/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 15/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 16/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 17/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 18/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 19/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 20/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 21/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 22/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 23/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 24/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 25/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 26/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 27/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 28/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 29/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 30/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Channel 31/32 :    0
03a3f4eedd9d:48:76 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1/-
03a3f4eedd9d:48:76 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
03a3f4eedd9d:48:76 [0] NCCL INFO comm 0x7fbd08331fd0 rank 0 nranks 1 cudaDev 0 busId 1000 - Init COMPLETE

Epoch 00001: saving model to /workspace/tao-experiments/ssd_cn/experiment_dir_retrain/weights/ssd_resnet18_epoch_001.tlt
Epoch 2/80
  59/1202 [>.............................] - ETA: 1:48 - loss: 4.9380DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Paste encountered:
CUDA allocation failed
Current pipeline object is no longer valid.
Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 366, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 494, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 482, in return_func
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 362, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py", line 274, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1039, in fit
    validation_steps=validation_steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 154, in fit_loop
    outs = f(ins)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Paste encountered:
CUDA allocation failed
Current pipeline object is no longer valid.
         [[{{node Dali}}]]
         [[cond_6/SliceReplace_4/range/_10809]]
  (1) Internal: DALI daliShareOutput(&pipe_handle_) failed: Critical error in pipeline:
Error when executing GPU operator Paste encountered:
CUDA allocation failed
Current pipeline object is no longer valid.
         [[{{node Dali}}]]
0 successful operations.
0 derived errors ignored.
2021-11-25 18:14:43,317 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
1125_1.jpg
使用道具 举报 回复 支持 反对
发表于 2021-11-26 07:15:04
这个是显存问题,意思是显存不够了,CUDA allocation failed。
你可以尝试将batch_size调小一点
使用道具 举报 回复 支持 反对
发新帖
您需要登录后才可以回帖 登录 | 立即注册