step=000 loss=6.9051 step_ms=927.02 step=007 loss=6.9025 step_ms=665.78 App: training_lab Mode: data Device: cuda Batch size: 256 Input dim: 1024 Hidden dim: 2048 Num workers: 0 Sleep per sample (ms): 2.0 CPU transform depth: 6 Micro ops: 0 Pointwise depth: 0 torch.compile: False AMP mode: none Average step time (ms): 702.02 Steady-state step time (ms): 669.87 Step time p50 (ms): 665.72 Steady-state throughput (samples/s): 382.16 Final loss: 6.9025 Top operators by self CPU time: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ enumerate(DataLoader)#_SingleProcessDataLoaderIter._... 90.77% 3.677s 99.07% 4.013s 668.856ms 0.000us 0.00% 0.000us 0.000us 0 B -155.98 MB 0 B 0 B 6 aten::tanh 1.29% 52.122ms 1.29% 52.122ms 5.656us 0.000us 0.00% 0.000us 0.000us 35.96 MB 35.96 MB 0 B 0 B 9216 aten::mul 1.04% 42.269ms 2.33% 94.273ms 10.229us 0.000us 0.00% 0.000us 0.000us 36.00 MB 35.95 MB 0 B 0 B 9216 aten::_to_copy 0.88% 35.798ms 1.78% 72.266ms 3.918us 0.000us 0.00% 258.588us 0.014us 72.00 KB 4.44 KB 6.01 MB 0 B 18444 aten::add 0.82% 33.072ms 1.62% 65.611ms 7.119us 0.000us 0.00% 0.000us 0.000us 36.00 MB 35.94 MB 0 B 0 B 9216 aten::roll 0.68% 27.390ms 2.37% 95.907ms 10.407us 0.000us 0.00% 0.000us 0.000us 36.00 MB 108.00 KB 0 B 0 B 9216 aten::slice 0.56% 22.775ms 0.74% 29.996ms 1.627us 0.000us 0.00% 0.000us 0.000us 0 B 0 B 0 B 0 B 18438 aten::cat 0.54% 21.672ms 0.54% 21.692ms 2.351us 0.000us 0.00% 0.000us 0.000us 41.91 MB 41.91 MB 0 B 0 B 9228 aten::copy_ 0.53% 21.601ms 0.55% 22.305ms 1.116us 258.588us 3.50% 258.588us 0.013us 0 B 0 B 0 B 0 B 19992 aten::empty_strided 0.52% 21.139ms 0.52% 21.139ms 1.058us 0.000us 0.00% 0.000us 0.000us 6.03 MB 6.03 MB 6.01 MB 6.01 MB 19986 aten::narrow 0.47% 19.008ms 1.21% 49.004ms 2.658us 0.000us 0.00% 0.000us 0.000us 0 B 0 B 0 B 0 B 18438 aten::to 0.33% 13.406ms 2.11% 85.671ms 4.643us 0.000us 0.00% 258.588us 0.014us 116.00 KB 44.00 KB 6.01 MB 0 B 18450 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 4.051s Self CUDA time total: 7.395ms Top operators by CUDA time: ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ProfilerStep* 0.06% 2.260ms 99.80% 4.043s 673.798ms 0.000us 0.00% 5.490ms 914.927us 0 B 0 B 0 B 0 B 6 forward 0.00% 0.000us 0.00% 0.000us 0.000us 4.818ms 65.15% 4.818ms 802.946us 0 B 0 B 0 B 0 B 6 Optimizer.step#AdamW.step 0.00% 0.000us 0.00% 0.000us 0.000us 3.652ms 49.38% 3.652ms 608.635us 0 B 0 B 0 B 0 B 6 optimizer 0.01% 252.050us 0.07% 2.753ms 458.851us 0.000us 0.00% 3.231ms 538.520us 0 B 0 B 0 B 0 B 6 Optimizer.step#AdamW.step 0.04% 1.586ms 0.06% 2.501ms 416.843us 0.000us 0.00% 3.231ms 538.520us 0 B 0 B 0 B 0 B 6 aten::_fused_adamw_ 0.00% 162.913us 0.01% 347.322us 57.887us 3.224ms 43.60% 3.224ms 537.300us 0 B 0 B 0 B 0 B 6 void at::native::(anonymous namespace)::multi_tensor... 0.00% 0.000us 0.00% 0.000us 0.000us 3.224ms 43.60% 3.224ms 537.300us 0 B 0 B 0 B 0 B 6 forward 0.04% 1.582ms 0.18% 7.156ms 1.193ms 0.000us 0.00% 1.995ms 332.446us 0 B 0 B 54.99 MB -5.86 MB 6 aten::linear 0.00% 139.605us 0.09% 3.845ms 213.600us 0.000us 0.00% 1.921ms 106.713us 0 B 0 B 29.86 MB 0 B 18 aten::addmm 0.04% 1.722ms 0.08% 3.372ms 187.313us 1.921ms 25.98% 1.921ms 106.713us 0 B 0 B 29.86 MB 29.86 MB 18 autograd::engine::evaluate_function: AddmmBackward0 0.01% 413.638us 0.12% 4.887ms 271.485us 0.000us 0.00% 1.833ms 101.827us 0 B 0 B 158.88 MB -56.11 MB 18 AddmmBackward0 0.01% 257.074us 0.08% 3.434ms 190.801us 0.000us 0.00% 1.736ms 96.432us 0 B 0 B 214.88 MB 0 B 18 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 4.051s Self CUDA time total: 7.395ms Torch profiler trace written to: /home/duoan/playground-cuda/traces/blog_data_profile