step=000 loss=7.0117 step_ms=432.63
step=007 loss=7.7943 step_ms=12.18
App: training_lab
Mode: torch
Device: cuda
Batch size: 256
Input dim: 1024
Hidden dim: 1024
Num workers: 2
Sleep per sample (ms): 0.0
CPU transform depth: 1
Micro ops: 48
Pointwise depth: 0
torch.compile: False
AMP mode: none
Average step time (ms): 65.54
Steady-state step time (ms): 13.10
Step time p50 (ms): 12.05
Steady-state throughput (samples/s): 19546.50
Final loss: 7.7943

Top operators by self CPU time:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               backward        32.88%      35.191ms        33.18%      35.522ms       5.920ms       0.000us         0.00%       5.033us       0.839us      -4.50 KB      -4.50 KB    -540.00 MB    -542.01 MB             6  
                                       cudaLaunchKernel        20.59%      22.039ms        20.59%      22.039ms       6.752us       0.000us         0.00%       0.000us       0.000us           0 B           0 B           0 B           0 B          3264  
                                                forward         7.88%       8.438ms        33.80%      36.184ms       6.031ms       0.000us         0.00%       3.246ms     541.009us       4.50 KB       4.51 KB     587.86 MB      -1.13 GB             6  
                                              aten::mul         6.31%       6.752ms        13.37%      14.313ms      12.424us       1.459ms        17.56%       1.459ms       1.267us           0 B           0 B       1.12 GB       1.12 GB          1152  
                                              aten::add         4.09%       4.376ms         7.74%       8.283ms      14.380us     823.131us         9.90%     823.131us       1.429us          -8 B          -8 B     576.00 MB     576.00 MB           576  
      autograd::engine::evaluate_function: MulBackward0         1.93%       2.070ms         8.14%       8.716ms      15.133us       0.000us         0.00%     728.340us       1.264us      -4.50 KB      -4.50 KB     288.00 MB    -288.00 MB           576  
                                          aten::sigmoid         1.74%       1.863ms         3.47%       3.710ms      12.881us     481.009us         5.79%     481.009us       1.670us           0 B           0 B     288.00 MB     288.00 MB           288  
                                  cudaDeviceSynchronize         1.49%       1.590ms         1.49%       1.590ms     122.303us       1.303us         0.02%       1.303us       0.100us           0 B           0 B           0 B           0 B            13  
autograd::engine::evaluate_function: SigmoidBackward...         1.48%       1.585ms         7.42%       7.942ms      27.575us       0.000us         0.00%       1.111ms       3.858us           0 B           0 B    -576.00 MB    -864.00 MB           288  
                                          ProfilerStep*         1.44%       1.544ms        74.95%      80.227ms      13.371ms       0.000us         0.00%       5.253ms     875.574us           0 B           0 B           0 B           0 B             6  
                               aten::threshold_backward         1.27%       1.364ms         3.09%       3.307ms      11.482us     679.469us         8.18%     679.469us       2.359us           0 B           0 B     288.00 MB     288.00 MB           288  
                                        aten::clamp_min         1.24%       1.326ms         3.14%       3.365ms      11.685us     377.195us         4.54%     377.195us       1.310us           0 B           0 B     288.00 MB     288.00 MB           288  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 107.043ms
Self CUDA time total: 8.311ms


Top operators by CUDA time:
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                forward         0.00%       0.000us         0.00%       0.000us       0.000us      33.661ms       405.00%      33.661ms       5.610ms           0 B           0 B           0 B           0 B             6  
                                          ProfilerStep*         1.44%       1.544ms        74.95%      80.227ms      13.371ms       0.000us         0.00%       5.253ms     875.574us           0 B           0 B           0 B           0 B             6  
                                                forward         7.88%       8.438ms        33.80%      36.184ms       6.031ms       0.000us         0.00%       3.246ms     541.009us       4.50 KB       4.51 KB     587.86 MB      -1.13 GB             6  
                              Optimizer.step#AdamW.step         0.00%       0.000us         0.00%       0.000us       0.000us       2.030ms        24.42%       2.030ms     338.320us           0 B           0 B           0 B           0 B             6  
                                              optimizer         0.33%     348.794us         2.00%       2.140ms     356.660us       0.000us         0.00%       1.626ms     271.000us           0 B           0 B           0 B           0 B             6  
                              Optimizer.step#AdamW.step         1.14%       1.225ms         1.67%       1.791ms     298.528us       0.000us         0.00%       1.626ms     271.000us           0 B           0 B           0 B           0 B             6  
                                    aten::_fused_adamw_         0.08%      89.806us         0.17%     179.172us      29.862us       1.619ms        19.47%       1.619ms     269.762us           0 B           0 B           0 B           0 B             6  
void at::native::(anonymous namespace)::multi_tensor...         0.00%       0.000us         0.00%       0.000us       0.000us       1.619ms        19.47%       1.619ms     269.762us           0 B           0 B           0 B           0 B             6  
                                              aten::mul         6.31%       6.752ms        13.37%      14.313ms      12.424us       1.459ms        17.56%       1.459ms       1.267us           0 B           0 B       1.12 GB       1.12 GB          1152  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       1.459ms        17.56%       1.459ms       1.267us           0 B           0 B           0 B           0 B          1152  
autograd::engine::evaluate_function: SigmoidBackward...         1.48%       1.585ms         7.42%       7.942ms      27.575us       0.000us         0.00%       1.111ms       3.858us           0 B           0 B    -576.00 MB    -864.00 MB           288  
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us     898.994us        10.82%     898.994us       1.561us           0 B           0 B           0 B           0 B           576  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 107.043ms
Self CUDA time total: 8.311ms

Torch profiler trace written to: /home/duoan/playground-cuda/traces/blog_torch_profile