![]() ![]() When batch size is 32, from below we can see most red lines are "vertical", which means kernels execute fast enough and don't overlap with following CPU side ops. ![]() Let's further analyze it by checking trace. GRU toy model (disable callstack and record_shape):īatch size: 32, benchmark time: 161ms, profiler time: 231msīatch size:256, benchmark time: 289ms, profiler time: 285ms.1st toy model (disable callstack and record_shape):īatch size: 32, benchmark time: 38ms, profiler time: 61msīatch size:256, benchmark time: 103ms, profiler time: 102ms.My experiemnt reusing your 2 copy of toy codes: The the overhead of instrumenting all ops will be "hidden behind" the overlapped kernel execution. So GPU side kernel time is dominate bottleneck. When we increase batch size bigger to 256, most kernels' execution time on GPU is longer than its launched op's duration on CPU side. And the overhead of instrumenting all ops is explicitly seen. When we set batch size as small as 32, most kernels run on GPU quickly finish, and the dominate bottleneck is CPU side's launching ops. ![]() No matter whether the kernel is scheduled to run on GPU or waiting for previous kernel to finish, CPU side will immediately go on executing next ops and launching following kernels. That is, CPU side op launches a kernel, and this kernel is queued to GPU. The prerequisite knowledge for this problem is knowing that PyTorch ops on CPU side and kernels on GPU side are asynchronized executed. I found it is due to overhead brought by profiler's recording before each op and after each op. Prof.step() # Need to call this at the end of each step to notify profiler of steps' Thanks for your valuable feedback! Let's analyze this interesting phenomenon of why profiler step is slower than step using benchmark.timeit. u = nn.GRU(input_size=in_features, hidden_size=in_features) Toy example without stack and input shape recording: Toy example with stack and input shape recording: I don't always see this behavior with every model / run, but it's not uncommon. To be clear, I'm using PyTorch 1.9.0 and Kineto built from plugin/0.2 to fix #299. I'm running this on an EC2 instance g4dn.4xlarge: (scroll down to chart for details). Sure, I ran the two versions of the toy code again and confirmed that they're reproducible for me with a Telsa T4. prev_step_end_time is defined as max(max ending time of kernel/memcpy/meset which are launched before first step, steps_cpu) Steps = max(prev_step_end_time, steps_cpu). Mark each step's start time steps as max(steps, steps_cpu). Mark each step's end time steps as max(steps_cpu, steps_device). Loop the steps from smaller id to bigger id: Mark the earliest start time as steps_device, the latest ending time as steps_deviceĢ. For each step, get all GPU executes(kernel/memcpy/memset) which are launched by this step. If run without GPU, each step interval is, steps_cpu].ġ. ![]() Mark beginning of each step "ProfilerStep#" as steps_cpu Mark end of each step "ProfilerStep#" as steps_cpu Prof.step() # Need to call this at the end of each step to notify profiler of steps' boundary. On_trace_ready=_trace_handler('./log/minimal_issue'), Self.fc1 = nn.Linear(in_features, nb_classes) Import as benchmarkĭef _init_(self, in_features, nb_classes): ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |