Timeout detection and recovery (TDR) settings to support AI/ML in Windows

Posted in engineering by Christopher R. Wirz on Mon May 08 2017

Nvidia's CUDA API has been one of the most game-changing technologies I have worked with. It lets you command your graphics card device and manage memory allocation and multi-threaded operations. While most CPUs have 10's of cores if you're lucky, your graphics card / graphics processing unit (GPU) has hundreds. Literally, I have seen 100x improvement in model training times over out-of-the-box alternatives.

When I was getting started, pushing the envelope for genetic model topology development, I would lock my windows machine. At first I thought it had something to do with model size and cudaMalloc, but after reading Nvidia's documentation, I learned about TDR.

Timeout detection and recovery (TDR) enables the operating system to detect that the UI is not responsive. Since the CUDA API will take control of the graphics card, the screen appears to be completely "frozen" while it is running. The frozen appearance of the computer typically occurs because the GPU is busy processing intensive graphical operations, typically during game play, and hence does not update the display screen.

While you can switch to a linux system (but device passthrough will still freeze), you'd be missing out on NSight integration with Visual Studio. The easiest approach is to just change the registry. Why navigate through various configuration menus? Just run regedit and set HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers to be the hexadecimal value of the number of seconds you desire. Or import a registry file...


Windows Registry Editor Version 5.00

; Setting the TrdDelay to 3c (hex) will set the value to 60 seconds
; 258 would be 600 seconds
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
"TdrDelay"=dword:0000003c

With this fix in place, windows won't try to reclaim your GPU from your AI/ML training - at least not within the specified window.

What's next? Update your TDR value and let your long-running AI/Ml training tasks shine.