I just finished a blog post where I replaced almost the entire Microsoft build toolchain for our Windows software with open-source alternatives better suiting our needs. Except for the Visual Studio C runtime library, nowadays called Universal CRT (shortened to UCRT or just CRT).
The CRT had been performing without any trouble, and I didn’t expect this to change, considering that our software was mostly using modern C++ constructs. Things took an unexpected turn though when I witnessed a huge memory leak in one of our applications that was creating and terminating multiple std::threads in a row.
Tracking down that leak took a while, partly because it only occurred on Windows XP and earlier, not on my Windows 10 development machine. I was already suspecting some of my code and the compatibility glue code from my previous post.
But it eventually turned out that code as simple as:
already caused havoc under Windows XP. Not just with my toolchain, but also when compiled with Microsoft’s official v141_xp toolchain. The used memory of the process grew rapidly in Task Manager, and soon enough Windows greeted me with a virtual memory exhaustion warning.
Interestingly, all of this only happened when the CRT was statically linked. A dynamically linked CRT caused no trouble, which gave me some more hints to track down the problem.
Every thread spawned by std::thread ends up in the CRT function _beginthreadex, no matter if the original Microsoft v141_xp toolchain is used or my libc++/winpthreads one. Fortunately, our Visual Studio installation has the CRT code in C:\Program Files (x86)\Windows Kits\10\Source\10.0.20348.0\ucrt, so it’s easy to track down what happens next.
The call ends up in the thread_start template function in startup\thread.cpp. This function calls __acrt_getptd() to get a per-thread data structure and implicitly create it in the course of that. The data structure needs to be cleaned up when the thread exits.
Previously, __acrt_initialize() from internal\initialization.cpp has already been called by the CRT when starting my application. It calls a bunch of initialization routines, one of which is __acrt_initialize_ptd() from _internal\per_threaddata.cpp. __acrt_initialize_ptd calls __acrt_FlsAlloc(destroy_fls) to allocate a fiber-local storage index to let all subsequent threads store a per-thread data structure. The given destroy_fls callback is a function in the same file that is meant to be automatically called whenever a thread exits to clean up its per-thread data structure.
__acrt_FlsAlloc is implemented in _internal\winapithunks.cpp like this:
As you see, it only passes the callback for operating systems that support FlsAlloc. The fallback TlsAlloc function provides no such callback parameter. Hence, the destroy_fls cleanup function is never called in that case.
FlsAlloc was introduced with Windows Server 2003 (NT 5.2), which explains why the problem occurs on Windows XP (NT 5.1) and earlier versions, but not on my current Windows 10. If that was everything, my application would always leak that per-thread data structure on Windows XP, no matter if it’s using a statically linked CRT or a DLL. Pretty sure that Microsoft would have caught this during testing. However, the problem is mitigated when the CRT is linked as a DLL:
The file _dll\appcrtdllmain.cpp implements __acrt_DllMain, which is called for each new thread when the CRT is linked as a DLL. __acrt_DllMain in turn calls DllMainDispatch, which implements a DLL_THREAD_DETACH handler that calls __acrt_thread_detach() from internal\initialization.cpp. Now __acrt_thread_detach() calls __acrt_freeptd() to explicitly clean up the per-thread data structure. This happens every time a thread exits and the CRT DLL is detached from the thread. This is why the memory does not leak when linking the CRT as a DLL. When statically linking the CRT, no __acrt_DllMain handler is ever called and subsequently, __acrt_freeptd() is not called either. The __acrt_freeptd function is even optimized out of a release build, because there is not a single call to it.
I have reported this bug to Microsoft along with a proposed solution. My fix is as simple as calling __acrt_freeptd() on all exit paths of the common_end_thread function in startup\thread.cpp. This explicitly frees the per-thread data structure without relying on callback magic. A look into startup\thread.cpp also confirms that common_end_thread is called in all situations when a CRT thread exits.
But having a code fix is only half of the solution. I may be able to look at the CRT source code, but Microsoft has long eliminated all official ways of rebuilding a modified CRT. Even just rebuilding the affected thread.cpp file with all headers of a Visual Studio installation fails due to a missing corecrt_internal_state_isolation.h.
The Practical Solution
I didn’t want to wait another 6 months for a fix to appear in Visual Studio (been there, done that). I needed a practical solution right now.
What I eventually did was taking only the affected thread.cpp, fixing it, but otherwise starting from a clean slate. I did not build it inside the CRT source tree along with all the other files. Heck, I even discarded the entire standard include path to not include any internal CRT headers that depend on further non-existing files.
thread.cpp includes two internal headers corecrt_internal.h and process.h, and uses some structures from them. Fortunately, all these structures could be found in the Visual Studio CRT source code and they were ported into a glue.h file. Local files named corecrt_internal.h and process.h were created as drop-in replacements for their originals. They now just include glue.h.
By calling Microsoft’s cl compiler with the /c parameter, thread.cpp can be compiled into an object file, skipping any link steps. After a few rounds of carefully adjusting the include directories (/I parameter of the compiler), I even got the desired thread.obj file. No additional preprocessor definitions (via the /D parameter) were needed for that.
Finally, I needed a way to make use of the fixed thread.obj file. For that, the compiled UCRT from C:\Program Files (x86)\Windows Kits\10\Lib\10.0.20348.0\ucrt\x86\libucrt.lib was examined in an x86 Native Tools Command Prompt via
This revealed the internal name of thread.obj in the compiled UCRT library: d:\os\obj\x86fre\minkernel\crts\ucrt\src\appcrt\dll\mt....\startup\mt\objfre\i386\thread.obj
I created a new libucrt-removed.lib without that thread.obj file. Finally, a call to
produced a patched libucrt.lib with my fixed thread.obj file.
I’ve then adjusted the Project Properties of my application in Visual Studio to exclude the default static UCRT (/NODEFAULTLIB:libucrt.lib) and added libucrt-patched.lib as an additional dependency. That’s it! My recompiled application built against the patched UCRT was instantly tested on Windows XP and no longer leaked any memory at thread destruction.
Note that all of this only applies to the Release build! The Debug build uses a Debug version of the CRT (libucrtd.lib), which I haven’t patched here.
With that positive outcome, I went on to publish my fix in a GitHub repo and automate all these instructions. You find it at https://github.com/enlyze/ucrt-patch
Our Drone CI runner automatically builds it on every commit and pushes the patched UCRT to https://github.com/enlyze/ucrt-patched
Having that infrastructure in place, I can easily add the fixed UCRT to all of our applications. I’m also prepared if I ever need to apply another patch to the UCRT. Who knows if I stumble upon another bug soon?
I admit that this is a very specific solution to a very specific problem. But hopefully these instructions can be helpful to more people than just us.
Colin Finck is working as a Software Engineer at ENLYZE. He has been digging into Windows internals for over a decade as a core member of the ReactOS Project.