27 Aug 2014



PSEH Data Corruption

An extremely rare data corruption issue caused by the Portable Structured Exception Handling library was discovered recently, one that is rather interesting in all the factors to come into play to bring it about. PSEH was a library created by KJK::Hyperion and massively overhauled by Timo Kreuzer for its third iteration, intended to bring Structured Exception Handling to GCC. SEH is natively supported by Microsoft's compilers and underpins a lot of error handling on Windows and so was necessary for proper compatibility between ReactOS' components and Windows'. The bug in question was originally encountered when a test for IoCreateFile by Pierre Schweitzer returned invalid results. Jerome Gardou and Thomas Faber spent the better part of a day debugging and brainstorming on the cause, with input from a variety of other developers. To understand the cause of the corruption requires a basic understanding of how program state works. The following section provides a basic explanation that can be skipped if one already knows how function scope and stacks work.

When a program function runs, it is given a chunk of memory on the stack to use for local variables and storing state information. That state information includes the contents of registers and where in the code the function is at. When another function is called, that new function gets its own chunk of the stack to use. When it is done it returns to its parent the previous function, restoring the state of the processor based on the data on the stack. Note that this may seem similar to a context switch, but the amount of state that is preserved for a context switch is considerably more comprehensive than for function calls. The chunk of memory used by the returning function can then be reclaimed for use by a future function and goes out of scope. Out of scope memory on the stack should never be relied upon as one never knows when it will be overwritten for use by something else. Only the stacks of the currently executing function and its parents can be relied upon to be in scope and their contents guaranteed to be consistent. This system and its guarantees work for the most part, unless something tries to muck about behind the scenes.

The way that exception handling works can be summed up as a giant hack that abuses program state mechanisms. Exception handlers, the bits of code that execute when exceptions occur, technically execute on the same stack as whatever function that triggered the exception. They are however separate chunks of code and the way control passes to them is not nearly as straightforward as a normal execution chain.

Code wise, PSEH is composed of pairs of filters and handlers. Simplistically, these pairs are installed as frames on the stack by the _SEH3_TRY macro. When an exception occurs, these frames are walked by PSEH and the filters will determine whether to accept the exception or not. The interesting bit here is how the stack scope changes throughout all this.

There are multiple levels of scope here, which can cause a bit of confusion. Simplistically, there is function scope that is the code that is attempting to execute, try scope that has the code that might throw the exception, filter scope that deals with checking exceptions, and handler scope that has the code that actually responds to an exception. Originally the system is in function and try scope, as that is the code that was originally running. If an exception is triggered, execution is transferred to the filter, who is treated as a child of the function/try scope. In this case, both the filter and the function/try are still in scope. If the filter determines that it will not deal with the exception, execution is transferred back to the try scope and then onto the next filter, so scope goes back up the chain before going down again with no major problems. It is what happens when a filter accepts an exception that things get a bit weird.

When a filter accepts an exception, execution is transferred to its associated handler. The handler however is not treated as a child of the filter. Execution is instead "returned" to the handler, so the handler is being treated as the parent of the filter. As a consequence, the handler's scope overlaps with that of the filter's parent, the try scope, and the filter itself goes out of scope once control has been "returned" to its supposed parent. This again would not be that much of a problem, except that the filter is supposed to provide the handler the exception code that indicates information about the exception. The filter does this by storing the code in its own stack before "returning" to the handler.

As mentioned above, when a function returns and goes out of scope the chunk of memory it got on the stack can be reused and any data on it should not be considered reliable. Because the data provided by the filter is still on its stack, which is out of scope when the handler executes, there is nothing preventing that area of memory from being overwritten before the exception handler retrieves the data and preserves it within its own scope. The filter needs to be the one that provides this information because while the filter has full access to the exception record that details the exception, the only thing the handler can do is ask for the exception code, which entails looking for it in the place where the filter stuck it.

The reason that this issue manifested so rarely was because it is very unusual for the stack space to end up being recycled before the exception handler retrieved the data. Exception handlers are supposed to be very brief snippets of code that will quickly handle whatever the fault is and then transfer execution back to the original program code. There are however exceptions to that rule and certain complex I/O or memory functions do perform additional operations that end up stomping over the stack. Calling a debug print for example will certainly result in enough additional operations that the region in question will get overwritten, probably multiple times over.

The solution that the developers ultimately came up with was to store the associated data in the stack frame of the overall parent function. This stack will not be going anywhere until the entire exception chain unwinds or attempts to handle the exception have failed.

Amusing aside, while discussing the details of this issue, the developers had a tendency to keep referring to things as "it" and variations thereof, often forgetting that other parties (me in this case) do not intuitively know what "it" is. And you people wonder why it takes so long to put these reports together.

Discussion: https://www.reactos.org/forum/viewtopic.php?f=2&t=13590

This blog post represents the personal opinion of the author and is not representative of the position of the ReactOS Project.