26 Jun 2013

36193

5

PITA bugs part 3

It is somewhat difficult to put into words just how much hate I have for this particular issue. Debugging it ended up taking an entire week and the result surprised pretty much everyone on the team, including the senior developers. The issue that was highlighted however has very deep root causes that touch upon the very foundation of the C ABI and its brittleness.

First, a slight aside. In C, there is a family of functions that take in a variable number of arguments, the printf functions. These take in a format string and then any number of strings, numbers, and characters that are to be placed into appropriate slots specified in the format string. The variable arguments are all placed on the stack right to left, so the first variable argument is actually placed last on the stack. When the code finally gets down to the bowels where this list of arguments is processed, they are popped off the stack. Because type checking is effectively impossible for something like this in C, this entire system only works if the established convention is followed. Specifically, when you use a %d in the format string, you have to pass in an int or equivalent type onto the stack, otherwise the lower level function will break in unpredictable ways.

At the same time, when using literal numbers like 10 or 11 in your code, the compiler silently interprets them in a very specific way. In the case of Microsoft's compiler, the de-facto compiler used for UEFI development, they will be interpreted as its native integer type, an int, which is 32bits regardless of whether you are compiling for 32 or 64bit code. As such, when you call a printf function with literal numbers, the value that gets put onto the stack will be sized as a 32bit int. Traditionally, this is generally not a problem, as most C standard library implementations will assume a %d indicates a native integer type was pushed onto the stack. The code that provides functionality akin to the C standard library in UEFI on the other hand, has a few exceptions to that rule.

Another aside, UEFI development generally uses two pieces of code as a foundation. The first is the open source EFI Development Kit (EDK) released by Intel and available at SourceForge as part of the TianoCore project. There have actually been two 'iterations' of EDK, EDK and EDKII, though additional intermediary updates are also sprinkled about. The second piece of code is proprietary and distributed by Intel to its various partners and are the real guts that allow UEFI to work on Intel processors. In this case, the abnormality that caused the week of headache for me exists in the EDK/EDKII source code.

When developing UEFI code, the EDK source provides explicit types such as UINT8, UINT16, CHAR16 and the like and one is supposed to outright avoid using things like int or char. This in theory is to prevent ambiguities in variable type sizes. There is also a UINTN type, which will be 32 or 64bit depending on which bitsize you compile your code for. The problem however is when literal numbers enter the picture. As mentioned before, a literal number like 10 will be implicitly treated by the Microsoft C compiler as an int. In the original EDK, the lower level printf support function would however attempt to pop off a %d value as a UINTN, so it would be right when the BIOS was in 32bit mode but wrong when it was in 64bit mode, where it would read in more than it was supposed to and potentially cause stack corruption or just plain wrong values to be printed. The EDK developers apparently realized this could happen, so for EDKII they changed the behavior so that %d values were treated as the native integer type. As such, using things like 10 would now work correctly. Whether they did the right thing or not is more a matter of opinion than anything else. After all, whether one needs to print literal values more than UINTN variables is a matter of personal experience. But the consequence is that anyone trying to print UINTN variables in 64bit mode must either cast their variable to be 32bit, assuming it is small enough to fit, or use %ld to indicate a long variable, which by UEFI convention is 64bit.

The problem, however, is that %d is not the only instance where a number needs to be passed in. The potentially really nasty one is when * is used in combination with another formatting flag, such as %*d. The * indicates that the caller desires padding of a certain number of characters, a number that is also passed in as part of the variable argument list. In both EDK and EDKII, this padding size is treated as a UINTN, meaning passing in a literal number like 10 risks the padding going completely ballistic depending on the the state of the stack around where the size value is pushed. If the chunk of memory after the padding size was already zeroed out, then there will not be any problems. If however there is garbage data, or even worse, another value in the variable argument list, then it will pop too many bytes off the stack and create a very, very long padding run. Keep in mind that 32bits can represent up to 4GB, so even having a 1 just in the 33rd bit would be asking the printf function to pad for over four billion characters, at which point, congratulations, you have likely filled up whatever output buffer the system was using. If the system is smart, it will degrade gracefully and just cut you off at that point. If it is not, it will crash because you are starting to write into other components' memory.

The especially nasty part about this issue was there was nothing obvious in the calling code to indicate the cause. Attempts to validate the usage of buffers before and after the printf call showed no buffer overflows and even breaking out the debugger to try and walk the code revealed no actual corruption of the stack. It really was only after I experimentally changed the number of bytes being popped off the stack inside the printf support function in the EDKII code that we realized the ramifications of what the EDK developers had done. The really unfortunate thing is there is no obvious or right fix for this. This is ultimately something that a programmer developing UEFI code simply must be aware of, otherwise they will eventually run into their printing code blowing up on them.

Comments (5)

  • anon

    You guys are way smarter than me. That is some gnarly low level C code. Nice find though!!!

    Jun 26, 2013
  • anon

    Hopefully there is some satisfaction to be gained from the fact that you found it. It is frustrating having to debug the tools you have to use rather than doing the actual hard work of creating Reactos.

    Jun 27, 2013
  • anon

    You must mean when the CPU is in 32 or 64 bit mode, not the BIOS. Anyway, shouldn't best practice be to use defined constants with 32/64 bit declarations instead of magic inline numbers? ;)

    Sep 19, 2013
  • anon

    We want more blogs like this - as much news about Reactos as possible.

    Nov 04, 2013
This blog post represents the personal opinion of the author and is not representative of the position of the ReactOS Project.