16 Okt 2014

44831

0

PITA bugs part 5

I mentioned in my last rant about annoying bugs that I do firmware development for the LHC triggers. In addition to that however, as one of the people with experience doing low level embedded programming on the team, I can get tapped to work on a variety of other systems. Recently I was asked to update an old AVR microcontroller codebase that was written for a proprietary toolchain to use AVR-GCC. None of us were expecting this to be terribly difficult since it was believed all the update would entail was changing out macros and decorators for things like interrupts and where in memory to place data. After a few iterations I finally got a version that compiled and so I went and updated one of the boards. The program promptly went into an infinite loop rebooting itself.

The next step of course was of course to figure out why this infinite loop was happening. Fortunately we have an AVR debugger here at work that I could plug into the micro-controller's JTAG interface and step through the code. This was where the results got very interesting. The program would stubbornly refuse to hit breakpoints where I knew the program had to have gone to because I was seeing output from the serial port that only occurred in those regions of code. Then after I modified the serial code to not use interrupts for transmitting data, the program would make a bit more progress than when interrupts were enabled. This might suggest that something weird going on with interrupts, which I initially suspected, but the somewhat random behavior as I tried various permutations suggested to me that it was memory corruption of some sort, even if it was being triggered by the interrupts.

Recalling what the co-worker that wrote the original version said, that this program was "big" and consumed almost all of the AVR's available memory, I took a peak at the output from AVR-GCC and saw that 99% of the data memory was being consumed. This in effect meant that at program startup time, 99% of available data memory was already in use without having attempted to allocate memory dynamically or having memory be allocated for stacks as functions are called. I think most readers that have an inkling about how programs run on processors can see the problem already. If there is a sufficient deep call chain of functions, then the program could easily consume all available memory and run out for new stacks. For AVR micro-controllers apparently this triggers a system reset. The lack of memory also explained why enabling/disabling the serial transmit interrupt caused a different behavior. The way that the serial transmit interrupt handler was coded, it would call another function to do something. Considering how finely balanced everything was, that was more than enough to cause the out of memory condition to occur and cause a reset/crash.

In the old code base, a significant amount of the constant data values had been placed in the non-writeable program memory instead of data memory and a few configuration values had also been moved to EEPROM to survive reboots. With the old toolchain, all one needed to do for this situation was to decorate variable declarations with an attribute to indicate they should go into the EEPROM or program memory. AVR-GCC works in a similar manner, but accessing the data from EEPROM or program memory is an explicit operation that requires calling functions. There was initially a belief that simply decorating a variable with the const keyword would see AVR-GCC put it into program memory and handle the reads transparently, but the 99% usage of the data memory proved that assumption false. After I properly tagged all of the constant variables to be put into the program memory, usage of data memory dropped to ~60%. The program also ran smoothly without any crashes, so I would call that a marked improvement.

A recurring theme of all of my bug rants that should be pretty obvious at this point is that all of the problems I've run into deal with memory in some way or another. None of the bugs that have caused me so much headache have come from logic bugs. Those a sufficiently experienced programmer will spot very quickly either by just looking at the code or walking through it via a debugger or even print statements. The behavior of such bugs is mostly consistent, in that the failures will repeat themselves when the correct conditions are fulfilled. Bugs caused by memory corruption however are major pains because they are inconsistent. They will not always corrupt the same chunk of memory and so the crashes and errors manifest in different places or different ways. This often makes it difficult to even determine if it is the same bug and not multiple different bugs and the majority of the time investigating the issue boils down to trying to isolate the root cause, once you've determined that the crashes and etc are actually related.

The relative ease with which one can extract the core logic of a piece of code is a major reason why I am often rather blase when it comes to the notion of code reuse. There is value to it if applied at the correct levels, but there is such a thing as overuse or utterly inappropriate use. If your goal is to reduce your own workload, adding another potential source of bugs because you do not understand the code is somewhat counterproductive. But if what you truly need is to understand how to accomplish something, then it is often better in the long run to write up something new using the existing code as a reference instead of insisting on some dogmatic determination to hack someone else's code into your own program.

Discussion: https://www.reactos.org/forum/viewtopic.php?f=2&t=13702

This blog post represents the personal opinion of the author and is not representative of the position of the ReactOS Project.