PITA bugs part 4

by Z98 | June 26, 2014

Most of the developers already know this, but my current job is as an engineer working on firmware and software for the trigger system of the Large Hadron Collider (yes, the thing that discovered the Higgs boson). Triggers are basically filters that take in the raw data coming out of the LHC and whittle down the noise to figure out what data is worth keeping and what can be discarded. This is something of a necessity as one of the physics professors that I spoke to mentioned the LHC generates something like a terabyte of data a second when doing collisions. The work can get fairly complicated due to our use of FPGAs to process and transfer data from the detectors. Anyone who has ever worked with FPGAs generally know that the toolchains supporting them are, flaky and not the most user friendly. Quite frankly, debugging operating systems is considerably easier than doing the same with FPGA designs since if something is busted on the FPGA and you can't find the issue in simulation, your only choice is generally to break out an oscilliscope and pray that the issue can be found by examining whatever exposed test pins your hardware has. This of course assumes that the FPGA vendor's tools are not the cause of the bugs, something that has happened to me on a fairly regular occassion.

The particular family of FPGAs we work with are from Xilinx and we've been working on putting together some fairly complicated systems. To ease some of the work, some of the control logic for the hardware is in the form of software that runs on a soft-core processor implemented on the FPGA. Soft-core processors are basically processors that are not ASICs and come in the form of logic that can be mapped to an FPGA's technology. Xilinx provides a soft-core in the form of the Microblaze processor, which we use as a kind of master system to help deal with communicating and controlling the rest of the trigger logic on the FPGA. Programming the Microblaze can be done in C and Xilinx also provides the necessary compilers and libraries for this. That's all fine and well, but programming a Microblaze is very different than programming a desktop system. Microblaze programming falls firmly in embedded development and is more like programming a microcontroller. This means that certain assumptions break down and you need to be very careful about what you do, especially when it comes to memory usage. The amount of memory available to the Microblaze is very, very limited, which means attempts to allocate large chunks will fail. But because the Microblaze is basically a microcontroller, it does not really have the concept of memory protection either. If you write to a memory address, the processor will write to that address without a care in the world, even if that address is 0.

The particular issue I ran into was when I was in the process of refactoring some code. The original code was slapped together by some physicsts in order to get a system working as quickly as possible. Now the physicsts I work with are smart, there is no doubt about that, but they are not programmers. As a consequence, the code they turn out often has issues, something that they readily admit. In this particular instance, their code was attempting to allocate large buffers to hold messages being passed back and forth between the FPGAs. Except, the Microblaze did not actually have a big enough heap to fulfill the request and so malloc was returning 0. When I was first debugging the code, I noticed that the program would go off on a strange tangent after a certain number of messages were passed about. Specifically, the flow of program execution would suddenly jump to seemingly random parts of the program. This made no sense whatsoever so I walked through the code, setting breakpoints to try and track the exact point the corruption happened. This eventually led me to the block of code that was responsible for putting newly received messages into the receive buffer, which suggested some sort of memory corruption. It was about then that I recalled that the allocation of the buffers had no error checking whatsoever so even if the malloc call failed, the program would proceed and treat the NULL value that was returned as a valid pointer and use that. I went back in and added the NULL return check, having the program exit if the error condition arose, and reran the program. Not surprisingly, the program terminated immediately upon startup. Walking through with breakpoints revealed that malloc was indeed returning NULL.

The consequence of using NULL as a memory address meant that every time data was put into the buffer, it would literally be writing into the .text region of the program where program instructions resided. Little wonder then that the program execution went so awry after enough of the program had been corrupted. Fixing the issue was fairly straightforward though it involved rewriting the entire message passing process. After that, the system worked reliably and I observed no performance degradation so I called it a day. Lesson of the day, check for return codes and error conditions. And to be frank, this isn't even the most frustrating issue I've dealt with so far.

Discussion: https://www.reactos.org/forum/viewtopic.php?f=2&t=13451