Sunday, January 6, 2013

Heisenbug : Missing Event Problem, Interesting Embedded system problem


I was working on a embedded environment where different processors would interact with each other. Let call them Master and Slave1 and Slave2.

Normal flow was like this, Master processor based on the input would ask Slave1 and Slave2 to do the work. Master processor can queue up more than one job for either Slave1 or Slave2.

Master keeps tracks of how many jobs it asked for particular Slave and would expect same number of responses (RTOS event) from the slave.

When slave completes a job it sends a interrupt to the master. Interrupt would generate an RTOS event.

Here was the problem, during hour’s long stress testing; one of the test was failing. I took the failed test case and fired it as standalone test And it failed.

Now was the time to put breakpoint in debugger and check, but, hey, if I put break point everything seems to be working fine. This is classic example of Heisenberg.

Well, I don’t think there is any hard and fast rule to fix Heisenberg problem . You should have thorough understanding of system.

In my case, one ‘other’ module , which was called before the problematic modules, was leaving one of the unhanded RTOS event from Slave2. So during stress testing, even before Slave2 has completed the Job, Master was getting a response, and subsequently it thinks that Slave2 has finished the job. But, in actuality Slave2 was still processing the job.

When I was hooking up the debugger, by the time I was stepping thought the program, Slave2 had enough time to finish the job and hence problem was getting masked.






1 comment:

blewis999 said...

Very good posts, thank you.

Robert