Echo Test Data inconsitencies.

Status:

Closed June 30, 2004

Participants:

Identified by Kael Hanson and John Jacobsen
Diagnosis by Kalle Sulanke, Thorsten Setzelberger, Arthur Jones, and John Jacobsen

Symptoms:

Sluggish communications on DOM B after DOM A was rebooted.
Very long time-outs needed to make scripted tests complete instead of timing out.
Longword zeros (4 contiguous, longword-aligned byte zeros) were found in data returned in a loop-around test.

Approach:

Schematics of the firmware were studied.
VHDL and other source files for the communications driver were studied.
State machine bubble diagrams were studied. 
Internal FPGA signals were routed out to test headers so their signals could be observed on an oscilloscope along with the communications waveforms.
The "Signal-Tap" logic analyzer block was embedded in the FPGA design, and wired to various logical combinations of internal signals as a trigger.  The results,  extracted via JTAG,  were studied on the computer screen, in an effort to identify inconsistencies.  For instance, the error exit from the byte loop in the data packet state machine was used as a trigger...

Solution:

A (positive)  reset signal was used in two places in the packet pointer subcircuit.  Once correctly, once inverted.  The result was that under some circumstances, the wrong data was written to the data  pointer.
Subsequently, during debugging, a state machine central to the accumulation of data bytes was rewritten in VHDL.  In the process, a missing error recovery state was discovered.   If a start-of-packet was detected by mistake, and never complemented by an end-of-packet, the communications would be disrupted. 
The erroneous start-of-packet events occurred at certain temperatures becauses the MSB to the communications DAC was pulled up by a highly variable 'internal' pull-up of the FPGA, and down by a 10K resistor.  At  a certain temperature the voltage divider would bias the digital input of the DAC to a metastable level.  Statistically, the bit spent about half the time high, and half low.  Some times, the noise pattern emitted would be decoded as start-of-packet, several times, during the time the FPGA was loaded.  The data sheet for the FPGA says the internal weak pull-up is in the range of 10K to 50K over the temperature range.  The temperature dependant metastability was eliminated by changing the 10K pull-down to 2K.
Once the firmware was fixed, in the DOM and ultimately also in the DOR card,  the packet shift race problem could be found.

Summary:

Rev 4.x DOMs will not have their resistors changed because the communications firmware is now immune to start-of-packet events detected in the noise sometimes transmitted during reboot.  Rev 5 DOM MBs all have the resistor change.
Communications firmware for new packets is up-to-date in the CVS archive.
DOMApp did not need to be changed.
The DOM Hub Driver has been updated as of June 30, 2004.
DOR driver bugfix - V02-00-01 delivered. and 

June 30, 2004  LBNL for IceCube

Other remarks:

- John says
I sent a full report yesterday: http://icecube.wisc.edu/mailing-list-archives/icebug_archive/msg00042.shtml

The summary is that sometimes packets emerge from the DOR RX FIFO corrupted.  So this is not a DOR driver problem.
They are not corrupted when the DOM software writes them either.  So it is firmware, either the DOM TX side or (more likely, according to ARthur) the DOR RX side.  Kalle is working on it.  The problem is not in domapp, HAL, dor-driver or dom hub software