Echo Test Data inconsitencies.
Status:
Closed June 30, 2004
Participants:
Identified by Kael Hanson and John Jacobsen
Diagnosis by Kalle Sulanke, Thorsten Setzelberger, Arthur Jones, and
John Jacobsen
Symptoms:
Sluggish communications on DOM B after
DOM A was rebooted.
Very long time-outs needed to make scripted tests complete instead of
timing out.
Longword zeros (4 contiguous, longword-aligned byte zeros) were found
in data returned in a loop-around test.
Approach:
Schematics of the firmware were studied.
VHDL and other source files for the communications driver were studied.
State machine bubble diagrams were studied.
Internal FPGA signals were routed out to test headers so their signals
could be observed on an oscilloscope along with the communications
waveforms.
The "Signal-Tap" logic analyzer block was embedded in the FPGA design,
and wired to various logical combinations of internal signals as a
trigger. The results, extracted via JTAG, were
studied on the computer screen, in an effort to identify
inconsistencies. For instance, the error exit from the byte loop
in the data packet state machine was used as a trigger...
Solution:
A (positive) reset signal was used in two places in the packet
pointer subcircuit. Once correctly, once inverted. The
result was that under some circumstances, the wrong data was written to
the data pointer.
Subsequently, during debugging, a state machine central to the
accumulation of data bytes was rewritten in VHDL. In the process,
a missing error recovery state was discovered. If a
start-of-packet was detected by mistake, and never complemented by an
end-of-packet, the communications would be disrupted.
The erroneous start-of-packet events occurred at certain temperatures
becauses the MSB to the communications DAC was pulled up by a highly
variable 'internal' pull-up of the FPGA, and down by a 10K
resistor. At a certain temperature the voltage divider
would bias the digital input of the DAC to a metastable level.
Statistically, the bit spent about half the time high, and half
low. Some times, the noise pattern emitted would be decoded as
start-of-packet, several times, during the time the FPGA was
loaded. The data sheet for the FPGA says the internal weak
pull-up is in the range of 10K to 50K over the temperature range.
The temperature dependant metastability was eliminated by changing the
10K pull-down to 2K.
Once the firmware was fixed, in the DOM and ultimately also in the DOR
card, the packet shift race problem could be found.
Summary:
Rev 4.x DOMs will not have their resistors changed because the
communications firmware is now immune to start-of-packet events
detected in the noise sometimes transmitted during reboot. Rev 5
DOM MBs all have the resistor change.
Communications firmware for new packets is up-to-date in the CVS
archive.
DOMApp did not need to be changed.
The DOM Hub Driver has been updated as of June 30, 2004.
DOR driver bugfix - V02-00-01 delivered. and
June 30, 2004 LBNL for IceCube
Other remarks:
- John says
The summary is that sometimes
packets emerge from the DOR RX FIFO corrupted. So this is not a
DOR driver problem.
They are not corrupted when the DOM
software writes them either. So it is firmware, either the DOM TX
side or (more likely, according to ARthur) the DOR RX side. Kalle
is working on it. The problem is not in domapp, HAL, dor-driver
or dom hub software