Config problems, Including Cold Config problems.
Status:
Closed June 16, 2004
Participants:
Thorsten Stezelberger, Arthur Jones, Gerald Przybylski
Symptoms:
Two DOM MBs could be operating, without trouble, on the same twisted
pair.
When one of the DOMs was commanded to execute a soft-reboot, the other
DOM would suffer communications disruptions. The affected DOM
would become partly or completely uncommunicative, or be sluggish in
its response to data requests from the DOM Hub. The
communications disruption would occur, sometimes, at a particular
temperature. The symptom could be alleviated by injecting heat
into the FPGA of the DOM MB that was being rebooted.
Approach:
Install the "Signal-Tap" in-PLD logic analyzer which reports waveforms
to the FPGA JTAG port when a trigger occurs. Trigger on error
conditions, while examining various internal signals, until the problem
is located.
Solution:
Item 1: Located improperly used asynchronous reset line to the counter
and latch used to point to the data record in dual port memory. The bug
occurred all the time at all temperatures. A positive going reset
pulse was used properly at the reset input of a counter, but improperly
to the not_reset input of a register used to load the counter with a
preset value.
Item 2: Examination of a state machine which reads in data bytes from
the line, and stores them in dual-port memory appeared to lack the
proper handling of the case where a start-of-frame is detected, but
end-of-frame is never detected. The bubble diagram state machine
was first converted to VHDL, then the VHDL modified to include error
recovery in the byte loop.
The testing of the revised FPGA code demonstrated that soft reboots of
the other DOM didn't induce communications faults at any temperature.
Item 3: The error state mentioned in item 2 was used to trigger an
oscilloscope which captured the waveform on the line during the 30ms
immediately preceding the frame error.
At a certain temperature, the triggers were generated when a
soft-reboot was executed on the other DOM on the pair. The wave on the
twisted pair contained random, medium amplitude pulses. Further
investigation revealed that output of the communications DAC of the DOM
MB that was undergoing the soft reboot was randomly flipping between
%x7F and %xFF, a transition of roughly half the DAC output range.
The %x80 bit of the DAC was pulled low by a 10 K ohm resistor, and at
the same time pulled high by a "soft pull-up" in the FPGA. The
FPGA specifications state that the soft pull-up value will fall in the
range of 10 K ohm to 50 K ohm. Apparently, at some temperature,
the voltage division between the pull-up and pull-down biased the DAC
input into a range where the input register bit would flip between "0"
and "1". The random pattern, delivered to the running
DOM MB by a rebooting DOM MB at the most susceptible temperature,
mimicked the start-of-frame sequence to the running DOM
MB. Changing the resistor from 10K to 2.2K prevents
the DAC digital input from ever being biased to a level that could
produce noisy DAC output.
Regretibly, the noise waveforms could not be retrieved from the floppy
disk.
Summary:
The firmware changes eliminate the communications disruption caused by
the faulty handling of an error state in the byte loop firmware.
The change makes the DOM Communications receiver firmware resistant to
the noise-triggered start-of-frame event. The explanation of the
underlying source of the noise-triggered errors makes it clear that the
mitigation of it need not be propagated into earlier production DOM
MBs.
June 16, 2004 LBNL for IceCube