Config problems, Including Cold Config problems.

Status:

Closed June 16, 2004

Participants:

Thorsten Stezelberger, Arthur Jones, Gerald Przybylski

Symptoms:

Two DOM MBs could be operating, without trouble, on the same twisted pair.
When one of the DOMs was commanded to execute a soft-reboot, the other DOM would suffer communications disruptions.  The affected DOM would become partly or completely uncommunicative, or be sluggish in its response to data requests from the DOM Hub.  The communications disruption would occur, sometimes, at a particular temperature.  The symptom could be alleviated by injecting heat into the FPGA of the DOM MB that was being rebooted.

Approach:

Install the "Signal-Tap" in-PLD logic analyzer which reports waveforms to the FPGA JTAG port when a trigger occurs.  Trigger on error conditions, while examining various internal signals, until the problem is located.

Solution:

Item 1: Located improperly used asynchronous reset line to the counter and latch used to point to the data record in dual port memory. The bug occurred all the time at all temperatures.  A positive going reset pulse was used properly at the reset input of a counter, but improperly to the not_reset input of a register used to load the counter with a preset value.

Item 2: Examination of a state machine which reads in data bytes from the line, and stores them in dual-port memory appeared to lack the proper handling of the case where a start-of-frame is detected, but end-of-frame is never detected.  The bubble diagram state machine was first converted to VHDL, then the VHDL modified to include error recovery in the byte loop. 
The testing of the revised FPGA code demonstrated that soft reboots of the other DOM didn't induce communications faults at any temperature.

Item 3: The error state mentioned in item 2 was used to trigger an oscilloscope which captured the waveform on the line during the 30ms immediately preceding the frame error. 
At a certain temperature, the triggers were generated when a soft-reboot was executed on the other DOM on the pair. The wave on the twisted pair contained random, medium amplitude pulses.  Further investigation revealed that output of the communications DAC of the DOM MB that was undergoing the soft reboot was randomly flipping between %x7F and %xFF, a transition of roughly half the DAC output range.  The %x80 bit of the DAC was pulled low by a 10 K ohm resistor, and at the same time pulled high by a "soft pull-up" in the FPGA.  The FPGA specifications state that the soft pull-up value will fall in the range of 10 K ohm to 50 K ohm.  Apparently, at some temperature, the voltage division between the pull-up and pull-down biased the DAC input into a range where the input register bit would flip between "0" and "1".    The random pattern, delivered to the running DOM MB by a rebooting DOM MB at the most susceptible temperature,  mimicked the start-of-frame sequence to the running DOM MB.    Changing the resistor from 10K to 2.2K prevents the DAC digital input from ever being biased to a level that could produce noisy DAC output.

Regretibly, the noise waveforms could not be retrieved from the floppy disk.

Summary:

The firmware changes eliminate the communications disruption caused by the faulty handling of an error state in the byte loop firmware.  The change makes the DOM Communications receiver firmware resistant to the noise-triggered start-of-frame event.  The explanation of the underlying source of the noise-triggered errors makes it clear that the mitigation of it need not be propagated into earlier production DOM MBs. 

June 16, 2004  LBNL for IceCube