Root Cause Analysis/Failure Mode Analysis

Discussion on the justification for SJTAG in each of the identified Use Cases: Alternatives, cost benefits and penalties
Post Reply
User avatar
Bradford Van Treuren
SJTAG Chair Emeritus
Posts: 103
Joined: Fri Nov 16, 2007 2:06 pm
Location: NOKIA / USA

Root Cause Analysis/Failure Mode Analysis

Post by Bradford Van Treuren » Fri Jul 25, 2008 8:20 pm

Meeting Minutes reference:
http://www.sjtag.org/minutes/minutes080519.html
http://www.sjtag.org/minutes/minutes080630.html


At first glance, this use case does not seem to be appropriate for 1149.1 since failures which occur in the field are usually cleared by replacing the field replacable unit (FRU) and sending it to the repair depot for repair. In reality, this process only adds to the problem of No Trouble Found/No Faults Found (NTF/NFF) cases at the repair depot. There are classes of faults which only manifest themselves in the environment they occur in. Many of these are thermal related and occur under specific system conditions. Thus, it is best to identify failures in the environment where the failure is experienced. Once captured in the failing environment, NTFs would be able to be associated with a previously recorded failing condition to narrow down the problem space in the repair depot.

The importance of applying tests in the system when a failure condition occurs is to record the failures with a granularity of diagnostics to pinpoint the location of the failure on the board instead of just a PASS/FAIL result. Many functional tests are able to identify a failing functional block, but the granularity as to where in the circuit a failure exists is poor. This is because functional tests typically target function features and not structural features. An 1149.1 based test is able to target specific structural features, such as open pins, and to identify devices that are not operating properly due to some environmental condition, such as overtemp or under-volatage where a device is not responding properly to the scan operations. By keeping track of failures at the net and device pin level, designers are able to identify trends of similar failures in a circuit which could indicate a design problem requiring rework to improve a product's reliability. For example, if the same device exhibits open pins over time, this could indicate a thermal problem for that location of the board or not enough heat dissipation features applied to the circuit. It could also indicate a mechanical clearance problem during installation and removal of the board.

Another aspect IEEE 1149.1 provides value for this use case is with regard to the SAMPLE instruction available in all boundary-scan devices. I have found that most people I talk to do not realize the SAMPLE instruction is non-intrusive to the operation of a device. In other words, the SAMPLE instruction is able to capture the state of the boundary-scan register (BSR) without applying changes to the device pins. Where this is useful is to capture a snapshot of the state of the system signals at a point in time. This is quite useful when needing to identify what alarm signal(s) has(have) indicated a problem and the normal event reporting system in the architecture no longer is working. From this state snapshot, the data can be analyzed to understand what caused the failure to trigger and changed the state of the board causing it to go out of service. This information is important for the software developers to identify events which could affect the state of their software model of the system and thus give insight into why the software responded the way that it did.

It is clear that this use case is an advanced use case and is probably more useful for highly reliable systems then others. However, some of these features may be useful during design prove-in and testing of prototype systems in other system types.

The thread of discussion in the meeting minutes for this use case begins with the 2008-05-19 meeting.
Last edited by Ian McIntosh on Thu Jun 11, 2009 7:05 am, edited 1 time in total.
Reason: Added links to meeting minutes
Bradford Van Treuren
Distinguished Member of Technical Staff
NOKIA MN

User avatar
Jim Webster
SJTAG Member
Posts: 8
Joined: Mon Nov 12, 2007 2:28 pm
Location: Integellus, Bonnie Scotland

Post by Jim Webster » Tue Sep 23, 2008 7:17 pm

I am not convinced about the usefulness of the SAMPLE instruction.

so some of my thoughts.....

The SJTAG clock is often considerably slower than the real-time operstion clock and this causes a latencyin the time time collect the sample from the issue of the actual instruction. Good idea if the system is stuck in a loop or something, but not I fear for real time samples of pin activity. Also if there is significant difference in clock speeds the pins being sampled may not collect the "instantaneous" data that one would look for in a sample.

It is also very difficult to arrange an "event based sample" even when SAMPLE is preloaded - the execution of SAMPLE is JTAG clock based and board conditions are not, at least not without some circuit modification to created a psudeo JTAG Clock edge. Getting design engineers to add this type of circuitry is a nightmare and the cost justification is also to be considered.

For software debug there may be a use, but again getting the condions of a non looping state ot pins will be difficult as will the instance in the loop. (unfortunately I tried this, but the softies like tools they control not a test engineer)

Appologies for the tardiness of the answer, but better late than never!!

Jim

User avatar
Bradford Van Treuren
SJTAG Chair Emeritus
Posts: 103
Joined: Fri Nov 16, 2007 2:06 pm
Location: NOKIA / USA

Where SAMPLE is useful

Post by Bradford Van Treuren » Tue Sep 23, 2008 8:16 pm

Good idea if the system is stuck in a loop or something, but not I fear for real time samples of pin activity.
The usefuleness of the SAMPLE instruction is quite dependent on the type of signals you are trying to acquire a snapshot from. Yes, many of the signals in a circuit are too trancient to be able to be acquired with any value using the SAMPLE instruction. However, there are quite a number of signals cases that are perfect for the use of the SAMPLE instruction and is the reason this instruction was included in the original 1149.1 standard. For signals that are showing the state of a circuit, especially error states, the time the signal spends in a stable enough state for the SAMPLE instruction to be able to capture it is quite long. There are many alarm condition signals in telecommunications boards that indicate a failure of a particular circuit. Normally, these alarms trigger interrupts of the CPU to handle the error condition and to try to recover from that problem. If the CPU is unresponsive, a hung processor case, the use of the SAMPLE instruction to capture the state of the circuit's alarm signals would give a valuable insight into the failure and possible identify an interrupt handler that went wrong in trying to handle this circumstance. Hopefully, people instituted fault injection on these alarms to ensure the service routines are doing their jobs before you get to this point.
Bradford Van Treuren
Distinguished Member of Technical Staff
NOKIA MN

Post Reply