Power-On Self Test (POST)

Discussion on the justification for SJTAG in each of the identified Use Cases: Alternatives, cost benefits and penalties
Post Reply
User avatar
Bradford Van Treuren
SJTAG Chair Emeritus
Posts: 107
Joined: Fri Nov 16, 2007 2:06 pm
Location: NOKIA / USA

Power-On Self Test (POST)

Post by Bradford Van Treuren » Thu Feb 07, 2008 7:29 pm

Meeting Minutes reference:
http://www.sjtag.org/minutes/minutes080128.html
http://www.sjtag.org/minutes/minutes080204.html
http://www.sjtag.org/minutes/minutes080211.html
http://www.sjtag.org/minutes/minutes080220.html


Power-On Self Test (POST) has been a basic staple for system level test since systems were first manufactured. POST traditionally comprised of a set of functional tests that would target key modules of a system that were able to be functionally separated from the whole of the system to try to verify that the circuit module was still functioning according to written requirements. In most cases, a failure of the functional test meant nothing more than the module did not perform as expected with little to no insight into what caused the test to fail (e.g., GO/NO-GO diagnostics). For field service calls, where a board was required to be pulled if a failure occurred, down time to diagnose where a failure occurred needed to be kept to a minimum. Unfortunately, functional testing was not able to isolate a circuit module down to a single board (e.g., Field Replaceable Unit (FRU)), but rather isolated a fault down to a set of boards. The craft operator would replace the entire set of boards (FRUs) to get the system back into operation. This let to a problem of No-Trouble-Found (NTF) or No-Fault-Found (NFF) conditions of boards while they were being tested at the Repair Depot. As boards become more complex, the cost associated with a board increases significantly. Thus, pulling out a board that is suspect is a costly inventory proposition and wastes testing resources at the Repair Depot that further increases the cost of maintenance of a product.

To improve the granularity of test coverage, more detailed functional tests may be written that target finer granularity of the system. Unfortunately, not all systems lend themselves to this partitioning for test. Further, to write a functional test that targets specific failure models is both time consuming and requires a special expertise in the circuit targeted by the test. Therefore, writing detailed functional tests for circuit boards tend to be cost prohibitive as market windows shrink and the overall life span of a product diminishes. To aid in the development of tests that provide a finer granularity of diagnostics, the use of Boundary-Scan technology is being introduced into the system as part of system test.

Boundary-Scan, or IEEE Std 1149.1 has been used since its inception in 1990 to provide a testing mechanism inside devices that aid in targeting structural or assembly type defects in circuit boards. The 1149.1 techniques are used extensively in the manufacturing process to verify that board assemblies have been built according to their design. Since 1149.1 targets the structure of the circuit and not the functional behavior, tests may be automatically generated from the design CAD data. The ability to automatically generate these tests significantly reduces the cost of test generation for a board. Further, these tests provide very precise test coverage metrics as to which parts of the circuit are tested and which are not. More importantly, the 1149.1 based tests are able to diagnose a failure down to a device pin and board net level giving the ability to isolate a failure to a single FRU instead of a set of FRUs. It is this latter case that may be leveraged to reduce the number of NTFs ending up at the Repair Depot for a system by implementing Boundary-Scan-Enhanced (BSE) POST in a system. Once the infrastructure is in place on a board, tests that were designed for manufacturing may be migrated to the system test environment directly or further constrained to eliminate the stimulation of signals that propagate off the board during testing. BSE POST would be run prior to or as an initial part of the boot process for the board. The simplest diagnostic results provided by BSE POST would be a PASS/FAIL indication or could be as detailed as identifying the failing device pin and net information.

To achieve Boundary-Scan-Enhanced POST, the design must provide an 1149.1 test controller on the board to be tested that is either under software control by a hosting processor or incorporated into a test co-processor in an FPGA, CPLD, or dedicated IC. The test controller hardware interface may be a dedicated device or may leverage 4 or 5 spare general purpose input/output (GPIO) pins on the hosting microcontroller. The additional hardware cost to support BSE POST, can range from $0 to $20 depending on the complexity of the test controller and the performance requirements. Generally, POST does not need a high performance interface that would be required for programmable logic device programming.

Some issues that need to be address are:
  • The type of test actions that should be performed during POST
  • The interface boundary between external test tools and embedded tooling
  • Real-time diagnostic results vs. Off-line diagnostics from stored failure information
  • What diagnostic resolution is required for POST?
  • Formats of information passed from external test tools to embedded test tools
  • What information should be reported?
  • How does BSE POST integrate into the usual POST process?
  • The value BSE POST brings to the POST process?
This forum will be a place that may keep track of new questions that come up as well as track the responses to those questions as well. Please browse this forum often and add your own feedback.
Best Regards,
Brad

Moderator note: Edited list formatting :wink:
Last edited by Ian McIntosh on Thu Jun 11, 2009 6:49 am, edited 1 time in total.
Reason: Added links to meeting minutes
Bradford Van Treuren
Distinguished Member of Technical Staff
NOKIA MN

User avatar
Bradford Van Treuren
SJTAG Chair Emeritus
Posts: 107
Joined: Fri Nov 16, 2007 2:06 pm
Location: NOKIA / USA

Questions that have been raised and need answers

Post by Bradford Van Treuren » Thu Feb 07, 2008 7:45 pm

  1. Why do we need to preserve detailed diagnostic data if the FRU fails?
  2. How many defects really occur at power-up compared to after the board runs and heats up?
  3. What is the process for performing BSE POST when the board must be reset following JTAG testing to recover core circuit sanity?
  4. How should one perform POST on a carrier with hot swappable mezzanine boards?
  5. How does one extract the diagnostic failure information from the POST result storage?
  6. What is the list of Boundary-Scan actions that may be performed as part of POST?
  7. Is real-time diagnostic reporting necessary for POST?
  8. Where is real-time diagnostics reporting useful?
  9. Are SVF and STAPL sufficient to represent test data between test generation tooling and the embedded tooling?
  10. What do we need to preserve and report as part of diagnostics (e.g., time/date, failing bit, failing pin/net)?
  11. Are there two forms of embedded boundary-scan required to support POST and on-demand testing/upgrades?
  12. Concerning diagnostics, how important is it to be able to generate diagnostics in the embedded environment vs. some form of representation that can be reviewed later for detailed diagnostics?
  13. How does hardware based JTAG deal with power management staging if it runs before the boot processor is active?
  14. Does BSE POST itself justify the added cost for implementing it on the board or does it only make sense for cases where Boundary-Scan is required for something else?
  15. Does BSE POST required different modes of operation – Quick Start (few tests) vs. Manufacturing (extensive tests)?
  16. How are we to deal with 3rd party mezzanine boards during POST?
  17. Is there a way to unify the results of the diagnostic data at the system level?
  18. Is there a need to unify the results of the diagnostic data at the system level?
  19. How should we manage additional testing required during EST or other life cycle stages of the product?
  20. What parts of the circuit should be tested by POST?
  21. What parts of the circuit should be only tested after boot level self test completes (OS is operational)?
  22. Where is the boundary between POST and System Diagnostics?
  23. How can/should BSE POST be complimented by other functional-type startup tests to optimize overall effectiveness (coverage and defect location) given startup time and overall POST development time constraints?
  24. Is POST focused strictly at the FRU module test?
  25. Are there multiple modes of BScan test required?
  26. Is there a difference between Guoqing's Manufacturing test and Initiated or On-demand test?
  27. Where is POST testing important?
  28. How is the interoperability between tool vendors suppose to be done?
  29. How will interoperability between multiple board vendors be done?
  30. How do we unify the results?
  31. What information do we need to report back?
Moderator note: Edited list formatting
Last edited by Bradford Van Treuren on Sun Feb 17, 2008 11:48 pm, edited 4 times in total.
Bradford Van Treuren
Distinguished Member of Technical Staff
NOKIA MN

User avatar
Ian McIntosh
SJTAG Chair
Posts: 429
Joined: Mon Nov 05, 2007 11:49 pm
Location: Leonardo, UK
Contact:

Re: Questions that have been raised and need answers

Post by Ian McIntosh » Thu Feb 07, 2008 9:51 pm

Bradford Van Treuren wrote:Does BSE POST required different modes of operation – Quick Start (few tests) vs. Manufacturing (extensive tests)?
I'd say "probably not", but it maybe depends on your environment: If you're building boards with pre-programmed devices, then a BSE POST is maybe an effective way of commissioning the board. But we mainly build with unprogrammed devices (due to the relative volatility of the firmware), so I'd be inclined to use an externally driven BSCAN for first manufacture. On top which, there are failure modes which would stop any POST running at all, and I'd rather do my tests and my diagnostics all in one place (high volume producers may well want a separate diagnostic facility, though).

I think your extended POST could be better described as an Initiated Self-Test (IBIT in my terminology, as opposed to PBIT)?

User avatar
Bradford Van Treuren
SJTAG Chair Emeritus
Posts: 107
Joined: Fri Nov 16, 2007 2:06 pm
Location: NOKIA / USA

Post by Bradford Van Treuren » Thu Feb 07, 2008 10:11 pm

I would tend to agree that there are partitions of diagnostics (POST being one of them). It also comes down to how much may be tested in the time budget allotted for POST. Also, there may need to have some configuration of, say a carrier board, before mezzanine boards become active to a state that they may be tested (e.g., power sequencing). Thus, the test of these mezzanines may have to be deferred until later indicating a need for a more extensive test procedure for the carrier assembly. I am not sure this constitutes an extended POST or not, but in the mezzanine perspective it is going through a POST state.

I also agree that manufacturing test may take advantage of on-demand testing or what you call IBIT.

The other question that comes up regarding POST is when should it run. If you must perform your interconnect test using configured FPGAs, then it must occur following the configuration sequence. If that sequence happens as a side effect of the software boot operation with the FPGA image stored in the boot code space/FLASH, there is a sequencing issue that has to be resolved when restarting the board following an after-POST reset. Thus, this question is really not that straight forward and is more of an "it depends" response. Guoqing seems to feel very strongly that there is a need for different modes of operation. Perhaps this is another topic for discussion and another question to answer as part of the use case. What are the different board states where POST should be allowed to operate?
Bradford Van Treuren
Distinguished Member of Technical Staff
NOKIA MN

User avatar
Bradford Van Treuren
SJTAG Chair Emeritus
Posts: 107
Joined: Fri Nov 16, 2007 2:06 pm
Location: NOKIA / USA

Re: Power-On Self Test (POST)

Post by Bradford Van Treuren » Fri Mar 27, 2009 8:59 pm

Within the scope of POST, I have been trying to analyze the areas of potential use of POST and what environments exist to support, define, and restrict the use of POST in a circuit's life cycle. I have been primarily looking at Manufacturing Test Process, System Integration, Field Support, and Repair Facility uses for POST. Much of what is discovered about these domains can also be applied to such test processes as EST and Design Verification.

POST is a critical part of system test or monitoring of complex systems because it identifies problems before a system becomes dependent on such resources so alternative actions may occur to ensure the activation of a system or to prevent the failure of a system function once that resource is required to perform at some later point in time. The level of detail of the results of a POST test is dependent on the purpose of the test. For system validation, as described above, a simple GO/NO-GO result is sufficient to flag a resource as usable or not. However, to determine the root cause of the failure, more detailed diagnostics is required. POST is generally performed as part of an automated testing process in a system. As such, the operators are generally not continuously monitoring the output of the results. Therefore, the results need to be fomatted in some persistent form to be able to be reviewed later by a test operator. The timeliness to when these results are observed is critical due to the volume of data that could be generated during each power cycle. Thus, multiple attempts to restart a board could end up with lost diagnostic data or data overload.

The existance of POST is also useful as a process monitoring tool. POST provides instant feedback that represents a consistent operational model of a board that does not change over time or locality. Since it runs anytime power is applied to a board, the sanity of a design can be tracked as the board travels from station to station. Thus, the failure of POST would indicate a point in the process where a handling or assembly operation caused a defect to occur in that board. This immediate indictment of a board is an important feedback tool for a continuous process improvement assembly operation. This monitoring is also available through system integration testing, Field Installation, Field Support, and Repair processing.

Since boundary-scan tests are able to perform identification of devices (e.g., IDCODE, USERCODE), POST is able to act as a monitor for allowable parts substitution at remote manufacturing sites. The validation of part codes could be reported as part of the scan path integrity test or as a separate test operation. It must be noted that success of this type of monitoring requires close interaction between the test engineering community, design community, and the parts procurement organization to ensure all known versions of the parts are accounted for.

For POST operation in a system, there is probably no need to support real-time diagnostics other than GO/NO-GO status as this is an automated test. However, on-demand testing (Initiated Testing - IBIT) requires real-time diagnostics to aid in determining the source of the failure. The mechanism for reporting the results will be different for POST based on the environment the test is running in. For example, POST operating in a system needs to be able to inform the system diagnostics manager program of the outcome. When POST runs at a manufacturing or repair station, there needs to exist a mechanism to inform the test operator that a POST test failed. Typically, some form of visual or audible indicatior (e.g., red LED, BEEP tones) is used to indicate there is a problem with the board. Once a failure of a board is identified, the board is removed and sent to a reapir station where the contents of the diagnostic results are extracted and analyzed. It should be noted that this process can be cost prohibitive or impossible due to the need for a special repair station, which might not exist at a facility, and the extra time and handling of the product might introduce other problems or temporarily resolve the current problem. Thus, some implementations of POST provide real-time diagnostic information of device pin and/or net locations of the failures detected through a special diagnostics console or message reporting mechanism. From this information, test operators can divert the product directly to a rework area eliminating the need for an additional debug station. Some operators prefer this type of diagnostics as they may be able to view the board to determine if some dirt or foreign matter is causing the defect that they are able to clear easily.

Another consideration with POST that must be considered is whether the same test mechanism can be reused for on-demand as well as automated periodic testing. The mechanism which triggers the POST operation is not the same as a test after a board is initialized. However, the same tests are probably necessary for the latter tests. The time it takes to execute the tests might limit the number of tests that could be applied during an initialization time window. Therefore, additional tests may need to be applied during normal system diagnostics operations than for POST. If test resources and data can be reused for multiple objectives, the cost of supporting boundary-scan testing in the system can be reduced.

Lastly, the issue of circuit recovery as part of system test must be discussed. Since input signals stimulate the core of a device while vectors are being applied to the Boundary-Scan Register (BSR) and that the vector patterns applied at these inputs do not always yield normal operating signal sets on these inputs, the core logic of a device may enter an unrecoverable state without a forced reset or power cycle. Thus, applying tests in the system domain requires the use of a recovery strategy which forces the board logic to reset itself to a known good state following some types of boundary-scan tests (e.g., interconnect tests, cluster tests). In some cases, a warm boot or cold boot will reset the logic to this state. In other cases, a power cycle is required. When the recovery process requires an operation that will retrigger the execution of POST, a persistent mechanism is required to bypass POST during this recovery phase (a persistent flag).
Last edited by Ian McIntosh on Fri Mar 27, 2009 9:25 pm, edited 1 time in total.
Reason: Fixed typos
Bradford Van Treuren
Distinguished Member of Technical Staff
NOKIA MN

User avatar
Ian McIntosh
SJTAG Chair
Posts: 429
Joined: Mon Nov 05, 2007 11:49 pm
Location: Leonardo, UK
Contact:

Re: Power-On Self Test (POST)

Post by Ian McIntosh » Fri Mar 27, 2009 10:25 pm

Bradford Van Treuren wrote:Another consideration with POST that must be considered is whether the same test mechanism can be reused for on-demand as well as automated periodic testing. The mechanism which triggers the POST operation is not the same as a test after a board is initialized. However, the same tests are probably necessary for the latter tests. The time it takes to execute the tests might limit the number of tests that could be applied during an initialization time window. Therefore, additional tests may need to be applied during normal system diagnostics operations than for POST. If test resources and data can be reused for multiple objectives, the cost of supporting boundary-scan testing in the system can be reduced.
In the language of my organisation POST is generally referred to as PBIT (Power-on Built-in Test). We also tend to have Initiated BIT (IBIT, sometimes called Interruptive BIT) and Continuous BIT (CBIT). What each of these actually means, in terms of constraints applied to each, tends to be rather application dependant, so it is difficult to describe each in a concise way, but here's an attempt:
  • PBIT - In virtually all of our applications this will have a contractually defined maximum execution time, but in many case this could be quite long (several minutes) because within that there may be lengthy operations like self-calibration to conduct. I expect other industries my have the opposite case, with POST having maximum execution times in seconds. As Brad notes, POST usually has the luxury of being able to run tests prior to board configuration, so if a reasonably long execution time is allowed then quite thorough testing (and perhaps diagnostics) is possible.
  • CBIT - can be considered as a "background BIT" executed continually (or at least with regular periodicity) throughout (most) mission modes of operation, and must not degrade the execution of any mission function/capability. In the past, military systems tended to use an interrupt driven time-slicing scheme for allocating processor resources to multiple tasks. Today threading will do the same job. In either case, some limited proportion of processing resources will be allocated to CBIT. In some cases this will mean that either a very limited amount of time (maybe a few tens of milliseconds) will be allowed for CBIT or the test(s) must be spread over a relatively extended period. This, combined with the constraint on mission impact, probably precludes the use of JTAG in most cases.
  • IBIT - It is important to differentiate this from CBIT. IBIT is usually event driven; either an operator requested test, or as a result of a some other indicator, such as a CBIT fault detection. IBIT is allowed to interrupt mission behaviour, but as it is often allowed a significant time for execution, IBIT may be "blocked" during critical mission modes, or at least deferred until exit from the critical mode (if you're in the middle of bringing an aircraft in to a landing by instruments would you rather have a radar that might have a fault or one that goes completely off-line for several seconds?). On completion of IBIT, the system is expected to return to a mission state, although depending on the IBIT result, the system may be "capability limited", e.g. if a suspected faulty module has been shut down. In such a case CBIT needs to be told that IBIT has shut the module down so that it doesn't then trigger multiple spurious alerts and a recurring cycle of IBIT requests.
As a summary (and again, this may not be the case in other industry sectors): POST/PBIT has the greatest potential to detect and to diagnose faults, followed by IBIT with CBIT generally being the least comprehensive form of Built-in Test.
Ian McIntosh
Testability Lead
Leonardo UK

Post Reply