Reliability, Availability, and Serviceability (RAS), for A-profile architecture
源自 https://developer.arm.com/documentation/102105/latest/
1 Introduction to RAS
1.1 Faults,Errors,and failures
三个概念的区分:
• A failure is the event of deviation from correct service. This includes data corruption, data loss, and service loss.
• An error is the deviation from correct service. An incorrect value that has an error is corrupt.
• A fault is the cause of the error.
There are many sources of faults in a system, including both software and hardware faults:
• Hardware faults originate in, or affect, hardware.
• Software faults affect software, that is programs or data.
The RAS Extension and RAS System Architecture primarily address errors produced from hardware faults. These fall into two main areas:
• 1. Transient faults.
• 2. Non-transient or persistent faults.
1.2 General taxonomy of errors(错误分类)
1.2.1 Error detection
When a component accesses memory or other state, an error might be detected in that memory or state.
The error might be corrected or deferred by the component, or signaled to another component as either a deferred error or a detected error.
1.2.2 Error propagation
An error is propagated by deviations from correct service, including when any of the following occurs that would not have been permitted to occur had the fault not been activated:
1)错误传播的场景有如下:
• 1. A corrupt value is passed from producer to consumer.
一个损坏的值从生产者传递给消费者
• 2. A transaction or other operation occurs that should not have occurred.
发生了不应该发生的事务或其他操作
• 3. A transaction or other operation that should have occurred does not occur.
本应发生的事务处理或其他操作没有发生
• 4. A loss of uniprocessor semantics or any other loss of coherency in a multiprocessor coherent system is observed.
多核处理器系统中有一致性损失的行为
• 5. Changing the timing and/or order of transactions or other operations such that the timing and/or order of those transactions or operations is incorrect. In this case, the service interface defines acceptable timings and/or orders for transactions and other operations.
改变了 timing 或者 transactions 的顺序
An error is silently propagated by the producer of a transaction if the consumer of the transaction cannot detect the error and consumes an undetected error because of the transaction. This might be because of one of the following:
2)错误被 Producer 静默传播的原因有如下:
• 1. The error is present on the transaction, but was not detected by the producer. The error is silently propagated by the producer.
Transaction中存在该错误,但生产者没有检测到该错误,错误由生产者无声地传播
• 2. The error is present on the transaction, but was not signaled to the consumer as an error. For example, a corrupt value was passed in the transaction with no indication that it was corrupt. The error is silently propagated by the producer.
该错误存在于事务中,但没有将其作为错误的信号发送给消费者。例如,在事务中传递了一个损坏的值,但没有显示它已损坏。错误由生产者无声地传播。
如上两者的差别是,第一种是 Producer 也检测不出来,所以传播下去了;另一种是 Producer 没有做错误标记给到 Consumer 传播了下去。
Errors might be propagated by components in a system until one of the following occurs:
3)错误可能由系统中的组件传播,直到发生以下情况之一为止
• They are masked and do not affect the outcome of the system.
The error might be masked because a corrupt value is discarded or overwritten, or the error is detected and removed.
它们被 Masked 了,并且不会影响系统的结果,错误可能被丢弃或覆盖,或者错误被检测并删除。
• They affect the service interface of the system and possibly cause failure. If the error has been silently propagated to the service interface then:
– This is a Silent Data Corruption (SDC).
– The rate of such failures, measured as the number of failures per billion device-hours of operation, is called the SDC Failure-in-Time (FIT) rate.
Alternatively, the error might have been detected, causing the system to invoke error handling and recovery.
它们会影响系统的服务接口,并有可能导致故障。如果错误已静默传播到服务接口,则:
– 这是静默数据损坏(SDC, Silent Data Corruption)
– 这种故障率,以每十亿个设备运行小时的故障数来衡量,称为SDC实时故障(FIT,Failure-in-Time)率
1.2.3 Infected and poisoned
The state of a component becomes infected when the component consumes an uncorrected error that updates
the state.
当组件使用一个更新状态的未更正错误时,该组件的状态将受到感染
A value is poisoned in the state of a component if it is marked as being in error, such that a subsequent access of
the state will detect the value is so marked and is treated as a detected error.
如果一个值被标记为错误,则它在组件状态下poisoned,这样该状态的后续访问将检测到该值被标记并被视为检测到的错误
Poison is used to defer an error.
Poison 是用来延缓错误的
1.2.4 Containable and uncontainable(可控制和不可控制)
An undetected error is uncontained at the component that failed to detect it.
未检测到的错误对于未能检测到它的组件而言是 不可控制的
A silently propagated error is uncontained at the component that silently propagated it.
静默传播的错误是 不可控制的
A detected uncorrected error is uncontainable at the component if it might be uncontained at the component.
检测到不可纠正的错误,对于组件来说也是不可控制的
A detected uncorrected error is containable at the component if it is not uncontainable at the component. If
the component cannot determine whether a detected uncorrected error is uncontainable or containable at the
component, then the component treats the detected uncorrected error as uncontainable at the component.
An error that is uncontainable at a component might be containable at the system level.
组件上无法控制的错误可能在系统级别上控制
Note:
Reporting an error as containable allows software to contain the error. This does not mean that hardware has
contained the error
报告一个可包含的错误允许软件包含该错误。这并不意味着硬件已经包含了这个错误
1.3 Techniques for improving reliability, availability, and serviceability
1.3.1 Fault prevention and fault removal(故障预防和故障排除)
Fault prevention and fault removal are two techniques for handling faults. Fault prevention and fault removal
mechanisms are IMPLEMENTATION DEFINED.
Fault prevention techniques are outside the scope of the architecture.
故障预防技术超出了体系结构的范围
A fault that is removed is a corrected error and might be recorded and generate a fault handling interrupt, but it
is not propagated. This means that it is not consumed and does not cause service failure.
故障排除 – 举例:一个纠正的错误,可能被记录并产生一个故障处理中断,但它没有传播。这意味着它没有被使用,也不会导致服务失败
A common technique to detect and correct errors is the use of an Error Detection and Correction Code (EDAC),
more commonly referred to as simply an Error Correction Code (ECC). ECC schemes use mathematical codes
to detect and correct an error in a value in memory. The size of the value is the protection granule for the ECC
scheme.
检测和纠正错误的一种常见技术是使用错误检测和校正代码(EDAC),这通常被称为简单的错误校正代码(ECC)。ECC方案使用数学代码来检测和纠正内存中的一个值中的错误。该值的大小为ECC方案的保护颗粒。
The RAS Extension and RAS System Architecture do not require implementation any fault removal schemes,
including ECC
RAS扩展和RAS系统体系结构不需要实现任何故障消除方案,包括ECC
1.3.2 Error handling and recovery(错误处理和恢复)
A fault that is not removed gives rise to an uncorrected error.
未消除的故障会导致不纠正的错误(1bit ECC积累成 2bit ECC错误)
Error recovery is the process by which software and hardware minimize the impact of an uncorrected error.
错误恢复是指软件和硬件尽量减少未纠正错误的影响的过程
Error recovery methods include:
错误恢复方法包括:
• Deferring an error from a fault. An error is deferred by hardware if hardware can make forward progress
without consuming the error. Deferring the error means(延迟错误意味着):
– 1. The fault might become masked later (fault removal). For example, because the corrupt value is
overwritten before it is consumed.
故障可能稍后masked(故障排除),例如,因为损坏的值在 consumed 之前被 Overwritten
– If the deferred error is later consumed, then the error is reported at the point of consumption. For
example, if the deferred error is consumed by a Processing element (PE) then the consumer PE
generates an error exception. This can give better results in terms of error recovery in the case where
the original producer of the data is not known when the error was deferred. For example because a
latent error was detected.
如果稍后 Consumed 了延迟错误,则会在消耗点报告该错误。
例如,如果延迟错误被处理元素(PE)消耗,则消费者PE将生成一个错误异常。
在错误被延迟时不知道数据的原始生产者的情况下,这可以在错误恢复方面提供更好的结果。例如,因为检测到了一个潜在的错误
A common technique to defer an error is to replace the corrupt value with a poisoned value, for example in
memory or in a transaction.
延迟错误的一种常见技术是用 poisoned 的值替换损坏的值,例如在内存或 transaction 中。
• Preventing further propagation of the error, that is containing the error. In particular, preventing silent
propagation of the error.
防止错误的进一步传播,即包含该错误。特别是,防止错误的无声传播
• Reducing the severity of a failure by invoking a service failure mode:
– This is a Detected Uncorrected Error (DUE).
– The rate of such failures gives the DUE FIT rate.
– The type of service failure mode depends on what is acceptable to the service.
A software error recovery agent is typically invoked when hardware detects an error it cannot correct, defer, or
remove.
当硬件检测到一个无法纠正、延迟或删除的错误时,通常会调用软件错误恢复代理
An error recovery agent also provides information to the operator through error logs to improve serviceability,
for example to help with the identification of a Field Replaceable Unit (FRU).
错误恢复代理还通过错误日志向操作员提供信息,以提高可服务性,例如,帮助识别现场可替换单元(FRU)。
The RAS Extension and RAS System Architecture provide optional common programmers’ models to record
information about an error in an error record.
RAS扩展和RAS系统体系结构提供了可选的通用程序员模型,以记录错误记录中有关错误的信息。
The RAS Extension describes the behavior of a PE when an error is signaled to it by the system, including
invoking a service failure mode by taking an error exception, and optional mechanisms to limit propagation of
an error.
RAS扩展描述了当系统向错误发出信号时PE的行为,包括通过采取错误异常调用服务失败模式,以及限制错误传播的可选机制。
The RAS Extension and RAS System Architecture do not require systems to implement error recovery
mechanisms, including poison, and do not require systems to limit the silent propagation of errors.
RAS扩展机制和RAS系统体系结构不要求系统实现错误恢复机制,包括毒药机制,也不要求系统限制错误的静默传播。
1.3.3 Fault handling
Fault handling by software is the process by which software diagnoses and responds to faults to improve
availability.
软件故障处理是指软件诊断故障并响应故障以提高可用性的过程
Fault handling methods include:
故障处理方法包括
• 1. Predictive Failure Analysis (PFA), using information recorded by hardware to trigger pre-emptive action.
预测性故障分析(PFA),使用硬件记录的信息来触发先发制人的行动
The RAS Extension and RAS System Architecture provide optional mechanisms to allow the reporting of errors
and warnings to a fault handling agent, and to record information about the fault in an error record. It is the
responsibility of the error recovery and fault handling processes to collate the error record data and write it to an
error log.
RAS扩展和RAS系统体系结构提供了可选的机制,以允许向错误处理代理报告错误和警告,并在错误记录中记录有关错误的信息。错误恢复和错误处理过程的责任是整理错误记录数据,并将其写入错误日志
The detailed nature of the fault handling agent is outside the scope of this architecture. Fault handling and error
recovery might be independent agents
故障处理代理的详细性质超出了此体系结构的范围。故障处理和错误恢复可能是独立的代理
2 RAS Extension for A-profile
2.1 PE error handling
2.1.1 PE error detection
When a PE accesses memory or other state, an error might be detected in that memory or state, and corrected,
deferred, or signaled to the PE as a detected error with an in-band error response.
当PE访问内存或其他状态时,可能在该内存或状态中检测到错误,并通过带内错误响应纠正、延迟或信号给PE
When an error is detected by a component on a read or a cache maintenance operation from the PE:
1)当组件在从PE执行读取或高速缓存维护操作时检测到错误时:
– 1. If the error can be corrected, it is corrected and corrected data is returned.
如果错误可以纠正,则被纠正并返回纠正后的数据
– 2. If the error cannot be corrected and can be deferred, it is deferred. For example, on a load by poisoning
the PE state, if this is supported by the PE implementation.
如果错误不能纠正且可以延迟,则会延迟;例如,在一个负载上,如果PE实现支持它,则通过 Poisoning PE状态
– If the error cannot be corrected and if implemented and enabled at the component, the detected error
is signaled to the PE as an in-band error response.
如果错误无法被纠正,如果在组件上实现和启用,检测到的错误将作为带内错误响应发送给PE
When an error is detected by a component consuming a write from the PE:
2)当使用从PE写入的组件检测到错误时:
– If the error can be corrected, it is corrected.
如果这个错误可以纠正,它就可以纠正
– If the error cannot be corrected and can be deferred, it is deferred to the consumer. For example, by
poisoning the location being written.
如果错误不能被纠正,并且可以延迟,则会延迟给消费者。例如,通过 Poisoning 到被写入的位置
– If the error cannot be corrected and if implemented and enabled at the component, the detected error
is signaled to the PE as an in-band error response.
如果错误无法被纠正,如果在组件上实现和启用,检测到的错误将作为带内错误响应发送给PE
2.1.2 PE error propagation
The program-visible architectural state of the PE, referred to as the PE state, includes:
• General-purpose, SIMD&FP, and SVE registers.
• System registers.
• Special-purpose registers.
• PSTATE.
An error is consumed by the PE by any of the following:
1)PE被以下任何一个项一个错误 Consumed:
• 1. An instruction commits the corruption into the PE state.
指令会将损坏提交到PE状态
• 2. The error is on an instruction fetch and the corrupt instruction is committed for execution.
错误在指令获取上,损坏的指令被提交执行
• 3. The error is on a translation table walk for a committed load, store, or instruction fetch.
错误已经位于提交加载、存储或指令获取的转换表中
An error is propagated by the PE by one or more of the following occurring that would not have been permitted
to occur had the fault not been activated:
2)PE通过以下一个或多个事件传播错误,如果故障没有被激活,就不允许发生这些错误:
• Consumption of the corrupt value by any instruction, propagating the error to the target(s) of the instruction.
This includes:
通过任何指令 Consumered 损坏的值,将错误传播到指令的目标值,这包括:
– A store of a corrupt value.
一个损坏值的写
– A write of a corrupt value to a System register, Special-purpose register, or PSTATE. Infecting a
System register state might mean that the PE generates transactions that would not otherwise be
permitted.
一个写,到了系统寄存器、特殊用途寄存器或PSTATE的损坏值。感染系统注册状态可能意味着PE生成以其他方式不被允许的 transaction
• Any operation occurring that should not have occurred, including:
任何不应该发生的操作,包括:
– 1. A load, translation table walk, or instruction fetch that would not have been permitted, including those
from hardware speculation or prefetching.
不允许的加载、转换表行走或指令获取,包括那些来自硬件猜测或预取的获取
– 2. A store to an incorrect address, or a store that would not have been made or not permitted.
地址错误的写,或者不会创建或不允许的写
– 3. A direct or indirect write to a Special-purpose or System register that would not have been made or
not permitted.
直接或间接写入特殊目的或系统寄存器的文件,不允许或不允许
– 4. Assertion of any signal, such as an interrupt, that would not have been asserted.
对不会被断言的任何信号,如中断的断言
• Any operation not occurring that should have occurred.
任何没有发生的本应该发生的操作。
• Causing the PE to take an imprecise exception, other than an error exception in response to the error itself.
See the section Definition of a precise exception in the Arm® Architecture Reference Manual, for A-profile
architecture.
导致PE采取不精确的异常,而不是响应错误本身的错误异常
• The PE discarding data that it holds in a modified state.
PE丢弃它在修改状态下保存的数据
• Any other loss of required uniprocessor semantics, ordering, or coherency
所需的单处理器语义、顺序或一致性的任何其他损失
An error propagated by the PE is silently propagated by the PE only if all of the following are true:
只有当以下所有错误均为真时,PE传播的错误才会由PE静默传播:
-
The propagation is not part of the required operation of the PE in taking an error exception generated by
the error.
该传播不是PE在接受由该错误产生的错误异常时所需的操作的一部分。 -
The propagation is not part of the required operation of the PE executing an ESB instruction that
synchronizes the error.
传播不是PE执行同步错误的ESB指令所需操作的一部分 -
The error is not signaled to the consumer as a detected error or deferred error.
该错误不会作为检测到的错误或延迟错误发送给使用者 -
Any of the following are true:
• The corrupt value is held in other than the general-purpose, SIMD&FP, or SVE registers.
损坏值保存在 general-purpose、SIMD&FP或SVE寄存器中
• The error is propagated by an instruction in program order before either taking an error exception
generated by the error or executing an ESB instruction that synchronizes the error, and is propagated
to outside of the general-purpose, SIMD&FP, or SVE registers
在错误接受由错误产生的错误异常或执行同步错误的ESB指令之前,错误通过程序顺序的指令传播,并传播到通用、SIMD&FP或SVE寄存器之外
• The error is propagated other than by an instruction that consumes the corrupt value as an input
operand but otherwise behaves correctly.
错误的传播方式不是指令将损坏的值作为输入操作数,但其他指令行为正确
2.1.3 Other errors – 2024.03.17 下周从这里开始