6.7 Error control activities

Error control covers the processes involved in successful correction of Known Errors. The objective is to change IT components to remove Known Errors affecting the IT infrastructure and thus to prevent any recurrence of Incidents.

Many IT departments are concerned with error control, and it spans both the live and development environments. It directly interfaces with, and operates alongside, Change Management processes. Figure 6.3 shows the three phases of the error control process. The monitoring and tracking phase covers the entire Problem/error life-cycle.

Figure 6.3 = Error control

6.7.1 Error identification and recording

An error is identified when a faulty CI (a CI that causes, or may be likely to cause, Incidents) is detected. A Known Error status is assigned when the root cause of a Problem is found and a Work-around has been identified.

There are two sources of Known Error data that feed the error control system. One is the Problem control subsystem in the live environment and the other its equivalent in the development environment. Errors found during live operations are identified and recorded as described in Problem control activity investigation and diagnosis. In this case, the Problem record forms the basis of the Known Error record (indeed, it really involves only a change of status).

The second source of Known Errors arises from development activity. For example, implementation of a new application or packaged Release is likely to include known, but unresolved, errors from the development phase. The data relating to Known Errors from development needs to be made available to the custodians of the live environment when an application or a Release package is implemented.

Many IT departments are involved in this sequence of events. The Problem Management system should provide a record of all resolution activity and provide monitoring and tracking facilities for support staff. It should also provide a complete audit trail navigable in either direction, from Incident to Problem to Known Error to Change request to Release or urgent Change implementation.

6.7.2 Error assessment

Problem Management staff perform an initial assessment of the means of resolving the error, in collaboration with specialist staff. If necessary, they then complete an RFC according to Change Management procedures. The priority of the RFC is determined by the urgency and impact of the error on the business. The RFC identifier should be included in the Known Error record and vice versa in order to maintain a full audit trail, or the two records should be linked.

The final stages of error resolution - impact analysis, detailed assessment of the resolution action to be carried out, amendment of the item in error, and testing of the Change - are under the control of Change Management. In extreme circumstances, authorisation and execution of an urgent resolution may be necessary.

Errors in third party products

Problems in vendor-maintained products may be identified by Problem Management or specialist support teams and should be reported to the person responsible for vendor support. Vendor support should be monitored to ensure that responses to Problem reports are received in a reasonable time.

Where software maintenance targets - e.g. mean and maximum time to repair and associated IT infrastructure reliability and serviceability - are specified in a contract or in licence conditions, remedial action should be initiated with the third party organisation in cases of non-compliance. The possibility of specifying maintenance targets should be borne in mind when procuring software, particularly when there is competition for the business. Note that Changes necessary to resolve software errors should be subject to the same Change Management procedures as for internal products.

Error control in the software environments

The processes of Problem and error control are essentially the same in the live and development environments. The support tools described earlier for Problem Management in the live environment are precisely those required in the development environment. Figure 6.4 shows how there is a cyclical relationship between error control in the live and development environments. Interworking and integrated Problem Management systems facilitate the handling of this situation.

Figures 6.4 - The error cycle in the live and development environments

Errors found during live operations result in an accumulation of RFCs. The Release strategy (see Chapter 9 - Release Management) allows for the eventual creation of a Release to incorporate authorised Changes for the amendment of system facilities. Development staff should be aware of all Known Errors and Problems that are associated with the package Release. They are required to delete Known Errors as they are corrected, but they add any newly introduced errors from the development activity itself, to a revised errors database (or CMDB).

Upon implementation of a new Release, this revised errors database replaces the database of the previous Release as the live version. The cycle then repeats itself as new errors are discovered in live operation.

6.7.3 Error resolution recording

The resolution process for each Known Error should be recorded in the Problem Management system. It is vital that data on the CIs, symptoms, and resolution or circumvention actions relating to all Known Errors is held in the Known Error database. This data is then available for Incident matching, providing guidance during future investigations on resolving and circumventing Incidents, and for providing management information.

6.7.4 Error closure

Following successful implementation of Changes to resolve errors, the relevant Known Error record(s) is closed, together with any associated Incident or Problem records. Consideration should be given to inserting into the process an interim status, on the Incident, Known Error and Problem records, of 'Closed pending PIR' to ensure that the fixes have actually worked. A Post-Implementation Review (PIR) can then confirm the effectiveness of the solution prior to final closure.

For Incidents, this may involve nothing more than a telephone call to the user to ensure that they are now content. For more serious Problems and Known Errors, a formal review may be required.

6.7.5 Problem/error resolution monitoring

Change Management is responsible for processing RFCs, whereas error control is responsible for monitoring progress with regard to resolving Known Errors. Throughout the resolution process, Problem Management should obtain regular reports from Change Management on progress in resolving Problems and errors.

Problem Management should monitor the continuing impact of Problems and Known Errors on User services. In the event that this impact becomes severe, Problem Management should escalate the Problem, perhaps referring to the Change Advisory Board to increase the priority of the RFC or to implement an urgent Change as appropriate.

The progress of Problem resolution should be monitored against SLAs. Typically, SLAs stipulate that there should not be more than a certain number of outstanding errors per severity level during each measurement interval (generally a rolling four-week period). If the number of Problems or errors at a severity level reaches a predefined threshold that looks likely to cause non-conformance to the SLAs, escalation should be invoked.

6.7.6 Tips on error control

The following are points worth remembering in relation to error control:

Not all Known Errors need to be resolved. An organisation can decide to allow Known Errors to remain - for instance because the resolution is too expensive, technically impossible, or requires too much time to resolve. In practice, error control is concerned with selecting justifiable investments to resolve a Problem.
Preparing an RFC is one of the responsibilities of error control. Resolutions are often found in technical adjustments. Don't forget that these RFCs may also need to include amendments to procedures, working methods and/or organisational structures.
Consider creating standard error records, by specific device (CI) or by device category, for routine hardware failures. Use these to maintain a quick guide to the failure rate - although most information, such as mean time between failures (MTBF) and downtime, is produced from Incident data.
The rectification of many hardware faults is carried out under Incident control, and not via error control and Change Management. Any Changes to the specification of hardware should, however, be subject to the normal Change Management procedures.
Ideally, common tools should be used for Incident, Problem and error control in live and development environments. If this is not possible, because of the use of specific CASE tools in the development environment, it will be necessary to design and produce a viable transfer mechanism.
In practice, the level of detail usually required for development Configuration Management often precludes a viable shared system. The key thing is to share the data, especially in terms of passing to the live environment information on Problems, Known Errors and ongoing Changes that are being handed over with any new or changed software.