6.6 Problem control activities

The occurrence of some Incidents is, for all practical purposes, unavoidable. Computer equipment and telecommunications lines fail occasionally. Many other Incidents are caused, not by random failures, but by errors somewhere in organisations' increasingly complex IT infrastructures. Even anticipated failures of computing or telecommunications equipment can increase in impact to unacceptable levels because of an error in a vendor product.

Reactive Problem control (Figure 6.1) is concerned with identifying the real underlying causes of Incidents in order to prevent future recurrences. The three phases involved in the (reactive) Problem control process are:

Problem identification and recording
Problem classification - in terms of the impact on the business
Problem investigation and diagnosis.

Figure 6.1 - Problem control

When the root cause is detected the error control process begins.

6.6.1 Problem identification and recording

Problem identification takes place when:

matching the process to existing Problems and Known Errors is not successful during the stage of Incident initial support and classification
analysis of Incident data reveals recurrent Incidents
analysis of Incident data reveals Incidents that are not yet matched to existing Problems or Known Errors
analysis of the IT infrastructure indicates a Problem that could potentially lead to Incidents
a major or significant Incident (serious and adverse impact on services to the Customer) occurs for which a structural solution has to be found.

It should be noted that some Problems may be identified by personnel outside the Problem Management team, e.g. by Capacity Management. Regardless, all Problems should be notified and recorded via the Problem Management process. Much of the Availability Management process is concerned with the detection and avoidance of Problems and Incidents to the IT infrastructure; a synergy between the two areas is thus an invaluable aid to improving service quality.

Tips:

Problem Management requires effort and resources and therefore can be expensive. The organisation may decide that the efforts and costs are not justifiable in certain types of unmatched Incidents - perhaps for Incidents with a quick resolution, low impact or low possibility of recurrence. In such cases, a dummy Problem record can be introduced in the CMDB, related to all connected Incidents, Known Errors, RFCs and CIs.

Problem records need to be recorded in a database (ideally the CMDB) and are very similar to Incident records. They usually exclude some of the standard Incident data (e.g. User data) that is inappropriate. However, Problem records should be linked to all associated Incident records. The solution and Work-arounds of Incidents should be recorded in the relevant Problem records for others to access should other related Incidents occur.

Figure 6.2 - Incident-matching process flow

The process of Problem identification, illustrated in Figure 6.2 includes the basic classification of Problems. Data on affected CIs should be accurately appended to this basic classification data. Ideally, these CIs are the lowest level of item capable of discrete amendment - for example, a module of applications code or hardware component. Identification of a Problem CI to this level is, however, often impossible at the Problem identification stage.

6.6.2 Problem classification

When a Problem is identified, the amount of effort required to detect and recover the failing CI(s) has to be determined. Therefore it is important to be aware of the impact of the Problem on existing service levels. This process is known as 'classification'. In practice, support effort is allocated to only a small proportion of Problems linked to a single Incident.

The steps involved in Problem classification are similar to the steps in classifying Incidents; they are to determine:

category
impact
urgency
priority.

Problems are categorised into related groups or domains (e.g. hardware, software, support software, whatever is appropriate). These groups could match the organisational responsibilities, or the User and Customer base, and are the basis for allocating Problems to support staff. Annex 6A gives an example of a simple but effective structure for categorising Problems.

Identification of a new Problem should be followed by an objective analysis of its impact (that is, its effect on the business). The relationships between components in the IT Infrastructure registered in the CMDB can be of great help when determining the impact of a Problem.

Organisations should design their own impact coding system in relation to their business needs. Impact coding is a most useful mechanism for the effective allocation of support effort. The further inclusion of a simple priority rating, subordinate to impact, provides a total control mechanism.

When determining the impact of a Problem, the relations between components in the IT infrastructure registered in the CMDB can be of great help. By interrogating the CMDB, it is possible to identify CIs that are dependent on part of, or identical to, the CI in the IT infrastructure to which the Incident is applied.

Urgency is the extent to which resolution of a Problem or error can bear delay; it should not be confused with priority. Priority indicates the relative order in which a series of items - be they Incidents, Problems, Changes or errors - should be addressed. This will be influenced by considerations of risk and resource availability but is primarily driven by a combination of urgency and impact. Despite a low business impact, something that requires urgent resolution will often be dealt with before something of very high potential business impact but that has lower urgency. It sometimes helps to allocate numerical values to each, in order to derive from them a numerical priority; but, as with all Service Management, such numbers should be modified by human common sense and business awareness. However, a useful and simple starting point is to assign numerical values from 1 to 4 to each of urgency and impact and sum these for any one Problem to give a relative priority. That done, an organisation should monitor and examine critically the resulting priorities and monitor the function to reflect their requirements. Both 'urgency' and 'priority' are listed in Appendix A (Glossary of Terms). Aspects influencing urgency are, for example:

the availability of a temporary fix
the existence of a Work-around
the possibility of planned delay of resolution
an awareness of future impact upon the business, e.g. equipment required to support month-end processes.

Every Incident, Problem and Change will have both an impact on the business services and an urgency:

impact describes the potential to which the business stands vulnerable
urgency illustrates the time that is available to avert, or at least reduce, this impact.

Tips:

Assign an impact code to all Problems at the earliest opportunity. When this has been done, it is important to make all Problems subject to a managed staff-assignment process before detailed investigations begin. The person assigned assumes responsibility for the Problem and becomes the focal point for all communications and for coordinating resolution activity on that Problem. Schedule effort according to impact, with major Problems receiving immediate attention. Make certain this resource-control process allows for low-impact Problems that have exceeded their specified time threshold.
The process of impact analysis suffers from one major constraint: it reflects a snapshot view. Although a Problem may be correctly assigned a low impact code, the sheer number of subsequent Incidents later attributed to it may demand that the Problem receives immediate attention. Incident thresholds should be set to address this difficulty. As illustrated in Figure 6.2, the Problem Management process can be designed to maintain a count of matched Incidents in Problem (and Known Error) records. The Problem and error control systems periodically scan this count, comparing it with a predetermined threshold value. When the count equals or exceeds the threshold, such Problems/Known Errors should be escalated to receive immediate attention. However, beware that the number is not always equal to the importance: a Problem that prevents the posting of 0.5% of orders can be suddenly and rightly recognised as critical when you find you can't enter order values exceeding £999,999.99!

6.6.3 Problem investigation and diagnosis

The process of Problem investigation is similar to that of Incident investigation (see Chapter 5) - but the primary objective of each process is significantly different. Incident Management's aim is rapid restoration of service, whereas Problem Management's aim is diagnosis of the underlying cause. Investigation activities should include available Work-arounds for the Incidents related to the Problem, as registered in the Incident record database. Problem Management activities should include updating recommended Work-arounds in the Problem record, to support Incident control.

Diagnosis frequently reveals that the cause of a Problem is not an error in a registered CI (hardware, software item, documentation or procedure) but is procedural. Incorrect release of a version of a program is one example. These situations result in Problem closure with an appropriate categorisation code. Problems of this type do not automatically achieve the formal status of Known Error. To ensure that these Problems are followed up and that action is taken to address them, consider creating a dummy CI record for the offending procedure and re-classifying the Problem as a Known Error, or raise an RFC.

Diagnosis showing the cause to be a fault in a registered CI should automatically change the status of the Problem into a Known Error. At this point the error control system and procedures take over.

As indicated earlier, the objectives of Problem investigation frequently conflict with those of Incident resolution. For example, Problem investigation may require detailed diagnostics data, which is available only when an Incident has occurred; its capture may significantly delay the restoration of normal services. Be sure to liaise closely with Incident control and the computer operations or network control functions to get a balanced view of the right time for such actions.

Methods of Problem analysis

Literature provides many methods for structural Problem analysis and diagnosis. Some available methods are:

Kepner and Tregoe (see Annex 6B)
Ishikawa diagrams (see Annex 6C)
brainstorming sessions
flowchart methods.

Problem Management should select methods that best fit the organisation's purposes.

6.6.4 Tips on Problem control

The following are points worth remembering in relation to Problem control:

The categorisation of Incidents can produce a first step towards Problem definition. Problem Management therefore should closely relate with Incident Management with regard to establishing common Incident and Problem categories. Appropriate categories should be created both for recording reported Incidents, which should be in 'Customer terms', and for recording the finally detected causes, more likely to be expressed in 'IT terms'.
If possible, establish a multidisciplinary team with, for instance, Problem Management, as coordinator, in order to involve as many different perspectives as possible in the investigation.
Ensure that support specialists involved have adequate tools and diagnostic aids in order to be able to carry out their tasks effectively.
If a Problem does not involve an error in a system component but is caused by say, a general lack of User training, execute any resolution action and close the Problem record. Alternatively, a new CI record can be created - in this example for 'training Problems' - and the Problem can then be converted into a Known Error in the usual way. Ensure that the detected cause reflects the situation, e.g. lack of user knowledge, training.
Investigation procedures during the Incident or Problem control process require that documentation on all products in the IT infrastructure is available to the process and to support staff for reference purposes. This includes documentation on the following:
- application systems
- systems software
- in-house utility routines
- networking hardware and software
- overall configuration/network diagrams.
In addition to product information, it is also necessary to have effective procedures to collect diagnostic data for Problem resolution. It is particularly important that support staff are familiar with these procedures, as any inappropriate use during an Incident can delay the resumption of normal IT services. So you also need procedures that support and enforce your process requirements - and those procedures might include adequate training, qualifications etc.
Often, support specialists are involved in both the Incident Management process and the Problem Management process. Keeping in mind the different goals of these processes (quick resolution versus structural resolutions), it can prove useful to assign specialists to both processes for a fixed percentage of their time, perhaps 80% to Incident Management and 20% to Problem Management. This prevents support specialists becoming fully absorbed by reactive Incident Management.
During Incident and Problem investigations, Problem Management staff also require accurate records of recent Changes, because these may provide pointers to the cause.