5.6 Incident Management activities

This section discusses in more detail the six activities encapsulated in Section 5.3 (Basic concepts), namely:

Incident detection and recording
classification and initial support
investigation and diagnosis
resolution and recovery
Incident closure
ownership, monitoring, tracking and communication.

Each of these is discussed in more detail below.

5.6.1 Incident detection and recording

Incident details from Service Desk or event management systems are the inputs for Incident Management. Resultant actions are to:

record basic details of the Incident
alert specialist support group(s) as necessary
start procedures for handling the service request.

Outputs will be:

updated details of Incidents
the recognition of any errors on the CMDB
notice to Customers when an Incident has been resolved.

All Incidents should be recorded: automatic generation of 'skeleton Incident records' in an Incident database by a system-monitoring tool is the ideal solution to this requirement. Symptoms, basic diagnostic data, and information about the related Configuration Item should be included in Incident records during detection and recording. Annex 5C illustrates the scope of data to be captured in records during the entire Incident Management process. This data is required both for Incident resolution/recovery and for management information on Incident types and trends.

In the past, it has been common practice for all Incidents to be reported to the Service Desk, where personnel manually created a record in the Incident database. Where this was not practical or possible support groups have been allowed to record Incidents manually; in this case the Service Desk received alerts so that they were informed about possible degradation of services. With modern technology, however, Incidents can nowadays be reported by various means, including the ability for Users to log Incidents directly to the system. But the fundamental requirement remains that these Incidents should still all reach the Incident Management database and that the Service Desk should receive appropriate alerts and maintain overall control - Incident monitoring remains the responsibility of the Service Desk.

An alert to the Service Manager is required in the case of serious degradation of service levels, in case it is necessary to take special action.

An Incident should be handled in conformance with standard SLM procedures. These specific procedures do not fall within the scope of the Incident Management process.

5.6.2 Classification and initial support

Inputs:

recorded Incident details
configuration details from the CMDB
response from Incident matching against Problems and Known Errors.

Incident records raised in the previous activity are now analysed to discover the reason for the Incident. The Incident should also be classified, the process on which further resolution actions are based. Annex 5A provides some examples of classification codes.

Actions:

classifying Incidents
matching against Known Errors and Problems
informing Problem Management of the existence of new Problems and of unmatched or multiple Incidents
assigning impact and urgency, and thereby defining priority
assessing related configuration details
providing initial support (assess Incident details, find quick resolution)
closing the Incident or routing to a specialist support group, and informing the User(s).

Outputs:

RFC for Incident resolution
updated Incident details, and
Work-arounds for Incidents, or Incident routed to second- or third-line support.

Classification is the process of identifying the reason for the Incident and hence the corresponding resolution action. Many Incidents are regularly experienced and the appropriate resolution actions are well known. This is not always the case, however, and a procedure for matching Incident classification data against that for Problems and Known Errors is necessary. Successful matching gives access to proven resolution actions, which should require no further investigation effort.

Classification is one of the most important aspects of Incident Management (and often one of the most difficult to get right). The classification is used to:

specify the service with which the Incident is related
associate with an SLA where appropriate
select/define the best specialist or group to handle the Incident
identify the priority based upon the business impact
define what questions should be asked or information checked
determine a primary reporting matrix for management information
identify a relationship to match against Known Errors or solutions.

The final classification(s) may vary from the initially reported classification because end Users are only able to report symptoms of the Incident rather than the root Problem. The levels of classification will vary depending on the detail required. For example, a top-level classification of 'Word Processing', or 'Payroll Service' is adequate for an overview; however, it may then be necessary to obtain greater detail in areas such as:

version number (of application in use)
supplier
module (e.g. printing), or
grouping (e.g. business application).

As much information as possible should be provided when classifying Incidents. Classification data contributing to the matching process includes:

details of Incident symptoms
initial Incident categorisation
details of associated Configuration Items (CIs)
the business impact.

The process of classification and matching allows Incident Management to be carried out with more speed and minimum recourse to support. The classification-matching process is an ideal application area for the use of so-called expert software.

The Service Desk collects information about affected CIs and therefore should be able to detect inconsistencies in the CMDB when asking a User for configuration id numbers, serial numbers and so on. If inconsistencies are discovered, an exception report should be raised and the Configuration Management process informed. This can take place automatically via the Incident Management software or by reporting on a daily basis.

One of the important aspects of managing an Incident is to define its priority: how important is it and what is the impact on the business. The responsibility for definition lies with Service Level Management within the parameters sets in the SLA. The priority with which Incidents need to be resolved, and therefore the amount of effort put into the resolution of and recovery from Incidents, will depend upon:

the impact on the business
the urgency to the business
the size, scope and complexity of the Incident
the resources availability for coping in the meantime and for correcting the fault.

'Impact' is a measure of the business criticality of an Incident or Problem, often equal to the extent to which an Incident leads to degradation of agreed service levels. Impact is often measured by the number of people or systems affected. Criteria for assigning impact should be set up in consultation with the business managers and formalised in SLAs.

When determining impact, information in the CMDB should be accessed to detect how many Users will suffer as a result of the technical failure of, for example, a hardware component. The Service Desk should have access to tools that enable it rapidly to:

assess the impact on Users of significant equipment failures
identify Users affected by equipment failure
establish contact to make them aware of the Problem
give a prognosis
alert second-line (specialist) support groups.

'Urgency' is about the necessary speed of solving an Incident of a certain impact. A high-impact Incident does not, by default, have to be solved immediately. For example a User having operational difficulties with his workstation (impact 'high') can have the fault registered with urgency 'low' if he is leaving the office for a fortnight's holiday directly after reporting the Incident.

'Priority' is defined by expected effort. An Incident with a low impact and average urgency that can be resolved with minor effort will be resolved immediately in most organisations (e.g. a password reset).

Initial support involves resolution of the Incident to the satisfaction of the Customer by the Service Desk. The resolution may be derived from several areas, including:

identification of a Known Error
Service Desk staff expertise
a knowledge search (with the help of expert software when possible).

After this, little further action is required by the Service Desk other than recording details of the resolution, the classification and Customer satisfaction.

Tip:

The number of requests resolved directly by the Service Desk is an essential service-monitoring component and leads to contented Users!

In the event that classification matching is unsuccessful, or the resolution process is complex, investigation and diagnosis by a support group is the next step.

Although responsibility for resolution is handed over to another support group, the Service Desk should retain ownership of the Incident, and manage it until it is resolved to the Customer's satisfaction.

5.6.3 Investigation and diagnosis

Inputs:

updated Incident details
configuration details from the CMDB.

Actions:

assessment of the Incident details,
collection and analysis of all related information, and resolution
(including any Work-around) or a route to n-line support.

Outputs:

Incident details yet further updated, and a specification of the selection or required Work-around.

Wherever possible, the relevant User should be provided with the means to continue business, perhaps via a degraded service. An example could be that faulty printers might necessitate printing taking place at another more distant location. The effect of such a Work-around is to minimise the impact of the Incident on the business and to provide more time to investigate and devise a structural resolution. Temporary Work-arounds may have to be advised to other Users too.

Once the Incident has been assigned to a support group, it should:

accept assignment of the Incident, specify the date and time (preferably automatically), ensuring:
- the Incident status and its history are regularly updated
- the Customer via the Service Desk is kept informed of progress towards resolution
- the current status of the Incident is reflected (e.g. work in progress, and so on)
advise the Service Desk/Customer of any identified Work-around, if it is possible to provide one immediately
review the Incident against Known Errors, Problem, solutions, planned Changes or knowledge bases
if necessary, ask the Service Desk to re-evaluate the assigned business impact and priority, adjusting them as required, based on agreed service levels
record all details applicable to this phase of the Incident life cycle:
- solution
- classification added/updated
- a update of all related Incidents
- time spent
reassign the Incident back to the Service Desk for closure action.

Investigation and diagnosis may become an iterative process, starting with a different specialist support group and following elimination of a previous possible cause. It may involve multisite support groups and support staff from different vendors. It may continue overnight with a new shift of support staff taking over the next day. All this demands a rigorous, disciplined approach and a comprehensive record of actions taken with corresponding results.

Tip:

If it is not clear which support group should investigate or resolve a User-related Incident, the Service Desk, as the owner of all Incidents, should coordinate the Incident Management process. If there are differences of opinion or there are any other issues arising, then the Service Desk should escalate the Incident to the Problem Management team.

Annex 5D shows a typical process of Incident investigation. Continual expansion of the Incident record should occur, with each progress point logging the action taken in a progress summary.

5.6.4 Resolution and recovery

Inputs:

updated Incident details
any response on an RFC to effect resolution for the Incident(s)
any derived Work-around or solution.

Actions:

resolve the Incident using the solution/Work-around or, alternatively, to raise an RFC (including a check for resolution)
take recovery actions.

Outputs:

RFC for future Incident resolution
resolved Incident, including recovery details,
updated Incident details.

After successful execution of the resolution or some circumvention activity, service recovery can be effected and recovery actions carried out, often by specialist staff (second- or third-level support). The Incident Management system should allow for the recording of events and actions during the resolution and recovery activity.

5.6.5 Incident closure

Inputs:

updated Incident details,
resolved Incident.

Actions:

the confirmation of the resolution with the Customer or originator
'close' category
Incident.

Outputs:

updated Incident detail,
closed Incident record.

When the Incident has been resolved, the Service Desk should ensure that:

details of the action taken to resolve the Incident are concise and readable
classification is complete and accurate according to root cause
resolution/action is agreed with the Customer - verbally or, preferably, by email or in writing
all details applicable to this phase of the Incident control are recorded, such that:
- the Customer is satisfied
- cost-centre project codes are allocated
- the time spent on the Incident is recorded
- the person, date and time of closure are recorded.

Tips:

This process is essential in resolving disputes between a service provider and a Customer over the validity of closure.
It is important that there should be restricted access to the Incident closure routine, and this should be controlled by the Service Desk Manager.
Incidents should be matched with the corresponding Problem/Known Error record, where one exists.
If a closed Incident is reopened, it is important to record the reason and adjust the workload values assigned if further work is required - if not, a new Incident should be raised and linked to the original one.

5.6.6 Ownership, monitoring, tracking and communication

Inputs:

Incident records.

Actions:

monitor Incidents
escalate Incidents
inform User.

Outputs:

management reports about Incident progress
escalated Incident details; and
Customer reports and communication.

The Service Desk is responsible for owning and overseeing the resolution of all outstanding Incidents, whatever the initial source, by the following procedure to:

regularly monitor the status and progress towards resolution and against service levels of all open Incidents
particularly note Incidents that move between different specialist support groups, as this may be indicative of uncertainty and, possibly, a dispute between support staff (in excessive cases, Incidents may be referred to Problem Management)
give priority to monitoring high-impact Incidents
keep affected Users informed of progress
check for similar Incidents.

Following this procedure will help to guarantee that each individual Incident will be resolved within agreed timeframes or, at least, as soon as possible. Larger Service Desks should consider the establishment of a dedicated team for Incident monitoring and tracking.

In the event that an Incident fails to achieve satisfactory progress, the Service Desk should act in accordance with well-defined escalation procedures. These procedures should be agreed on by all support groups. In practice, it is important to be aware of support staff becoming too engrossed in an Incident, spending much time on diagnostics gathering, and consequently losing sight of the immediate User need; in all circumstances, when agreed escalation thresholds have been exceeded (which are defined in SLAs), action should be taken to escalate the matter regardless of the views of support staff.

Tips:

Identify Incidents that are liable to breach agreed service level targets and inform the assigned solver.
Make individuals who are identified as escalation contacts aware of any Incidents that are likely to breach service levels.
Record in the Incident history any information regarding escalation of an Incident at the Customer end, and bring this to the attention of the escalation contacts.
Agree on escalation values and processes such as:
- when 75% of the agreed time for resolution has elapsed and the request is still unresolved, the Service Desk should consult with the assigned solver on progress
- when 90% of such time has elapsed and the request is still unresolved, the Service Desk should consult with the line manager of the assigned solver.