IT Service Continuity Management: 7.3 The Business Continuity Lifecycle

7.3 The Business Continuity Lifecycle

7.3.1 Stage 1 - Initiation
7.3.2 Stage 2 - Requirements Analysis and Strategy Definition
7.3.3 Stage 3 - Implementation
7.3.4 Stage 4 - Operational Management
7.3.5 Invocation

It is not possible to develop an effective ITSCM plan in isolation, it must fully support the requirements of the business. This Section considers the four stages of the Business Continuity lifecycle with particular emphasis on the IT aspects. A full understanding of the Business Continuity Process can be obtained through the OGC ITIL publications, 'An Introduction to Business Continuity Management' and 'A Guide to Business Continuity Management'.

The process is illustrated in the Figure 7.1.

Figure 7.1 - Business Continuity Management Process Model

7.3.1 Stage 1 - Initiation

Figure 7.2 - Initiation

The activities to be considered during the initiation process (See Figure 7.2) depend on the extent to which contingency facilities have been applied within the organisation. Some parts of the business may have established individual Continuity Plans based around manual Work-arounds and IT may have developed contingency plans for systems perceived to be critical. This is good input to the process, however, effective ITSCM is dependent on supporting critical Business functions and ensuring that the available budget is applied in the most appropriate way.

KEY MESSAGE

The only way of implementing effective ITSCM is through the identification of critical Business processes and the analysis, and co-ordination, of the required Infrastructure and IT supporting Services.

The initiation process covers the whole of the organisation and consists of the following activities:

Policy setting - this should be established and communicated as soon as possible so that all members of the organisation involved in, or affected by, Business Continuity issues are aware of their responsibilities to comply with and support ITSCM. As a minimum the policy should set out management intention and objectives.
Specify terms of reference and scope - this includes defining the scope and responsibilities of managers and staff in the organisation, and the method of working. It covers such tasks as undertaking a Risk assessment and Business Impact analysis and determination of the command and control structure required to support a business interruption. There is also a need to take into account such issues as outstanding audit points, regulatory or client requirements and insurance organisation stipulations, and compliance with standards such as BS 7799, the British Standard on Information Security Management (which also addresses Service Continuity requirements).
Allocate Resources - the establishment of an effective Business Continuity Environment requires considerable resource both in terms of money and manpower. Depending on the maturity of the organisation, with respect to ITSCM, there may be a requirement to familiarise and/or train staff to accomplish the Stage 2 tasks. Alternatively, the use of experienced external consultants may assist in completing the analysis more quickly. However, it is important that the organisation can then maintain the process going forward without the need to rely totally on external support.
Define the project organisation and control structure - ITSCM and BCM projects are potentially complex and need to be well organised and controlled. It is advisable to use a standard project planning methodology such as PRINCE2 complemented with a project-planning tool such as Project Manager Workbench or Microsoft Project. The appointment of an experienced project manager reporting to a steering committee and guiding the working groups is key to success. As IT is an important component of the overall process, it may be that the project is best driven from the IT area reporting through to the highest levels of management.
Agree project and quality plans - plans enable the project to be controlled and variances addressed. Quality plans ensure that the deliverables are achieved and to an acceptable level of quality. They also provide a mechanism for communicating project resource requirements and deliverables, thereby obtaining 'buy-in' from all necessary parties.

KEY MESSAGE

A well planned project initiation enables ITSCM work to proceed smoothly with the necessary sponsorship, 'buy-in' and awareness, with all contributing members of the organisation aware of their responsibilities and commitment.

7.3.2 Stage 2 - Requirements Analysis and Strategy Definition

Figure 7.3 - Requirements and Strategy

This stage is depicted in Figure 7.3. It provides the foundation for ITSCM and is a critical component in order to determine how well an organisation will survive a business interruption or disaster and the costs that will be incurred.

KEY MESSAGE

If the requirements analysis is incorrect or key information has been missed, this could have serious consequences on the effectiveness of ITSCM mechanisms.

This stage can effectively be split into two sections:

Requirements - perform Business Impact Analysis and risk assessment
Strategy - determine and agree Risk reduction measures and recovery options to support the requirements.

Requirements

Business Impact Analysis

A key driver in determining ITSCM requirements is how much the organisation stands to lose as a result of a disaster or other service disruption and the speed of escalation of these losses. The purpose of a Business Impact Analysis (BIA) is to assess this through identifying:

critical business processes
the potential damage or loss that may be caused to the organisation as a result of a disruption to critical business processes.

The BIA also identifies:

the form that the damage or loss may take including lost income, additional costs, damaged reputation, loss of goodwill, loss of competitive advantage
how the degree of damage or loss is likely to escalate after an service disruption
the staffing, skills, facilities and services (including the IT Services) necessary to enable critical and essential business processes to continue operating at a minimum acceptable level
the time within which minimum levels of staffing, facilities and services should be recovered
the time within which all required business processes and supporting staff, facilities and services should be fully recovered.

The latter three items provide the drivers for the level of ITSCM mechanisms that need to be considered or deployed. Once presented with these options, the business may decide that lower levels of service or increased delays are more acceptable based upon a Cost/benefit analysis.

Hints and tips

Key inputs into the Business Impact Analysis include any service or application definitions for business areas or business processes.

These definitions and their components enable the mapping of critical service, application and Infrastructure components to critical business processes, thus helping to identify the ITSCM elements that need to be provided. The business requirements are ranked and the associated ITSCM elements confirmed and prioritised in terms of risk assessment/reduction and recovery planning.

Impacts are measured against particular scenarios for each business process such as an inability to settle trades in a money market dealing process, or an inability to invoice for a period of days.

The impact analysis concentrates on the scenarios where the impact on critical business processes is likely to be greatest.

Impacts are measured against the scenarios and typically fall into one or more of the following categories:

failure to achieve agreed internal service levels
financial loss
additional costs
immediate and long-term loss of market share
breach of law, regulations, or standards
risk to personal safety
political, corporate or personal embarrassment
breach of moral responsibility
loss of goodwill
loss of credibility
loss of image and reputation
loss of operational capability, for example in a command and control environment.

This process enables a business to understand at what point the Unavailability of a service would become untenable. This in turn allows the types of ITSCM mechanisms that are most appropriate to be determined to meet these business requirements.

Example

In a money market dealing environment, loss of market data information could mean that the organisation starts to lose money immediately as trading cannot continue. In addition, Customers may go to another organisation, which would mean a potential loss of core business. Loss of the settlement System does not prevent trading from taking place, but if trades already conducted cannot be settled within a specified period of time, the organisation may be in breach of regulatory rules or settlement periods and suffer fines and damaged reputation. This may actually be a more significant impact than the inability to trade because of an inability to satisfy Customer expectations.

It is also important to understand how impacts may Change over time. For instance, it may be possible for a business to function without a particular process for a short period of time, for example invoicing, but over a longer period re-establishment will become critical, i.e. in order to maintain cash flow to pay bills and staff. This can be effectively represented using a graphical illustration of how business impacts vary with length of disruption as shown in Figure 7.4.

In a balanced scenario, impacts to the business will occur and become greater over time, however, not all organisations are affected in this way. In some organisations, impacts are not apparent immediately, such as a consultancy organisation where the need to issue reports can be deferred and impacts to the business do not begin to accrue for a period of time. In this case the 'contingency' line of Figure 7.4 is applicable. In other organisations, such as investment banks, a loss of service for even a short period of time will cause major impacts to accrue immediately and the 'preventative' line applies. At some point however, for any organisation, the impacts will accrue to such a level that the business can no longer operate. ITSCM ensures that contingency options are identified so that the appropriate measure can be applied at the appropriate time to keep business impacts from service disruption to a minimum level.

Figure 7.4 - Graphical representation of business impacts

In the majority of cases, business processes can be re-established without a full complement of staff, systems and other facilities, and still maintain an acceptable level of service to clients and Customers. The Business recovery objectives should therefore be stated in terms of:

the time within which a pre-defined team of core staff and stated minimum facilities must be recovered
the timetable for recovery of remaining staff and facilities.

It may not always be possible to provide the recovery requirements to a detailed level. There is a need to balance the potential impact against the cost of recovery to ensure that the costs are acceptable. The recovery objectives do, however, provide a starting point from which different business recovery and ITSCM options can be evaluated.

KEY MESSAGE

The Business Impact Analysis identifies the minimum critical requirements to support the business.

Risk Assessment

The second driver in determining ITSCM requirements is the likelihood that a disaster or other serious service disruption will actually occur. This is an assessment of the level of threat and the extent to which an organisation is vulnerable to that threat. This is demonstrated in Figure 7.5.

Figure 7.5 - Risk Assessment Model

See Paragraph 8.9.3 for definitions of Risk Analysis and Risk Management. The top section of Figure 7.5 refers to the Risk Analysis - if an organisation's assets are highly valued and there is a high threat to those assets and the Vulnerability of those assets to those threats is high, there would be a high risk. The bottom section of Figure 7.5 shows Risk Management - where Countermeasures are applied to manage the business risks by protecting the assets.

As a minimum, the following risk assessment activities should be performed:

Identify risks - i.e. risks to particular IT Service components (assets) that support the business process which cause an interruption to service.
Assess threat and vulnerability levels - the threat is defined as 'how likely it is that a service disruption will occur' and the vulnerability is defined as 'whether, and to what extent, the organisation will be affected by the threat materialising'. A threat is dependent on such factors as:

likely motivation, capability and resources for deliberate service disruptions such as malicious damage to computer systems, commercial failure of a key technology provider, attack against a organisation's web servers and corruption of Internet sites for accidental service disruptions, the organisation's location, environment, and quality of internal systems and procedures
business processes are vulnerable where there are single points of failure for the delivery of IT Services (for example, a travel agent relies on information feeds for flight bookings, if the link were to fail and no backup is available, flights cannot be sold).

Assess the levels of risk - the overall risk can then be measured. This may be done as a measurement if quantitative data has been collected, or qualitative using a subjective assessment of, for example, low, medium or high. An example of a tabular format used to express the level of risk is illustrated in Figure 7.6. Each risk can be assessed in terms of the associated threat and vulnerability. Using the table in Figure 7.6 it is possible to determine the probability of specified risks occurring (e.g. a high threat and high vulnerability implies a high probability of occurrence).

Figure 7.6 - Risk Measurement table

There are many tools and methodologies available to assist in the measurement of risks of which the preferred solution is CRAMM(see Paragraph 8.9.3).

Hints and tips

ITSCM needs to consider and assess potential risks and reduction measures across the whole Infrastructure.

Following the Risk Analysis it is possible to determine appropriate countermeasures or risk reduction measures (ITSCM mechanisms) to manage the risks, i.e. reduce the risk to an acceptable minimum level or mitigate the risk.

In the context of ITSCM there are a number of risks that need to be taken into consideration. Table 7.1 provides a checklist of some of the risks and threats to be considered by the IT Manager:

Risk	Threat
Loss of internal IT systems/networks, PABXs, ACDs, etc.	Fire Power failure Arson and vandalism Flood Aircraft impact Weather damage, e.g., hurricane Environmental disaster Terrorist attack Sabotage Catastrophic failure Electrical damage, e.g. lighting Accidental damage Poor quality software
Loss of external IT systems/networks, e.g., e-commerce servers, cryptographic systems, etc.	All of the above Excessive demand for services Denial of service attack, e.g. against an Internet firewall Technical failure, e.g. cryptographic systems
Loss of data	Technical failure Human error Viruses, malicious software, e.g. attack applets
Loss of network services	Damage or denial of access to network Service providers' premises Loss of service provider's IT systems/networks Loss of service provider's data Failure of the service providers
Unavailability of key technical and support staff	Industrial action Denial of access to premises Resignation Sickness/Injury Transport difficulties
Failure of service providers, e.g. outsourced IT	Commercial failure, e.g. insolvency Denial of access to premises Unavailability of service provider's staff Failure to meet contractual service levels

Table 7.1 - Risks and Threats to be addressed by the IT Manager

Hints and tips

The risk assessment identifies specific risks to the organisation. Many of these risks concern the ability to provide Continuity of IT Services. Failure to assess all the relevant risks will result in an incomplete risk assessment leaving the business exposed to disruption.

Business Continuity Strategy

The information collated in the impact analysis and the risk assessment, and the associated ITSCM mechanisms chosen, enables an appropriate strategy for the organisation to be developed with an optimum balance of risk reduction and recovery or Continuity options. This includes consideration of the relative service recovery priorities and the changes in relative service Priority for the time of day, day of the week, and monthly and annual variations.

As businesses become more dependent and driven through the use and Availability of technology (e.g. e-commerce developments), ITSCM elements become a more integral part of the overall Business Continuity Strategy. Referring back to Figure 7.4, an organisation that identifies high impacts in the short term will want to concentrate efforts on preventative risk reduction methods e.g. through full resilience and fault tolerance, while an organisation that has low short-term impacts would be better suited to comprehensive recovery options.

Risk Reduction Measures

Most organisations have to adopt a balanced approach where risk reduction and recovery are complementary and both are required. This entails reducing, as far as possible, the risks to the continued provision of the IT Service and usually achieved through Availability Management. However well planned, it is impossible to completely eliminate all risks - for example, a fire in a nearby building will probably result in damage, or at least denial of access, as a result of the implementation of a cordon. As a general rule, the Invocation of a recovery capability should only be taken as a last resort. Ideally, an organisation should assess all of the risks to reduce the potential requirement to recover the business and/or IT Services.

Example

A financial institution dealing in the equities market relies on high Availability of market information and computer systems to analyse that information. Failure of the market data feeds would mean that the business process fails with an immediate financial impact to the organisation. The failure may result from a failure of the information provider (in which case competitors may suffer the same loss so the impact is lessened) so, to prevent this, the organisation takes feeds from multiple providers. Alternatively there may be a technical failure of the equipment or damage to the location where the feeds enter the building (in which case competitors are unaffected and the impacts are greater) so the organisation establishes at least two entry points and alternative equipment on immediate Availability.

Typical risk reduction measures include:

a comprehensive backup and recovery strategy, including off-site storage
the elimination of single points of failure such as a single power supply into a building or power supply from a single utility organisation
Outsourcing services to more than one provider
resilient IT systems and networks constantly change-managed to ensure maximum performance in meeting the increasing business requirements
greater security controls such as a physical access control system using smartcards
better controls to detect local service disruptions such as fire detection systems coupled with suppression systems
improving procedures to reduce the likelihood of errors or failures such as Change control.

Hints and tips

Outsourcing mainframe processing to a third party who provides the service remotely means that a service disruption affecting the organisation's building will not necessarily affect the Availability of the Host system. Outsourcing to different third parties will have the benefit of reducing the risk of a major failure as component parts of the service will always be available. This can be likened to a bookmaker 'laying off' bets to reduce the exposure on a particular gamble. This does, of course, assume the Availability of resilient networks to maintain Continuity of service and the fact that the third party has itself an effective and tested Service Continuity Plan.

The above measures will not necessarily solve an ITSCM issue and remove the risk totally, but all or a combination of them may significantly reduce the risks associated with the way in which services are provided to the business. As with recovery options, it is important that the reduction of one risk does not increase another. The risk of Availability of systems and data may be reduced by outsourcing to an off-site third party, however, this potentially increases the risk of compromise of confidential information unless rigorous security controls are applied.

It is important that organisations check that recovery and ITSCM options selected are capable of implementation and integration at the time they are required, and that the required service recovery can be achieved.

KEY MESSAGE

An organisation's ITSCM strategy is a balance between the cost of risk reduction measures and recovery options to support the recovery of critical business processes within agreed timescales.

Recovery Options

Recovery options need to be considered for:

People and accommodation - including alternative premises either owned, leased or through agreement with a third party; reciprocal arrangements with other organisations; and rapid procurement of alternative premises or refurbishment of existing premises. Consideration should also be given to the respective location of the proposed premises, the mobility of the staff who will be supporting the recovered business operations including IT staff and the total number of staff required to support the business process.
IT systems and networks - these options need to be identified and agreed by the IT Manager responsible for ITSCM and include recovery of IT systems, hardware, applications, software and networks, and the data used within these systems and facilities. This relies on the Availability of effective backups to enable restoration of the service and needs to be performed in collaboration with Availability Management. This strategy should also include the implementation of Continuity mechanisms to support local disruption/interruption of IT Services supporting critical business processes, such as disk mirroring, UPS or dual power supplies, dual communication links, etc.
Critical services such as power, telecommunications, water, couriers and post.
Critical assets such as paper records and reference material.

There may be a need to consider different options for short-term and long-term recovery. Where business processes are highly dependent on external service providers, there is a need to consider the options to address failure of, or peak contention for, the services.

The costs and benefits of each option need to be analysed. This involves a comparative assessment of the:

ability to meet the business recovery objectives
likely reduction in the potential impact
costs of establishing the option
costs of maintaining, testing and invoking the option
technical, organisational, cultural and administrative implications against the risk of disruption or disaster and the potential impact if no action is taken.

When undertaking the analysis there is a need to consider whether the introduction of an option will adversely affect other risks.

Hints and tips

Do not forget to check the organisation's insurance provision to determine whether adequate cover is provided.

IT Recovery Options

There are a number of options that can be considered by IT to provide contingency:

Do nothing

Few, if any, organisations can function effectively without IT Services. Even if there is a requirement for stand-alone PC processing, there is still a need for recovery to be supported.

Manual Work-arounds

IT facilities enable organisations to process information much more quickly and efficiently. Indeed the justification for much IT spend is made on the basis of a reduced headcount. In some organisations, such as the finance, banking and insurance industries, complex calculations are undertaken by applications which would be difficult to reproduce manually in a short period of time. They are dependent upon a succession of calculations by different systems with information fed between them or are dependent upon information being fed to them from external sources. However, manual Work-arounds can be an effective interim measure until the IT Service is resumed wherever they are practical and possible.

Reciprocal arrangements

Entering into an agreement with another organisation using similar technology used to be an effective contingency option when the computing workload was essentially batch processing. Today, the distributed computing environment means that there is a much greater requirement for individual processing power and high Availability, which suggests that this is not a practical solution and may not support an effective resumption of service. In addition, there are maintenance difficulties in keeping reciprocal arrangements in step and increased need for security. Benefits can exist, however, in maintaining some reciprocal arrangements, for example, in the off-site storage of backups and other critical information.

Gradual Recovery

This option (sometimes referred to as 'cold standby') is applicable to organisations that do not need immediate restoration of business processes and can function for a period of up to 72 hours, or longer, without a re-establishment of full IT facilities. This may include the provision of empty accommodation fully equipped with power, environmental controls and local network cabling Infrastructure, telecommunications connections, and available in a disaster situation for an organisation to install its own computer equipment.

The accommodation may be provided commercially by a third party, for a fee, or may be private, (established by the organisation itself) and provided as either a fixed or portable service. A fixed facility may be located at the premises of the third party that provides the service, or specially built at a location owned by the subscriber. There is a need to ensure that all services including telecommunications, market data feeds, etc. are established and adequate accommodation is available to house staff involved in the recovery process.

A portable facility is typically a prefabricated building provided by a third party and located when needed at a predetermined site agreed with the organisation. This may be in a car park or another location some distance from the home site, perhaps, another owned building.

The organisation calls on contracts for the supply of required computer equipment including PCs, servers, and mini computers. The organisation or the contractor (whichever has been formally pre-agreed) then configures the equipment to the organisational requirements and loads all data before a service can be provided.

Third parties rarely guarantee replacement equipment within a fixed deadline, but would normally do so under their best efforts.

When opting for a gradual recovery, consideration must be given to highly customised items of hardware or equipment that will be difficult, if not impossible, to replace if no spares are kept securely by the organisation. Other contingency measures may be needed to cope with having to use different equipment. The same difficulties apply to items supplied by organisations that have since gone out of business and alternatives need to be identified, possibly putting the Service Delivery at risk due to delays or potential Problems.

Intermediate Recovery

This option (sometimes referred to as 'warm standby') is selected by organisations that need to recover IT facilities within a predetermined time to prevent impacts to the business process. This typically involves the re-establishment of the critical systems and services within a 24 to 72 hour period.

Most common is the use of commercial facilities, which are offered by third party recovery organisations to a number of subscribers, spreading the cost across those subscribers. Commercial facilities often include operation, system management and technical support. The cost varies depending on the facilities requested such as processors, peripherals, communications, and how quickly the services must be restored (invocation timescale).

The advantage of this service is that the Customer can have virtually instantaneous access to a site, housed in a secure building, in the event of disaster. It must be understood, however that the restoration of services at the site may take some time as delays may be encountered while the site is re-configured for the organisation that invokes the service, and the organisation's applications and data will need to be restored from backups.

There is a disadvantage in that the site is almost certainly some distance from the home site, which presents a number of logistical problems. The positions are shared (usually up to 20 to 30 times) with other organisations so there can be no guarantee of Availability if a service disruption were to affect two organisations at the same time. There is a need to ensure that a recovery organisation is not providing the same services for firms within an immediate geographical area. This is well understood by the recovery organisations, who apply good Risk Management to the sale of the positions in order to reduce the risk of multiple invocations. It is also a fairly expensive option and can be likened to insurance. What is being paid for is peace of mind. In recent years the number of recovery centres has increased considerably and, together with the falling cost of computer hardware, good deals can be negotiated for 3, 5, or 7-year contracts.

If the site is invoked, there is often a daily fee for use of the service in an emergency, although this may be offset against additional cost of working insurance. Most commercial agreements limit invocation access to a pre-determined length of time, typically between 6 to 12 weeks and therefore longer term options are also required.

It is important that any arrangements of this sort include adequate opportunity for testing at the contingency site.

Commercial recovery services can be provided in portable form where an agreed system is delivered to a Customer's site, within a certain time, typically 24 hours. The computer equipment is contained in a trailer and transported to the site by truck. The trailer is fitted out as a computer environment with the necessary services and only needs power and telecommunications links from the site to the trailer for the service to be established. Special measures may need to be taken to make the site secure.

The service provider normally charges an annual fee for such a service, and there is often a 'call-out' charge if the service is invoked. However in some circumstances, such as when there is damage to the site or when an exclusion zone is applied by the emergency services to the site, this option cannot be used

An advantage of this approach is that the trailer can be installed close to the main site subject to the necessary parking consents having been obtained. Parking a trailer on a busy road in a city is likely to draw the unwelcome attention of the police who may insist on removal.

Organisations with alternative locations may opt for a mutual fallback arrangement where accommodation is provided through displacement of non-critical staff at the unaffected building and computer facilities provided via mobile recovery.

Immediate Recovery

This option (sometimes referred to as 'hot standby') provides for immediate restoration of services and is usually provided as an extension to the intermediate recovery provided by a third party recovery provider. The immediate recovery is supported by the recovery of other critical business and support areas during the first 24 hours following a service disruption. Instances where immediate recovery may be required are where the impact of loss of service has an immediate impact on the organisation's ability to make money, such as a Bank's dealing room.

Where there is a need for a fast restoration of a service, it is possible to 'rent' floorspace at the recovery site and install servers or systems with application systems and communications already available and data mirrored from the operational servers. In the event of a system failure, the Customers can immediately switch to the backup facility with little or no loss of service.

In the case of building loss or denial of access an organisation can pay for a limited number of exclusive positions at a recovery centre. This is a highly expensive option and is not appropriate for the majority of organisations. However, these positions are always available and ready for immediate occupation and use.

Some organisations may identify a need for their own exclusive immediate recovery facilities provided internally. This again is an expensive option but may be justified for a certain business process where non-Availability for a short period could result in a significant impact. The facility needs to be located separately and far enough away from the home site that it will not be affected by a disaster affecting that location.

For highly critical business processes, a mirrored service can be established at an alternative location, which is kept up to date with the live service, either by data transfer at regular intervals, or by replications from the live service. Such a service could be used merely as a backup service, but it might also be used for enquiry access (such as reporting) without affecting the live processing performance. This is also useful if there are legal or legislative obligations to safeguard the completeness and integrity of all financial records. As this is essentially spare Capacity, under normal circumstances this spare Capacity can be used for development, training or testing, but could be made available immediately when a Service Continuity situation demands it.

The ultimate solution is to have a mirrored site with duplicate equipment as part of the live operation. However, these mirrored servers and sites options, should be implemented in close liaison with Availability Management.

It is important to distinguish between the previous definition of 'hot standby' and 'immediate recovery'. Hot standby typically referred to Availability of services within a short timescale such as 2 or 4 hours whereas immediate recovery implies the instant Availability of services. A recovery plan for an organisation will include a combination of some or all scenarios. Instant recovery for critical business processing, 4 hour recovery for additional business processes, 8 hour recovery for key support services and the other business areas being recovered as and when required.

7.3.3 Stage 3 - Implementation

Figure 7.7 - Implementation

Once the strategy has been agreed the Business Continuity lifecycle moves into the implementation stage (see Figure 7.7), involving IT at a detailed level. The implementation stage consists of the following processes:

establish the organisation and develop implementation plans
implement Stand-by arrangements
implement risk reduction measures
develop IT recovery plans
develop procedures
undertake initial tests.

Each of the above is considered with respect to the specific responsibilities that IT must action.

Organisation planning

The IT function is responsible for the provision of IT Services to support the business requirements identified during the Business Impact Analysis and requirements definition. However, for recovery purposes, IT in itself only forms part of the overall Command, control and communications structure. The structure is based around three tiers:

Executive - including senior management / executive board with overall authority and control within the organisation and responsible for Crisis management and liaison with other departments, divisions, organisations, the media, regulators, emergency services etc.
Co-ordination - typically one level below the Executive group and responsible for co-ordinating the overall recovery effort within the organisation.
Recovery - a series of business and service recovery teams representing the critical business functions and the services that need to be established to support these functions. Each team is responsible for executing the plans within their own areas and for liaison with staff, Customers and third parties. Within IT the recovery teams should be grouped by IT Service and application, for example the Infrastructure team may have one or more people responsible for recovering external connections, voice services, local area networks, etc., the support teams may be split by platform, operating system or application. In addition, the recovery priorities for the service, application or its components identified during the Business Impact Analysis should be documented within the recovery plans and applied during their execution.

Implementation planning

Plan development is one of the most important parts of the implementation process and without workable plans the process will certainly fail. At the highest level there is a need for an overall co-ordination plan that includes:

Emergency Response Plan
Damage Assessment Plan
Salvage Plan
Vital Records Plan
Crisis Management and Public Relations Plan.

These plans are used to identify and respond to a service disruption, ensure the safety of all affected staff members and visitors and determine whether there is a need to implement the business recovery process. If so, then the next level of plans are invoked which include the key support functions such as:

Accommodation and Services Plan
Computer Systems and Network Plan
Telecommunication Plan
Security Plan
Personnel Plan
Finance and Administration Plan.

Finally, each critical business area is responsible for the development of a plan detailing the individuals who will comprise the recovery team and a detailed task list to be undertaken on invocation of recovery arrangements. The owners of each plan must ensure that they have identified and agreed support and services from other parties upon who they have a reliance for a service or resource.

The ITSC Plan must contain all the information needed to recover the computer systems, network and telecommunications in a disaster situation once a decision to invoke has been made and then to manage the business return to normal operation once the service disruption has been resolved. There is a need to consider the various sources of information that are required in the development of the plan and these include the minimum requirements identified through the Business Impact Analysis, Service Level Agreements, security requirements, operating instructions and procedures, and external contracts. This plan will be complemented by the other plans such as the Personnel plan that will address the need for transport and accommodation of key IT recovery personnel to the recovery site, or the use of overnight accommodation for critical staff. This is especially important if the recovery site has to be used for extended periods of time (e.g. weeks).

As part of the implementation planning process, it is vitally important to review key and critical contracts required to deliver business critical services. These contracts should be reviewed to ensure that, if appropriate, they provide a BCM service, there is a defined Service Level agreed and the contracts are still valid and in-force if operations have to switch to the recovery site (either wholly or partially). If contracts do not include these details, then the service criticality should be reviewed and the risks associated with the service not being provided should be assessed.

Hints and tips

Check all key and critical contracts to ensure they provide a BCM service if required, have SLAs defined for business as usual and check that BCM will still be valid if the recovery site has to be invoked.

Implement risk reduction measures

The risk reduction measures detailed in Paragraph 7.3.2 need to be implemented. This is often achieved in conjunction with Availability Management as many of these reduce the probability of failure affecting the Availability of service. Typical risk reduction measures include such things as:

installation of UPS and back-up power to the computer
fault tolerant systems for critical applications where even minimal downtime is unacceptable, for example, a bank dealing system
offsite storage and archiving
RAID arrays and disk mirroring for LAN servers to prevent against data loss and to ensure continued Availability of data
spare equipment / components to be used in the event of equipment or component failure, for example, a spare LAN server already configured with the standard configuration and available to replace a faulty server with minimum build and configuration time.

Some of the ITSCM measures that can be implemented to maintain the Availability of services due to a localised disruption are described in more detail in Chapter 8.

Implement stand-by arrangements

The recovery options were detailed in Paragraph 7.3.2. It is important to remember that the recovery is based around a series of stand-by arrangements including accommodation as well as systems and telecommunications. Certain actions are necessary to implement the stand-by arrangements, for example:

negotiating for third party recovery facilities and entering into a contractual arrangement
preparing and equipping the stand-by accommodation
purchasing and installing stand-by computer systems
negotiating with external service providers on their ITSC Plans and undertaking due diligence if necessary.

Example

In a manufacturing organisation, there may be a reliance on a critical component produced by a third party without which the product cannot be manufactured. Unavailability of this component would severely disrupt, if not stop the business process so the Organisation must satisfy itself that appropriate contingency arrangements are in place to maintain the manufacture and delivery of the component.

A call centre that relies on telecommunications services should ensure that they do not rely on a single provider. Failure of that provider would mean that there would be no capability to make or receive telephone calls, which would have a major impact on their reputation.

Training and new procedures may be required to operate, test and maintain the stand-by arrangements and to ensure that they can be initiated when required.

Develop ITSCM plans

ITSCM plans need to be developed to enable the necessary information for critical systems, services and facilities to either continue to be provided or to be reinstated within an acceptable period to the business. Generally the Business Continuity Plans rely on the Availability of IT systems and facilities. As a consequence of this ITSCM plans need to address all activities to ensure that the required systems and facilities are delivered in an acceptable operational state and are 'fit for purpose' when accepted by the business. This entails not only the restoration of systems and facilities, but also the understanding of dependencies between them, the testing required prior to delivery (performance, functional, operational and acceptance testing), and the validation of data integrity and consistency.

The criticality and priority of services, systems and facilities needs to be communicated by the Business Continuity planners for inclusion in the ITSCM plans. This ensures that disruptions are dealt with in the priority order required by the business and subject to system interdependencies. In addition, mutual agreement of the provision of the service Infrastructure and the completeness of the ITSCM planning is then possible.

Management of the distribution of the plans is important to ensure that copies are available to key staff at all times. The plans should be controlled documents (with formalised document control, Change control and distribution) to ensure that only the latest versions are in circulation and each recipient should ensure that a personal copy is maintained off-site.

In addition, ensure that:

there are sufficient details to enable a technical person not familiar with the system to be able to follow the procedures - involve people who are not familiar with the system to perform a recovery test
the recovery plans include key detail such as the data recovery point, a list of dependent systems, the nature of the Dependency and their data recovery points, system hardware and software requirements, configuration details and references to other relevant or essential information about the system
a check-list is included that covers specific actions required during all stages of recovery for the system, for example after the system has been restored to an operational state, connectivity checks, functionality checks or data consistency and integrity checks should be carried out prior to handing the system over to the business,

Develop procedures

The ITSCM plan is dependent on specific technical tasks being undertaken. It is necessary that these are fully documented and comprehensive so that any literate IT person can undertake the recovery. Procedures need to be developed to include the:

installation and testing of replacement hardware and networks
restoration of software and data to a common reference point which is consistent across all business processes
different time zones in a multinational organisation
business cut-off points.

Many procedures may already exist, such as the procedures for restoring systems and data in the event of equipment failure, and these should be refined and attached to the ITSC Plan as an Appendix.

Initial testing

KEY MESSAGE

Testing is a critical part of the overall ITSCM process and is the only way of ensuring that the selected strategy, stand-by arrangements, logistics, Business recovery plans and procedures will work in practice.

IT is responsible for the provision of the technical components and testing that these function effectively. An initial technical test can usually be done without the need to involve the business. However, for subsequent tests it is prudent to get the business to be involved to 'prove' the capability and to aid mutual understanding of the activities and resources needed to achieve the common goal of business recovery.

A full test needs to replicate as far as possible the invocation of all stand-by arrangements, including the recovery of business processes and the involvement of external parties. This tests completeness of the plans and confirms:

time objectives, e.g. to recover the key server applications within a certain number of hours
staff preparedness and awareness
staff duplication and potential over commitment of key resources, e.g. a system administrator being required to support a number of modular plans (Service Desk, operations, networks and communications)
the responsiveness, effectiveness and awareness of external parties.

Tests may be announced or unannounced, however, in the latter case it is necessary to ensure that senior management approval is obtained in advance otherwise it may be difficult to achieve commitment. All tests need to be undertaken against defined test scenarios, which are described as realistically as possible. It should be noted, however, that even the most comprehensive test does not cover everything. For example in a service disruption where there has been injury or even death to colleagues, the reaction of staff to a crisis cannot be tested and the plans need to make allowance for this. In addition, tests must have clearly defined objectives and critical success factors which will be used to determine the success or otherwise of the exercise

7.3.4 Stage 4 - Operational Management

Figure 7.8 - Operational Management

Once the implementation and planning has been completed there is a need to ensure that the process is maintained as part of business as usual. This is achieved through operational management (see Figure 7.8) and includes:

Education and awareness - this should cover the organisation and in particular, the IT organisation, for Service Continuity-specific items. This ensures that all staff are aware of the implications of Business Continuity and of Service Continuity and consider these as part of their normal working routine and budget.
Training - IT may be involved in training the non-IT literate Business recovery team members to ensure that they have the necessary level of competence to facilitate recovery.
Review - regular review of all of the deliverables from the ITSCM process needs to be undertaken to ensure that they remain current. With respect to IT this is required whenever there is a major Change to the IT Infrastructure, assets or dependencies such as new systems or networks or a change in service providers, as well as when there is a change in business direction, business strategy or IT strategy. As organisations typically have rapid change, it is necessary to invest in an ongoing review Programme and incorporate ITSCM into the organisational business justification processes. New requirements will be implemented in accordance with the Change control process.
Testing - following the initial testing it is necessary to establish a programme of regular testing to ensure that the critical components of the strategy are tested at least annually or as directed by senior management or audit. It is important that any changes to the IT Infrastructure are included in the strategy, implemented in an appropriate fashion and tested to ensure that they function correctly within the overall provision of IT Services.
Change control - following tests and reviews and in response to day to day Changes, there is a need for the ITSCM plans to be updated. ITSCM must be included as part of the Change Management process to ensure that any Changes in the Infrastructure are reflected in the contingency arrangements provided by IT or third parties. Inaccurate plans and inadequate recovery capabilities may result in the failure of ITSCM. Further guidance is provided in Chapter 8 in the Service Support book.
Assurance - the final process in the ITSCM lifecycle involves obtaining assurance that the quality of the ITSCM deliverables is acceptable to senior business management and that the operational management processes are working satisfactorily.

7.3.5 Invocation

Invocation is the ultimate test of the Business Continuity and ITSCM plans. If all the preparatory work has been successfully completed and plans developed and tested then an invocation of the Business Continuity Plans should be a straightforward process.

Invocation is a key component of the plans, which must include the invocation process and guidance. It should be remembered that the decision to invoke, especially if a third party recovery facility is to be used, should not be taken lightly. Costs will be involved and the process will involve disruption to the business. This decision is typically made by a 'crisis management team' comprising senior managers from the business and support departments (including IT) using information gathered through damage assessment and other sources.

A disruption could occur at any time of the day or night, so it is essential that guidance on the invocation process is readily available. Plans must be available both in the office and at home and key members of the crisis management team should be issued with a short aide memoire, which they must keep with them at all time detailing:

the locations of these plans
the associated key actions and decision points
contact details of the crisis management team.

The decision to invoke must be made quickly, as there may be a lead-time involved in establishing facilities at a recovery site. In the case of a building fire, the decision is fairly easy to make, however, in the case of power failure, where a Resolution is expected within a short period, a deadline should be set by which time if the Problem has not been resolved, invocation will take place. This deadline will be established by the crisis management team working back from the critical point by which the business processes must be re-established to prevent an unacceptable impact to the organisation.

Hints and tips

Whenever there is a situation where invocation may be required, put the recovery service provider on alert immediately so that facilities can be made available as quickly as possible if a decision to invoke is made.

The decision to invoke needs to take into account a number of factors:

the extent of the damage and scope of the potential invocation
the likely length of the disruption and Unavailability of premises and/or services
the time of day/month/year and the potential business impact. At year end the need to invoke may be more pressing to ensure that year-end processing is completed on time
specific requirements of the business depending on work being undertaken at the time.

Once the crisis management team has decided to invoke business recovery facilities, there is a need to communicate this within the organisation. This is typically done through the use of call trees, a mechanism for communicating quickly and efficiently with identified recovery personnel throughout the organisation. The crisis management plan should include details of key personnel to be contacted to initiate the business and ITSCM recovery plans. Within each of these plans, contact details for essential personnel (and their deputies) should be included to enable the plans to be initiated.

Hints and tips

It is vital to ensure that the message has been passed to all essential personnel involved in the recovery process. The last person to receive the message should 'close the loop' by contacting the initiator and confirming the action to be taken.

The ITSCM plan should include details of activities that need to be undertaken including:

retrieval of backup tapes or use of data vaulting to retrieve data
retrieval of essential documentation, procedures, workstation images, etc. stored off-site
mobilisation of the appropriate technical personnel to go to the recovery site to commence the recovery of required systems and services
contacting and putting on alert telecommunications suppliers, support services, application vendors, etc. who may be required to undertake actions or provide assistance in the recovery process.

Throughout the initial recovery, it is important that all activities are recorded. These will be used following the service disruption to analyse what went well and identify areas for improvement. The plans should include blank logs that must be given to all personnel to record activities (such as telephone conversations, timings for activities, etc.) and issues experienced.

The invocation and initial recovery is likely to be a time of high activity involving long hours for many individuals. This must be recognised and managed by the recovery team leaders to ensure that breaks are provided and prevent 'burn-out'. Planning for shifts and handovers must be undertaken to ensure that the best use is made of the facilities available. The commitment of staff (especially technical staff who will typically spend in excess of 24 hours ensuring a successful recovery) must be recognised and potentially rewarded once the service disruption is over. It is also vitally important to ensure that the usual business and technology controls remain in place during invocation, recovery and return to normal to ensure that information security is maintained at the correct level and that Data Protection is preserved.

Hints and tips

It is vital to ensure that Information Security and Data Protection controls and mechanisms are maintained and enforced during the invocation, recovery and return to normal stages of Service Continuity.

Once the recovery has been completed, the business should be able to operate from the recovery site at the level determined and agreed in the Business Continuity strategy. The objective, however, will be to build up the business to normal levels and vacate the recovery site in the shortest possible time. The recovery period will depend on the original service disruption. In the case of a power failure, return to normal may be achieved fairly quickly, whereas in the case of a fire, reoccupation of the affected building may be impossible and alternative accommodation should be sought. Whatever the period, a return to normal must be carefully planned and undertaken in a controlled fashion. Typically this will be over a weekend and may include some necessary downtime in business hours. It is important that this is managed well and that all personnel involved are aware of their responsibilities to ensure a smooth transition.