Previous Section   Next Section

8.5  Availability Planning

8.5.1 Determining Availability requirements
8.5.2 Design activity
8.5.3 Designing for Availability
8.5.4 Designing for recovery
8.5.5 Security considerations
8.5.6 Managing planned downtime


8.5.1  Determining Availability requirements

Before any Service level requirement is accepted and ultimately the SLR or SLA is agreed between the business and the IT organisation it is essential that the Availability requirements of the business are analysed to assess if/how the IT Infrastructure can deliver the required levels of Availability.

This applies not only to new IT Services that are being introduced but also any requested Changes to the Availability requirements of existing IT Services.

It is important that the business is consulted early in the development Lifecycle so that the business Availability needs of a new or enhanced IT Service can be costed and agreed. This is particularly important where stringent Availability requirements may require additional investment in Service Management processes, IT Service management tools, high Availability design and special solutions with full redundancy.

It is likely that the business need for IT Availability cannot be expressed in technical terms. Availability Management therefore provides an important Role in being able to translate the business and User requirements into quantifiable Availability terms and conditions. This is an important input into the IT Infrastructure design and provides the basis for assessing the capability of the IT Infrastructure design and IT support organisation in meeting the Availability requirements of the business.

The business requirements for IT Availability should at least contain:

Hints and tips

The translation of User and business Availability requirements into quantifiable terms and conditions is crucial. To avoid any confusion or misunderstanding between the Business and IT the first step should be to document and agree a description or definition of the Availability terms and conditions that will be used.

What the business understands by downtime, Availability, reliability etc may differ from the IT perspective. This avoids any misunderstandings and enables subsequent design activities to commence with a clear and unambiguous understanding of what is required.

Availability Management having assessed the combined capability of the IT Infrastructure design and IT support organisation is then in a position to confirm if the Availability requirements can be met. Where shortfalls are identified, dialogue with the business is required to present the Cost options that exist to enhance the proposed design to meet the Availability requirements. This enables the business to reassess if lower or higher levels of Availability are required and to understand the appropriate impact and costs associated with their decision.

Determining the Availability requirements is likely to be an iterative Process particularly where there is a need to balance the business Availability requirement against the associated costs. The necessary steps are:

Hints and tips

If costs are seen as prohibitive, either:

Reassess the IT Infrastructure design and provide options for reducing costs and assess the consequences on Availability.

Or:

Reassess the business use and reliance on the IT Service and renegotiate the Availability targets to be documented in the SLA.

The Service Level Management (SLM) function is normally responsible for communicating with the business on how their Availability requirements are to be met and ultimately negotiating the SLA for the IT Service. Availability Management therefore provides important support to the SLM function during this period.

While higher levels of Availability can often be provided by technology investment there is no justification for providing a higher level of Availability than that needed and afforded by the business. The reality is that satisfying Availability requirements is always a balance between cost and quality.

This is where Availability Management can play a key role in optimising Availability of the IT Infrastructure to meet increasing Availability demands while deferring an increase in costs. (See Paragraph 8.5.4 and Section 8.9 for additional guidance.

8.5.2  Design activity

Designing for Availability is a key activity driven by Availability Management. This ensures that the required level of Availability for an IT Service can be met.

Availability Management needs to ensure that the design activity for Availability looks at the task from two related but distinct perspectives:

DESIGNING FOR AVAILABILITY: This relates to the technical design of the IT Infrastructure and the alignment of the internal and external suppliers required to meet the Availability requirements for an IT Service.

KEY MESSAGE

Designing for Availability can be considered the proactive perspective aimed at avoiding loss of IT Service Availability.

DESIGNING FOR RECOVERY: - This relates to the design points required to ensure that in the event of an IT Service failure, the service can be reinstated to enable normal business operations to resume as quickly as is possible.

KEY MESSAGE

Designing for Recovery can be considered the reactive perspective aimed at minimising the business and User impact from an IT Service failure.

Taking this two phased approach to the design activity ensures that new IT Services do not suffer unnecessary and extended recovery when the first failure situation occurs.

Additionally, the ability to recover quickly may be a crucial factor. In simple terms it may not be possible or cost justified to build a design that is highly resilient to failure(s). The ability to meet the Availability requirements within the cost parameters may rely on the ability to recover in a timely and effective manner, consistently.

8.5.3  Designing for Availability

Availability should be considered in the design process at the earliest possible stage of the development lifecycle. This avoids the potential for:

Figure 8.8 illustrates a high level outline of how initial Availability requirements are progressed by Availability Management to ensure these can be met by the IT Infrastructure and IT support organisation. This simple framework can be applied to new IT Services or existing IT Services where a Change of Availability requirements has required major redesign.

Figure 8.8 - Progressing Availability requirements for an IT Service

The role of Availability Management within the design activities is to provide:

The framework that Availability Management should utilise to determine the appropriateness of a given design to meet the stated Availability requirements consists of the following:

Method of approach

IT Infrastructure analysis - Capability Review

The first stage is to understand the Vulnerability to failure of the proposed IT Infrastructure design.

Single Points of Failure

A Single Point of Failure (SPOF) is any component within the IT Infrastructure that has no back-up capability and can cause impact to the business and User when it fails.

It is important that no unrecognised single points of failure exist within the IT Infrastructure design. The use of Component Failure Impact Assessment (CFIA) as a technique to identify single points of failure is recommended.

Where these are identified, CFIA can be used to identify the business and User impact and help determine what alternatives can or should be considered to cater for this weakness in the design. See Paragraph 8.9.1 for additional guidance on the use of CFIA.

Risk Analysis and Management

To assess the vulnerability of failure within the configuration and capability of the IT support organisation it is recommended that the proposed IT Infrastructure, service configurations, service design and supporting organisation (internal and external suppliers) are subject to a formal Risk Analysis. CRAMM is a technique that can be used to identify justifiable countermeasures that can protect the Availability of IT Systems. (See Paragraph 8.9.3 for additional guidance on CRAMM).

Testing or Simulation

To assess if new components within the design can match the stated requirements it is important that the testing regime instigated ensures that the Availability expected can be delivered. Simulation tools to generate the expected User demand for the new IT Service should be seriously considered to ensure components continue to operate under volume and stress conditions.

Improving the design

The second stage is to re-evaluate the IT Infrastructure design if the Availability requirements cannot be met and identify cost justified design Changes.

Improvements in design to meet the Availability requirements can be achieved by reviewing the capability of the technology to be deployed in the proposed IT Infrastructure design, e.g.:

Chapter 7 of this book provides additional guidance on this aspect of IT Infrastructure design.

Hints and tips

Consider documenting the Availability design requirements and considerations for new IT Services and make available to the areas responsible for design and implementation. Longer term seek to mandate these requirements and integrate within the appropriate governance mechanisms that cover the introduction of new IT Services.

Considerations for high Availability

Where the business operation has a high Dependency on IT Availability and the cost of failure or loss of business reputation is considered not acceptable, the business may define stringent Availability requirements. These factors may be sufficient for the business to justify the additional costs required to meet these more demanding levels of Availability.

Achieving high levels of Availability begins with the procurement and/or development of good quality products and components. However, these in isolation are unlikely to deliver the sustained levels of Availability required.

To achieve a consistent and sustained level of high Availability requires investment and deployment of effective Service Management processes, systems management tools, high Availability design and ultimately special solutions with full redundancy.

Figure 8.9 - The building blocks to meet the more stringent Availability requirements

Figure 8.9 illustrates quite simply, that to achieve higher levels of Availability requires investment in more than just the base product and components. This is a similar diagram to that presented in Figure 8.6 which emphasises how the higher the Availability demand the higher the overall costs.

The above can therefore be viewed as a framework for what needs to be considered within the overall design for Availability where stringent Availability requirements are set.

This suggested framework is described as follows:

Base product and components

The procurement or development of the base product and components should be based on their capability to meet stringent Availability and reliability requirements. These should be considered as the cornerstone of the Availability design. The additional investment required to achieve even higher levels of Availability will be wasted and Availability levels not met if these base products and components are unreliable and prone to failure.

Service Management processes

Effective Service Management processes contribute to higher levels of Availability. Processes such as Availability Management, Incident Management, Problem Management, Change Management etc play a crucial role in the overall management of the IT Service.

Systems Management

Systems Management should provide the monitoring, diagnostic and automated error recovery to enable fast detection and Resolution of potential and actual IT failure.

High Availability design

The design for high Availability needs to consider the elimination of single points of failure and/or the provision of alternative components to provide minimal disruption to the business operation should an IT component failure occur.

The design also needs to eliminate or minimise the effects of planned downtime to the business operation normally required to accommodate maintenance activity, the implementation of Changes to the IT Infrastructure or business application.

Recovery criteria should define rapid recovery and IT Service reinstatement as a key objective within the designing for recovery phase of design.

Special solutions with full redundancy

To approach continuous Availability in the range of 100% requires expensive solutions that incorporate full redundancy. Redundancy is the technique of improving Availability by using duplicate components. For stringent Availability requirements to be met these need to be working autonomously in parallel. These solutions are not just restricted to the IT components, but also the IT Environment, i.e. power supplies, air conditioning, telecommunications.

Helpful additional definitions when defining stringent Availability requirements.

As stated in Paragraph 8.5.1, the translation of the business and User Availability requirements into quantifiable Availability terms and conditions is crucial.

Where stringent levels of Availability are required additional definitions should be documented and agreed between the business and IT to ensure both parties understand the specific high Availability conditions.

The suggested additional definitions are: -

HIGH AVAILABILITY

A characteristic of the IT Service that minimises or masks the effects of IT component failure to the User.

CONTINUOUS OPERATION

A characteristic of the IT Service that minimises or masks the effects of planned downtime to the User.

CONTINUOUS AVAILABILITY

A characteristic of the IT Service that minimises or masks the effects of ALL failures and planned downtime to the User.

These definitions help to define the often-used term of 'High Availability'. This provides a better structure for determining which areas of Availability design are most important to the business.

Industry view

Many suppliers commit to high Availability or continuous Availability solutions only if stringent environmental standards are used. They often only agree to such contracts after a site survey has been completed and additional, sometimes costly improvements have been made.

8.5.4  Designing for recovery

Designing for Availability is a key activity driven by Availability Management. This ensures that the stated Availability requirements for an IT Service can be met.

However, Availability Management should also ensure that within this design activity there is focus on the design elements required to ensure that when IT Services fail, the service can be reinstated to enable normal business operations to resume as quickly as is possible.

'Designing for Recovery' may at first sound negative. Clearly good Availability design is about avoiding failures and delivering where possible a Fail-Safe IT Infrastructure. However, with this focus is too much reliance placed on technology and has as much emphasis been placed on the Safe-Fail aspects of the IT Infrastructure? The reality is that failures will occur. The way the IT organisation manages failure situations can have the following positive outcomes:

KEY MESSAGE

Every failure is a 'moment of truth' - every failure is an opportunity to make or break your reputation with the business.

By providing focus on the 'designing for recovery' aspects of the overall Availability design can ensure that every failure is an opportunity to maintain and even enhance business and User satisfaction.

Designing for Recovery - needs

To provide an effective 'design for recovery' it is important to recognise that both the Business and the IT organisation have needs that must be satisfied to enable an effective recovery from IT failure.

BUSINESS NEEDS

These are informational needs which the business requires to help them manage the impact of failure on their business and set expectation within the business, User community and their business Customers.

IT NEEDS

These are the process, procedures and tools required to enable the technical recovery to be completed in an optimal time.

Hints and tips

Consider documenting the Recovery design requirements and considerations for new IT Services and make available to the areas responsible for design and implementation. Longer term seek to mandate these requirements and integrate within the appropriate governance mechanisms that cover the introduction of new IT Services.

Key elements in the Design for Recovery

THE ROLE OF INCIDENT MANAGEMENT AND THE SERVICE DESK

A key aim is to avoid small Incidents becoming major by ensuring the right people are involved early enough to avoid mistakes being made and to ensure the appropriate business and technical recovery procedures are invoked at the earliest opportunity.

This is the responsibility of the Incident Management process and role of the Service Desk.

To ensure business needs are met during major IT Service failures and to ensure the most optimal recovery, the Incident Management process and Service Desk needs to have defined and execute:

KEY MESSAGE

The above are not the responsibilities of Availability Management. However, the effectiveness of the Incident Management process and Service Desk can strongly influence the overall recovery period. The use of Availability Management methods and techniques to further optimise IT recovery may be the stimulus for subsequent continuous improvement activities to the Incident Management process and Service Desk.

Understanding the Incident 'lifecycle'

It is important to recognise that every Incident passes through a number of stages. These are described as follows:

This 'lifecycle' view provides an important framework in determining amongst others, systems management requirements for Incident detection, diagnostic data capture requirements and tools for diagnosis, recovery plans to aid speedy recovery and how to verify that IT Service has been restored.

Chapter 5 of the Service Support book provides information on Incident Management and the Incident 'lifecycle' from the Incident handling perspective. To aid the designing for recovery this lifecycle has been expanded to reflect an IT Availability perspective of an Incident.

Paragraph 8.9.9 provides additional guidance on, and illustrates the use of, the expanded Incident lifecycle.

Systems Management

The provision of Systems Management tools positively influences the levels of Availability that can be delivered. Implementation and exploitation should have strong focus on achieving high Availability and enhanced recovery objectives.

In the context of recovery, such tools should be exploited to provide automated failure detection, assist failure diagnosis and support automated error recovery.

Diagnostic data capture procedures

When IT components fail it is important that the required level of diagnostics are captured, to enable Problem determination to identify the root cause. For certain failures the capture of diagnostics may extend service downtime. However, the non-capture of the appropriate diagnostics creates and exposes the service to repeat service failures.

Where the time required taking diagnostics is considered excessive; a review should be instigated to identify if techniques and/or procedures can be streamlined to reduce the time required. Equally the scope of the diagnostic data available for capture can be assessed to ensure only the diagnostic data considered essential is taken.

The additional downtime required to capture diagnostics should be included in the recovery metrics documented for each IT component.

Determine backup and recovery requirements

The backup and recovery requirements for the components underpinning a new IT Service should be identified as early as possible within the development or selection cycle. These requirements should cover hardware, software and data. The outcome from this activity should be a documented set of recovery requirements that enable the development of appropriate recovery plans.

Develop and test a backup and recovery strategy and schedule

To anticipate and prepare for performing recovery such that reinstatement of service is effective and efficient requires the development and testing of appropriate recovery plans based on the documented recovery requirements.

The outcome from this activity should be clear, operable and accurate recovery plans that are available to the appropriate parties immediately the new IT Service is introduced.

Wherever possible, the operational activities within the recovery plan should be automated.

The testing of the recovery plans also delivers approximate timings for recovery. These recovery metrics can be used to support the communication of estimated recovery of service and validate or enhance the CFIA documentation.

Recovery metrics

The provision of a timely and accurate estimation of when service will be restored is the key informational need of the business. This information enables the business to make sensible decisions on how they are to manage the impact of failure on the business and on their Customers. To enable this information to be communicated to the business requires the creation and maintenance of recovery metrics for each IT component covering a variety of recovery scenarios.

Paragraph 8.9.1 provides techniques for guidance on how Component Failure Impact analysis (CFIA) can be used to derive recovery metrics to support the communications element by providing estimated recovery timings.

Backup and recovery performance

Availability Management must continuously seek and promote faster methods of recovery for all potential Incidents. This can be achieved via a range of methods including automated failure detection, automated recovery, more stringent escalation procedures, exploitation of new and faster recovery tools and techniques.

It is recommended that this aspect of Availability measurement is included in the basic set of IT Availability measures utilised to measure and report IT Availability.

Paragraph 8.9.9 describes how to use the expanded 'Incident lifecycle' as a model for metrics creation. These metrics could be used to drive the 'review backup and recovery performance' process.

Service restoration and verification

An Incident can only be considered 'closed' once service has been restored and normal business operation has resumed. It is important that the restored IT Service is verified as working correctly as soon as service restoration is completed and before any technical staff involved in the Incident are stood down. In the majority of cases this is simply a case of getting confirmation from the User. However, the User for some services may be a Customer of the business, i.e. ATM services, Internet based services.

For these types of services it is recommended that IT Service verification procedures are developed to enable the IT support organisation to verify that a restored IT Service is now working as expected. These could simply be visual checks of transaction throughput or User simulation scripts that validate the end-to-end service.

Hints and tips

There is potential for confusion in distinguishing the aspects of Incident Management appropriate to the Service Desk and those appropriate to Availability Management.

The goal of the Incident Management process and the aims of Availability Management in designing for recovery are completely complementary, i.e. to restore normal business operation as quickly as possible and to minimise the impact to the business and User.

The Incident Management process is used by the Service Desk to provide a structured and consistent approach to the handling, tracking and ultimate resolution of all Incidents. This is the management perspective best described by the Incident lifecycle (ITIL Service Support - Chapter 5).

Availability Management is concerned with the methods, tools and techniques employed by the IT support organisation within each stage of the Incident Management 'lifecycle'. This is the technical perspective best described by the expanded Incident 'lifecycle' (ITIL Service Delivery).

8.5.5  Security considerations

Availability Management is concerned with the Availability of all IT Service components, including data. Availability Management is therefore closely connected with Security Management. The importance of Availability being recognised as one third of the security 'CIA' tenet:

Example

An IT Service may not be available due to the erroneous deletion of data, this (security) Incident resulting in a breach of the Service Level Agreement.

The overall aim of IT security is 'balanced security in depth' with justifiable controls implemented to ensure continued IT Service within secure parameters (viz., Confidentiality, Integrity and Availability).

During the gathering of Availability requirements for new IT Services it is important that requirements that cover IT security are defined. These requirements need to be applied within the design phase for the supporting IT Infrastructure. The points made in the earlier Sections about designing Availability into the design at the earliest opportunity equally apply to security controls.

For many organisations the approach taken to IT security is covered by an IT security policy owned and maintained by Security Management. In the execution of security policy, Availability Management plays an important role in its operation for new IT Services.

Hints and tips

There is potential for confusion between the process owners for Security Management and Availability Management with regard to security requirements for new IT Services.

Security Management can be viewed as accountable for ensuring compliance to IT security policy for the implementation of new IT Services. Availability Management is responsible for ensuring security requirements are defined and incorporated within the overall Availability design.

Availability Management can gain guidance from the information contained within the organisation's IT security policy and associated procedures and methods. However, the following are typical security

considerations that must, amongst others be addressed:

Industry view

The Purpose of information security?

'Information Security protects information from a wide range of threats in order to ensure Business Continuity, minimise business damage and maximise return on investments and business opportunity'

Source: BS 7799 - The UK code of practice for Information Security.

For further reference to Information Security refer to ITIL Security Management, OGC , ISBN 011330014X.

8.5.6  Managing planned downtime

Maintenance

All IT components should be subject to a planned maintenance strategy. The frequency and levels of maintenance required varies from component to component taking into account the technologies involved, criticality and the potential business benefits that may be introduced.

Planned maintenance activities enable the IT support organisation to provide:

The requirement for planned downtime clearly influences the level of Availability that can be delivered for an IT Service, particularly those that have stringent Availability requirements.

In determining the Availability requirements for a new or enhanced IT Service the amount of downtime and the resultant loss of income required for planned maintenance may not be acceptable to the business. This is becoming a growing issue in the area of E-commerce. In these instances it is essential that continuous operation is a core design feature to enable maintenance activity to be performed without impacting the full IT Service.

Where the required service hours for IT Services are less than 24 hrs per day and/or 7 days per week, it is likely that the majority of planned maintenance can be accommodated without impacting IT Service Availability.

However, where the business needs IT Services available on a 24 hour and 7 day basis, Availability Management needs to determine the most effective approach in balancing the requirements for planned maintenance against the loss of service to the business. Unless mechanisms exist to allow continuous operation, scheduled downtime for planned maintenance is essential if high levels of Availability are to be achieved and sustained. For all IT Services there should logically be a 'low impact' period for the implementation of maintenance.

Once the requirements for managing scheduled maintenance have been defined and agreed, these should be documented as a minimum in the following:

The areas responsible for implementing and managing Change, i.e. Service Desk, Network Management and Computer Operations, need to be made aware of the maintenance targets and any future revisions.

KEY MESSAGE

Availability Management should ensure that building in a low impact period for preventative maintenance is one of the prime design considerations for a '24 hours per day/7 days a week' IT Service.

Minimising business impact

Assessing Service Impact

The output from the Component Failure Impact Analysis (CFIA) indicates for a given component the impact on the User when the component is not available. The definition of IT Service downtime obtained when determining the Availability requirements establishes the level of business impact arising from the non-Availability of this component.

The CFIA also indicates if an alternative CI can continue to provide the service. Where an alternative CI is available, service impact is minimal dependent on how quickly the alternate CI is activated.

For components that have an alternative CI, the maintenance policy agreed with the internal or external supplier should be to ensure planned maintenance to these components are not scheduled concurrently.

Scheduling downtime

The most appropriate time to schedule planned downtime is clearly when the impact on the business and its Customers is least. This information should be provided initially by the business when determining the Availability requirements.

For an existing IT Service or once the new service has been established, monitoring of business and Customer transactions helps establish the hours where IT Service usage is at its lowest. This should determine the most appropriate timing window for the component(s) to be removed for planned maintenance activity.

Aggregation of maintenance activity

To accommodate the individual component requirements for planned downtime while balancing the IT Service Availability requirements of the business provides an opportunity to consider scheduling planned maintenance to multiple components concurrently.

The benefit of this approach is that the number of service disruptions required to meet the maintenance requirements is reduced.

While this approach has benefits, there are potential risks that need to be assessed, for example:

Service Maintenance Objectives

The effective management of planned downtime is an important contribution in meeting the required levels of Availability for an IT Service.

Where planned downtime is required on a cyclic basis to an IT component(s), the time that the component is unavailable to enable the planned maintenance activity to be undertaken should be defined and agreed with the internal or external supplier. This becomes a stated objective that can be formalised, measured and reported.

The Service Maintenance Objective (SMO) for a given planned maintenance activity should be the total time required for the IT component to be unavailable. The SMO should therefore be an aggregate of the following timeline events:

The benefits of defining Service Maintenance Objectives for cyclic planned maintenance activity are:

In addition they also provide an early warning during the maintenance activity of the time allocated to the planned outage duration being breached. This can enable an early decision to be made on whether the activity is allowed to complete with the potential to further impact service or to abort the activity and instigate the backout plan.

Planned downtime and performance against the stated SMO for each component should be recorded and used in service reporting.

Validation

An IT organisation that supported a 24 hr IT Service looked at ways that they could reduce the amount of downtime required for scheduled maintenance. A review of the scheduled outages revealed wide ranges in the amount of downtime incurred. These were due to a combination of Change quality issues impacting the implementation and operational issues with system closedown and restart procedures. Responsibilities for the successful implementation of the Changes were split between the area of the IT support organisation supplying the Change and the operational area responsible for implementing the Change.

The first step taken was to get both parties to agree on what should be a realistic time to complete the Change implementation (closedown, application of Change, restart of system).

Once agreed this was documented as a formal agreement between both parties upon which there was a shared ownership for meeting the maintenance objective.

Formal reporting and regular reviews were held. The reasons for failing to meet the target were investigated and addressed. In addition both parties explored opportunities to reduce the target time allocated. Where improvements were made and the results showed a sustained improvement a new maintenance objective was set.

This process continued year on year until the implementation time was considered optimal. In this time the duration for planned mainframe maintenance had been reduced from extremes of 3 hrs to consistently less then one hour.

Previous Section   Next Section