Business Continuity Planning & Disaster Recovery Solutions Continuous
Transcription
Business Continuity Planning & Disaster Recovery Solutions Continuous
Business Continuity Planning & Disaster Recovery Solutions Achieving Continuous Availability Introduction 3 Today’s business continuity problem 5 Planning for disaster recovery and business continuity 8 Regulatory squeeze and ubiquitous deployment 10 The ideal enterprise continuity solution 13 Achieving continuous availability with Sybase 18 Conclusion 24 Appendix: Business Continuity Planning and Disaster Recovery in practice 25 First Published 2002 www.datamonitor.com Datamonitor USA 1 Park Avenue 14th Floor New York, NY 10016-5802 USA Datamonitor Europe Charles House 108-110 Finchley Road London NW3 5JJ United Kingdom Datamonitor Germany Messe Turm Box 23 60308 Frankfurt Deutschland Datamonitor Asia Pacific Room 2413-18, 24/F Shui On Centre 6-8 Harbour Road Hong Kong t: +1 212 686 7400 f: +1 212 686 2626 e: usinfo@datamonitor.com t: +44 20 7675 7000 f: +44 20 7675 7500 e: eurinfo@datamonitor.com t: +49 69 9754 4517 f: +49 69 9754 4900 e: deinfo@datamonitor.com t: +852 2520 1177 f: +852 2520 1165 e: hkinfo@datamonitor.com ABOUT DATAMONITOR Datamonitor plc is a premium business information company specializing in industry analysis. We help our clients, 5000 of the world’s leading companies, to address complex strategic issues. Through our proprietary databases and wealth of expertise, we provide clients with unbiased expert analysis and in-depth forecasts for six industry sectors: Automotive, Consumer Markets, Energy, Financial Services, Healthcare, Technology. Datamonitor maintains its headquarters in London and has regional offices in New York, Frankfurt and Hong Kong. All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publisher, Datamonitor plc. The facts of this report are believed to be correct at the time of publication but cannot be guaranteed. Please note that the findings, conclusions and recommendations that Datamonitor delivers will be based on information gathered in good faith from both primary and secondary sources, whose accuracy we are not always in a position to guarantee. As such Datamonitor can accept no liability whatever for actions taken based on any information that may subsequently prove to be incorrect. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 2 BCP & DR – Achieving Continuous Availability Introduction As a solution, Disaster Recovery has been around for some time - the need to back up IT assets in the event of incidents such as flooding, fire, power failure or other physical and electronic attack is clear. Business continuity is an extension of this to ensure processes and transactions can continue under all circumstances. Recent events – notably those of the September 11th 2001– have made it clear that BCP and DR (Business Continuity Planning and Disaster Recovery) is increasingly vital for all large enterprises. The solutions themselves have also moved on; it has become apparent that classic back-up and restore where tapes containing data assets would be shipped to a remote location is no longer feasible. The complexity of the modern IT environment exposes it both to an increased internal technical risk of failure as well as external risks, in addition to the need for continuous data access in a ‘real-time’ IT world. There has been massive investment over the past few years in enterprise applications, such as CRM, ERP and SCM, to name but a few. This investment has greatly affected business continuity and disaster recovery, as many organisations are integrating their business processes with those of their customers, suppliers and business partners. Recover times have shrunk to minutes and hours, and in some cases moved to zero – this means 24x7x52 continuous business process availability. Furthermore, scenario plans have broadened to take on the new risks of eBusiness, for example, downtime due to: • operational risk (such as the Microsoft.com three-day outage); • security risk (denial-of-service attacks bringing down Yahoo); • lack of capacity (the launch of the UK Governments 1901 census site resulted in a 1.2 million hits per hour causing the site to crash and be withdrawn for several months); • application failure (such as the full day outage last year by the London Stock Exchange); • partner / outsourcer unavailability (such as ISP network failure or links from a web site to those of partners that are unavailable); • natural disasters (such as floods on the River Oder and the Rhine, earthquakes in Bulgaria and Turkey); Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 3 BCP & DR – Achieving Continuous Availability • terrorist attacks (such as those common to the United Kingdom and Spain). Any downtime risk must be a concern to every business, as any downtime today results in a press event which could impact the image and reputation of the enterprise. Therefore, the aim of this paper is to increase the awareness and understanding of the benefits surrounding Business Continuity and Disaster Recovery technologies, by: • discussing the increasing need for BCP and DR solutions by organisations; • highlighting the business benefits of business continuity solutions; • outlining the various technology solutions that can be used to achieve BCP and DR; • explaining the advantages of using a 3-tiered, continuously available softwarebased solution. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 4 BCP & DR – Achieving Continuous Availability Today’s business continuity problem The issue surrounding business continuity planning (BCP) is essentially the same for a traditional bricks and mortar company or an innovative ‘dot com’; business processes and information systems must be continually available, 24x7x52. However, what has changed over time is the speed in which recovery is required and the increased willingness of competitors to capitalise on a company’s downtime. With increasing reliance on electronic markets and the impact of natural disasters and criminal activity on the company’s technology base, its supply chain and the customer base, companies are becoming more and more concerned about business continuity planning. The main drivers and factors that have created the need for comprehensive business continuity solutions are: • the global nature of modern business practices; • corporate, internal and external supply chain interdependency • speed and timeliness • IT-dependant business processes; • the increasing value of data; • legislation / regulation. Business continuity planning means formalising a company’s strategy for dealing with the unexpected and unknown by planning, training and testing for the recovery of critical business processes and IT systems in a timely fashion to minimise the impact of any disruption on the business and the customer. Typical natural disaster scenarios addressed by many businesses include fire, floods, tornadoes, hurricanes, earthquakes, snow / ice and extreme heat. However, many businesses are willing to accept the risk of the above natural disasters due to their perceived unlikelihood and hence they feel business continuity planning is not necessary. Many threats that have nothing to do with geographic location or threat of natural disaster are often overlooked, such as: • work-place violence; • terrorism; • workforce unavailability; Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 5 BCP & DR – Achieving Continuous Availability • computer / Internet based crime (including denial of service attacks); • geographical restrictions caused by events, such as chemical contamination; • power outage; • computer viruses; • telecommunications failure; • asset malfunction (including hardware and software failures); • human error. Business continuity planning is not simply ‘ticking a box’ by purchasing a hot-site contract or a business continuity planning software package. Similarly, having a documented plan is not enough. In the beginning As illustrated in Figure 1, the modern concepts of disaster recovery and business continuity were hatched around the beginning of the nineties mainly in the form of legal requirements for banks and financial institutions to set up contingency plans in the event of a disaster. Figure 1: The evolution of disaster recovery and business continuity R e c o v e ry tim e s c a le seconds m in u te s A p p lic a tio n re c o v e ry & c o n tin u ity h o u rs D is a s te r re c o v e ry 2 4 -h o u r s days B u s in e s s C o n tin u ity D is a s te r m itig a tio n 1990 1995 Source: Datamonitor 2000 2000 + DATAMONITOR Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 6 BCP & DR – Achieving Continuous Availability These solutions mainly mitigated the impact of a major disaster. Consequently, these contingency arrangements were designed to enable the restoration of business and data over a period of a few days protecting the business from a permanent or longterm loss. As such the contingencies developed in this period did not allow for any immediate continuity in the running of the business. Over the following half-decade the concept of being able to recover the operations of the business in the case of a disaster gained further ground with enterprises establishing remote locations where workers could be relocated for the restoration of a business in the case of a disaster. These facilities are to this day the ultimate protection against major disasters, but are as such only available to enterprises whose wealth and data value makes a facility of this nature possible. The dawn of the millennium At the end of the last millennium, the spread of the Internet and the advent of eCommerce saw the timeframes of business transactions shrink to levels never seen before. For example the average patience of people surfing the web is less than 10 seconds until they give up and move to another site, unlikely to return to the site that did not work. At the same time IT has been integrated into business processes, leading to the increased importance of maintaining the availability of applications and IT infrastructure at all times. These solutions still mainly involved back-up and restore implementations, typically to tape, with the aim of minimising any data loss in case of a systems failure. Furthermore a backup and restore facility facilitated the process of restoring the last known working configuration of a system thereby enabling higher levels of availability. This process was perfected with more frequent back-up windows and redundant application servers enabling the fail-over of application to occur in a matter of minutes. The fall in the price of back-up and restore solutions over this period to levels that were affordable for most enterprises meant that the majority of companies with an IT department could afford a simple tape library for the backup of critical data. The birth of Business Continuity In the scare relating to the Y2K bug, many enterprises performed a close audit of their business involving contingency plans in the event of downtime or disaster. The resulting solutions for the first time saw the widespread acceptance of the concept of business continuity whereby the business operations were to be maintained at all times in some form. Business continuity is meant to encompass the whole business, not just the web-site or key applications. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 7 BCP & DR – Achieving Continuous Availability Today the main issue in business continuity and disaster recovery is no longer merely related to the attainment of higher rates of availability as the economics of moving from the highly expensive availability level of ‘four or five 9s’ of reliability (i.e. 99.99% or 99.999% reliable) to even higher levels rarely makes sense to even the most downtime sensitive-organization. Instead, the main focus in business continuity and disaster recovery should be to concentrate of the level of deployment within the enterprise. Furthermore, in today’s world of interdependence and collaboration it is also important to think about partner companies, most notably suppliers, to ensure that agreed levels of performance can be maintained. Most organizations at the moment still only have quite rudimentary solutions in use for disaster recovery and business continuity. Planning for disaster recovery and business continuity Datamonitor research on issues relating to business continuity and disaster recovery indicated that enterprises had a very poor idea of what these solutions entail and that levels of deployment seem to be quite low. It is this low level of deployment that needs to be addressed, as with so many IT issues, through education. Despite the higher priority that business continuity and disaster recovery have been able to gain in overall IT budgets, there is still a clear shortfall in the level of spending in enterprises resulting in inadequate cover against disruptive events. Furthermore, many BCP mistakes or assumptions are made by companies dependant on traditional technology infrastructures, as well as those relying more heavily on the Internet, and include: • over resilience – relying on a business continuity plan can lead to a false sense of security and potential business failure if the plan is not updated regularly and fully tested. Formal mechanisms should be in place to force a plan update on a regular basis or when significant systems or business process change occurs. A comprehensive BCP plan should include mechanisms to ensure periodic updates; • segregated planning processes – companies often limit the scope of their efforts to systems recovery or consider IT assets and business processes separately. BCP requires consideration of both business process and systems recovery together, given that technology often play’s a critical role in the business processes. The plan must address those processes that coincide with corporate strategy and objectives; • lack of planning prioritisation – prioritising key business processes is a critical step that often does not get the appropriate attention. Without prioritisation, a plan may recover less-than-critical business processes rather than the ones crucial for Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 8 BCP & DR – Achieving Continuous Availability survival. Furthermore, vulnerability should also be considered and taken into account when prioritising which processes will be planned first. The fact that more and more business processes are interdependent, the order in which processes and systems are recovered is important; • safety deficiency – all stages of the BCP processes, employee, business partner and customer safety must be taken into account. The plan should include safety controls to minimise casualties in the event of a disaster and ways to contain the situation to minimise or eliminate risk; • inadequate communications – communication issues are often overlooked. Often, businesses lack formal communications plans to contact employees, business partners and clients, in the event of problems. Strategies to address how these groups obtain recovery status updates are usually inadequate; • poor security – physical and information systems security controls are often disregarded during plan development and implementation, resulting in greater risk exposure during recovery operations. In order to recover equipment quickly and without interference, as well as to process insurance claims in a timely manner, physical security over a disaster site is an important consideration. Also, similar logical security controls should be in place for the back-up information systems as the primary processing environment; • ineffective response to insurance requirements – many business continuity plans fail to adequately plan to support the filing of insurance claims, resulting in delayed or reduced settlements; • poor recovery services evaluation – many companies poorly evaluate recovery products (hot sites, cold sites, off site storage and planning software), relying on vendor information. Furthermore, some companies eliminate the product or service simply due to cost without understanding how it could significantly affect the timely recovery of critical business processes. Lack of foresight may lead to a solution that does not adequately address a company’s needs; • regulatory / legislative compliance – adherence to legislative issues is also often overlooked, with many companies being unaware of their legal requirement to plan for the continuity of business processes and must understand local statutes and industry regulations governing business resumption and disaster recovery; Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 9 BCP & DR – Achieving Continuous Availability Regulatory squeeze and ubiquitous deployment Until now the regulation relating to business continuity and disaster recovery has been quite limited, relating mostly to financial companies and financial records within an enterprise. Very little legislation is actually in place that forces businesses in normal circumstances to implement measures for business continuity. However, driven by the political will created following the events of September 11 regulatory bodies across the Western world will most likely step up their requirements on business continuity features of businesses, especially publicly listed businesses. Primarily, this would be to ensure the going concern of the business but also would set down standards for business availability- in essence setting down a legally enforceable SLA. However, there are two regulatory bodies that are providing guidance with the adoption of risk-based approaches to business operations. They include the Basel Committee for the financial services sector and the Turnbull Committee for companies of any industry. A discussion of both regulatory bodies is given below. Basel Committee In 1988, the Basel committee of the Bank for International Settlements (BIS) classified the risk levels of different types of credit and the minimum amount of capital that should be held as a cushion against the risks of each. The revised Accord (Basel II) conceived in June 1999, attempts to tackle the shortfalls of its predecessor but more importantly, it encourages banks to develop internal models rather than have capital reserves imposed by regulators. Proposals focus on incentivizing banks to develop more sophisticated approaches to credit and operational risk based on the banks’ own internal ratings and measurement systems. In order to benefit from capital incentives, banks must be able to demonstrate high levels of data quality, accuracy and integrity in order to obtain approval for the internal rating based approach that will give the institutions the ability to self regulate risk management. Basel II also recognises that risks other than credit and the market can be substantial, and therefore also incorporates directives on operational risk, i.e. the risk of losses from inadequate or failed internal processes, people, systems or external events. These losses can be catastrophic, especially with the increased reliance on electronic payment processing, electronic trading and the automation of other middle and back office functions. The increased complexity and granularity of Basel II over its predecessor forces banks to revise existing credit and operational risk management systems and data Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 10 BCP & DR – Achieving Continuous Availability standards. Critical to this is the management of large volumes of high quality loss data and detailed documentation to enable audit trails and third party verification. Figure 2: The Basel II data challenge DATA MANAGEMENT Data collection Assimilation of data from across the enterprise at a business unit and geographical level Data standardisation Imposing a uniform discipline on the captured data Data cleansing Quality control Data consolidation Developing and maintaining a dynamic database Source: Datamonitor DATAMONITOR A significant factor in a Basel II project will be the process for collecting, standardizing and consolidating data through the business units to model risk. Banks need to establish a workable definition of risk and for this definition to be applied across all geographies and business units. While presenting a significant technological and business process challenge, possibly the single greatest factor will be cultural. Many banks are not culturally accustomed to amassing large volumes of historical operational loss data given that operational risk by definition has traditionally been regarded as a risk category immune to standard linear risk management methodologies. The links between Basel II and Business Continuity are clear. Firstly, it integrates business risk and the need to collect and store information. Business Continuity as a concept is tightly integrated into the requirements for Basel II as a method of mitigating risk and ensuring data integrity against low frequency high consequence and high frequency low risk events. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 11 BCP & DR – Achieving Continuous Availability Turnbull Committee Report The Turnbull Report is the popular name given to the ICAEW's Guidance on Internal Control which was produced in clarification of The Combined Code on Corporate Governance. The Turnbull guidance is linked via the Combined Code on Corporate Governance to the Listing Rule disclosure requirements of the London Stock exchange. The guidance is concerned with the adoption of a risk-based approach to establishing a system of internal control and reviewing its effectiveness and can be instituted by any company in any industry. The guidance is intended to reflect sound business practice whereby internal control is embedded within the business processes themselves. This ensures that rather than being merely an exercise to meet regulatory hurdles it is incorporated into the company’s normal management and governance processes. Crucial to this is the system of internal control; sound internal control systems should ensure the enterprise will not be hindered in achieving its objectives. Building on the internal control aspect, the Turnbull guidance mandates the need for an ongoing internal audit stating how the enterprise identifies, evaluates and manages risks. As with Basel II, there is a requirement for transparency of the results of this and the enterprise is required to state what actions are being taken to rectify or improve any major failings or weaknesses with the control system or risk management system. Although this requirement involves non-IT functions it is clear that in any enterprise where information handling or processing is paramount, the institution of a business continuity capability is key. An effective system should not only mitigate risk, but also satisfies the IT internal control requirements for Turnbull. Business Continuity for the masses Traditionally, Business Continuity solutions have been associated with the financial services sector. However, with the evolution of eBusiness and the wide-spread use of mission critical applications, organisations from all verticals are investing in continuity and recovery solutions. In 2002, financial services institutions will account for 35% of a €2.9 billion market in Europe. Datamonitor expects the dominance of financial services end-users to remain, due to various initiatives such as GSTPA, T+1 and multi-channel banking. The uptake of BC and DR solutions in other verticals is relatively similar to the security market – with the public sector, utilities and telecommunications organisations showing an increasing spend pattern. By 2005, Datamonitor expects European businesses to spend a total of €7.7 billion in achieving continuous availability of their operations. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 12 BCP & DR – Achieving Continuous Availability The ideal enterprise continuity solution The need to ensure the availability and continuous operation of business systems, in spite of potential failures ranging from disk crashes and CPU failures to catastrophic losses of computing facilities or communications networks, and planned downtime and maintenance is of critical importance to today’s businesses. Typically, many companies also have geographically dispersed data centres. Information must be available across the data centres and steps should be taken to ensure data continuity, particularly when a data centre has an outage, planned or unplanned. Although solutions exist to provide tolerance to component failure, the issue of site loss is often overlooked, with potentially disastrous consequences – business interruption and loss of information. Traditionally, off-site tape dumps have satisfied the requirements for disaster recovery for batch systems but they are typically inadequate for protecting the information stored in real-time eBusinesses and online transaction processing systems. Asynchronous replication facilities can provide continuous duplication of critical eBusiness applications to off-site backup facilities, without the high latency inherent in tape backup strategies. Once established, such an environment can be automated to ensure that information is replicated in a timely manner and the switch to backup systems is accomplished with minimal or no business interruption. Companies must develop a solution to provide continuous availability of their systems, or risk losing business to their competitors. It is essential that high availability solutions address the availability of data in three areas: • managed planned downtime (e.g. routine maintenance downtime, software upgrades, etc.); • protect during unplanned downtime (machine / network outages); • provide disaster recovery. Any system providing high availability (HA) should provide continuous availability of data in all three scenarios. The remainder of this section will provide an understanding of the various components needed to achieve continuously available systems. In combination, these components can provide a near-continuous service. Hardware redundancy and physical replication Hardware redundancy is often though as the first line of defence in continuous systems and it protects against computer and disk failure. Although this can be an Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 13 BCP & DR – Achieving Continuous Availability expensive option, it is the only way to protect against the failure of a machine or disk drive. Multiple processors in a single machine can protect against processor failure, while multiple machines in a cluster formation protects against machine failure, and these form the basis of hardware redundancy. Drive redundancy can be achieved by using RAID (redundant array of inexpensive disk) and disk mirroring solutions: • RAID – provides redundancy within a single RAID cabinet that contains extra storage / disk drives; • Disk mirroring – a subset of RAID, two separate disk storage devices are used to provide redundancy. Both of these solutions protect against disk media failure as long as the redundant storage contains a valid copy of the data and is also known as ‘cold standby’. However, these solutions only secure data in its raw low-level format and do not address the actual meaning and usage of the data, or the interaction with applications. Today’s most critical eBusiness systems depend on data which has a complicated internal structure and is used in ways that are dependent on the state of key applications and business processes, than a conventional set of data files. The delay and perception of unreliability involved in such incidents can represent millions of Euros in direct and indirect costs or loss of opportunity. Reputations are also at risk due to the external impression that the organisation is not in full control of its data, or that strict audit trails may have been compromised. Logical replication Logical replication is a process whereby logical operations / processes are replicated at the level of the database instead of at the disk level. Typically, database operations are replicated between two or more databases. Therefore, it replicates meaningful operations such as ‘Customer Y with account number 00192 has placed an order that needs delivered in 36 hours’. Using logical replication allows the customer information to be validated before the redundant copy is written to disk. Redundant hardware is used in conjunction with logical replication, with two or more databases running on separate machines, writing to separate disk devices. This solution can be combined with physical replication, providing a line of defence against loss of availability and logical replication providing coverage in the event of physical Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 14 BCP & DR – Achieving Continuous Availability replication failure. This is also known as a ‘warm standby’ system and can be located anywhere in the world, technology and reliable bandwidth permitting. Surviving planned downtime Generally, high availability is thought of in the context of server failures or disasters that are actually quite rare. Today’s administrators have to consider another measure of availability – keeping systems / applications available during normal planned maintenance. Although this downtime is planned, it accounts for the majority of time that systems are unavailable. This becomes critical for businesses that are operating a true 24x7 system that serves customers globally, and taking any data offline can have severe consequences. Therefore, systems and software must be designed so that maintenance activities do not disrupt normal usage. This process is usually known as switchover. Previous solutions involved minimising the level of data impacted by maintenance operations. Although this reduces the number of end-users affected, it still leaves some data unavailable, making it far from an ideal solution. Application and business process continuity Combining hardware redundancy, physical replication and logical replication reduces the amount of time it takes to recover in the event of an outage. However, it takes more than just fast recovery to provide continuous availability. Even if there is an immediately available copy of data when there is a disruption, users and applications must be switched over to the copy. This process of switching over to a good copy of data or another copy of an application is known as failover. The modes of failover include: • active-passive – the fastest failover mechanism using a cold standby copy of the data, either a genuine copy or switchable disks attached to both machines in a pair. However, this is not the most rapid and seamless approach, as a second server has to be started which involves its own overheads and wastes resources because the second server is not available for other work during the time when there is no emergency. Therefore, although failover is rapid, achieving true business continuity can be slow procedure; • active-active – this method solves the active-passive problem by running a second server alongside the main server. This server is available to perform other duties and means that resources are not wasted during normal operation. Also, the time to take over from the main server during failure is significantly shorter; Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 15 BCP & DR – Achieving Continuous Availability • connection transparency – allows the various protection and failover solutions to work without any of the users or applications noticing that there has been a failure in the system. Supporting continuous availability requires that failover is automatic and immediate so that it is transparent to users and applications accessing the failed component. Also, once this component has been repaired or replaced and brought back online, users and applications will need to be switched back to the original component. Again, this process must be immediate and transparent. Achieving automatic, immediate and transparent failover requires appropriate failover support in both database management systems and applications servers. What about the network? A common mistake to make in BCP and DR deployments is to ignore the vulnerability of the actual network, whether private or public. Typically organisations will use a secure VPN between the onsite and offsite facilities to mirror / RAID their information assets. However, what happens if this link goes down during planned maintenance, or when network traffic becomes excessive for example? Any loss of data, transactions or applications could prove very costly to the organisation. There is a method of bypassing this problem however and businesses must accept that networks are unreliable and plan for it through their BCP and DR solution. It is essential that storing and queuing techniques are used, where the most recent version of the replicated data is queued at the failed network node. Once the network is restored, this information can then be passed securely to the offsite facility. Restoration – the weakest link? Backup of information has to be fast enough to keep up with the size of the database, must scale with the database as it grows and must have minimal performance impact on active database operations. One method of minimising this impact is to offload as much of the backup processing from the database itself to a backup server. Restoring the database is the last step in recovering from an outage. However, this can be the limiting step in the whole recovery process. Typically in day-to-day operations the key issue in a smoothly running system is backup performance, not restoration performance and it is often the case that not enough attention is paid to restoration time. For example, a database management system that backs up at a rate of 500Mb / hour and restores at a rate of 40Mb / hour does little more than provide a false sense Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 16 BCP & DR – Achieving Continuous Availability of security. Under these conditions, a one terabyte database can be backed up in two hours, but restoration would take 25 hours. These restoration time frames are now unacceptable for most businesses who wish to retain their customers, partners and suppliers, without losing competitive edge. Therefore, it is essential that fast restoration systems are used. Common mis-conceptions about BCP Many businesses have spent millions on technology infrastructure and the resilience solutions to support them, however many areas still exist where they are left exposed. This is usually a result of the mis-perceptions and lack of education and understanding among the end-user base, and includes: • throwing hardware at the problem will not solve it – disk mirroring, clustered machine pairs and offsite disk or tape backups may secure critical data, but are not enough to allow the continuous availability of applications that customers demand; • successful failover is only the beginning – mirroring or taking snapshots preserves critical data, but unless there is a systematic, rapid and reliable way of going back to normality, organisations will be exposed to further failures; • clustering is not sufficient – clustered pairs will not protect against all emergency situations, especially those that would be categorised as disasters. A pair of machines will not guarantee continuous availability when fires, floods or terrorist attacks occur; • dual level resilience is not perfect – this becomes significant during planned maintenance. The applications and users are dependent on the second environment while the first is undergoing maintenance. What happens if the second system fails during the maintenance session? The answer is to have three layers (or more) of resilience, preferably combined with offsite secure copies of the data; • messaging has its limitations – the use of messaging to some extent insulates one application from a failure in other applications feeding it. However, this does not guarantee that an entire chain of related systems will continue to function smoothly whatever the combination of emergencies. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 17 BCP & DR – Achieving Continuous Availability Achieving continuous availability with Sybase This paper has, so far, discussed numerous continuity planning solutions, such as warm standby, avoidance of low level database corruption, triple layer resilience and connection transparency. Combinations of the above solutions are essential to make an enterprise truly resilient. Depending on the business need, Sybase can provide a variety of levels of availability that address the various degrees of severity in disasters and emergencies, in various combinations. Their solution offering is illustrated in Figure 3. It can be observed that Sybase offers a complete solutions architecture – the products, design and implementation templates to make them work. Each element of the architecture raises the bar in terms of continuously availability to higher levels on the 4x9’s, 5x9’s scale. Their three key product offerings are: • Adaptive Server Enterprise- High Availability (ASE HA); • Replication Server; • OpenSwitch. Figure 3: Sybase Triple layer resilience solution D a ta b a s e o n d ua l p o rte d d is k a rra y M irro r in g W a rm S ta n d b y R e p lic a tio n S e rv e r AS E H A AS E H A O p e n S w itch S o ftw a re F a ilure s C O N T IN U O U S L Y A V A IL A B L E A P P L IC A T IO N S H a rd w a re F a ilure s 2 4 x7 x5 2 B us ine s s o p e ra tio n Da ta b e as o r ro m ri s e bl ou F lo Te d, O ffs ite S ta nd b y A S E O ffs ite R e p lic a te d D a ta b a s e D Source: Datamonitor s re H a rd w a re C lu s te rin g ilu C lu s te r C o n tro l S o ftw a re Fa e, F ir Co on p ti rru DATAMONITOR Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 18 BCP & DR – Achieving Continuous Availability Adaptive Server Enterprise (High Availiability) Sybase ASE (HA) 12.5 is a data management platform for mission-critical, transaction-intensive enterprise applications. It can be used to provide the continuous availability of database systems to customers, partners and suppliers, without creating any headaches, confusion or requiring them to reconnect. It has the ability to manage each level of a BCP and DR solution, including hardware redundancy, cold standby (backup and restore), warm standby (replication) and active / active hot standby. In the event of a failure, ASE (HA) enables the movement of end-users from a primary system to a back-up system at any point in time, without causing any disruption to the applications or information that they are using. Essentially, it insulates the user from the complexities of back-end systems. Sybase ASE (HA) leverages the cluster architecture discussed previously and provides failover to a backup server without losing any non-committed data or severing user connections. It also works in concert with existing hardware and software high availability solutions from third party vendors, to deliver maximum systems availability. Furthermore, Sybase ASE (HA) incorporates a Companion Server that allows the configuration of two ASE (HA) servers as companions in either asymmetric (masterslave) or symmetric companion (active / active hot standby) to create a hot standby capability that can further reduce unplanned downtime. This approach involves the deployment of a two-node hardware cluster with two ASE (HA) databases running as companion servers to each other. In this configuration, both servers run applications, so that if Server 1 fails, Server 2 opens up the devices on which the primary’s databases are built and perform a fast recovery to bring them online, while continuing to handle its own clients. The failover process is simplified through the use of ASE (HA), as it sends the enduser an error message indicating that failover has occurred and that the current transaction must be resubmitted. Replication Server Sybase’s replication server provides automatic server failover solutions that reach across LAN and WAN, providing simple replication of data over extensive geographic regions. The replication server provides high availability and disaster recovery services, giving greater protection against failures through asynchronous, wide-area delivery of database transactions. It is also possible to share information across an Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 19 BCP & DR – Achieving Continuous Availability organization by replicating data to and from heterogeneous hardware and data sources without losing transactional integrity of the data. The replication server operates in a warm standby mode where databases are kept in close synchronisation using a database replication system. This provides simple configuration, database mirroring with geographical reach of up to 1000s of km, the ability for rapid switching of the direction of replication and automatic switchover in the event of failure (when used in combination with OpenSwitch). This warm standby feature enables customers to more easily configure and manage a high-availability, distributed recovery environment at lower administrative cost than traditional replication products. Also, this feature simplifies the configuration and operation of a warm standby environment by eliminating the traditional requirement to define individual objects eligible for replication and explicit subscriptions to the objects. With the replication server operating in the warm standby mode, it is possible to switch the direction of replication when the primary system is unavailable, with the applications being routed to the standby system. Activation of the switch results in the reverse flow of transaction traffic from the standby system to the active system (which maybe currently unavailable). This enables the replication server to queue up any transactions applied to the standby system by the clients, until the primary system is brought online. Once online, the primary system can be re-synchronised from the queued data. This functionality is critical in situations where failback is as important as failover, for example when managing planned downtime, or unplanned downtime where the primary systems are recovered relatively quickly and need to be resynchronised rather than built from scratch. Sybase Replication Server is fully interoperable with ASE (HA) and supports the high availability features in a cluster configuration. For example, in the event of a node failure (or process failure), the replication server process is migrated over to the companion node without administrator intervention, thereby presenting the organisation with cost and time benefits. OpenSwitch – facilitating continuous availability Adaptive Server Enterprise’s Companion Server option provides the fastest failover functionality and can also guarantee that users will not be disconnected in the event of a system failure. However, organisations that wish to provide continuous Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 20 BCP & DR – Achieving Continuous Availability availability to a variety of clients over extended geographic regions will benefit from the enhanced “connection transparency” facilities of Sybase OpenSwitch Sybase OpenSwitch is a solution that provides transparent client connection failover as well as transparent client connection routing, load balancing and central management. OpenSwitch sits between client applications and the servers to which they connect and works either manually, in response to an administrative request, or automatically when it detects a server failure. It transparently transfers incoming connections to any Sybase server product, or another instance of OpenSwitch. The true beauty of OpenSwitch is that it monitors server availability on an ongoing basis and when a server becomes unavailable, it transfers the client connection to another server without disturbing the connection. By monitoring and restoring the transaction state, communications state and connection state of each connection, OpenSwitch is able to transfer client connections without the need to disconnect and reconnect. From an end-users perspective, nothing has changed, while behind the scenes a great deal may be happening. The user may notice a pause or short delay, but the connection and the customer is not lost. This functionality is available to existing applications without requiring any programming changes and highlights the continuous availability available from Sybase solutions. Advantages of OpenSwitch include: • load balancing and routing – allows organisations to transfer connections based on the type of users coming into pools of servers. For example, different users may have different performance requirements for their applications, ranging from the need for sub-second query responses to the need to run less time sensitive batch reports. This allocation process is performed on the fly without having to bring any servers down; • central management – provides tools designed to make the end-user experience as simple as possible and to simplify the challenges IT faces behind the scenes. For example, consider an organization in which the administrator needs to perform a data load involving hundreds of gigabytes. Such loads can take hours and are normally done in the middle of the night. Using OpenSwitch, the load can be performed during the day and once the load is completed, users can be moved over to the updated server with a simple flick of a switch. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 21 BCP & DR – Achieving Continuous Availability Why Sybase? • builds on existing investments – in mirroring, clustered hardware, cluster control software, remote DR recovery, etc. but adds the database and application level resilience that enhances basic data protection and hardware duplications; • active–active – configuration for the quickest failover, and a secondary server that is always running; • secure, physical and logical warm standby – of database copies, providing insurance against lurking corruption and provision of an offsite copy that is secure against sever emergencies like fires, floods and terrorist attacks; • triple level resilience – architected with active-active and warm standby combined. Provides integrated, automated protection against service interruptions, database corruption and severe emergencies; • connection transparency – provides protection and failover without applications or users noticing. Real assurance provided for each part of an STP, reporting or customer service chain will continue smoothly. The whole value chain will continue correct operation both in parts and as a whole; • multiple benefits from each investment – remote DR sites also support failover operations if hardware failure occurs during planned maintenance and OpenSwitch ensures this is invisible to end-users, while also ensuring minimum disruption to applications during any normal high availability failover; • ease of switching back to normal – well defined procedures are inherent in the Sybase architecture and allow for ‘returning to normal’ configuration. Also, the Sybase solution set works at the database and the application level, providing logical integrity of the system at all times. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 22 BCP & DR – Achieving Continuous Availability The true value of the Sybase offering arises when each of the products work as one, to provide continuously available BCP and DR solutions. The benefits of such an integrated solution include: • fast failover – locally provides resilience against any hardware breakdowns; • secure logical copy – provides a highly up to date logical copy of all data at the remote site, providing resilience against any site disasters, or inaccessibility of the main site to the key staff; • invisibility – any maintenance activities to hardware or applications are invisible to the end-users; • triple layer resilience – possible breakdowns during planned maintenance periods go unnoticed by users through the secure triple layer resilience; • highly automated – procedures providing rapid failover and failback are provided through the automation of the solution; • immune to corruptions – the solution provides a remote copy which is immune to any database corruptions originating in the original copy; • intelligent – the data replication functionality is closely related to the individual databases supporting individual applications, meaning that the failback procedures are less error prone than if all the data is mirrored indiscriminately. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 23 BCP & DR – Achieving Continuous Availability Conclusion Today’s world of eBusiness means that it is essential for businesses to have some form of business continuity planning and disaster recovery solution in place. This is not only limited to the financial service sector, which is dependent on high volume and value transactions to run its business, but to businesses in all industry sectors. Now it is not only the internal employees that are dependent on system availability, but also your business partners, customers and suppliers. Enterprise applications and collaborative initiatives have further highlighted the need BCP and DR. The cost of not having a robust continuity solution in place could be catastrophic – lost revenues, bad press coverage, loss of customers and competitive mindshare to name but a few. The level of continuity needed is obviously dependent on the business need, however BCP and DR demands a certain level of skill, expertise and experience that may be lacking in an enterprise. Using an outside vendor for the writing, testing and implementation stages of any plan may be an expensive option. On the other hand, a ‘build it yourself’ approach could prove complicated and time consuming, especially when multiple organizations and sites are involved. Therefore, it is essential that businesses select a continuity vendor with a proven track record and a comprehensive solution offering that meets the business need. This is where Sybase comes in with its various continuity offerings. Together, the Adaptive Server Enterprise, Replication Server and OpenSwitch solutions provide a truly available and continuous continuity solution for use with all angles of the business, on a 24x7x52 basis. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 24 BCP & DR – Achieving Continuous Availability Appendix: Business Continuity Planning and Disaster Recovery in practice Case study – large European clearing bank The business challenge: A large European clearing bank is offering one of its largest customers an extranet web-based system allowing frequent and high volume transactions to be performed. These transactions would be aggregated and calculations performed based on external data feeds and the aggregated transactional feeds sent to existing bank foreign exchange systems. As this business offering was to support the customers’ own web sites, 24x7x52 availability of the transactional receipt process was essential for customer satisfaction. In order to avoid backlogs, minimal downtime in any part of the system could be tolerated. Also, to meet regulatory and auditor requirement, the ability to switch processing to an alternate site was essential. The business solution: This was a greenfield deployment and the bank opted for a three tier solution, using a well known application server for the middle tier and web clients at the top. The banks appreciation of Sybase’s high availability solution in ASE 12.5 led them to insist on this architecture, in preference to other database vendor’s offerings. Also, the bank’s long track record with Replication Server gave them faith in this as the most reliable way to transport data rapidly and safely offsite, while filtering out any low level physical database corruptions. The Sybase High Availability Solution: This customer used all elements of the Sybase continuity offering bar one, Sybase OpenSwitch. The customer felt that connection transparency could be built into the J2EE based middle tier. No other parts of the system would require direct connection to the database. For a site with many existing 2 tier applications, the decision would have been made differently. This example indicates that, while comprehensive, the Sybase architecture is flexible and capable of working with 3rd party products where necessary. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 25 BCP & DR – Achieving Continuous Availability To fulfil the business and technology requirements, the bank opted for the ASE 12.5 solution, providing dynamic memory management features for administration without frequent restarts, in addition to the High Availability, fast failover support. Replication servers were used onsite and offsite to avoid physical low level data corruptions. The benefits: The bank is able to provide the ‘always available’ solution demanded by the end customers using the web sites of the bank’s own customers, and comfortably meet the regulatory requirements and service level agreements (SLAs) in place. Use of such a scalable solution will allow the bank to expand these operations easily to include the growing number of customers who are interested in this business offering. Case study – European branch of a major Asian bank The business challenge: A European branch of a major Asian bank was running a significant number of business critical applications and Sybase based systems that were supported by a variety of technical architectures. The day-to-day operational management and administrative workload generated by these various architectures / hosts was becoming and unacceptable cost to the business. Furthermore, the resilience strategy for a number of these systems and applications (notably Summit and Fidessa) was not going to meet the latest demands of the business user communities, both in terms of loss of data and time to recover in the event of a significant primary and / or host failure. The bank was facing a situation where the operational risks that its business systems infrastructure presented were now considered unacceptable and needed to be addressed as a matter of urgency. The business solution: To address these problems, the bank chose a Sybase systems consolidation and resilience solution that addressed two key criteria: • migration / consolidation of these disparate Sybase based systems to a single technical architecture based on two ASE servers; • incorporation of the migrated solution into a single standardised high availability solution, based on proven / complementary Sybase technologies. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 26 BCP & DR – Achieving Continuous Availability The result of the project was that it produced a new high resilience architecture for the bank’s operations. The Sybase High Availability Solution: For each of the systems / applications that were migrated to the new architecture, exactly the same high availability solution was deployed. Even if the SLA requirements for a given system did not require the level of resilience the solution provided, it was considered that taking an uniform approach would be most cost effective in terms of operational support, control and management. The high availability solution deployed utilised two Sybase ASE(HA)’s, one on the primary site and the other on the DR site. Warm standby functionality was provided by the replication server and allowed near real time replication of all database data between the two ASE(HA)s. This server was deployed at the DR site host. This bank made use of the Sybase OpenSwitch technology, providing user and application transparency in the event of a systems failure. Management of the OpenSwitch technology is achieved through the OpenSwitch Coordination Module and it monitors all other Sybase components in the architecture and runs alongside the OpenSwitch server on the DR host. Its primary role is to detect significant failure in the primary database server environment and then coordinate the switching of user connections within OpenSwitch, with management activities being performed by the replication server. The benefits: In addition to solving the critical business problem, the project aimed to maximise the return on investment, by ensuring that best use was made of the new host architecture. The migration of a given system / application to the new architecture / high availability solution would be transparent to business users and in house application development teams. Disruption to 3rd party systems was nonexistent, thereby allowing the data movement and interfaces to remain the same. The resulting high availability solution is platform independent and therefore allows for flexibility in client choices, particularly when it comes to platform upgrades such as operating systems or disk sub-systems. This also allows for a scalable architecture, both in terms of hardware and software performance. Using OpenSwitch has allowed the bank to perform essential maintenance tasks by seamlessly switching business users between the primary and disaster recovery database servers. Business Continuity Planning and Disaster Recovery – Achieving Continuous Availability Page 27