Business Continuity and Disaster Recovery

Purpose

Vision Database Systems wants to maintain services and support for our customers and the ability to work for our employees as far as possible. When disasters happen that affect our ability to deliver services, support customers, do our work, or basic safety, we need to be prepared to respond. The goal of this plan is to ensure the safety of our employees and restore services and operations to the greatest extent possible in the shortest possible time, while maintaining security and compliance. This plan provides guidance for responses to significant detrimental events but is not intended to document daily problem resolution procedures.

Scope

All business-critical IT Systems, especially those systems providing services to customers or facilitating customer support.

Any event that causes prolonged degradation of Vision Database Systems services, or the inability of Vision Database Systems employees to perform core business functions (customer support, operation of services, security of operations).

This policy applies to all employees of Vision Database Systems and to all relevant external parties, including but not limited to Vision Database Systems consultants and contractors.

Policy

In the event of a major disruption to production services, or a disaster affecting either the business-critical systems used for Vision Database Systems operations, or a disaster affecting the safety, security or ability to work of a significant number of Vision Database Systems’ employees, the Director of Operations shall diagnose the situation and direct mitigating actions.

Appendix: Diagnostic Steps provides guidance on determining what is affected by the disaster.

Where possible, mitigating actions are prepared and described in scenario-specific action plans in Appendix: Scenarios. The Director of Operations will follow these prepared action plans when appropriate. For situations that have not been preplanned, the Director of Operations will coordinate mitigating actions in consultation with the Principal Engineer.

Hard copies of this plan should be kept in each Vision Database Systems office, as well as the home office of all relevant employees.

The following factors are to be considered in planning mitigations:

  • Employee Safety

  • Continuity of information security

  • Continuity of compliance

  • Continuity of operations.

In the case of an information security event or incident, refer to the Incident Response Plan.

Review

This plan must be reviewed and tested annually and updated to address any issues identified. The plan must also be reviewed and updated after any activation of the plan to determine improvements for future disaster scenarios. The agenda for a testing exercise should be maintained in Appendix: Test Plan.

Activation

This Business Continuity and Disaster Recovery plan is to be activated when one or more of the following criteria are met:

An Amazon data center in which Vision Database Systems stores its data is unavailable or is in imminent danger of becoming unavailable for an extended period of time.

A system supporting a core Vision Database Systems business function is unavailable or is in imminent danger of becoming unavailable for an extended period of time.

A significant number of Vision Database Systems employees are unable to work or in imminent danger of being unable to work for an extended period of time.

Examples of situations that would cause the above criteria to be met are:

  • loss of utility service (water, power, heating fuel)

  • loss of internet connectivity

  • catastrophic events (weather, natural disaster, vandalism)

The person discovering the potential or actual disaster must notify the Director of Operations (contact details listed below). If the Director of Operations is unavailable, the Principal Engineer must be notified instead.

Communications Processes

Once notified of a potential disaster, the Director of Operations will consult with the principal department heads and follow appropriate diagnostic steps. Once the disaster has been diagnosed and the plan activated, the Director of Operations will direct the department heads and all relevant employees to convene an all-call remote meeting. This online meeting room will be used as the primary mechanism to coordinate action and internally communicate status updates.

If the remote meeting room is unavailable, the Director of Operations will arrange an alternate digital or physical meeting room and communicate the location to the department heads and all relevant employees.

The Customer Support team will handle proactive communications to customers and resellers and respond to questions from resellers and customers.

If employee safety is threatened, the first action should be to communicate with all employees to ensure their safety (see Appendix: Employee Safety Confirmation Process).

Alternate Work Facilities

If the Vision Database Systems office becomes unavailable due to a disaster, all staff should work remotely from their homes or any safe location. Similarly, if an employee’s home office becomes unavailable or unsafe, the employee should first seek safety, and once safe, work from the office or find a safe alternate work location.

If necessary, the Director of Operations should procure a temporary work location (e.g. coworking space membership) and accommodations in a location unaffected by the disaster so that affected employees can continue working.

All tools and processes used to conduct regular operations at Vision Database Systems should be conducive to remote work to the greatest extent possible. For example, the use of web applications over encrypted channels is preferred to private server applications that require users to be on the network. Security controls should assume and account for remote work.

Continuity of Critical Services

Procedures for maintaining continuity of critical services in a disaster can be found in Appendix: Scenarios.

Recovery Time Objectives (RTO) and Recovery Point Objects (RPO) can be found in Appendix: Asset RPO and RTO.

Strategy for maintaining continuity of services:

KEY BUSINESS PROCESS

CONTINUITY STRATEGY

Customer Service Delivery

Rely on AWS availability commitments and SLAs; use multi-site active-active, with cross-region backups where possible.

IT Operations

Use SaaS applications or AWS hosted applications to ensure operations do not depend on a single physical location and are conducive to remote work arrangements.

Email

Utilize Microsoft365 and its distributed nature, rely on Microsoft’s standard service level agreements.

Customer Support

All systems are vendor-hosted SaaS applications, use Microsoft365 as communications channel if FreshDesk is down.

Finance, Legal and HR

All systems are vendor-hosted SaaS applications.

Sales and Marketing

All systems are vendor-hosted SaaS applications.

Roles and Responsibilities

Person

Roles

Responsibilities

Director of Operations

Coordination and Communication

Determine activation of plan

Coordinate employee response

Coordinate communication of status internally

Prioritize activities to ensure safety, security, and core services are maintained or restored as soon as possible

Work with Customer Support team to ensure employee safety

Monitor employee safety status

Review and Test plan annually

Principal Engineer

Technical Execution

Provide technical guidance on mitigating actions to the Director of Operations

Ensure all failovers complete smoothly

Deploy new infrastructure to replace failed infrastructure where necessary.

Review and Test plan annually

Designate and brief alternate person in case of unavailability.

Director of Sales

Alternate for Director of Operations

Provide support for Director of Operations

Coordinate external communication to resellers and customers.

 

Customer Support Team

Communication

Communicate with employees to ensure safety

Communicate status to customers and resellers

Handle questions from customers and resellers

 

 

 

Engineering Team

Technical Execution

Support Principal Engineer as needed to recover services

Revision History

Version

Date

Description

Author

Approved by

1.0 (Business Continuity and Disaster Recovery Plan)

May 2024

Initial Plan

 Zack Walker

Andrew Moretti

 

 

 

 

 

Appendices

Appendix: Disaster Recovery Strategies

AWS Multi-site Active-Active Strategy

Application load is distributed across multiple resources located in two or more physical locations (AWS Availability Zones). If one Availability Zone (AZ) becomes unavailable, resources are automatically or manually provisioned in the healthy Availability Zone to handle the load from the first zone.

Specific resources following these strategies:

  • Relational Database Service (RDS) - the core application database is hosted on Aurora Serverless. Aurora databases use a separate redundant storage layer independent of the servers that is distributed across all Availability Zones in a region. If the DB instance for an Aurora Serverless DB cluster becomes unavailable or the AZ it's in fails, Aurora automatically recreates the DB instance in a different AZ.

    • In addition to active data, Aurora backs up the database automatically and continuously for a 7 day period. Additionally, Aurora takes a daily snapshot to ensure further redundancy above the continuous backups.

    • New DB clusters can be established from snapshots, typically in less than 30 minutes.

  • Elastic Compute Cloud (EC2) - the application servers run in an auto scaling group distributed across more than one availability zone. If an entire availability zone were to become unavailable, the auto scaling logic would provision more servers in the other availability zone(s) until the load from the users was met.

  • Simple Storage Service (S3) - redundantly stores objects on multiple devices across at least three Availability Zones in an AWS Region and is designed to sustain data in the event of the loss of an entire Amazon S3 Availability Zone.

Appendix: Diagnostic Steps

  1. If the disaster affects the Southeast Florida region:

    1. consult news sources to gather information on impact of disaster

    2. If employee safety could be affected, immediately direct the Customer Support Team to follow the Appendix: Employee Safety Confirmation Process.

  2. Check the Vision Database Systems Status Dashboard (a site which includes information on Vision Database Systems system health)

  1. Determine if Vision Database Systems applications are experiencing downtime

  2. If not a Vision Database Systems application, determine if AWS has published any notices.

  1. Attempt to log into RapIDadmin, EliteID, and PerfectPass

  2. Attempt to log into the AWS console

  3. Observe the state of the database and application environments:

  1. Are the major components (autoscaling functionality, RDS cluster) still operational?

  2. Is autoscaling and failover functioning normally and recovering the services?

  3. Is the service recovery trending towards normal within less than 15 minutes?

Based on the evidence gained from the above diagnostic steps, the Director of Operations will decide, in consultation with the Principal Engineer, if a disaster has occurred. If the disaster corresponds to one of the scenarios in the Appendix: Scenarios, the Director of Operations will direct the execution of the given checklist. If the disaster does not correspond to a prepared scenario, the Director of Operations will consult with the Principal Engineer and appropriate department heads to determine the appropriate plan of action.

If AWS has not acknowledged the disaster on their public site, consider submitting an AWS support ticket to notify AWS of the issue.

Appendix: Employee Safety Confirmation Process

When an event could affect employee safety, Vision Database Systems will confirm the safety of employees using this process.

  1. The Director of Operations, or the Customer Support team at the direction of the Director of Operations, will post a message in a company-wide email and text group, describing the situation and asking each employee to respond with their status.

  2. Employees will report back their status.

  3. The Director of Operations will monitor the group to ensure all employees report back.

  4. The Customer Support team will follow up with any employee who does not respond quickly via alternative communications channels.

  5. If an employee is not safe, Vision Database Systems will attempt to provide that employee with resources (information or help) to assist in getting them to safety where possible, and continue to monitor the situation.

Appendix: Scenarios

Vision Database Systems’ production operations are hosted using AWS services with auto scaling and multi-site redundancy, and incremental continuous backups. Therefore our strategies for disasters in the cloud center around making sure failovers happen correctly, and creating new application environments from data backups when necessary.

Vision Database Systems uses SaaS / Cloud applications for most business critical functions so that all employees should be able to perform their work from any safe location that has power and a stable internet connection. Business continuity is therefore focused on (a) ensuring employee safety and (b) getting enough staff to a connected alternate work location to continue serving customers.

These scenarios also assume that there exists a safe travel channel for a minimal number of employees to take to reach a safe, internet-connected working location (if their home does not qualify), and that it is safe for the employee to leave their home. Employees should establish safety for themselves and their families / household prior to returning to work.

If a situation is so severe that it is not possible for even a minimal number of Vision Database Systems staff to safely relocate to an internet-connected work location, we assume that it is immaterial for Vision Database Systems to continue operations. For example, a natural disaster destroying power and network infrastructure across the entire Eastern and Central United States - in this situation, very few people will be connected to the internet at all, so Vision Database Systems’ employees should focus on finding safety and taking care of others until power or network infrastructure is restored to a sufficient extent that Vision Database Systems can resume continuity efforts according to one of the Scenarios below.

Disasters affecting AWS Availability Zone(s) or Individual Services

Plan of Action

  1. Assemble team in the appropriate meeting room. (Director of Operations)

  2. Monitor service failover and deploy backup infrastructure. (Principal Engineer)

    1. Ensure database failover occurs.

    2. Ensure auto scaling replaces lost services with new nodes.

      1. If the application scaling infrastructure is disabled, create a new application environment from backed up code artifacts.

    3. Determine if any data loss occurred, or if data needs to be corrected (e.g. to prevent stuck jobs). If so, restore or recover the data from backups.

    4. Determine if any secondary services are down and recover them.

  3. Determine service and data recovery timeframes. (Principal Engineer)

  4. If service is likely to be degraded for more than 15 minutes, direct the Communications Team to contact Customers and Resellers to make them aware of the situation. (Director of Operations)

  1. Update the service updates page and direct customers to review it for updates. (Director of Operations and Principal Engineer)

  1. Improve the process in case of a future disaster. (Director of Operations and Principal Engineer)

Data / Infrastructure Sabotage or Human Error

In this scenario the assumption is that an attacker or an employee has intentionally or accidentally tampered with production resources to such an extent as to cause a major outage.

  1. Triage - determine the actor causing the sabotage

    1. If more appropriate, follow the Incident Response Plan.

  2. Perform containment to ensure further actor access or action is prevented.

  3. Follow the steps in Disasters affecting AWS Availability Zone(s) to restore services.

  4. Take appropriate legal, disciplinary or training action.

Outages of Business Critical Services

Microsoft365 Outlook Email unavailable:

  1. Review news from Microsoft to determine timeline for restoration of services.

  2. Update the Vision Database Systems status updates page and main website to indicate outage.

  3. Customer Support team communicate with customers via FreshDesk.

  4. As a last resort, if service is unlikely to be restored for a significant period, and other email providers are fine, set up temporary or permanent operations on a different email provider and repoint DNS records.

FreshDesk unavailable:

  1. Review news from FreshDesk to determine timeline for restoration of services.

  2. Update the Vision Database Systems status updates page and main website to indicate outage.

  3. Customer Support team communicate with customers via Email.

  4. As a last resort, if service is unlikely to be restored for a significant period, and other help desk providers are fine, set up temporary or permanent operations on a different helpdesk provider.

Website unavailable:

  1. Review news from Website provider to determine timeline for restoration of services.

  2. Set up a simple html static site in S3 and repoint dns for the website to the static site.

  3. Customer Support team monitor FreshDesk for customer questions.

  4. As a last resort, if service is unlikely to be restored for a significant period, and other website hosting provider are fine, set up temporary or permanent operations on a different website hosting provider, or rebuild the website on the S3 static site.

Sales / Accounting / Task Tracking unavailable:

These services do not affect Vision Database Systems’ immediate ability to serve customers. If they become unavailable, staff should use spreadsheets or manual systems to track information until the system comes back online. If the outage is likely to be prolonged, Vision Database Systems should seek another service provider.

Disasters affecting the Vision Database Systems Office

Assumptions:

  • Employee home offices are unaffected by the disaster, safe to work from, and connected to the internet.

Plan of Action:

  1. Ensure employee safety. Evacuate the building or area if necessary.

  2. If safe to do so, employees at the office relocate to home offices.

  3. Verify internet connectivity at home offices.

  4. Remotely resume normal operations.

Disasters affecting the greater Jupiter, FL area

Assumptions:

  • The Vision Database Systems office is unavailable

  • Most employees home offices are affected by the disaster

  • Some locations within 1 hour driving time of Jupiter are unaffected

  • At least some of the affected employees can safely commute to an unaffected location

Plan of Action:

  1. Ensure employee safety. Evacuate to a safe location if necessary.

  2. Delegate immediate tasks/operations to non-Florida based employees

  3. Director of Operations locates a coworking space or other working location within 1 hour drive of Jupiter that is safe to work from, has sufficient internet connectivity, and has a safe commute for affected employees. If necessary, multiple such locations could be established.

  4. Employees work from the coworking space until their home office or the Vision Database Systems office becomes available.

Disasters affecting most of Southeast Florida

Assumptions:

  • The Vision Database Systems office is unavailable

  • Most employees home offices are affected by the disaster

  • The entire region within at least 1 hour driving time of Jupiter is affected by the disaster.

  • Some locations within 8 hours driving time of Jupiter are unaffected

  • At least some employees can safely commute to an unaffected location.

Plan of Action:

  1. Ensure employee safety. Evacuate to a safe location if necessary.

  2. Delegate immediate tasks/operations to non-Florida based employees

  3. Director of Operations finds a coworking space or other working location within an 8-hour drive of Jupiter that is safe to work from, has sufficient internet connectivity, and has a safe commute for at least some employees.

  4. A minimal group of employees is coordinated to work at coworking space or other working location.

    1. Where possible, a rotational model will be established so that employees can return to their families frequently and are not burned out.

Major Disasters affecting multiple states surrounding Florida

Assumptions:

  • The Vision Database Systems office is unavailable

  • Most employees home offices are affected by the disaster

  • The entire region within at least 12 hours driving time of Jupiter is affected by the disaster.

  • It is therefore impossible to find a location within 12 hours driving time of Jupiter that is safe to work from, connected to the internet, and that at least some employees can commute to

Response:

  • It is immaterial for Vision Database Systems to focus on continuity efforts at this time.

  • The Director of Operations should monitor the situation until a safe location becomes available

  • The Director of Operations should maintain regular communications with employees to whatever degree possible to ensure their safety and arrange help where possible.

Distributed Denial of Service (DDoS) Attacks

AWS provides base level protection against DDoS and similar attacks. If a situation becomes more severe than the built-in AWS protection provides, contact AWS support for assistance in dealing with the situation.

Death or incapacitation of key leader

Vision Database Systems’ key leaders are the Director of Operations and the Principal Engineer.

Vision Database Systems’ key leaders should each designate another employee as an alternate. The alternate should be briefed on the responsibilities of the given role and able to perform interim responsibilities in case of death or incapacitation, or planned absence of the key leader.

After a key leader takes time off from work, an after-action review should be performed to determine what gaps exist in the knowledge possessed by the alternate to perform interim key leader responsibilities.

All employee roles and responsibilities should be documented to enable other employees or new hires to assume responsibility.

Small Scale Events that are out of scope

The following are examples of events that are not large enough in scale to warrant activation of this plan.

  • Loss of connectivity for a single employee

  • Laptop failure for a single employee.

  • Loss of availability of a production application or service necessary to Vision Database Systems’ operations that either (a) does either not affect all of Vision Database Systems’ core services, or (b) is short-lived (outage lasting less than 4 hours)

Appendix: Asset RPO and RTO

 

Asset

Scenario

Recovery Strategy

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

AWS Data and Services

Amazon data center failure or destruction

Autoscaling, failover, or restoration of backups

< 1 hour

< 1 hour

Main Office

Major utility Outage

Alternate work location

< 1 hour

< 1 hour

Employee Home Offices

Major utility outage

Alternate work location

< 12 hours

< 12 hours

Microsoft Outlook

Major service outage

Rely on Microsoft SLAs

 

 

WebSite Provider

Major service outage

Use Outlook until service restored

 

 

Appendix: Test Plan

The Director of Operations and Principal Engineer will meet with all other relevant employees for the following:

  1. Read through the plan and address any questions.

  2. Test the employee safety confirmation process.

  3. For each of the scenarios defined in Appendix: Scenarios, craft an example of that scenario, and walk through how the plan would be implemented in that scenario. Document the estimated time taken for each action.

    1. For technical actions that can be simulated, note those actions for later simulation and continue the walkthrough.

  4. Simulate the actions noted during the walkthrough, and add the actual RPO and RTO achieved during these simulations to the walk-through notes. These actions should include (but are not limited to):

  1. Test failover of database to another availability zone and adding a new read replica to the cluster.

  2. Test scaling up the application cluster to introduce new servers in a different availability zone to replace others lost in the outage. Ensure that all availability zones in the region can be used by the cluster.

  3. Test deploying a completely new database cluster from a database backup.

  4. Test deploying a completely new application cluster.

  1. Perform an after action review - collect all suggestions from all those included in the test for review.

  2. Document the test results and after action review notes.

  3. Update this Plan based on the results and suggestions.

Appendix: Planned Improvements

  • Improve internal system status observability to include more data points for relevant employees.