This page is primarily for the cloud.gov team. It's public so that you can learn from it. For help using cloud.gov, see the user docs.
This Contingency Plan provides guidance for our team in the case of trouble delivering our essential mission and business functions because of disruption, compromise, or failure of any component of cloud.gov. As a general guideline, we consider “disruption” to mean more than 30 minutes of unexpected downtime or significantly reduced service for customer applications.
Scenarios where that could happen include unexpected downtime of key external services, data loss, or high-severity security incidents. In the case of a security incident, the team uses the Security Incident Response Guide as well.
Short-term disruptions lasting less than 30 minutes are outside the scope of this plan. (See Service Disruption Guide instead.)
More than 3 hours of cloud.gov being offline would be unacceptable. Our objective is to recover from any significant problem (disruption, compromise, or failure) within that span of time.
Contingency plan outline
When any cloud.gov team member identifies a potential contingency-plan-level problem, they should begin following this plan. That first person becomes the Incident Commander (communications coordinator) until recovery efforts are complete or they explicitly reassign the Incident Commander role to another team member.
- Security note: If the problem is identified as part of a security incident response situation (or becomes a security incident response situation), the same Incident Commander (IC) should handle the overall situation, since these response processes must be coordinated.
The IC’s first priority is to notify the following people. They are each authorized to decide that cloud.gov needs to activate the contingency plan. To contact them, use Slack chat, a Slack call, or call and text the phone numbers in their Slack profiles.
- cloud.gov Director (System Owner)
- cloud.gov Deputy Director
- cloud.gov Program Manager
- Cloud Operations team (@cg-operators) – any two members can together authorize the activation of the contingency plan.
If the IC receives confirmation that cloud.gov is in a contingency plan situation, the plan is activated, and the IC should continue following the rest of this plan.
The IC should:
- Start a team videocall for coordination.
- Start a notes doc with timestamps in Google Drive.
If this is also a security incident: also follow the security incident communications process. As necessary, delegate assistant ICs.
The IC should notify the following people within 15 minutes of identifying the service disruption:
- cloud.gov users (through StatusPage) – can be a brief note summarizing the observable symptoms and saying that we’re working on it.
The IC must notify the following people within 1 hour of activating the plan:
- TTS and 18F leadership – give a brief description of the situation in Slack or email to the TTS Director (or TTS Assistant Commissioner for Operations) and the 18F Director.
- GSA Information Security – email a link to the StatusPage and a technical description of the situation (including whether there are any known security aspects) to email@example.com, firstname.lastname@example.org, email@example.com (even if there is no security impact).
- FedRAMP JAB representatives – email a link to the StatusPage and a technical description of the situation (including whether there are any known security aspects) to our JAB representatives (contact information at the top of this doc).
The IC should also:
- Monitor #cg-support and cloud-gov-support email for questions from customers, and respond as needed.
- Coordinate with Director or Deputy Director to identify any key customer System Owners who we should proactively email.
- Coordinate with Director or Deputy Director to identify any additional people to notify, such as TTS leadership, depending on the situation.
The Cloud Operations team assesses the situation and works to recover the system. See the list of external dependencies for procedures for recovery from problems with external services.
If the IC assesses that the overall response process is likely to last longer than 3 hours, the IC should organize shifts so that each responder works on response for no longer than 3 hours at a time, including handing off their own responsibility to a new IC after 3 hours.
- At least once an hour:
- Post a brief update to StatusPage for cloud.gov users.
- Send email updates to GSA Information Security and FedRAMP JAB representatives.
- Monitor #cg-support and cloud-gov-support email.
The Cloud Operations team tests and validates the system as operational.
The Incident Commander declares that recovery efforts are complete and notifies all relevant people, which typically includes:
- Post to StatusPage for cloud.gov users.
- Email GSA Information Security and FedRAMP JAB representatives.
- Notify TTS and 18F leadership.
- As needed, email customers who contacted support with concerns.
Next, the team schedules an internal postmortem to discuss the event. This is the same as the security incident retrospective process. Then we write a public postmortem and post it on StatusPage for users. We should also discuss with our JAB representatives whether they need additional information.
cloud.gov depends on several external services. In the event one or more of these services has a long-term disruption, the team will mitigate impact by following this plan.
If GitHub becomes unavailable, the live cloud.gov will continue to operate in its current state. The disruption would only impact the team’s ability to update cloud.gov.
Disruption lasting less than 7 days
Cloud Operations will postpone any non-critical updates to the platform until the disruption is resolved. If a critical update must be deployed, Cloud Operations will:
- Locate a copy of the current version of the required git repository by comparing last commit times of all checked out versions on Cloud Operations local systems and any copies held by cloud.gov systems (Concourse, BOSH director, etc.)
- Pair with another member of Cloud Operations to:
- Perform the change on the local copy of the repository (if the update requires a change to git-managed source code), or use local copies of the repository instead of remote GitHub repository references (if the update depends on remote repositories but implies no change to those repositories)
- Manually deploy the change by provisioning a Concourse jumpbox container, copying in the repository/repositories, and executing any required steps by hand based on the Concourse pipeline
- For example, in the case of a stemcell update, a Cloud Operations member would obtain the latest stemcell object, extract and build the light stemcell using the latest available checkouts of the associated GitHub repositories, then manually upload the build artifact to the S3 bucket specified in the pipeline.
When the disruption is resolved, Cloud Operations will push any changes to the appropriate repositories in GitHub to restore them to the current known-good state. Cloud Operations will monitor Concourse to ensure it redeploys any changes pushed to GitHub. Then, Cloud Operations will verify the system is in the expected state after all automated deployments complete.
Disruption lasting more than 7 days
Cloud Operations will:
- Deploy and configure GitLab Community Edition to newly-provisioned instances
- Migrate repositories from local backups to GitLab
- Update all Concourse jobs to retrieve resources from the newly-provisioned Gitlab instance
After these steps are complete, updates will be deployed per usual policy using GitLab in place of GitHub.
All alerts automatically get delivered to the Cloud Operations team via GSA email. If GSA email is unavailable, the Prometheus Alert Manager provides current alerts.
There is no direct impact to the platform if a disruption occurs. When debugging any issues where New Relic would provide insight, the team will use manual investigation to access the same information directly from the affected system(s).
Cloud Operations will update the
opslogin UAA instance to allow temporary access via password authentication for any accounts that require access during a disruption in service.
When the disruption in service is resolved, Cloud Operations will disable password authentication for all accounts.
In case of a significant disruption, after receiving approval from our Authorizing Official, Cloud Operations will deploy a new instance of the entire system to a different region using the instructions stored in the cg-provision repository.
If all AWS regions are disrupted, Cloud Operations will deploy the system to another BOSH-supported IaaS provider (such as Microsoft Azure).
Major services (Cloud Foundry, BOSH, Concourse, and UAA) rely on databases provided by RDS. To restore a database from backup, see restoring RDS.
How this document works
This plan is most effective if all core cloud.gov team members know about it, remember that it exists, have the ongoing opportunity to give input based on their expertise, and keep it up to date. Here’s how we make that happen.
- Initial review and approval:
- cloud.gov team leads propose this draft to the team, inviting all team members to review and comment on it. The team leads approve an initial version to be published on cloud.gov as our working document. Team leads coordinate with 18F Director of Infrastructure to make sure our plan complies with all necessary requirements.
- We publish this plan on https://cloud.gov/docs/ (which is sourced from https://github.com/18F/cg-site) in our Operations section, alongside our Security Incident Response Guide and other key team policies. Our Onboarding Checklist includes an item for all new team members to read this plan as well as other key team policies when they join.
- How we review and update this plan and communicate those updates (to address changes to our team or systems, along with any problems we find while testing or using this plan):
- The cloud.gov team is responsible for maintaining this document and updating it as needed. Any change to it must be approved and peer reviewed by at least another member of the team.
- All changes to the plan should be communicated to the rest of the team.
- At least once a year, and after major changes to our systems, we review and update the plan.
- How we protect this plan from unauthorized modification:
- This plan is stored in a public GitHub repository (https://github.com/18F/cg-site) with authorization to modify it limited to 18F staff by GitHub access controls. cloud.gov team policy is that 18F staff may not directly make changes to this document; they must propose a change by making a pull request and ask another team member to review and merge the pull request.