US flag signifying that this is a United States Federal Government website An official website of the United States government

Ongoing platform maintenance

Ongoing platform maintenance

This page is primarily for the team. It's public so that you can learn from it. For help using, see the user docs.

The platform requires regular support and maintenance activities to remain in a compliant state. If you are on support and you can’t complete any of the items personally, you are responsible for ensuring that an appropriate person does it. If you haven’t already reached out on Slack, reach out during standup to get visibility with the people who might best help.

Is it your first day of support?

  • Update the #cg-platform topic to include your name as the support contact.
  • Update the support schedule by moving yourself to the end of the list in the slackbot auto response for atlas support schedule.
  • Join/unmute #cg-support and #cg-platform-news
  • Meet with the previous support person and take responsibility for any open support items they are still working on. There is a standing support sync meeting (Tuesday 4:30-5 pm ET) that you should join if you are rolling on or off support that week.

At least once during your week of support

Daily maintenance checklist

The tasks on this checklist should be performed each day.

PR as you go

If you see a way to make this checklist better, just submit a PR to the cg-site repo for content/docs/ops/

Ensure all VMs are running the current stemcell

  • Check the latest stemcell version for AWS Xen-HVM Light at

  • From the jumpbox in each of our four environments, tooling, development, staging and production, run bosh deployments and verify the stemcell in use for each deployment is current. For example, the 3431.13 stemcells are outdated below:

    root@PRODUCTION:/tmp/build/8e72821d$ bosh deployments | grep go_agent
    admin-ui             	clamav/9                           	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3445   	-	outdated
    cf-production        	clamav/9                           	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3431.13	-	none
                     	awslogs/27                         	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3445
    cf-production-diego  	clamav/9                           	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3431.13	-	none
    concourse            	clamav/9                           	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3445   	-	latest
    kubernetes           	cron/17                            	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3445   	-	latest
    logsearch            	cron/17                            	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3445   	-	latest
    shibboleth-production	clamav/9                           	bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3445   	-	latest
  • When the stemcells are out-of-date:

    • Review the release notes at
    • Trigger the appropriate deploy-... jobs in Concourse
    • Triggering more than X jobs simultaneously is not advised in case any issues arise during the deployment or if you’re interrupted. X being a number you’re comfortable with monitoring which can vary based on experience or confidence in the deployment. If you’re not sure, ‘3’ is a good starting point.
  • Nessus warning: Before deploying an update that will recreate the Nessus VM, such as updating the stemcell or VM type, be aware that we need to ensure a 10 day waiting period between Nessus VM stemcell upgrades. This is because the Nessus manager deployment requires that the System Owner reset the license key after a stemcell upgrade, and the license key can only be reset every 10 days. Coordinate with the System Owner to ensure the key is ready to be reset before deploying an update that will upgrade the stemcell. You should also read the Troubleshooting Nessus runbook.

Review and respond to open alerts

Review all recent alerts and notifications delivered to cg-notifications , #cg-platform-news

Are there no alerts or notifications?

Verify the monitoring system is functioning correctly and confirm that alerts are reaching their expected destinations.

Investigate open alerts

  • Use our guides for reviewing alerts (prometheus, elastalert) for alert descriptions, links to the relevant rules, and starting points for reviewing each type of alert.
  • Was the alert caused by known maintenance or testing in dev environments? Check with other members of the team if you can’t determine the source.
  • Is this a recurring alert? Search alert history to determine how frequently it is occuring and what event may have started its firing.
  • Should the underlying condition have caused an alert? Alerts should only be raised when they’re something we need to remediate.

Is the alert a real issue?

If the alert may indicate a security issue follow the Security Incident Response Guide , otherwise work to remediate its cause.

Is the alert a false-positive?

If the alert can be tuned to reduce the number of false-positives with less than one day’s work, do it. If more work is required to tune the alert, add a card to capture the work that needs to be done or +1 an existing card if one already exists for tuning the alert.

Be prepared to represent support needs at the next grooming meeting to ensure that cards to fix alerts are prioritized properly.

Review AWS CloudTrail events

Get familiar with the documentation for CloudTrail logs.

Use the AWS Console to review API activity history for the EventNames listed below. Or, use the AWS CLI with the appropriate $event_name, and parse the emitted JSON:

aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=$event_name

These EventNames should be attributed to human individuals on the team:

  • ConsoleLogin

All human-generated events should be mapped to named users, e.g. firstname.lastname, and NOT to Administrator. Discuss the event(s) with the indicated operator(s)

All events in the following EventNames should be attributed to Terraform:

  • DeleteTrail
  • UpdateTrail
  • ModifyVpcAttribute
  • PutUserPolicy
  • PutRolePolicy
  • RevokeSecurityGroupEgress
  • RevokeSecurityGroupIngress
  • AuthorizeSecurityGroupEgress
  • AuthorizeSecurityGroupIngress
  • CreatePolicy
  • CreateSecurityGroup

Terraform runs on instances that use instance profile roles, so authorized events will include:

  • a user name like i-17deadbeef1234567
  • a source IP address within AWS.
  • an AWS access key starting with ASIA

If you observe any non-Terraform activity, discuss the event(s) with the indicated operator(s) (see above)

If you’re unable to ascertain an event was authorized, follow the Security Incident Response Guide.

Review vulnerability and compliance reports

If the reports contain any HIGH items work to remediate them.

Is an update from our IaaS provider required to remediate?

Open a case with the IaaS provider and monitor the case until it has been resolved.

Is a stemcell update required to remediate?

Ask for a date when new stemcells will be delivered in #security in the CF Slack.

Is a bosh release update required to remediate?

Update the bosh release and file a PR for the changes. Once the PR is merged, ensure the updated release is deployed to all required VMs.

Review open support requests

Review the “new” (yellow) and “open” (red) Zendesk tickets. First-tier support (customer squad) has primary responsibility to do the work of answering these, and you serve as second-tier support providing technical expertise. You’re welcome to reply to the customer with answers if you like (choose “pending” when you submit the answer)*, but your main responsibility is to provide technical diagnoses/advice/details. The easiest way to do that is to write comments on the associated posts in #cg-supportstream. First-tier support may also ask you for pairing time to work out responses together.

* People with emails can’t receive email via Zendesk, so we have to email them via the cloud-gov-support Google Group instead.

See also: Detailed guidance on working with our support tools.

Review potential improvements in CloudCheckr

Review the Best Practices report in CloudCheckr and try to fix something near the top.