Diego for 18F operators
Diego is the new application runtime system for Cloud Foundry. While initially a rewrite of the legacy runtime (DEA) in golang, it has evolved into a container management solution, separate from but in practice integrated with Cloud Foundry. Note the name “Diego” was originally “DEAgo”, a portmanteau of the legacy runtime, DEA, and the new implementation language, go (golang).
For those familiar with the legacy runtime, understanding the differences between it and Diego may be helpful.
Rotate the data store encryption key often. See https://github.com/cloudfoundry/diego-release/blob/master/docs/data-store-encryption.md
Diego uses a network in each AZ, separate from the CF networks, and an ELB to service SSH traffic. The networks, ELB, and associated security group are created and maintained with Terraform as part of the terraform-provision pipeline.
Manifest Generation and Deployment
We make use of the standard, recommended manifest generation technique (
cf-release/scripts/generate-deployment-manifest) and afterwards use
spiff to add on 18F-specific releases and properties. This allows us to avoid closely tracking the
diego-release repository’s stubs and generation script. We trust the diego-release script to generate a proper and sane manifest for any given release, taking into account any job additions or changes.
Deployment is via the
deploy-diego pipeline in the
cg-deploy-diego repository. Eventually, 18F should consider combining the
For the initial Diego deployment, we sized the cells to match existing DEAs in size and number. Future sizing/scaling should take into account a number of factors, including current and anticipated workloads, risk tolerance and recovery performance characteristics in the event a cell or AZ goes offline where containers will come up on surviving cells. See scaling notes at https://docs.cloudfoundry.org/concepts/high-availability.html#processes
Disk Quota Over-Enforcement during Container Setup
When copying a droplet, a buildpack, or other assets into a container, the Garden-Linux backend may end up over-reporting the amount of disk used in that container. If this disk usage exceeds the quota allocated to the container, the copying-in operation will fail, and the container will crash. If you see crash events for your CF app with the exit description, “Copying into the container failed”, this quota issue is likely the cause.
This erroneous reporting appears to be an interaction between the how the backing filesystem that garden-linux uses for container images accounts for disk usage and how payloads are streamed into the container. View the Cloud Foundry documentation for more detail about this issue including possible workarounds.
Diego is a very high-performance runtime solution, with negligible overhead when scaling into the tens of thousands containers. For more information, see this CF Summit talk
Diego Acceptance Tests are performed by the CF Acceptance Tests (CATs) and are run by the 18F deploy-cf pipeline.
BBS Data Store
18F uses an RDS (PostgreSQL) data store for its BBS component rather than etcd. This means the etcd properties in the Diego manifest are null and there is no need for consensus (i.e., no need for an odd number of BBS instances).
Diego has (beta) support for persistent volumes, which may be a good solution for some workloads. Volume support was not enabled/implemented as part the of the initial Diego rollout at 18F. See this post for more detail.
Each Diego instance runs standard 18F BOSH releases in addition the standard Diego components. Examples include BOSH releases for ClamAV, Tripwire, Nessus agent, and Snort.
Diego platform logs are sent to logstash using the
syslog_daemon_config sink provided by the platform.
Metrics are ultimately sent to Grafana from the platform via a firehose nozzle and from each Diego VM via collocated
collectd BOSH releases.
Legacy Runtime Removal
Once the Diego backend is determined to be stable and sufficient for 18F needs, remove the legacy backend (DEA) components and integrate