v12 Is Here
The end of the quarter brings a lot of new updates to cloud.gov!
As part of our commitment to communicating better, we feel it’s time to start highlighting some of what we do. cloud.gov runs entirely on open-source software, primarily from the Cloud Foundry open-source ecosystem.
The Cloud Foundry ecosystem publishes what is known as
cf-deployment, a canonical reference for how to deploy the Cloud Foundry platform. cloud.gov then applies government-specific security policies and procedures and deploys it, continuously. In the last quarter, we’ve deployed the platform about 150 times. Most of these fixes were patches, security updates, and configuration changes.
This month, there was a major release of the cf-deployment: v12.0.0. We’ve integrated this release into our deployment, applied our normal changes to it, tested it, and deployed it. As of this writing, v12.0.0 is available to customers. While many changes we highlight are aimed at improving availability, there were also 50 CVEs (Common Vulnerabilities and Exposures) patched, so this update is worth discussing.
You’ll find a human-friendly description of each major change and information on how they are relevant.
There were some discreet features designed for operators, but there were two high-scored security patches fixed. Customers are no longer vulnerable to these CVEs when authenticating with the platform.
This release bumps the Java buildpack from v4.20 to v4.21.
- Metric Writer Augments with Cloud Foundry-specific dimensions (via many people)
- Update Introscope Agent Version (via Dhruv Mevada)
- Shade Auto-reconfiguration Jackson Dependency (via @Agraham21 and @pborbas)
This section highlights a component called Diego, which is the underpinning container-orchestration technology of cloud.gov. Although most of these fixes are not customer-facing, they are important to the health and stability of the cloud.gov platform. For the interested, feel free to do a deep dive into the container orchestrator.
On startup, with TCP routing enabled, the route-emitter, which notifies the routing API about application domains, attempts to talk to UAA, the identity component, to grab a key for the configured client. This communication must happen for the route-emitter server to become healthy and be able to talk to the routing API.
There is a scenario where the route-emitter fails to start up because the platform’s internal DNS server is temporarily unavailable and a secondary DNS server takes an extended amount of time to fulfill a DNS probe. The route-emitter attempts to talk to UAA, is unable to resolve the address (and this takes a long time), fails to start, failing the container host VM’s deployment.
This version deploys a new release of the internal UAA client, which allows for configurable DNS timeouts, preventing this problem from occurring anymore.
Compute Host Capacity Reporting
There are multiple factors which could contribute to apps failing due to out-of-disk errors when they have been placed on a compute host successfully (and therefore should theoretically have enough reserved disk space). We can take a step toward mitigating the incidence rate of these failures by assuring Garden, the container runtime, subtracts the currently included “Reserved Space For Other Jobs in Mb” value from its available disk calculation so as not to give the scheduling representative the impression it has more disk space to use than it actually does in practice.
In v12, the container runtime added a new field to the metric which separates local disk space from schedulable disk space for the container orchestrator.
Local Route Emitters
The Cloud Foundry ecosystem has an internal goal of TLSEverywhere. This release adds mutual TLS communication between the route-emitter and the routing API to prevent Man-In-The-Middle attacks. This increases the cloud.gov’s security stance and allows us to regenerate certificates on-the-fly, should there be a need.
Removing Unused Volume Support
The cloud.gov team deprecated our experimental NFS support for the platform due to security concerns and the pending removal of functionality from Cloud Foundry. This release removes the last remaining available NFS functionality.
App Logging and Metrics
From previous investigations, it appears that when CPU usage is high enough on the compute host VM and measurements are frequent enough, the time skew between the scheduling representative’s measurement time and the container runtime’s measurement time can be significant enough to cause the scheduling representative to compute CPU metrics that are impossibly high (more than N * 100%, where N is the number of CPU cores on the VM). This was tested in a lab by the Cloud Foundry team, and resulted in a 4-core VM with 1000% CPU usage, where is shouldn’t be higher than 800% with hyper-threading.
Now that later versions of the container runtime report a server-side age in the container-metric payload for each Garden container, the scheduling representative can use this more accurate measurement time to compute its CPU-usage difference quotient.
Component Logging and Metrics
There were two core improvements, mostly around metrics, on the operations side, which are fine grained metrics we can use to troubleshoot the system, if needed.
- Fixed a bug where a multiple instance task with a value of 0 is scheduled.
- Exported scheduling metrics for easier dashboard visualization.
cflinuxfs3 is the base container filesystem all apps on cloud.gov run on top of. The new release of this root filesystem includes a combined 48 CVEs!
- CVEs Patched
We are committed to improving the user experience of government. If you have questions, please don’t hesitate to reach out at firstname.lastname@example.org. We recommend that you subscribe to service updates at the cloud.gov StatusPage.
The humans of cloud.gov