FinOps Retrospective Tagging for a Central Government Department
A Central Government client recently completed a large-scale and high-cadence migration to AWS. During this process, they made heavy use of ‘Rehosting’ and ‘Replatforming’ rather than ‘Rearchitecting’ because of a tight deadline to complete the migration. Due to the limited scope for cloud focused application rearchitecting and AWS estate management, the ongoing maintenance of the estate became the next concern to focus on (despite the migration itself being a resounding success). Cloud costs were rising and were estimated at between £1.2 – £1.4 million a month in cloud hosting.
A company of this size with multiple teams and projects in flight needed visibility and control of their spend.
A key aspect of the ongoing management that Version 1 addressed was the level of tag compliance across the estate. Before migration, the client did not have a defined tagging policy, which was only introduced during the process – with no mechanism for application, enforcement, or exploitation of tags for cost categorisation. Because of the retrospective nature of internal efforts to enforce and utilise tags for FinOps purposes, compliance sat around 17% across the organisation (prior to Version 1’s engagement).
Version 1 were asked to improve the tag compliance across the estate without disruption to existing production services and consistent application across millions of resources, and 10,000+ code repositories. Version 1 produced a client wide unified retrospective tagging approach that met the needs of the entire engineering community. This maximised improvement to tag compliance levels with minimal operational disruption.
After rapidly prototyping and iterating a solution, alongside a limited manual rollout in controlled environments, Version 1 produced a powerful retrospective tag compliance solution. This empowered and motivated developers to fix the tag compliance for their products in a universally consistent way, with no impact to production environments that can be rolled out at massive scale with very little operational overhead.
The main goal was to ensure that the custom policies in the solution produced compliance reporting that could be relied upon by engineers to accurately reflect their tag compliance levels. This also served as a guide to resources and automation solutions for addressing the tag compliance issues.
The core of the solution navigated all these issues, providing the following features:
- Identify every Terraform workspace and module correctly, changing analysis based on the source and individually assessing each
- Dynamically load compliant tag values from external systems
- Correctly identifies non-taggable resources from every version of the AWS Provider
- Addresses all edge case resource tagging scenarios encountered
- Global checks to identify Terraform and AWS Provider versions, and legacy tagging methodologies are not used
- Bespoke reporting function providing summarised compliance levels, most common policy failures and files with the most failures
- Human friendly debugging for reasons for policy failures and reference to detailed documentation with specific remediation instructions
With the core of the solution established, the mechanism by which it could be deployed and run at scale was required. This was developed in parallel to the core solution itself.
Small-scale testing allowed for non-disruptive validation, testing, and refinement of the tag reporting.
Once confidence was built that the solution covered all appropriate edge cases with accurate tag compliance reporting, the next stage was to roll out the solution at scale. This was done by integration into the client’s GitLab compliance framework – a set of mandatory CI/CD jobs that are enforced to run against all repositories matching certain conditions. In this instance, ones that deployed AWS resources using Terraform and GitLab CI/CD, a number which amounted to around 10,000 repositories.
Initial deployment of the solution was completed in the compliance framework by allowing the job to fail. Feedback was also gathered from the community and this allowed developers a grace period to ensure compliance before ‘hard’ compliance was enabled and the job was enforced to succeed.
This was enabled in batches across the project, allowing for the best of both worlds – engineers have pre-warning of compliance requirements and are empowered make fixes, while the FinOps team has a mechanism to lock down enforcement of the tag policy once compliance is in place.
Real Differences, Delivered
With the solution fully up and running, left shifted tag compliance enforcement for AWS/Terraform resource is now in place, and our customer is reaping the benefits.
With growing tag compliance, the FinOps team have far more complete data, encompassing the AWS cloud estate that allows for true insight with their analysis tools. They can categorise and compare cloud cost between business units, teams, applications, environments and more, identifying outliers and high spenders explicitly. They can visualise and present business specific metrics that cloud service providers natively lack context for, giving a clear picture to the organisation of how their cloud bill disseminates across the organisation, increasing accountability and cost awareness.
This adds a layer of visibility and control for our customer that didn’t exist previously.
The FinOps team can also more easily identify and terminate rouge resources outside of CI/CD or IaC that remain noncompliant, with confidence from teams that all the resources that matter to them are compliant with the tagging policy. Compliant resources can also be easily traced back to the specific repo, file, and lines of code that they were deployed from. This increases not just team accountability, but also rapid identification for SRE of failed resources, and the ability for engineers to easily identify what has been deployed from their code.
As insight and awareness increases throughout the cloud estate, FinOps can pressure high-spending teams and business units with data-driven findings, pushing rightsizing and efficiency improvements. This also significantly decreases cloud costs alongside improving visibility (what do we have?), accountability (who owns it?) and traceability (where did it come from?).
A clearly defined and comprehensively enforced tagging policy is important to apply early for any growing cloud estate, but when one has already grown unmanageable without it, Version 1 provides retrospective solutions that can rapidly bring insight from the unknown and order from chaos.