Delivering an Improved Platform for a UK Government Department
A UK Government Department had a tenant go live deadline which meant that the platform had to be production ready and the tenant had to be onboarded for 31st March 2022.
This meant that the current platform had to be improved, privileged access management had to be implemented, monitoring had to be in place and there had to be a secure gateway traffic flow via SMTP, HTTPS and FTPS for the customer to be able to access the platform.
The Customer's Challenge
The customer had a tenant go live deadline which meant that the platform had to be production ready and the tenant had to be onboarded for 31st March 2022. This meant that the current platform had to be improved, privileged access management had to be implemented, monitoring had to be in place and there had to be a secure gateway traffic flow via SMTP, HTTPS and FTPS for the customer to be able to access the platform.
The testing as part of the Quality Assurance Testing (QAT) and health checks would also have to pass by 31st March 2022.
This work would then enable future tenants to also onboard, have a smooth transition and have a production ready environment in the wider platform.
Why Version 1?
Version 1 was involved in the first phase of this project and had experience and background knowledge on the design and implementation of current platform which ensured that the work could be carried out straight away without any need of additional training and minimal knowledge transfer.
Version 1 being aware of the processes and security standards that would need to be adhered to was very beneficial and a driving factor to allow this key milestone to be worked on.
The first phase involved getting the platform to be production ready and this second phase would involve getting the platform to be ready for tenants to onboard and use.
As the taxpayer’s money is being used for this government public sector project, it was a feasible and cost-effective way to bring back Version 1 for the second phase to assist with this work.
Solutions Proposed & Delivered by Version 1
The following solution was proposed, and objectives were refined while working closely with the tenant to ensure that the requirements are adhered to:
- Privileged access management implementation using IAM, HashiCorp Vault, Active Directory including Group Policy Objects to lock down specific instances
- Operational monitoring had to be in place with logging – CloudWatch, Splunk and Nessus
- Secure gateway traffic flow via SMTP, HTTPS and FTPS
- HashiCorp Vault role creation to enable permissions for read/write access and policy creation to generate/issue certificates
- Networking and routing in place
- Assisting the testers and providing them with a test environment as part of the QAT and health checks
- Ensuring that the QAT/health checks pass and meet the security standards and that any bugs that are raised by QAT are addressed and fixed
- Jenkins updated with Splunk and Nessus for monitoring and to ensure that there is vulnerability scanning and preventing malicious attacks
- Tenant infrastructure is up-to-date
- Any tools and platform requirements from the customer to be deployed and ready in the tenant environment e.g. Jenkins Controller and Agents and Windows Bastions
- Creation of documentation/runbook guides from both a DevOps internal team process perspective and for the customer with any external processes to be documented and shared via workshops if required to the wider platform team and to the customer
- Ongoing support to ensure that the platform has high availability and is cost effective and adheres to the Amazon Well-Architected Framework
The Tenant Onboarding to a production ready platform was achieved for the go live deadline of 31st March 2022. This was achieved through Version 1’s knowledge and expertise within the team which was still fresh from the first phase of the project.
Having this existing background knowledge on the project helped enable improvements to be made and work to be shaped and planned.
Documentation was created for onboarding guides for both internal processes within the team and for external processes for the tenant as required.
This gives other tenants confidence with onboarding onto the platform and ensuring that that they can be production ready as Infrastructure as Code has been used such as Terraform, Terragrunt and Python via the tried and tested solution.
Reusable Terraform modules have been created which ensures that future tenants can be onboarded via Terraform and Terragrunt.
Since the first tenant onboarded onto the platform and went live from March 31st 2022, three other tenants have now onboarded onto the platform successfully.
There was a lot of blocked work on Jenkins which Version 1 was able to help with straight away, and permanent fixes were applied in the Userdata to ensure that it would not disrupt the current Jenkins instances. The Jenkins version had to be downgraded to ensure that it was compatible with the Jenkins plugin dependencies. This fix was applied to all of the tenant Jenkins Controllers to ensure that this issue would not cause an outage to their Jenkins Controller.
Root cause analysis was conducted and captured to ensure that a permanent fix could be applied and meeting minutes captured following a high priority incident, to ensure that the chances of the same incident occurring in the future can be reduced.
Workshops were held regularly to knowledge share on areas where the DevOps Engineer was a Subject Matter Expert in that field. The Internal Show and Tell Session was introduced to ensure that knowledge can be shared within the team, as sometimes you are unable to get the visibility on the great work that other DevOps Engineers in the team are working on.
It was ensured that hard coded variables are replaced with either:
- A data lookup in Terraform to populate the value
- A new variable introduced to ensure that the information can be passed through via Terragrunt as the values would be unique for each tenant
- A dependency of another output from another module – then this can be defined via the Terragrunt code
One challenge that was faced was that Tech Debt was not being captured on Jira. Version 1 reduced the amount of tech debt by creating a Tech Debt epic in Jira and linked the feature tickets that were touching the same codebase with the tech debt tickets, to ensure that they can be worked on at the same time and the tech debt reduced.
Definition of done, and definition of ready documents were created for the wider internal DevOps team, which ensures that any blockers can be identified and resolved during the refinement sessions, and when the ticket is picked up the DevOps Engineer’s time is used more effectively working on the feature.
In the feature ticket, we have introduced adding the access required e.g. IAM groups required to complete the feature work and acceptance criteria defined in each ticket. We have also been able to get the QAT team involved to test and ensure that features meet the definition of done. This has ensured that feature tickets do not need to be reopened after being completed.
About Version 1
Version 1 proves that IT can make a real difference to our customers’ businesses. We are trusted by global brands to deliver IT services and solutions which drive customer success. Our team of over 2500 dedicated difference-makers works tirelessly to provide independent advice and deliver impactful changes to help our customers navigate the rapidly changing Digital-First world we live in. Our greatest strength is balance in our efforts to achieve Customer Success, Empowered People, and a Strong Organisation, underpinned by a commitment to our values. We believe this is what makes Version 1 different and more importantly, our customers agree.