Trailblazing to the Cloud with Opsview and Amazon Aurora - Version 1

Introduction

In this article, discover how Version 1 and Opsview reduced key inefficiencies through a complete redesign of Opsview Monitor’s architecture in line with modern principles.

Version 1 supports over 1300 databases across 6 dedicated teams. Opsview Monitor performs approx. 35,000 different service checks across their customers’ infrastructure, databases and applications. With 134 collectors monitoring over 2900 devices, and over 11 million polls per day reporting back to their master database, the Version 1 MSP service has enhanced visibility into every one of its customers’ environments.

Version 1 supports long-standing AWS customers, with the longest being an AWS database customer of over eight years. Since 2017, Version 1’s Opsview Monitor masters have been running MySQL on Amazon RDS and have been migrated to Aurora MySQL on RDS in 2019.

Monitoring is a core operational discipline for managed service providers. Opsview Monitor allows MSPs to flexibly monitor their platforms quickly and easily, whilst securely providing visibility into diverse customer IT infrastructure and applications.

Challenge

Version 1’s traditional implementation of Opsview Monitor was facing the challenge of an ageing infrastructure that required heavy maintenance, leading to sub-optimal performance. Ageing hardware causing issues with storage capacity and data transfer speeds, manual backups, failover, and disaster recovery.

There was also only a single ISP connecting both datacentres providing a single point of failure supporting BCP and DR. Furthermore, no native reporting was integrated within this implementation, restricting reporting capabilities. To futureproof the infrastructure, we need to plan 3-5 years in advance and over spec for current performance requirements.

Opsview – Core Monitoring System Stats

Daily Checks: 11,808,720

Customers: 80

Collectors: 134

Hosts: 2981As a core monitoring system, Version 1 aimed to reduce these inefficiencies through a complete redesign of the architecture in line with modern principles. AWS Public Cloud was identified as the platform to digitally transform Opsview Monitor, with migration taking place to move from a private datacentre to AWS using infrastructure as code. To increase blast radius between production, test, and development, Version 1 created separate VPC’s, isolating dev and test from any production workloads.

Terraform was used to build the Virtual Data Centres (VDC), network, security, and basic instances. RedHat Ansible was subsequently used to configure these instances, applications, and servers. Ansible was also used to control the application itself, granting permissions for users, and ensuring consistent configuration. The infrastructure-as-code is source-controlled through Git, providing review gates to help ensure that changes to the application are secure and error-free. Infrastructure-as-code also helps reduce security risks through consistent code deployments and removing the risks of error-prone human deployment.

Roles were split from a single monolithic server into smaller role-based servers. The pay-as-you-go model allowed Version 1 to spec the system for current requirements, with the ability to scale each of the individual services up (increasing single node capacity) or out (spreading load across multiple nodes) according to need, with no requirement to over-spec and overspend ahead of time.

These services can also auto-scale up and down as requirements dictate without adversely affecting performance, leading to greater control and stability of the application.

The old infrastructure’s failover process was also causing issues due to the complexity of the Opsview system, affecting business continuity. The disaster recovery process was unreliable and time-consuming. Integrating with AWS provides enough capacity to increase new resources or failover if needed, which leads to essentially no downtime for the new infrastructure. This new architecture also allows Version 1 to use AMI’s to take full images of the entire Opsview instances, automating full system backups.

Overall the new platform encompassing all three systems (prod, dev, test) can be built from scratch in approximately 2 hours for a greenfield site, and under a day for complete disaster recovery. The traditional approach would have taken up to a week for a greenfield implementation, with a further extra few days for complete disaster recovery.

Through the old infrastructure, database servers had to be manually managed to ensure each element of the system was appropriately configured and resourced. Using Amazon RDS, no underlying components of the database have to be managed. This creates efficiencies and allows Version 1 to only deal with the database schema directly. Multi A-Z deployments provide enhanced availability and durability, across multiple availability zones.To prove that Opsview Monitor 5.4.1 can operate on Aurora MySQL 5.7, members from the Version 1 Infrastructure team and Opsview engineering team conducted a battery of tests to prove aspects of its capability; including supportability, robustness, scalability, stability and ‘key-person’ dependencies (i.e., whether the product requires special training, dedicated personnel, and/or succession planning in the event that product specialist(s) leave the company).

In order to create an optimal testing environment, Version 1 provided a near exact replica of the production Managed Services environment using a production dataset. Opsview provided an Opsview Monitor test platform, which can connect to a local or remote MySQL-compatible database. The Opsview Monitor master server does the monitoring (polling, receiving traps, NetFlow, querying via WMI, etc.) and returns the data which is stored in the MySQL database.

Opsview Monitor’s web UI, custom dashboards and Business Service Monitoring visualizations, reports, notifications service, ticketing and service-desk integrations and other user-facing tools then analyse that data and deliver it to users. Over the period of a week, items such as regression testing, non-functional testing, and others were successfully completed to satisfaction.

The result is that Opsview has been able to certify use of Amazon Aurora as a back-end database for Version 1, and for all other Opsview Monitor users.

…Delivered.

By migrating to Aurora, the Opsview Monitor master database is now replicated 15 times (compared to 5 on MySQL) with automated failover to these replicas. This enables AWS to guarantee greater than 99.99% service availability.
With Aurora, storage is fully managed which means the database will automatically scale as storage grows up to 64TB without requiring availability disruptions to resize or restripe data. In addition, Version 1’s database now has 5 times the throughput capacity.
This migration to AWS and Aurora has significantly improved reporting capabilities, downtime, infrastructure rebuilds, and front-end security.
Daily reports previously took over 4 hours when running on-premise. Following the migration from MySQL to Amazon Aurora this has now been reduced to approx. 4 minutes.
Graph data over time can now be rendered in milliseconds compared to several seconds wait previously.
In the event of any failure on the master monitoring server, Version 1 is now able to failover to a disaster recovery master server in less than 10 minutes. The traditional approach previously utilised on-prem was very difficult and usually took hours to complete the failover process.
As Version 1 has now deployed Opsview using infrastructure-as-code, we can rebuild all the Opsview core infrastructure in under 2 hours including data migration. Again, the traditional approach of rebuilding on new hardware was tedious and the data migration would take days based on the slow network speeds.
By moving to AWS, Version 1 could start using more services to improve the security of the Opsview front-end. We have implemented an AWS Web Application Firewall that restricts all access except a pre-approved list of external IPs.