In order to effectively manage cloud infrastructure at any scale, enterprise organizations need to make use of infrastructure as code tools, treating their application infrastructure configuration as they would their standard application code.
Christina Harker, PhD
Marketing
Manually provisioning infrastructure simply isn't viable at the scale of modern cloud applications, and ClickOps will lead to spiralling costs and likely security vulnerabilities.
In order to effectively manage cloud infrastructure at any scale, enterprise organizations need to make use of infrastructure-as-code tools, treating their application infrastructure configuration as they would their standard application code. This new standard of resource provisioning no longer involves invoices and work orders to rack new servers, now it's clicks and API calls.
Getting this right is critically important to the long term scalability and security of application workloads. Bad implementations inevitably lead to spiraling cloud costs, technical debt, and security vulnerabilities. What options are available, and what is the right tool for an enterprise getting started on its cloud journey?
The previous generation of infrastructure-as-code (IaC) tooling was based on a concept known as configuration management (CM). Software such as: CFEngine, Puppet, Chef, Ansible, and Salt are generally the most well-known and broadly deployed CM systems. They enabled system administrators and infrastructure engineers to automate configuration of large numbers of servers far more efficiently than previous, manual processes.
This model of infrastructure management involved the user first defining a desired state in code. This could be the presence of a specific configuration file, or user accounts, or the installation of versioned OS packages or application dependencies. This desired state was then assigned to different servers based on roles; web servers might have a web configuration, backend servers would have specific configurations, and so forth.
Periodically, the tool would run to verify the infrastructure was in compliance with that state. If the state of the resource was not in compliance, the tool would take steps to converge on that state. For example, if a server was missing a software package or dependency, the CM tool would take action to install it.
If CM was such a leap forward in terms of enabling large-scale infrastructure management, why has the tech industry largely moved on? One of the core issues is that nearly every major tool in the space was designed and released prior to the broader adoption of the cloud as a first-choice destination for software infrastructure. The core logic design of these tools presupposed that the servers that they would be managing would already be deployed out-of-band.
The primary mechanism of automation in a CM tool is either an installed agent, or agentless (via SSH) interaction with the target servers; the API-driven design of cloud platforms means additional features have been grafted on. Additionally, as the complexity of CM-managed systems grew, the configurations became brittle and difficult to manage.
The accumulation of changesets and drift over time meant that the infrastructure was not immutable, and this growing tech debt left admins afraid to make changes for fear of unintended outages or regressions. The newer generation of cloud-first tooling has been built to address the needs of infrastructure-as-code at scale, while solving the issues that plagued CM systems.
The newer generation of tools enables users to define their desired end-state; the crucial feature that separates them from the previous generation is that the resources can be both provisioned and configured within the same tooling. This provides a significant advantage over CM, and allows engineering teams to get closer to the ideal of immutable infrastructure.
Immutable infrastructure is the concept that infrastructure components are not modified after their initial deployment, but rather replaced and redeployed in order to make changes. Immutability helps ensure that the configuration of infrastructure in code is always as close as possible to an accurate representation of the state of the live infrastructure systems. With immutable infrastructure, scaling, security, and reliability are all easier to achieve.
In terms of which tool to use, there are a few options for enterprises to choose from. Terraform is generally considered the leader in infrastructure-as-code. It has broad industry adoption and has established itself as the standard in IaC. It has a robust ecosystem of 3rd-party tools, community support, and add-ons, as well as a solid selection of supported platforms.
Its configuration language, HCL, is simple to read, and its declarative syntax means users do not have to worry about defining logical structure or execution order. The desired infrastructure state is defined, and the tool determines the best path forward to execute. For enterprise organizations that are wholly deployed on AWS, Cloudformation is also available. Cloudformation offers first-class interoperability with AWS services, however it is proprietary and limited only to what services AWS provides. In contrast to Terraform, it would not be possible, for instance, to deploy Pagerduty schedules or DataDog dashboards.
With an IaC tool like Terraform, engineers can lint and test their code, enforce configuration rules with tools like OPA and Sentinel. Terraform also provides the concept of modules, giving users the ability to encapsulate and re-use blocks of resources. One of the core principles in software development is the idea of "Don't Repeat Yourself" or DRY. DRY means taking advantage of a programming language to define a component or piece of logic once and then being able to repeatedly call on that part of the code as needed, rather than needlessly writing it over and over. This leads to cleaner, more efficient, and less buggy code. The same principle applies in IaC: using a shared module for, say, an application stack ensures consistent configuration and makes management and security much simpler.
Terraform is not without its downsides, however. Firstly, there’s price. Terraform is an overkill solution in many situations. Secondly, the declarative syntax can limit the usage of imperative logic, such as a for-loop. This can make some larger resource configurations cumbersome and wordy. Learning HCL means another language and syntax for developers to learn.
The latest generation of tools now enable software engineers to deploy infrastructure using traditional coding languages and libraries. AWS CDK, CDKTF, and Pulumi are available as modules for several languages, and provide the opportunity to leverage imperative logic in defining and deploying infrastructure configurations.
Choosing the correct path forward in managing cloud resources depends on the resources available and the timeline for implementation. Before the hard work begins, it's essential for companies to take a moment and talk about what resources are needed in order to reach their objectives, any staff or other resource gaps that could be an issue, and when they'd like the implementation to be able to meet user SLAs.
If an enterprise has:
Ample access to DevOps or cloud-experienced personnel and resources OR
An 18-24 month timeline to hire, train, plan and implement the project
...then an in-house solution using Terraform is a good choice. As mentioned in the previous section, the broad support and resources available, as well as its ubiquity in the industry make it a great choice and provides solid opportunities to either hire experience or engineers, or train existing ones.
It's important to understand that for enterprises coming from traditional, non-cloud environments, there needs to be a considerable investment in building outprocess, skills, staff, and a nearly complete overhaul of the operational and engineering culture. Traditional environments often treat infrastructure and software as two very distinct silos with distinct modes of operation. Building on a cloud platform using DevOps methodologies means breaking down those silos and adopting agile software development practices for both.
For many teams, this will not be an easy or quick transformation. Trying to take shortcuts will lead to inefficiencies, cost-overruns, and security vulnerabilities.
If an enterprise has:
Leaner/smaller teams
Shorter timelines (3-12 months)
Little stakeholder support for internal transformation
...then a batteries-included Platform-as-a-Service (PAAS) might be a better solution. Developers can package their code into a ubiquitous artifact, such as a Docker container, and ship directly into the platform while having to make few modifications to their existing development workflow. This approach means developers can stay productive; feature iteration and customer experience doesn't suffer while engineering teams pivot to adopt cloud operations.
When faced with a lack of available capacity, resources, personnel or enthusiasm for change, enterprise organizations should strongly consider not reinventing the wheel. They'll end up with half-baked cloud deployments with abundant mis-configration, inconsistencies, and security flaws.
When done correctly, infrastructure-as-code can unlock massive scaling and efficiency potential in any cloud deployment, and many of the most successful technology and software companies rely on it heavily. However, it is a significant cultural, operational and process transformation that may require significant investment and dependence on outside resources. Platform-as-a-service offers a great way to take advantage of the scaling and performance offered by the cloud without the operational burden.