Atlantis
This step-by-step guide uses the Atlantis Fargate module to deploy Atlantis in GovCloud in a single account. Unsurprisingly, both the Atlantis module and Atlantis itself have evolved since we first deployed Atlantis at Truss. This step-by-step guide seeks to reflect how our implementations evolve along with the module. If you use this guide and you notice something is out of date, please submit a PR with improvements (no matter how small) to try and keep it up to date as a courtesy to those who come after you.
For accessibility, code links are sourced from the legendary-waddle repository as Atlantis is not currently deployed in legendary-waddle-gov. Inline code examples have been supplied for notable deviations.
Step 1: Know the Options
The first step in implementing Atlantis is to familiarize ourselves with what Atlantis is and what it does. Atlantis is "a simple Go app. It receives webhooks from our Git host and executes Terraform commands locally." It's high-level infrastructure middleware.
We don't have to make concrete decisions on all implementation facets before we begin, but we should make some decisions on
How to configure the Atlantis server, and
How to integrate Atlantis' resources into our existing structure.
Configuration Options
As of the time of this publication, there are 3 ways to configure Atlantis. Step one is looking at our project's configuration and deciding which way to implement our Atlantis server. The Atlantis docs summarize our options.
command line flags
environment variables
a config file, or
a mix of all three
As an example, Truss' legendary waddle repo uses a combination of environment variables and repo-level atlantis.yaml files. In general, we only want to use environment variables and a config file. However it's good to know we can use command line flags for certain workarounds.
We'll want to decide which accounts to place Atlantis in before we begin, as well as have a general idea of which server configuration method(s) we want to use.
Integration Options
This tutorial covers using the Atlantis Fargate module to deploy Atlantis. This module leverages a variety of other modules, submodules, and direct resource calls in our code to create the following key components:
Virtual Private Cloud (VPC) and the accompanying EC2-VPC security group
Application Load Balancer (ALB) and submodules for https 443 and http 80
Domain name using AWS Route53 which points to ALB
AWS Parameter Store to keep secrets and access them in ECS task natively
Some of these resources probably already exist on your project. The module expects us to integrate them. If the key resources don't exist, the Atlantis module (and submodules) will create those resources for us. Figuring out which of our project's pre-existing resources to integrate, and then which resources we should leverage the Atlantis module's calls to create is the next step. Here's a rough whiteboarded visual of the process. Some call it art:

The next step is to figure out the order in which to create the resources we need so as to avoid/minimize interdependency conflicts with the pre-existing resources we must integrate. The order of operations for this is what makes using this Atlantis module a bit tricky, especially if we're not familiar with how to both troubleshoot the creation of the resources Atlantis requires and troubleshoot the interconnectedness of those resources.
However, as a general example, here are some tickets @rpdelaney shared from CMS's EASi project in planning/implementing Atlantis:
Ticket Name
Type
GitHub user
Story
ACM Certificate
Story
IAM, provider, backend changes
Story
Set up network connectivity
Story
Image build pipeline
Story
Email for Atlantis user
Story
Create GitHub repos
Story
Create ECR repo
Story
Grant automation user push access to the ECR repo
Task
Create GitHub token & add to SSM
Task
Add Atlantis module in terraform
Task
Investigate accessibility via GitHub hook with ALB DNS record
Task
Add policies to automation role policy document
Task
Tune automation policy document to grant access to necessary SSM parameters
Task
Add IAM policy attachments for Execution role to Atlantis module
Task
Add yaml to infra repo
Story
Update Atlantis policy to allow cross-account role assumption to dev
Story
Configure webhook via new DNS record
Task
Add the Atlantis IAM roles to impl, prod, infra
Story
Pass ECR access from automation user to Atlantis user
Task
Investigate applicability of auto-planning
Task
Grant Atlantis service permission to assume the Atlantis role
Task
After we've planned out our implementation, we're ready to begin.
Step 2: Prep Work
This section contains prep work to highlight resources we will need before calling the Atlantis module. This section is not meant to be exhaustive, but is meant to highlight specific resources we may want to have before we call the module itself.
Directory Setup
Create an
atlantis-globaldirectory in our desired account, in our case (exp):mkdir -p exp/atlantis-globalAdd the terraform state bucket, version, and provider files following the steps in the bootstrapping document from the Atlantis section.
While we're here, we can go ahead and set up our
repo configandatlantis.yamlfiles.The official Atlantis documentation provides a lot of options for this configuration if you're feeling experimental, but there are some really great base models in legendary-waddle for both the yaml and config.
Use the Bot 🤖
If our project has a Robot user already, we can ride the coattails of those pre-existing GitHub user permissions for Atlantis as well. To do this, all you need to do is pass in the name of the robot user to the
atlantis_github_uservalue in a later step when we call the Atlantis module.Since we have a robot user, we can skip to "Store a Key in AWS SSM/Parameter Store."
Set Up a User, Email, and GitHub Deploy Key (optional) 📧
The official Atlantis docs recommend creating a dedicated user. At this point you may want to set up a robot user (you'll eventually need one anyway).
Although this decision is outside the scope of this tutorial, we can consult the ADR in legendary waddle on robot accounts and this ADR on key rotation consequences regarding creating a user to make our decision.
Make sure we've got an email we can associate with Atlantis to recieve notifications, etc. On milmove we were able to associate our pre-existing email address and use
[email protected]to associate with the Atlantis role created in the next step.Generate a new SSH key per GitHub's documentation.
Next, add the key as a Deploy Key, following GitHub Docs. As we can see, the first step in the setup is to "Run the
ssh-keygenprocedure" on our server, which we did in the previous step. Note that the act of adding the deploy key is done in the GUI for the repo associated with the location we'd like to deploy Atlantis in. In our example we're deploying to theexpaccount first. Go to the GitHub repo, click on Settings, and then click on "Deploy keys". Check the box to "Allow write access."We can
catourid_ed25519.pubto see our public key. It will look something likeGitHub will send an email to our chosen email address to confirm we succesfully addted the key.
Store a Key in AWS SSM/Parameter Store 🗝️
If we're using a pre-existing robot user, we can repurpose that Robot's existing deploy key. The easiest way to find this information may be to just log into the account where the robot user exists and look for a key that corresponds in name to the robot user.
Assuming the robot user does not have access to our newly created account's /atlantis-global directory, we can simply copy/paste the key value into the parameter store for the account we want to use Atlantis in.
We can log into the console for the account we want to deploy Atlantis in (in our case, exp), and add our key to AWS Systems Manager > Parameter Store using the naming convention </directory/object_name> (ex. /atlantis-global/atlantis_key) as type SecureString. Use "My current account" as the KMS key source.
Create the Atlantis IAM role and policies 🎩
Whether not you chose to create a dedicated IAM user for Atlantis, you will need to create an IAM role for Atlantis. Atlantis needs a role (and corresponding policy permissions) in order to perform two main functions for our module: assuming ECS task control and assuming Terraform/GitHub control in the Account.
Create the Atlantis role and policy
Our role, policy, and policy attachment will look a lot like the example in legendary waddle. We'll want to alter the policy principals to match our project's role names, but otherwise the code is pretty standard. Depending on our logs setup, we may want to add access to service logs as an additional principal:
We'll also need to add a line of code to let Atlantis role control terraform. In
legendary-waddlethis is done per account by adding a single line of code to therole_arnvalue to the s3 backend, and anassume_roleobject to the account provider. We may want to begin by leaving this code commented out. See the IAM troubleshooting section for caveats regarding role assumption.Create the ECS policy for the Auto-Created Role
The Atlantis module creates the ECS task automatically (using your name variable and called
<NAME>-ecs_task_execution) by passing in theterraform-aws-ecsmodule as a submodule, which in turn uses the aws_iam_instance_profile resource to create the IAM role for the task. As you can see, we're now a few layers deep, which makes any variance in the expected chain of events a bit tricky to troubleshoot.For our example, we'll need to create:
a policy
a policy document, and
Our final code will look look something like this:
For comparison, there is a slightly different setup in the
legendary-waddlerepository which places the policy attachments in the individual accounts. Pick your poison; the setup will vary based on your account structure.Finally, because we're using GovCloud, we'll need to manually go in the console and change the ECS policy arn until this PR gets merged. This is because the Atlantis module's
policies_arnvariable passes through a default value for an AWS "managed" policy for ECS task execution using a partition that only works in commercial accounts. Tl;dr: instead ofaws, we need the valueaws-us-govin our policy.
Add & validate a certificate for Atlantis
We'll need to add a new certificate to ACM, which manages our certificates for us. There are two steps: adding the certificate, and validating the certificate. We can't create and validate Route53 records in Govcloud. For a more thorough explanation, see our Engineering Playbook documentation on ACM in GovCloud. To reiterate briefly here, we'll need to:
Create the new certificate in GovCloud using the
aws_acm_certificateresourceExport the values as an output in the
output.tffile:Once we've merged the PR, we'll have the cert validation output value to validate the certificate in the next step.
Validate the Route53 DNS records in the Commercial account using the
aws_acm_certificate_validationresource.We'll need to find the
arnto plug into the atlantis module we'll call in the next step. After merging the PR to add the certificate, we can find thisarnin the console's Certificate Manager under the "Details" section for the domain name (in our example,atlantis.exp.net).
Set up the Image for Fargate to Use 🐋
For any project not requiring a specific type of docker implementation, we can simply pass in the trussworks-atlantis-ecs-image image when we make the module call in Step 3.
If we do not set the atlantis_image variable, we'll find atlantis:latest is used by default. This default is not recognized as updated when a new "latest" is released due to the word remaining unchanged. Therefore, we recommend passing in a numbered version of the Atlantis docker image to the atlantis_image var.
Step 3: Call the Atlantis module 🧜
We'll start with the most basic Atlantis module call we can. Assuming an existing VPC, Robot user, validated certificate, zone name, Atlantis Docker image, and GitHub connection, the code will look something like this
Eventually we'll build out until our call looks lot like the example in legendary-waddle. Note that the example above is GovCloud-based. Replace any references to us-gov with us if not using GovCloud.
Add custom environment secrets:
Notice we use
custom_environment_secretshere instead of using the module's built inatlantis_github_user_tokenvariable. This avoids storing the SSM parameter in the state file and prevents us from having to troubleshoot the general bugginess around parameters in the module.Submit a PR, get approval, and
terraform applythe code.
Step 4: Configure Your Webhook
The Atlantis module conveniently creates a GitHub webhook for you, and the documentation on this is fairly good. Follow the Atlantis documentation on how to configure that webhook.
Step 5: Connect the ALB Logs Bucket 🪣
Of all the things this module does, it does not create a logs bucket. The module itself has the expectation we will either create a new bucket or connect an existing one. In our example, we'll connect an existing logs bucket and ensure our permissions are correct.
Locate the existing logs bucket, presumably created using the
trussworks/logs/awsmodule. We're just going to put logs in the bucket that already exists for the account, in our caseexp-aws-logs.Once we find the code that creates the bucket, we need to give Atlantis permission to add our logs to the bucket. We do this by adding our chosen logs bucket prefix (we're following the established pattern and using
alb/atlantis-exphere) to thealb_logs_prefixesand thenlb_logs_prefixeslists in our existing logs bucket.After that, we seal the deal by returning to our code and adding the code like this inside our module call:
Another example is available in the legendary-waddle repo
Submit a PR, get approval, and
terraform applythe code.Log into the console to make sure the bucket stores logs in our prefix path by viewing the auto-created
ELBAccessLogTestFilein the path we chose:alb/atlantis-exp/
Step 6: Hide the UI (and a little backstory) 📰
Once you've confirmed ACM grants Atlantis access to GitHub, we should hide the UI. Previously, the only option we had for releasing a terraform lock held by Atlantis was through the UI, so it was necessary for both Infra and GitHub to access this UI. However, we still need to prevent malicious outsiders access so they can't do nasty things like sneak into Atlantis, tap into it's Administrator-level access, and run terraform destroy on all our precious code, for example.
One method we've succesfully used is to force federated login via Cognito, keeping the UI visible so that Infra could still access the UI and unlock plans as needed. However, the Atlantis module evolved. We can now simply run atlantis unlock as a command in the PR workflow. Humans no longer need acces to the UI to resolve locks. As a result, we can now construct a WAF to restrict access and return a 403:

Another option is to simply tighten security groups to restrict access so that only GitHub IPs are allowed to access Atlantis. We combine two Atlantis module optional input settings to get the result we want:
Leave
allow_unauthenticated_accessto remain at its default setting offalse
In this way, we're able to restrict our ingress rules to allow only GitHub IPs. All other requests will return ERR_CONNECTION_TIMED_OUT.
Submit a PR, get approval, and terraform apply the code. We should check our urls (in our example, atlantis.exp.net and atlantis.exp.net/events) to ensure we receive our desired responses.
Troubleshooting 🔧
Due to some bugs in the module and the inherent complexity of integrating/setting up so many resources, some degree of troubleshooting will be necessary. Thus it's included here as a step.
General IAM Role Assumptions Troubleshooting
We added code to let the Atlantis role control terraform following the legendary-waddle examples for the s3 backend, and the account provider. While we're making changes in terraform for various resources (such as the VPC, ALB, etc.), those resources do not neccesarily also have permissions to control our code. As a result, terraform init (as well as any other terraform commands) throw an "access denied" or "unauthorized" error. Temporarily commenting out the assumed role-related code allows us to continue.
GitHub Troubleshooting
When adding access to directories within the same account, we may encounter a Host key verification failed error.
Atlantis is trying to download a [email protected]/ prefixed module via ssh and can't. This happens because our Atlantis docker container doesn't have GitHub permissions to clone the private module. It's possible to fix this using a Dockerfile ENTRYPOINT customization, but the simplest way is to pass in the --write-git-creds flag to your environment variables. Add the following code to the custom_environment_variables section of your Atlantis module call.
Our plan will show this updates the aws_ecs_service task definition and replaces the aws_ecs_task_definition environment, adding our variable.
ACM/Certificate Troubleshooting
We'll get a 503 connection refusal error if our certs aren't set up correctly. If we get a cert error when we look at our chosen url (atlantis.exp.net) check the records and corresponding IP addresses in the terminal using dig exp.net and host atlantis.exp.net to pull up our ACM associated values.
We can also look at this in Route 53 > Hosted zones check record name --> and look at the Value/Route traffic to column. The value should not be going to an IP, it should be going to a CNAME.
We troubleshoot by logging into the console, looking at our ALB, and checking what's associated with it via the path: EC2 > load balancer > atlantis-exp
We can click on our certificate and manually point to the load balancer.
VPC Troubleshooting
Anecdotally, the variables for private_subnets and public_subnets did not succesfully connect to the VPC when plugging in the IP values directly. However, private_subnet_ids and public_subnet_ids work fine.
There are no open issues in the module for this discrepancy, so this mystery may be confined to the project (or author). This information is included here simply for posterity.
ALB Troubleshooting
We'll see two ALB listeners in the plan related to the redirect on the ALB. Here's a sample plan output:
Check out these ports because this is what's happening:

The ALB is being created with these two listeners (one https & one http). The http/80 port serves to redirect to https/443 and force use of our ACM certificate, setting up SSL termination on the load balancer. This keeps us from having to jump through the hoops of setting up docker and the client with certificates and dealing with SSL termination the TCP way (which is also how we would have to terminate the certificate with NLBs).
Check that the Fargate instance is behind the ALB. If not we'll have to put it there manually in the console. We can do this by specifying the target group.
ECS Task and Role Permissions Troubleshooting
We only had govcloud drama here. The module hard-codes aws as the provider but govcloud requires aws-us-gov to create the task policy attachment. 🤷♀️
Check task exists in console via ECS > Clusters > atlantis-exp

We can also see the task definition by clicking on the Task definition name

If the Fargate container is STOPPED, we'll see a section called "Stopped reason" including the reason with the error message included. Unfortunately AWS character-limits this field so the error message may be truncated and end in an ellipsis, looking something like this:
Typically this specific error is a policy/permissions error. We can see here the only permissions we have are the two policies needed to read our Github secrets.

To fix this in our case, we can mimic the ECS tasks in another stack. We create the policy in terraform and attach it to either the Atlantis IAM user we created or the Atlantis role.
SSM/Parameter Store Troubleshooting
When resetting a GitHub personal token in Parameter Store, we will have to redeploy the Atlantis instance. Otherwise, the new credential will not be updated in the instance.
Links and other reading
What Would Improve This Documentation
Switch Default Scenario to Commercial: Since commercial deployments will be the default case (and ACM, etc. still need to be handled in commercial), the GovCloud-specific cases should be in an appendix or in annotations - ex) "in GovCloud, due to X, you must do Y...". This also allows us to use the
legendary-waddledeployment verbatim.Upgrade/improve certain images to diagrams or remove altogether: Not everyone can be expected to understand my art. More seriously, images can reduce accessibility. UIs often change, so screenshots tend to be outdated quickly. Images should be used strategically and sparingly.
Last updated
Was this helpful?