Introduction
Welcome to codingwithdrew.com where, typically, I describe (in-depth) how to conquer a new tool or language. Generally, these articles are written for myself as future notes on how to do things. It is my number one reference. In general, those articles are pretty timeless because they aren’t opinionated.
This article book is a bit different and very opinion based. I’ll probably read it in 2030 and think I was mentally unstable or something.
My guess is: I’ll cringe... maybe, probably, kind of, sometimes...
In this article book, I leverage the plethora of languages I’ve learned, the courses I’ve taken, the on the job experience I’ve had to lay out the best tools for the job in relation to Development and Devops in 2024 (and beyond?).
The goal here is to guide you through the decisions necessary for your infrastructure and get you up to speed with recommended technology and processes that are used in todays devops industry.
My perspective
As a person who loves customer service and thinks it’s an essential life skill and as someone who hates repetitive tasks - I kind of fell into DevOps. I’ve had amazing mentors and am always looking for ways to improve my craft.
Anyone in DevOps without basic customer service skills is in the wrong industry. Yeah, I said that... but it’s true, this is an engineering support role.
DevOps Engineers (and SREs) are basically the same thing and are experts in their fields of intersystem solution architecture, automation engineering and developer support.
When I did customer support, I focused heavily on documenting answers that customers would ask and then script them in a text expander called Alfred (which I highly recommend). This enabled me to be more effective and imprinted some important automation mindset which thrives today.
If you type it more than once, automate it.
What do I need to in order to follow along with Drew
Local Setup
I’m assuming you are already familiar with git, repositories, and have experience with git operations - I’m not going to go into detail over how to use git, though if you need to get set up, I have a few guides on my site codingwithdrew.com on how to do so.
Maybe one day I’ll cover A-Z devops including local computer setup.
GitOps and you
I assume you understand gitops and why we care about it. This topic is a bit too theoretical for this conversation but understanding it is paramount to discussing CI/CD with kubernetes.
DevOps Dad's
A DevOps journey through time
Does filling out a form, or performing manual UI operations - AKA “ClickOps” to create or deploy your infrastructure sound like a good idea to you?
If it does, who hurt you? Let’s talk. Wanna talk about it?
ClickOps is the way our Dad’s did DevOps in the 90’s (and in some cases still today). 🔪
Clickops - The way our dad’s did DevOps in the 90’s
Our Dad’s started out with long “scripts” and not the way you and I think of scripts but more like how an actor follows scripts with calls to action and words... lots of words. They wrote them in word documents.
Effectively, step-by-step guides they’d use to perform delicate operations for infrastructure changes. Those processes would take hours, days, or even months in many cases.
Initially, they did it in Word Documents until they realized that Microsoft changes characters from time to time so when they copy paste a command, it wouldn’t work. So, they moved to plain text document-scripts.
Overtime, no one knew for sure what was the state of individual servers, so they had to SSH into them and find out/fix issues. I guess given enough mistakes, this was determined to be an unwanted and unreliable approach.
Dad’s thought that it would be a good idea to document everything. We had expectations that the documentation was accurate when in reality, it was only accurate when it was written leading to a false sense of confidence and downtime.
Wiki Pages and confluence pages everywhere. Ugh how I dislike confluence for so many reasons.
Eventually plain text documents weren’t awesome enough (for obvious reasons) so hard coded executable scripts were created which were real nice until a configuration change happened and the executable scripts had to be manually updated. This is harder and more error prone than doing it manually via ssh.
Every permutation of the script had to be accounted for and documented and everyone was very unhappy.
So... the scripts didn’t work with the new changes, so they created user interfaces that enabled people with no knowledge of scripts or commands to be able to provide inputs to our infrastructure scripts. Cool. But what happens when the infrastructure changes? How do you know? Does your script call every few minutes and check on the state of infrastructure? Also, with UIs, you have to make a whole new UI and documentation on how to navigate the UI and train people. This sounds like the original solution with extra steps.
If a Web UI isn’t the right place to manage infrastructure - where is? Is it with a CLI?
Using a CLI might seem like a good idea at first (especially in contrast to the options already presented); however, the next time we want to upgrade the CLI, the code may change. The dependencies may change and so might the commands. Other limitations start to get in the way. Also, they cannot show us what changes will be applied before they happen (obviously some IaC cli exceptions like terraform exist).
There is no doubt a CLI is better than a UI in terms of automation and git operations but CLIs are not the only options and definitely not the best option. Infrastructure as code and Configuration is the torch our dad’s inadvertently left us and we should run with it.
Dads everywhere eventually moved to configuration management tools - and no, I don’t mean Service now! Gah... How is Service Now still a thing in 2022. Configuration management tools like Chef, CF Engine, Puppet, ansible, and a few others.
These configuration management tools allowed us to describe state and then allow the infrastructure to try to meet that description. This was a game changer because we then had the ability to validate that what we provided as configuration is what would be on the infrastructure - at least to some degree of confidence.
Our Dad’s thought they had a winner and everyone tried to follow this process of merging the actual state into the desired state. They were predicated on the idea that services would live on bare metal equipment, not the cloud.
I’m a dad now...
When virtual machines came on the scene, they provided a lot of benefits. They could split individual servers into individual workloads. We could decide how much memory and cpu could be provided to each VM.
The real benefit of this wasn’t immediately clear to us then. We could create and delete machines with the touch of a button. Ironically, this enabled us to move to immutable infrastructure.
Instead of trying to converge the actual state with the desired state of our infrastructure and making changes to live machines, we could adopt the benefits of immutable infrastructure. Base images would be deployed to a server and they are treated as immutable - no one should be able to change them in any form. When ever Dad’s needed to apply a new change with a new image, we could create a new VM and destroy the old one.
We could then leverage Kubernetes in the cloud with rolling updates where one image is taken out of service and a new base image could be deployed in its place and ensure reliability and stability of our entire system during updates.
Conceptually that process is the same of how we should be operating today.
In 2022, if you do anything with infrastructure, it should be managed as code. You should not do anything as ClickOps unless you are in development for learning/testing reasons.
Disagree in general? I’d love to hear why in the comments section.
Ok, I hear you say “clickops = bad” but why?
Humans are error prone in repetitive tasks but computers shine.
ClickOps leads to undocumented and unreproducible processes, these should be avoided at all cost when dealing with enterprise systems.
You want your co-workers, bosses, or predecessors to be able to understand what is happening and also reproduce the processes reliably without errors. This leads to the concept of developer enablement.
ClickOps are also not fast. I mean you can probably click through a form pretty quickly right? But you can’t compete with a script and be as accurate. Maybe that could be an olympic sport next to speed typing. Everyone, quick! Break out your QWERTY, let’s show those other countries how our secretary skills!
If you document your processes, you will quickly find out that this-or-that action would be better performed as an executable script with logic than it would be done in a UI. Over time, this executable script could grow into CLI or even IaC. Next time you are doing a manual process, pull the new guy in and have them record and document the process. You’ll be amazed at all the ideas and thoughts related to automation that can come from this documentation.
By contrast, ClickOps don’t generally show a paper trail of who did what on what date and break down everything that was changed. when you manage your infrastructure as code however, that Git repo tells the full picture.
What is developer enablement and why do I care?
As a DevOps engineer, you have three customers at the end of the day.
- The end user - if your system is down, they suffer.
- The bussiness - if your system is down, the end user isn’t paying the business
- The developers - if your system is down, they aren’t able to do their jobs.
All three of the customers are equal in importance but as a DevOps engineer, our greatest impact and potential for improvement is the developers.
Developers are expensive to the business and they fix problems the end user has. Typically, DevOps primary objective is or at least should be to enable the developers to do that.
When developers cannot commit code, when their pipelines are clogged, or when they are endless triages - they aren’t doing the thing that matters most... Coding.
Developer⏱ > Devops⏱
A Developer should have limited interactions with the underlying systems that enable their code to live on the open internet. They don’t own infrastructure.
This limitation should be to their IDE of choice, kanban/agile boards, local environment (if applicable), and git. They should go to ceremonies and commit code, not fill out forms or play with cli’s (aside from git).
Git should be the trigger for anything and everything downstream of the developer. The developer should not be sifting through logs or managing autoscaling groups.
They shouldn’t be deploying code anywhere - it should be done for them.
They shouldn’t be running their code locally either but that’s a talk for another day.
Ok, ok, I get it... Devs should develop. So how do I enable them as a DevOps engineer?
The first step is GitOps as a way of lifework. The next part is embracing and managing infrastructure as code so that your git repository is the single source of truth.
To get there though, we have to know where we came from and how previous paradigms got us to where we are today.
I think it really helps to understand the benefits of the way we did stuff and why we moved away from it so we can learn from it.
So... Where are we today?
Today, infrastructure as code and immutability are closely tied together. Immutability brings uniformity and reliability while IaC brings readability, source of truthiness as code in a repository with history.
Servers are defined as code, and scripts are used to create VM script images not VM’s themselves. Clusters are defined as code in the form of instructions on how to create and manage those VM images and how to tie them together through different services.
Traditional tools like Puppet, Ansible, and others like CloudFormation do allow us to manage infrastructure as code but they are not designed for the cloud with immutability in mind. We can look at them as version 1 of IaC tools.
We could adopt cloud specific tools like Cloudformation - and while I do like Cloudformation, for enterprise, Terraform is better suited. Instead of changing tools, you can easily change providers and shop for better pricing, support, communities. While I do see the value in tools like Cloudformation and kind of even prefer it in smaller single (AWS) cloud based businesses, terraform is the clear and obvious go to for larger scale operations.
The benefit of Terraform becomes even more apparent when we realize that many of the vendors eventually abandon their own single cloud based tools in favor of terraform.
In the next section we learn all things terraform. Hold on to your butts!
Dad's DevOps Torch
Our predecessors have done the leg work and gotten things right, at least as of writing this article. Now it’s our turn to carry that Devop’s Torch forward. IaC is a transformative approach to managing infrastructure, as such we will be using the most popular tool for the job.
Terraform as IaC
It’s not by accident that terraform is the de facto standard tool when it comes to IaC. My intention isn’t to create an in-depth side by side comparison of every IaC tool, in fact, I’m only going to focus on Terraform. Here is the sales pitchy pitch:
Why Terraform?
Terraform’s ability to use different providers and manage them while using resources is combined with templating makes it a pleasure to work with.
Its system of variables allows u to easily modify aspects of our infrastructure without making changes to definitions of your resources.
Above all, when you apply a set of definition it converges the actual state into the desired state no matter what - even if that means deleting or creating new services and resources.
Terraform plan command also shows us what will happen when we apply the changes. This is powerful stuff.
With our Gitops principles we can trigger terraform commands when we make a change to a repository and we can have a pipeline that without the plan when we create the pull request this way we can easily review changes that will be applied to our infrastructure and then decided whether we want those changes or not with a merge.
If we do merge, another pipeline could apply those changes after being triggered by our pull.
This makes it a perfect candidate for a simple yet extremely effective mix of infrastructure as code principles and GitOps. Ultimately CD (Continuous Delivery) tools.
Terraform stores the actual state in a file system and the plan so that it is able to compare the plan (desired state) with the actual state. Terraform even allows us to utilize different backends where state can be stored such as a network drive, db, or literally any other storage.
Another really nice feature that makes terraform standout is it’s ability to untangle dependencies between different resources. Its capable of figuring out on its own which order that resources should be created, modified, or deleted.
Other tools have all these features. They aren’t unique to terraform.
The defining difference is the ease with which we can leverage it’s features and the platform’s robust ecosystem around it.
How to use Terraform
Go to https://codingwithdrew.com/category/cloud/terraform/ for my in-depth guide on setting up your local environment and how to use the tool itself but I’m going to take a new and simplified chunk and chew approach to break down terraform specifically for a GCP use case. I’m going to discuss how to craft and think about creating your terraform implementation. It’s a very opinionated approach and you may even disagree. Even if you already know terraform, it’s worth reading.
In the next section I discuss Terraform with Google Cloud Platform (GCP).
Creating and managing Clusters on GCP using Terraform
In this section on IaC, I’d like to start to create a fully operational Kubernetes cluster defined in a way that is production ready. If this is your first contact with Terraform, you will have the base knowledge to work with it as well as the processes necessary to make informed decisions on whether this tool is best for your specific needs.
Even if you are already familiar with Terraform, you might still discover some new things. Examples include the proper way to manage it’s state.
In terraform with Google Kubernetes (GKE), there are 4
Getting started
Generally speaking, entries in terraform definitions are split into four groups (there are other types).
- Provider
- Resource
- Output
- Variable entries
These 4 are the most important and most commonly used ones.
For now, we will focus on variables.
All the configuration files we’ll use are going to be broken down, for now, just head over to https://github.com/DrewKarr/devopsdads and poke around.
Let’s look at these files one by one together and so that we can gradually get used to the different capabilities of the terraform software.
Variables
Variables as a concept
I’d like the cluster to be scalable within some limits, so I set variables min node count and max node count. I set these first to establish a set of requirements to work towards.
Variables are a way to blueprint what you’d like to accomplish. They define the information that you believe could be changed over time and they drive the global software IaC the we define later. It’s the best starting point for any terraform project.
Many of these variables will never change for the life of the cluster/project but are great for establishing a templating system for future purposes.
Let’s create our first file called “variables.tf”. (I know in the past I’ve called it “vars.tf” but I prefer the verbosity in naming over the brevity.
Variables.tf
Let’s open the variables.tf, it should look like this:
variable "region" {
type = string
default = "us-east1"
}
variable "project_id" {
type = string
default = "drewlearns-gke"
}
variable "cluster_name" {
type = string
default = "drewlearns-cluster"
}
variable "k8s_version" {
type = string
}
variable "min_node_count" {
type = number
default = 1
}
variable "max_node_count" {
type = number
default = 3
}
variable "machine_type" {
type = string
default = "e2-standard-2"
}
variable "preemptible" {
type = bool
default = true
}
variable "state_bucket" {
type = string
}
Most of the contents of this file are self explanatory, we are using this syntax:
variable "key" {
type = string
default = "value"
}
We use this to set variable definitions our terraform files will utilize.
What matters is that each variable has a type (string, number, bool) and a default.
Default values are necessary and I’ve provided different types for you to see solid examples but there is one exception of the k8s_version
and state_bucket
variable - which doesn’t have a default. This is because as of writing this and by the time someone actually reads this and follows it, the version may have changed. Later on, we will see the effect of not having a default value in this variable.
We will also learn in the next section how Terraform will take a local exported variable and convert it to a variable on this file. Specifically the project_id variable will be overwritten.
Setting up our GCP environment and permissions
Now that we have our variables all set up, let’s deal with the prerequisites before we start running terraform.
Even though the goal is to set up terraform for all infrastructure related tasks, we still need a few things that are specific to Google Cloud to be more precise.
GCP Account and SDK
Let’s make sure you have all the tools you need install and are registered in GCP.
In your terminal run:
curl https://sdk.cloud.google.com | bash
Follow the terminal instructions.
After that is complete (it’ll take a few) you’ll need to restart your shell:
exec -l $SHELL
Now initialize gcloud:
gcloud init
Head over to https://console.cloud.google.com, make sure you have an account, if not create one now.
Next we need to authenticate into our gcloud account by running this command in our terminal:
gcloud auth login
Follow the on screen instructions until you are authenticated.
GCP Project Creation
We’ll need to create a project and a service account with sufficient permissions. Everything in Google Cloud is managed by “projects”. Time to Create a new project. Run the following command:
export PROJECT_ID=doc-drew-$(date +%Y%m%d%H%M%S)
export TF_VAR_project_id=$PROJECT_ID
gcloud projects create $PROJECT_ID
In the code above you will see TF_VAR_...
variable we export. This is special in that export variables with the prefix “TFVAR will be automatically converted to terraform variables.
In this case, the value of the environment variable will be used as the value of the terraformed variable project_id
to overwrite the default value.
To be on the safe side, let’s confirm the newly created project is present on our GCP Dashboard.
Run the command:
gcloud projects list | grep $PROJECT_ID
We should see an output:
doc-drew20220302064657 doc-drew20220302064657 329570104953
YAY! We have a project!
Service Accounts
We have a project and we are going to start using it soon but first we have to make sure our project has a service account.
Through a service account, we will be able to identify as a user with sufficient permissions.
Run the following commands:
export SERVICE_ACCOUNT=drewlearns
gcloud iam service-accounts create $SERVICE_ACCOUNT --project $PROJECT_ID --display-name SERVICE_ACCOUNT
You should see an output like this: Created service account [drewlearns].
Let’s confirm this service account was created.
gcloud iam service-accounts list --project $PROJECT_ID
You should see an output like this:
DISPLAY NAME EMAIL DISABLED
SERVICE_ACCOUNT [email protected] False
What really matters here is the email column which is the unique identifier which will be used to create keys. We need to be able to authenticate those keys with yet another GCP IAM command using that email.
gcloud iam service-accounts keys create account.json --iam-account $SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com
Since I’m paranoid by nature - we will confirm that the key was actually created. We have stored the key in the local file account.json. We should see an output like this:
created key [123456] of type [json] as [account.json] for [[email protected]]
If you are curious to see what it created, feel free to use the following command to view the json output that was added to account.json:
gcloud iam service-accounts keys list --iam-account=$(gcloud iam service-accounts list --project $PROJECT_ID | grep -o "$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com") --project $PROJECT_ID --format=json
Key Bindings
Having a service account and the key which we can use to access is not much use in its current state if we do not assign it sufficient permissions.
For this we use the add-iam-policy-binding command with our member and the role.
gcloud projects add-iam-policy-binding $PROJECT_ID --member serviceAccount:$SERVICE_ACCOUNT@$PROJECT_ID.iam.gserviceaccount.com --role roles/owner
We should never do this section outside of the exercise we are doing right now. This is manual and we are going to assign an owner role to a service account. Owner is a bit too permissive.
This service should be allowed to create and service all the resources we need and nothing more. Being an owner makes life easier but adds a serious security flaw to our architecture.
Install terraform
In your terminal install terraform using the following command:
brew install terraform
Defining our provider (GCP)
Everything we have done so far with variables and service accounts really only does one thing - we did it so that we can use the Google Provider in Terraform.
What that means - it all allows us to configure the credentials used to authenticate with GCP among other things.
Let’s take a look at our provider.tf file.
provider "google" {
credentials = file("account.json")
project = var.project_id
region = var.region
}
Remember how in the last section we set up a service account and generated an account.json file? You can see that we actually reference that above, this is how we connect terraform with our service account.
We never want to hardcode credentials into a .tf file that will be loaded into a repository unless we just love risk (we don’t).
Project and region variable are being referenced by the variables.tj file we created. It’s not a good idea to hardcode these either because they may change. Even if they don’t change, someone else may take our configuration and reuse it and potentially create resources on the wrong project/region. Therefore, it is a best practice to set these in the variables.tf file.
Plugin installation (terraform init
)
Most of terraform’s abilities are defined by plugins. We need to figure out what plugin suits our needs and download it.
Fortunately for us, terraform will automatically figure out what plugins to install and where to find them but to do so we have to use the terrarform init
command to do this. Run that command now.
What does terraform apply
do?
In terraform terms, terraform apply
means update whats different and destroy or delete what isn’t needed anymore.
We are now ready to use the terraform apply
command for the definitions we created so far (provider.tf and variables.tf).
Now the output of our command would lead you to believe that the apply was complete but we can see that it did not add, change, or destroy anything. Currently, that’s to be expected because we did not specify that we want to have any GCP recourses.
State management storage
Terraform maintains its internal information about the current state of our infrastructure which will allow us to understand what what needs to be done to reach our desired state and convert the actual state to reach our definition of desired state in the respective files. Currently, state is stored locally in a file called terraform.tfstate
. It is automatically generated when we ran terraform apply
.
Currently, we haven’t created any resources for terraform to track so our state file is pretty boring:
{ "version": 4,
"terraform_version": "1.1.6",
"serial": 1,
"lineage": "a8b8278b-bc72-8893-04a8-17bee6dd1c0e",
"outputs": {},
"resources": []
}
Keeping our state file locally is a very bad idea. If it is locked up, we won’t allow teammates to modify the state for resources that need it. We would need to send them the terraform.tfstate file via email or keep it on a network or something similar. This isn’t practical. You may even be tempted to store it in git but that would not be secure. Instead, we will tell Terraform to keep the state in a google bucket. This will be the first resource we create.
Since we are trying to define infrastructure as code, we will not do that by executing a shell command. We also won’t want to go to the GCP console - instead we will just tell terraform to create the bucket.
Before we proceed, you must ensure you have a google account and billing is enabled - otherwise, google won’t allow you to create storage. New accounts on GCP are provided a $300 sign on credit budget and 91 days to use them. You won’t be automatically charged afterwards without confirmation. If you haven’t already, please do that now.
Run the command below in your terminal
open -a "Google Chrome" https://console.cloud.google.com/storage/browser?project=$PROJECT_ID
If you see a banner on this browser page, you will want to make sure that it does not have a banner at the top asking you to set up billing. If it doesn’t, you are good to go!
Managing our first resource as IaC - Google Cloud Storage Buckets
Google Cloud Storage Buckets are just like AWS S3 buckets if you are familiar. It is an object storage solution that enables you to manage versions, access, encryption, and the like of objects. Objects in this case are files. Go to https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/storage_bucket.html and poke around to learn more about this resource.
When we want to create a new resource for Google Cloud Storage, we will need a file called “storage.tf”. In our current working directory with all our other terraform files, go ahead and create it.
resource "google_storage_bucket" "state" {
name = var.state_bucket
location = var.region
project = var.project_id
storage_class = "NEARLINE"
labels = {
environment = "development"
created-by = "terraform"
owner = "drewlearns"
}
}
You can see we are creating a “resource” called “google_storage_bucket” that will have some state. We then define that state in terraform key values.
Similar to before, the values of some of those fields are defined as variables while others that are less likely to change are hardcoded.
If you can recall, there were two variables in our variables.tf file that we created without default values. One of them was "state_bucket"
which you can see we call here referenced as the name. We will want to set the value when we run terraform apply
when prompted now instead of just mashing the return key as previously instructed. Alternatively, you can set a default value in that file now if you’d like.
Run terraform apply
and we can see what will be performed:
+
means something will be created
-
means something will be destroyed
~
means something will be modified
Some of the values will be known and are defined by us while other values cannot be known until after terraform creates the resource such as ID’s and links.
The Plan
tells us how many resources will be added, changed, or destroyed. This is an important block to take note of.
Finally, we will be asked if we would like to perform these actions or not. The only valid response is a “yes”, otherwise it’ll stop. Basically if we exit now, we only ran terraform plan
.
After running terraform apply
we should see the following:
Terraform used the selected providers to generate the following execution plan. Resource
actions are indicated with the following symbols:
+ create
Terraform will perform the following actions:
# google_storage_bucket.state will be created
+ resource "google_storage_bucket" "state" {
+ force_destroy = false
+ id = (known after apply)
+ labels = {
+ "created-by" = "terraform"
+ "environment" = "development"
+ "owner" = "drewlearns"
}
+ location = "US-EAST1"
+ name = "doc-drew-20220304174003"
+ project = "doc-drew-20220304174003"
+ self_link = (known after apply)
+ storage_class = "NEARLINE"
+ uniform_bucket_level_access = (known after apply)
+ url = (known after apply)
}
Plan: 1 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
google_storage_bucket.state: Creating...
google_storage_bucket.state: Creation complete after 1s [id=doc-drew-20220304174003]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Verifying what we have done
Let’s confirm that the bucket was created by listing all those available.
gsutil ls -p $PROJECT_ID
It should return a value like this:
gs://doc-drew-20220304174003/
Over time, we will gain confidence in terraform and we will not need to confirm everything worked correctly but since we just started it’s important we know how to confirm all the things.
Let’s imagine for a moment that someone else executed the terraform apply
command and that we are not sure of the the state of the resources.
In this case we can ask terraform to show us the state.
terraform show
It will return an output like this:
# google_storage_bucket.state:
resource "google_storage_bucket" "state" {
default_event_based_hold = false
force_destroy = false
id = "doc-drew-20220304174003"
labels = {
"created-by" = "terraform"
"environment" = "development"
"owner" = "drewlearns"
}
location = "US-EAST1"
name = "doc-drew-20220304174003"
project = "doc-drew-20220304174003"
requester_pays = false
self_link = "https://www.googleapis.com/storage/v1/b/doc-drew-20220304174003"
storage_class = "NEARLINE"
uniform_bucket_level_access = false
url = "gs://doc-drew-20220304174003"
}
As you can see, there is just one resource and it reflects the state of the resources managed by terraform. This format is human readable of the state currently stored in terraform.
We can also inspect the state by looking at the terraform.tfstate file which was used to create the output above. The main difference between the two outputs is the formatting. One is human readable, the other not so much.
Even though currently, the state file doesn’t contain confidential information, it definitely could. It’s stored locally but we want to be able to store it in a more reliable location with better access controls. To move our state to the bucket, we have to create a GCS (Google Cloud Storage) backend.
To create this backend for terraform to store our state file, we just need to create a backend.tf file.
terraform {
backend "gcs" {
bucket = "doc-drew-20220304174003"
prefix = "terraform/state"
credentials = "account.json"
}
}
In order for this backend to take effect to store our state file, we have to re-initialize terraform. The first time we had to do this was to install plugins but this time it is to update the state file backend.
Run:
terraform init
Confirm it with a “yes”.
From now on, the state of our resources will store terraform.tfstate on our GCS bucket instead of locally.
We should now be able to apply the definitions.
terraform apply
No changes to any of the resources occurred because the only change happened during the terraform init.
Creating our Control Plane
A Kubernetes cluster will almost always consist of a control plane and one or more pools of worker nodes.
We will start with the creation of our control plane and then our nodes after.
We will want to enable the Kubernetes API on our project by using the following command:
bash && open -a "Google Chrome" https://console.developers.google.com/apis/api/container.googleapis.com/overview?project=$PROJECT_ID && zsh
Click the enable button. This is a one time prompt, you will not be required to do it again on a given project.
We need to create a new file called “k8s-control-plane.tf”.
resource "google_container_cluster" "primary" {
name = var.cluster_name
location = var.region
remove_default_node_pool = true
initial_node_count = 1
min_master_version = var.k8s_version
resource_labels = {
environment = "development"
created-by = "terraform"
owner = "drewlearns"
}
}
The meaning of the fields in this file are probably easy to guess but maybe it’s hard to understand what the remove_default_node_pool = true
and the initial_node_count = 1
does.
If we remove them, a GKE cluster would be created with the default node pool for work nodes but we don’t want this. More often than not, we want to have these nodes created separate from the control plane - this way we have better control over it.
The problem, however, is that the default node pool is mandatory during the creation of the control plane. Given that we need to have it initially even though we don’t want to, the best option is to remove it and that’s what remove_default_node_pool = true
does.
We set the initial_node_count
to only 1 node for it to be more precise. (Really it’s three, one in each zone of the region).
In any case, we are setting it to be the smallest possible value so that we can save a bit of time and achieve the lowest possible cost.
There is one more thing we need to do.
We will not be able to continue to press return when prompted to provide a kubernetes version. We will not get away with ignoring that prompt now since we are creating a kubernetes resource.
How do we know what the right version is?
Instead of guessing, we are going to ask google to output a list of all currently supported versions in our region.
gcloud container get-server-config --region us-east1 --project $PROJECT_ID
Select any “Valid Master Version” except the newest, my recommendation is to choose the second newest.
export K8S_VERSION=1.22.6-gke.300
Let’s apply our change with terraform apply --var k8s_version=$K8S_VERSION
. You could update the [variables.tf](http://variables.tf)
and then just run terraform apply
but it would harm our ease of commands later so let’s not do that.
Creating the control plane will take a while...
When it completes, we will see what terraform did. As expected another resource was added and no resources were changed or destroyed.
Terraform Outputs and automating our kubeconfig file
So far we have created variables.tf, provider.tf, backen.tf, and we have applied all our changes. We created a storage bucket and we also enabled the Kubernetes API which we used to enable our control plane.
Now we will retrieve the nodes of the newly created cluster. Before that though, we will create a kubeconfig file to provide Kubectl the information on how to access the cluster.
We could do that right away with gcloud, I think but why would we make it easy?
Let’s complicate things by creating a kubeconfig file automatically. To create one, we need to know the name of the cluster, the region, and project into which it’s running.
We may know that information in our heads but we will assume we forgot it or weren’t paying attention. This will give me the opportunity to introduce to you another terraform feature.
We can define output with the information we need as long as that information is available in terraform from state.
Let’s create output.tf like this:
output "cluster_name" {
value = var.cluster_name
}
output "region" {
value = var.region
}
output "project_id" {
value = var.project_id
}
What you see here is that we are specifying what data should be returned from terraform. These outputs are generated at the end of terraform apply
process. We will see that in a moment, for now though, we are interested only in the outputs so that we can use them to deduce the name of the cluster so that we can retrieve and use them in the kubeconfig file.
If you wanted to see all the outputs, we can simply run terraform refresh
within our Kubernetes version using the following command:
terraform refresh --var k8s_version=$K8S_VERSION
The refresh will update the state file with the information about the physical resources that terraform is tracking and more importantly, it shows those outputs. The output of that command should look like this:
Outputs:
cluster_name = "drewlearns-cluster"
project_id = "doc-drew-20220304174003"
region = "us-east1"
We can clearly see the name of the cluster, the project id, and the region but that’s not really what we need.
We aren’t really interested in seeing that information but rather using it to construct the command that will retrieve the credentials. We can accomplish that with terraform output
commands like this:
terraform output cluster_name
Which will return "drewlearns-cluster”
, now that we know how to retrieve a single value from our output, we can automate the creation of the kube config file.
Let’s construct the command to retrieve the credentials. First, we specify that the kubeconfig should be in our current directory by exporting it.
Then we will run a gcloud command to get credentials.
export KUBECONFIG=$PWD/kubeconfig
gcloud container clusters get-credentials $(terraform output cluster_name) \
--project $(terraform output project_id) \
--region $(terraform output region)
This command above will error because the output is not in the correct format that google cloud likes so we have to add something here to get it to run.
Add -raw
before the variable name to avoid an error that isn’t so obvious saying your project_id isn’t valid.
export KUBECONFIG=$PWD/kubeconfig
gcloud container clusters get-credentials $(terraform output -raw cluster_name) \
--project $(terraform output -raw project_id) \
--region $(terraform output -raw region)
It should return something like this:
Fetching cluster endpoint and auth data.
kubeconfig entry generated for drewlearns-cluster.
Validate the kubeconfig was created by running cat kubeconfig
The only thing left is to give ourselves the proper admin permissions to the cluster.
kubectl create clusterrolebinding cluster-admin-binding \
--clusterrole cluster-admin \
--user $(gcloud config get-value account)
If we run that command we will see an output like this:
clusterrolebinding.rbac.authorization.k8s.io/cluster-admin-binding created
Now we should be able to check the cluster terraform created for us.
What happens if we run kubectl get nodes
now? “No resources found in default namespace” which would be expected. We retrieved the nodes from the cluster and got none. How could this be?
GKE does now allow us to access the nodes of the control plane.
On the other hand, we did not create worker nodes yet which leads us to our next section.
Creating Worker Nodes
In our last section we created our control plane and connected kubectl to our cluster but now we need to create nodes. We can manage our worker nodes through the google container node pool module.
Let’s create another terraform file called k8s-worker-nodes.tf.
resource "google_container_node_pool" "primary_nodes" {
name = var.cluster_name
location = var.region
cluster = "${google_container_cluster.primary.name}"
version = var.k8s_version
initial_node_count = var.min_node_count
node_config {
preemptible = var.preemptible
machine_type = var.machine_type
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
autoscaling {
min_node_count = var.min_node_count
max_node_count = var.max_node_count
}
management {
auto_upgrade = false
}
timeouts {
create = "15m"
update = "1h"
}
}
In this file, it’s a bit more verbose than previous terraform files because we have a lot more to configure than just a name, region, cluster, and version. We have to describe our nodes, how we manage them and their behavior. Aside from the obvious name, region, cluster, and version we need to define how our nodes will be managed, scaled, and timeout.
First, we need to specify our initial_node_count
- we have already defined this in our vars so we know we need it.
Next, we need to configure our node. node_config
block defines the properties of the node group we are about to create.
In our variables.tf file we also defined our min and max node counts, we need to call those here.
In the management block, we are specifying that we do not want to operate the cluster automatically. That would defy the idea that everything is defined as code, but we are doing it anyways.
Just in case the creation or updating of the cluster hangs for whatever reason, we don’t want to wait around for ever so we add a timeouts block.
Now we want to apply these changes.
terraform apply --var k8s_version=$K8S_VERSION
This will take some time to run, so grab a coffee.
The process should have finished with a familiar output - we created a new resource without changing or destroying anything and our terraform outputs are printed to the screen.
Let’s see what happens when we try to retrieve the nodes this time:
kubectl get nodes
We should see the following output:
NAME STATUS ROLES AGE VERSION
gke-drewlearns-clust-drewlearns-clust-62ede7dd-59p3 Ready <none> 14m v1.22.6-gke.300
gke-drewlearns-clust-drewlearns-clust-cba2742b-7mjz Ready <none> 14m v1.22.6-gke.300
gke-drewlearns-clust-drewlearns-clust-ef598e0d-j8t3 Ready <none> 14m v1.22.6-gke.300
What did we do?
We created a cluster using IaC principles with terraform on Google Cloud. This is the moment where we should push our changes to our git repository to make sure the changes we have made are available to our coworkers who might need to change our clusters or the surrounding infrastructure.
For now, pushing this repository to git is out of the scope of this context but we will get to that when we start talking about continuous delivery.
Upgrading our Cluster
Changing any aspect of the resources that we have already created is very easy and straight forward.
All we have to do is modify the different definitions and apply the changes. It doesn’t matter if we added, removed, or modified them in any way.
Let’s demonstrate that we can update the kubernetes version. Before we do though, let’s see what version we are using.
kubectl version --output yaml
We can see in the output I am running kubernetes version v1.22.6-gke.300
.
If you recall, we intentionally didn’t chose the most recent version - that’s because we were going to upgrade it here 🙂.
To upgrade the version, we need to see what versions are available:
gcloud container get-server-config \
--region us-east1 \
--project $PROJECT_ID
We see that 1.22.6-gke.1000
is newer than v1.22.6-gke.300
(Yours may be different), let’s update it.
export K8S_VERSION=1.22.6-gke.1000
We should avoid updating terraform configurations as var arguments like we have been doing because this violates the principles of IaC.
Instead, we should declare the variable in our variables.tf file, commit the changes to git and use terraform apply. However, we will continue using the var method for simplicity.
We are not creating resources this time, we are updating the properties of some of the resources.
terraform apply --var k8s_version=$K8S_VERSION
You will see that we are changing the min_master_version
to the former version on our cluster and updating the node version
from the former version to the latest version.
All the other properties will remain intact. You will see that there is nothing to add or destroy and that there are 2 resources that will be changing.
Kubernetes will take the action to perform rolling updates by creating a new node with the current version and destroying a node with the former version by confirming that the new system. It will confirm that the node is healthy before performing the next iteration.
This is our immutable infrastructure in action! We haven’t made any changes to live nodes. We aren’t ssh’ing into anything and making updates, we are simply creating a new resource and removing the one it’s replacing. This process is a bit slower than the initial creation of the resource but it’s safer. This process also eliminates downtime.
It’ll take some time, about 10 minutes per node, so about 30 minutes since we are using 1 node per region. Let’s confirm our changes when it’s complete.
kubectl version --output yaml
You should now see the newer version. Yay! We did it, and we did it in a safe way 🙂
Reorganizing Resources
Currently every resource is in a different file. Terraform doesn’t care if we have one or a hundred files. Terraform concatenates all files with the .tf
extension at runtime. Let’s demonstrate Terraform’s DGAF factor.
One of the things we mentioned as making terraform unique is it’s dependency management. No matter how we organize definitions of different resources, it will figure out the dependency tree and it will create or delete resources in the correct order. Because of this dependency management feature, we do not need to concern ourselves with planning what should be created first or in which order those resources are defined.
Many people prefer to use only three files:
- Variables
- Providers and Resources
- outputs
Let’s try to reorganize things in this fashion.
This is how I’d do it:
cat backend.tf k8s-control-plane.tf k8s-worker-nodes.tf provider.tf storage.tf | tee main.tf
rm -f backend.tf k8s-control-plane.tf k8s-worker-nodes.tf provider.tf storage.tf
We can demonstrate that terraform does not care about the organization of these definitions by running terraform apply
Destroying Resources
We are nearly finished with our exploration of Terraform with GCP using GKE as an example. The only thing missing is how to destroy resources.
If we would like to destroy some of the resources, all we would have to do is remove their definitions and run terraform apply again.
In some cases, we might want to destroy everything or almost everything and as you probably have guessed, there is a command for that as well.
terraform destroy --var k8s_version=$K8S_VERSION
This will take some time but at the end, you may see an output stating that it coudn’t delete your storage bucket without force destroy set as well. This is ok though because we didn’t specify that a bucket could be removed if it contains files or not. This is turned off by default because if we wanted to recreate the cluster we could do so with the terraform state files it contains.
The cluster (and no other resources) we defined are gone now.
Deploying to Staging (non-production permanent environment)
Deploying to permanent environments should be much easier since there is no temporarily destroyed environments to consider so there shouldn’t be any dynamic values involved.
All we have to do is define another set of values.
For now, I’m going to skip diving deeper into this section as it’s getting long enough.
Packaging, managing, and deploying applications with Helm
Dad’s Deployments
The way our dad’s deployed code
In a not too distant past, we copied files to a server and that was mostly because our needs were simple in comparison to our needs today. We weren’t releasing frequently and we were using bare metal servers, and not many of them. Things weren’t necessarily easier though. We also didn’t worry about many of the things we consider to be a bare minimum now. If we were advanced enough, we might go crazy and create a zip file will all the files we’d need users to have access to. We’d use ssh or sftp to copy files to the server and then uncompress them.
This process worked fine many years ago. Since then, we have developed many different ways to and formats to package, deploy, and manage our applications.
Brew, chocolate and many other package managers entered the scene because it wasn’t enough to just copy files. We also needed the ability to start, stop, and shut down processes. More specifically, we needed different ways to deploy applications on an operating system by operating system basis. This drastically increased complexity with packaging applications.
We would need to build binaries for each operating system, package them in different native formats for those systems, then we would need to distribute them.
Containerization changed everything.
Docker’s efforts to make containerization widely accessible made containers easy and raised its popularity. With increased popularity, the community brought containerization to operating systems beyond linux systems. They can be ran virtually on any operating system now. We found one format to rule them all!
All we have to do is build a docker container image and let people run it through the docker engine.
Turns out, that wasn’t enough. We quickly learned that just having a container image alone was only a small part of the pie when we are needing to run an application, especially at scale.
Images contain processes that run inside containers but these processes alone don’t do everything we need them to do!
Containers don’t scale by themselves, they do no enable external communications by themselves, and they don’t attach external volumes by themselves. So how did we resolve this?
Thinking that all we need is a few docker commands to deploy an application is a huge misunderstanding at best.
Carrying the Dad’s Deployment Torch
Running applications successfully today require many more capabilities to execute a binary than our Dad’s approaches.
We need to enable networking, communication, TLS certification, replication of data sets and application, scaling, and making them highly available. We may need to attach external storage, environmental specific configurations, inject secrets and so on. The list of things we would need to do from one deployment to another can be daunting.
Today, the definition of an application and everything needed to deploy, run, and maintain is complicated and far removed from a single image from which we can choose from.
We can choose from a giant list of schedulers like Docker Swarm and a few others but they don’t matter anymore. In the battle of deployment tools, they lost.
The one that matters, at least in 2022, is kubernetes.
Kubernetes API is the de facto standard that allows us to define everything we need for an application in an easy to read format that enables commenting (Yaml) format.
Kubernetes API is still not enough. We need to package kubernetes yaml files in a way that they can be retrieved and used easily.
We need some templating tools so that the differences between the deployments can be defined and modified effortlessly instead constantly modifying endless lines of yaml files. This is where Helm comes in.
If you haven’t read my exhaustive guide on kubernetes, I recommend it before taking a deeper dive into this section. You can find it here: https://codingwithdrew.com/drew-learns-kubernetes-an-introduction-to-devops/
In the next section we are going to talk about Helm in detail.
Helm Introduction
Introduction to helm
Every operating system has at least one packaging tool. Some use RPM, Chocolate, Yarn, Brew, and others. they all are ways to get applications onto your operating system.
We can think of the kubernetes API as an operating system for clusters while linux, MacOs, windows, etc, are operating systems of individual hardware.
The scheduler is only one of many features of Kubernetes and saying that it is an operating system for a cluster is a much more precise way to define it. It serves a similar purpose to linux for example except that it operates over a group of servers. As such, it needs a packaging system designed to leverage its purpose.
While there are a lot of tools we can use to package, install, and manage applications in kubernetes, none of them are as widely used as helm.
Helm is the de facto standard of kubernetes ecosystem, and for good reason. Helm uses charts to help us define, install, and manage our applications and their surrounding resources.
Helm simplifies versioning, publishing, and sharing of applications and it was donated to the Cloud Native Computing Foundation (CNCF) landing it in the same place of other projects that matter in the kubernetes ecosystem.
Helm is very simple to use. It is a templating and packaging system for kubernetes resources. Even though it is much more than that, it is the primary purpose of Helm.
It organizes packages into charts, we can create charts from scratch using the helm cli with commands like helm create my-applicaiton
. We can also convert existing kubernetes definitions into helm templates.
Helm uses a naming convention, which, as long as we are learning it, simplifies the creating and maintaining of packages but also the navigation to those packages created by others in our community.
In its simplest form, the bulk of helm work is about defining variables in values.yaml and injecting them into kubernetes yaml definitions by using entries like {{ template "fullname" }}
.
Helm is not only about templating but also about adding dependencies as additional charts. Those charts can be developed or packaged and maintained by others (usually 3rd parties such as mongodb).
This new use case is compelling because it allows us to leverage a collective effort and knowledge base of a wider community.
Another advantage of helm is that it empowers us to have variations of our applications without the need to change their definitions. It accomplishes this by allowing us to define different properties of our application running in different environments. It does that through the ability to overwrite default values, which in turn modify the definitions propagated to the kubernetes API. This greatly simplifies the process of deploying variations to different namespaces or environments.
Once a chart is packaged, it can be stored in a registry such as Github and easily retrieved by others with credentials and permissions to access them.
Another benefit is the ability to roll back releases. This is key to an automated CI/CD pipeline because if we are able to deploy code automatically but need to manually roll back, we have a problem.
Let’s learn the most common applications of Helm by defining a few requirements and see if we can fulfill them with helm.
Considering a Scenario
I’d like to start this discussion with a list of objectives before diving into any specific goals.
This helps us to evaluate whether or not our proposed solution fits our specific needs or if we need to consider a different option.
There is a common theme for all of us and the differences of use cases tend to be small in comparison and they are often details rather than substantial differentiation between use cases.
Most of us are deploying applications to multiple environments. We may have some personal development environments, staging, production, etc.
The number of environments may differ from one person’s use case to another and we may even use different naming conventions for each environment. In any case, some are dynamic while others are more permanent like production. What matters is that the application is that in different environments have different needs.
It would be great if a release in each of the environments was exactly the same. If that was the case we would have some assurance knowing that what we are testing in one environment is exactly the same as another but as you have probably experienced in real life, this is never the case.
This was technically what the promise of containers was in that every image should be the same no matter where it runs. The tragedy is that this isn’t pragmatic or cost effective. Any two environments are already at odds though from that goal. We cannot use the same address when we want to access an application in two different environments like staging and production. We have to be able to account for that in our deployments.
Even thought its is insignificant to consider the differences, it illustrates the problems we will run into with more than one environment since no two environments will be exactly the same.
It would also be too expensive to run the same number of replicas in a lower environment such as development vs production.
So, we tend to keep our applications in different environments as similar as possible since we are being realistic and making tradeoffs that make sense. So what tradeoffs are worth making?
Let’s create a scenario where we need an application that may not fully fit the applications needs but should be close enough that it will serve as a base for understanding that you should be able to extend on your own.
Let’s imagine...
The Plot
We are going to use a demo application that is an API that needs to be accessible from outside of the cluster.
The files can be downloaded from the github repository https://github.com/DrewKarr/devopsdads. It is a GoLang application in the “Demo” folder and its pretty basic.
It needs to be scalable even though we might not always run multiple replicas.
It will use a database which is scalable since its is stateful. Since it is stateful, we will need storage.
The Scene
Imagine we have four different environments where the goal is to promote from one environment to another until it reaches production in a zero touch process. No gate keepers beyond PR/merge approvals. You commit your code to the lowest environment and as long as it passes all its automated tests, it will be in production.
The 4 typical environments:
- Preview - A personal development environment where every member of our team would have their own and they would all be temporary.
- QA environment (Also Preview) - A place where we will deploy applications as a result of pull requests. These would also need to be temporary just like the personal development environment. They would be created when Merge Requests (pull requests) are made and destroyed when branches are closed. This is really the Preview environment but the naming of the host will be based on the pull request name.
- Staging - A permanent environment we use to deploy our applications for integration tests.
- Production - This is our permanent environment.
We should be able to explore what the differences should be in how we run our application in each of these environments.
How it should behave in each environment
Preview Environment
When running in a development or “preview” environment the domain of the application needs to be dynamic. For example, if my github name is “drewkarr”, the domain I’d access my preview application would be called drewkarr.example.com this way it would be unique because no two developers have the same github user and it wouldn’t clash with existing applications.
There should be no need to kill more than one replica of the application or the database. The production database shouldn’t be the same as the Preview environment and I wouldn’t expect it to be the same size since you only need one replica. Since we know this, there probably isn’t a need for scaling of the database or the application.
No need to kill persistent storage for the database in the preview environment.
That database would be shut down when I finish working on the application and it should be ok if I always start with the same default seed dataset.
QA Environment
Everything in QA should be the same as the preview environment with one difference, the domain name which should be set to the name of the PR like POPS-1201.example.com.
Staging and production environments
The domains should be fixed to staging.example.com and lastly production should have no prefix (well maybe www CNAMED).
The application running in staging should be almost the same as the one running in production. This will give us a higher sense of confidence that what we test in staging is what we will run in production. There is no need for staging to have the same scale as production but should have all the same elements.
The API should have horizontal ability to scale enabled and the database should have persistence storage in both environments however staging should have a fraction of the replicas of production or at least two.
This will enable us to ensure that horizontal scaling works in both as expected while at the same time we keep our costs low.
The database in production will run as two replicas and that means we should have an equal number in staging so that we can test scaling and database replication before we promote to production.
Overview
Feature | Preview | QA | Staging | Production |
---|---|---|---|---|
Ingress | github-user.example.com | PR-01.example.com | staging.example.com | example.com |
www.example.com | ||||
HPA | False | False | True | True |
App Replicas | 1 | 1 | 2 | 3-6 |
DB persistance | False | False | True | True |
There may be other requirements for your specific use case but for our example, this should demonstrate how helm works and how to make applications behave in different environments.
Now let’s see if we can package, deploy, and manage our application in a way that fulfills those needs.
Setting up our Local Environment
In order to “helmify” our application, we first need to make sure helm is installed.
Run the command:
helm version
If the command is not found, install it with brew install helm
. If your version is 2, please update. Helm 2 cli is terrible, you’ll tiller namespace yourself to death.
I also assume you have cloned the repository https://github.com/DrewKarr/devopsdads, run git pull
just in case I’ve made changes to it since you downloaded it previously.
Helm packaging format
Helm uses a packaging format called charts. A chart is a collection of files that describe a related set of kubernetes resources.
Helm relies heavily on naming conventions so charts are created as files laid out in a particular directory tree. Some of the files are going to use predefined names.
Charts can be packaged into version archives that we deployed but that isn’t our current focus. We’ll explore packaging in depth in another section.
Let’s create a chart.
Creating our first chart
We can create a basic chart through the CLI for example running the command: helm create my-app
.
Using Helm 3 Demo
Let’s use helm to create an application and take a peek under the hood:
helm create my-app
ls -l my-app
It will output something like this:
Chart.yaml
charts
templates
values.yaml
Look through the various files it created for an idea of how to construct a helm application. For now, this sample app is a bit too simple for our purposes.
Let’s remove that application with rm -frd my-app
and look at a better example.
The chart we will use is located in the https://github.com/DrewKarr/DevopsDad demo-app directory.
The files in the directories are the same as the original files that we created. There is however an additional requirements.yaml file but we will discuss that in a while.
For the most part, we want to keep the charts close to our applications in their individual repositories for simplicity. We are breaking that rule for simplicity of this tutorial.
Let’s take a look a our Demo-app/Chart.yaml
Chart.yaml
apiVersion: v1
description: A Helm Chart
name: demo-app
version: 0.0.1
appVersion: 0.0.1
This file contains meta information about the chart and is mostly for internal use. It does not define any kubernetes resources.
The API version is set to v1. We could say v2 which would mean our chart is only compatible with helm3 but if we set it to v1 it is backwards compatible. For simplicity again, we will leave it at v1.
The version field is mandatory and it defines the version of the chart however the appVersion is optional.
What matters is that both should use semantic versioning.
/Templates
This directory contains
NOTES.txt
_helpers.tpl
deployment.yaml
hpa.yaml
ingress.yaml
service.yaml
NOTES.txt is used which will provide the server with templated usage information that will be displayed when we deploy the chart.
_helpers.tpl defintes template partials or reusable code that can be used in templates. Think EJS or php partials.
The rest of the files define templates that will be converted to kubernetes definitions. Unlike other helm files, they can be named anyway you like. Since you are already familiar with kubernetes, you should be able to guess what is inside those files from their names.
Let’s take a look at the deployment.yaml file.
Kubernetes is a combination of go template language and sprig template libraries.
Most of the time, definitions you will use and define will be within a small range of syntaxes.
Most of the values inside double mustaches start with .Values
. For example we can have an entry path call and then in double curly braces, {{ .Values.probePath }}
.
The value of the probePath is defined in the values.yaml file by default but does not necessarily mean it is definitely defined there.
If we look in values.yaml and try to locate probePath, we see that it is set to /
.
We already said that the ingress host should be different based on the environment of the deployment. Let’s see how we accomplish this in our templats/values.yaml file.
in the - host: {{ .Values.ingress.host }}
line, we are refrencing a Value. Let’s try to locate that in the root/values.yaml file.
We can see that there is the value ingress:
host
set to demo-app.example.com
.
This is the pattern most people prefer.
More often than not, all the values.yaml variables will always represent an application in production that allows me to have easy insights into the file state of the application while also keeping the option to keep those values for other environment variables.
Another goal we set for ourselves is to enable or disable horizontal pod auto scaling depending on the environment we deploy to.
For best control, we want to make sure that HPA either is or isn’t true and we said we need to have a minimum of 2 replicas in staging and between 3-6 in production.
We accomplish this with minReplicas
and maxReplicas
set to {{ .Values.hpa.minReplicas }}
and {{ .Values.hpa.maxReplicas }}
.
We can confirm those values are defined by looking at the values.yaml file. The values there should be set to what we expect to have in production and as we can see, if we do not overwrite the default values, HPA will be set to true and the min/maxReplicas will be set to 3/6.
There is one more thing we need to explore before we deploy our application. We need to figure out how to add the database as a dependency.
Adding application dependencies
As you already saw, Helm allows us to define templates. This works great for our applications but its probably not the best idea for 3rd party applications. Our application requires a database, to be more precise specifically it needs mongodb.
Running MongoDb is not only about creating a kubernetes stateful set or service, its much more than that.
We might want to have a deployment when it is running as a single replica and a stateful set when it is running in a replica set mode.
We may need an operator that would join replicas into a replica set.
We might need to set up auto scaling.
We might need different storage options and we might need to be able to choose not to use storage options at all.
There are many other options that we might need to define.
We could even go on hard mode and create a custom operator that ties different resources and makes sure it is working as expected.
The real question is: “Is it is a worthwhile investment to define everything we need to deploy and manage mongodb?” The answer is typically no.
Whenever possible we should focus on things that brings differentiating value. In most cases, that means that we should focus mostly on developing, releasing, and running our applications. We should be using community tools for everything else. MongoDB is not an exception.
Helm contains a massive library of charts maintained by vendors and the community, leverage them!
Requirements.yaml
Let’s look at the requirements.yaml file. This file is optional but if we do have it we can use it to define 1 or more dependencies .
dependencies:
- name: mongodb
alias: demo-app
version: 7.13.0
repository: https://charts.bitnami.com/bitnami
This file alone might result in conflicts if multiple applications running in the same namespace use the same chart as a dependency. To avoid this issue, we specify the alias that is a unique identifier that should be used when deploying the chart instead of the name of the chart.
We also see the dependency is defined in a particular registry https://charts.bitnami.com/bitnami
This is a super easy way to add any 3rd party application as a dependency.
We could use the same mechanism throughout our entire application but how do we know which values to add?
Let’s take a few steps back and consider the process that lead us to creating a requirements.yaml.
The first step is usually to search for the chart on google or through the helm search
command but in order to do that we have to add the repo to our helm cli.
helm repo add stable https://charts.helm.sh/stable
Now we can search using helm search.
helm search repo mongodb
Unfortunately all of them are deprecated per the descriptions. If we read the information - we find that it’s actually maintained by bitnami and provides instructions on how to install it.
Let’s learn how to add and use additional repositories:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo list
helm search repo mongodb
We can see that mongodb’s current version as of writing this is now 7.13.0 and this is the version we have set in our requirements.yaml.
Let’s explore which values we can use to customize our deployment of mongodb.
helm show values bitnami/mongodb
What we can find is that
replicaSet:
enabled: false
and all we would have to do is set this value to a template which we can set the value of to true
in our Values.yaml file.
The values that are related to mongodb are prefixed with demo-app-db
which matches the alias we set before. Within that value, we set the
replicaSet:
enable: true
As you will see soon, when we deploy the application, with the dependency, the database will be replicated.
We can check to make sure the production HPA (horizontal scaling group) has the requisit 3-6 pods.
kuebectl -n production get hpa
We can also check if the database is able to scale and if it shows persistent volumes.
kubectl get persistentvolumes
We can conclude from the 2 persistent volumes that each are attached to a different database.
Let’s move to development and pull request environments and see whether we can deploy our application there with different requirements/properties.
Deploying Our Application to production
It makes more sense to deploy to Preview first doesn’t it? From the SDLC perspective it certainly does.
We will start with production because it is the default since it’s the easiest configuration without tweaks for preview, qa, and staging.
We want to keep costs low, so we will use name spaces to keep our deployments separated.
The only substantial difference would be in the kubeconfig file which would need to point to the desired kube API.
We will start by creating the production namespace.
kubectl create namespace production
We assume we don’t know if the application has any dependencies. We will look it up:
helm dependency list demo-app
We can see that mongodb version 7.13.0 from https://charts.bitnami.com/bitnami is needed but missing.
We have to download the missing dependency before we try to apply the chart.
We can do that by updating all the dependencies.
helm dependency update demo-app
We can then re-list the dependencies and confirm that it’s no longer missing.
This isn’t the ideal workflow for doing this.
Run this instead:
helm -n production upgrade --install demo-app demo-app --wait
We can list the ingresses by running
kubectl -n production get ingresses
We can also curl our application.
curl -H "Host: demo-app.example.com" "https://$INGRESS_HOST"
Deploying to “Preview”
Let’s learn how to overwrite our default template values for production.
Things get a little more complicated next.
The challenge is that the requirements are for applications in our environments is not the same.
The host name should be dynamic, given that every developer should have a different environment and there may be any number of open pull requests. So, the host of the application needs to be also generated dynamically where each name is always unique.
We also need to disable HPA (auto scaling) and have a single replica of both the application and the database dependency.
We also don’t want persistent storage for a temporary environment that will exist for anything from minutes to weeks.
The point of having different hosts is aimed at allowing multiple releases of the application happen in parallel.
The rest of the requirements are focused on keeping our costs low.
There is no need to run a production size application in development environments or else we would blow our whole budget for our hosting on development. Can you imagine a world where you were told to reduce the number of development environments to reduce costs? It happens. Usually as a result of development environment budgets being digested by applications that are too large and a mismatch to their actual needs. We don’t want either.
Getting started with overwriting production values in our preview environment
We will start creating our preview environment by defining a variable with our github user name. Given that each user is unique, there should never be a case where multiple environments conflict. This will allow us to create a namespace without worrying about it causing crashes on someone else’s. We will use it to generate a unique host.
export GH_USER=drewkarr #obviously change this to your user
kubectl create namespace $GH_USER
cat demo-app/values.yaml
We are going to have to update a few variables in this file - or rather overwrite them.
We don’t wan to use a specific tag of an image but rather the latest on that we will be building whenever we want to see the results of our changes to the code.
To avoid collisions with others, we may also want to change the repository. We won’t do this because it will add unnecessary complexity to our demo but definitely something to consider when doing this for real.
We should use Always for our pullPolicy. Otherwise we would need to build a different stack every time we create a new image.
We need to define a unique host and disable HPA.
Finally, we need to disable database replica sets and persistence.
Let’s see how we overwrite it.
We have a separate values yaml file that can be used by anyone in need of personal development. This file will live inside the Charts/preview directory.
image:
tag; latest
pullPolicy: Always
hpa:
enabled: false
demo-app-db:
replicaSet:
enabled: false
persistence:
enabled: false
We will use these values to overwrite the defaults in our Charts/values.yaml.
The only variable that is missing is the ingress.yaml file. We cannot define it since it will differ from person to person. Instead we must assign it to an environment variable that we will use to set the value at runtime.
export ADDRESS=$GH_USER.demo-app.example.com
Now we are ready to create the resources in the newly created namespace and ingress.host value
helm --namespace $GH_USER \
upgrade \
--install \
--values preview/values.yaml \
--set ingress.host=$ADDRESS \
demo-app demo-app
--wait
kubectl get ingress demo-app demo-app
We use the namespace argument to ensure that the resources are created inside the correct place.
We also added new arguments to the mix.
The values arguments specify the path to our preview
Now that we have our environment up and running. We should ensure that our application remains temporary by deleting the resources when we are done with them. Since it is so easy and quick to deploy anything we need, there is no need to keep things running longer than we need to avoid cost increases.
helm --namespace $GH_USER delete demo-app
kubectl delete namespace $GH_USER
Let’s move forward and see what we have to do to run the application in permanent non-production environments (staging).
Packaging and Deploying
In order for us to follow the gitops principles, each release should result in a changed tag and version of the chart.
We also need to package the chart in a way that it can be distributed to our team.
Let’s start by creating a new branch named after a fictitious Jira ticket “eng-001”.
git checkout -b eng-001-feature
Now imagine you spend some time writing code, tests, and validated that this new feature works as expected in the preview environment. We also want to picture that we decided to make a release out of this feature.
The next thing we will want to do is change the version of the chart and the application. This information is stored in Chart.yaml.
Let’s take another look at our charts/demo-app/Chart.yaml file.
apiVersion: v1
description: A Helm chart
name: demo-app
version: 0.0.1
Since we are devops engineers, we don’t want to manually update this, instead we want to automate this process:
cat Charts/demo-app/Chart.yaml \
| sed -e "s@version: 0.0.1@version:0.0.2@g" \
| sed -e "s@appVersion: 0.0.1@appVersion: 0.0.2@g"
| tee Charts/demo-app/Chart.yaml
Now that we replaced the versions, we should focus on the tag that we want to use in the charts/demo-app/values.yaml file.
We will want to replace that value with yet another sed command:
cat Charts/demo-app/values.yaml \
| sed -e "s@tag: 0.0.1@tag:0.0.2@g" \
| tee Charts/demo-app/values.yaml
That’s it. Those are all the modifications we have to do.
Normally we would have a bunch more steps in a production pipeline like rebuilding and deploying our image but that’s beyond the scope of this helm demo-app.
The application is currently built and ready for us to use it but before we do, we want to commit to making a new release based on that chart - so we should validate whether the syntax we are using is correct.
helm lint demo-app
Based on the output of that command we can determine if our chart is without errors.
Now that the chart appears to be defined correctly, we can package it.
helm package demo-app #sign
We can see from the output of the command that the package was saved successfully and the location of the packaged file.
~/dev/drewlearns/helm/demo-app-0.0.2.tgz
From now on we can use that file to apply the chart instead of pointing to a directory.
helm --namespace production upgrade --install \
demo-app demo-app-0.0.2.tgz \
--wait \
--timeout 10m
Right now you might be wondering what the advantage of packaging the chart would be. It’s just as easy to reference a directory to use as a tgz file. This would be true if you were always looking for charts locally but in the real world, we will be looking for charts from a shared artifact repository. These repositories can be purpose built to store helm charts. For example https://chartmuseum.com or a more generic repository like https://jfrog.com/artifactory/. There are many to choose from. You can explore these more later.
Let’s check that the application was upgraded.
helm --namespace production list
We should see that the app version is 0.02 and production is now running on a 2nd revision that has been made.
We can then test that our new release works by running a simple curl request.
curl -H "Host: demo-app.example.com" "http://$INGRESS_HOST"
The output should state the version.
Rollbacks
Let’s change the direction of releases. Let’s assume our 0.0.2 release broke things in production and we need to roll back to version 0.0.1. Can we easily roll back since there is a problem we couldn’t detect before applying?
Let’s look at our releases:
helm --namespace production history demo-app
We can see that we have 2 releases.
Revision number 1 was an install and revision 2 was an upgrade from the original code.
We can roll back to a specific revision or we can go to the previous release.
Let’s rollback to the previous release:
helm --namespace production rollback demo-app
The response should clearly state that the rollback was a success. Happy Helming!
Let’s take another look at the history.
helm --namespace production history demo-app
We now see that there is a new entry with the description “Rollback”.
Instead of driving backwards, it continues forwards keeping the history intact.
Things we didn’t do correctly
We could have done things differently in this example but we focused on moving forward and not getting too stuck on tangental subjects like private keys.
Here we take a moment to make a few notes of things that will prove to be useful in the future when you use helm.
Upgrading the release directly to helm
Without going through preview or staging or whatever environments, we went straight into release. This isn’t the best practice. We took a shortcut for the subject but normally, we would want to take the long route to ensure we develop within the rails of our SDLC.
When we are working on real projects, we will never commit directly to production. We should follow the SDLC process established within your organization, what ever it may be.
If the process isn’t right, make some suggestions. Change it, instead of skipping steps.
Changes to clusters directly
We should never apply changes to clusters directly.
Instead, we should be pushing them to git and applying all of those changes that were reviewed and merged into the main branch.
Charts
In terms of charts, we should keep the charts close to the applications, preferably in the same repository as the code.
Environment specific values and dependencies should be in the repositories that define those environments no matter their namespaces or clusters.
Helm Commands
We should not run helm commands at all. Except maybe during early development. They should all be automated through whichever continuous delivery tool (Argo CD for example) that you’re using.
The commands are simple and straightforward so you should not have any problems extending your pipelines to include helm.
Destroying Changes
We won’t need anything we created or did in our future branches.
Let’s stash all our changes and delete our namespaces.
git stash
git checkout main
git branch -d eng-001-feature
kubectl delete namespace staging
kubectl delete namespace preview
kubectl delete namespace production
Let’s destroy the cluster.
Alternatives
Kustomize
Kustomize is a template-free declarative approach to Kubernetes configuration management and customization.
Kustomize and Helm serve the same primary function. Both allow us to define applications in a more flexible way than using only Kubernetes manifests. However, the way Helm solves the problem is quite different from the approach adopted with Kustomize. Which one is better? Which one should you choose?
Ketch
Ketch makes it easy to deploy and manage applications on Kubernetes through a CLI. It removes the need to deal with Kubernetes complexity. It even removes the need to write YAML files.
https://learn.theketch.io/docs
Portainer
Portainer is the definitive open source container management tool for Kubernetes, Docker, Docker Swarm and Azure ACI. It allows anyone to deploy and manage containers without the need to write code.
DevOps That Would Make Dad Proud
Introduction to Argo CD
Almost everything we do today ends up with the deployment or application or a suite of files. Today, that looks like a pretty stinkin easy task.
Virtual machines and cloud computing has simplified the process of deployments and kubernetes brought them to the now.
All we have to do is execute kubectl apply
and Whammy, the application is running together with all the associated resources.
If the application needs storage, it is mounted. If it needs to be used by other operations, then it is discoverable. Metrics are also exposed and logs are shipped out to their appropriate end point and they have become scalable. They can rollback automatically and they are deployed without any downtime if there are potential issues.
We needed some sort of template so we got helm and the like to customize our deployments based on their environments.
We have managed to get to the point of executing the deployment commands from continuous delivery pipelines.
We have advanced a lot from dad’s devops even in just the last few years. It’s rather exhausting to write about but the good news is, I think we are pretty much there. We now have the ability to build fully automated zero touch pipelines.
To our Dad’s, today’s processes and pipelines would look like PFM (Pure F***ing Magic).
I wish I could say this will be the end all be all guide on CD but the industry is still moving forward and in a constant state of continuous improvement. It’s changing the way we manage and deploy our applications at the speed of light.
I feel that the basis for this topic is somewhat specific to a tool but the principles are timeless... er... at least for now.
🤖 Reality check - Machines should be taking over
As a Devops Engineer, we should not be running commands ourselves, ever. Everything should be automated.
No one is supposed to ever deploy anything by executing kubectl apply
or upgrade or literally any other command. Our CD tools should be the ones doing that, especially in production. Production processes are supposed to be run by the machines, not by us pleebs.
The above statements may lead you to think that we should define the commands and the steps in our continuous deployment and that those command should use definitions stored in git repositories...
That is better than the way our dad’s did it and it is compliant with gitops principles.
We are and should be trying to establish git as a boundary between tasks performed by humans and tasks performed by robots (computers).
So, if we are trying to establish “Git” as a boundary between the tasks performed by us and machines, then you should never run commands or even write them.
We are supposed to write code including declarative definitions (usually in the form of yaml files) of the state of our infrastructure and applications.
We define the desired state, and then we store it in git. That is where our job ends and machines take over.
Once we push something to git, notifications are sent to tools that initiate a set of processes executed by machines without our involvement.
Those notifications are generally sent to CD tools like jenkins, spinnaker, cloudfresh.
Who/what should have access to production clusters?
Letting git send web hooks to notify production that there is a change is not secure. It means that git needs to have access to production clusters. There is still room for improvements!
If a CD tool runs inside the production clusters and git can communicate with it, then others could do the same. Therefore, our cluster is exposed.
You could say that your CD tool isn’t running inside the production cluster which would be a better option, it could be somewhere else. You might even be using a CD as a service solution but it is nearly the same as sending a webhook. Something has to have access to the cluster or an API of the application that runs inside of it.
What if I told you that you could configure clusters so that no one has access to it and it can still deploy applications to it as frequently as we want?
What if neither git nor any platform nor any CD tools should be able to reach your production clusters?
What if I said that nobody, including you should be able to access it beyond, perhaps, the initial setup?
Wouldn’t that be ideal?
If you have been in this industry for a while, you probably remember or even work at a place currently where almost no one has access to production clusters aside from a privileged few (cursed?). You had to open Jira issues and write documentation and justifications and open service now requests and so on... just so that one of the cursed few would do it.
In those times or work places, saying “few people and tools should have access to the cluster” probably evokes nightmares that keep you up at night. It may even be the reason you say you are “living the dream” (nightmares are dreams too).
We should not go back to this archaic methodology. Instead we should be working towards a policy of NOTHING and NO-ONE should have access to production clusters.
Does that sound worse? It probably does. It definitely sounds like a terrible idea at first take. It even sounds like it’s not doable, yet.
And yet, that is precisely what we are supposed to do based on gitops principles. Deployments should be defined through a declarative format. We should be defining the desired state and store it in git but there is so much more to deployment than defining state.
We also need to be able to describe the state of all environments no matter whether those are namespaces, clusters, or the like. Environments are the backbone of deployments. So let’s spend a few minutes defining what they are and what we need to manage them effectively.
Defining Deployments and Environments
An environment can be a namespace, a whole cluster, or even a federation of clusters. This doesn’t really help your understanding though, did it?
An environment is a collection of application and their associated resources. Production can be considered an environment and it can be located inside a single namespace.
It could spend the whole cluster or it could be distributed across the fleet of clusters.
The scope of what production is, differs from one organization to another.
What matters is that it’s a collection of applications. Preview, dev, staging, qa, test, cap, pre-prod, prod, etc. It doesn’t matter, what does is that they all have one ore more logically grouped applications to manage environments.
To manage environments we need to be able to define a few things:
- Permissions
- Restrictions
- Limitations & similar policies
- Constraints
We also need to be able to deploy and manage applications individually and as a group.
If we were to make a new release of a single independently deployable application, like a micro service, we need to be albe to update the state of an environment to reflect the desire to have a new release of an application in it.
We also know that an environment is an entity by itself and has no connections with other “environments”.
It will fulfill expectations when all the pieces are put together.
This could be a frontend application that communicates with several APIs, databases, backends, and storage. It could be many different things.
What matters is, within the environment, a whole infrastructure-application puzzle is assembled and fulfills it’s purpose.
WHY DOES THIS MATTER?
It matters because we need to be able to manage an environment as a whole as well.
Since we are focused on gitops principles, we also need to have different sources of truth. The most obvious one is a an application, a repository that contains it’s code, build scripts, deployment definitions, etc.
This allows the team that is in charge of that application to be in full control and develop it effectively.
Environments also have a desired status of also need to be stored in git. Dad’s tend to use separate repositories. These usually contain manifests; however, they may not be the type that you are used to if each repository containing the application has the manifest of that application. It would be “not so smart” to copy those definitions into an environment specific repositories. Gross.
Instead environment repositories often contain references to all of the manifest of individual applications and environment specific parameters applied to those apps.
An application running in production will certainly not be accessible through the same address as the same app running in staging. At least I hope it isn’t. Gross.
Scaling patterns might be different for a given environment as well, we can observe lots of differentiators between environments.
Since everything is defined as code, and the goal is to have the desired state of everything - we must have a mechanism to define those differences.
As a result, people have environment specific repositories that contain the references to manifests of individual applications combined with all the things that are unique to those environments.
As a general rule, we can distinguish the source of truth being split into two groups which are
- Individual applications
- Environments
Each group represents different types of the desired state.
It is a logical separation.
How we apply that separation technically is an entirely different issue.
You may have 1 repo for each app and 1 repo for each environment.
We may split environments into various branches of the same repo using directories inside the main branch. We may even make bad decisions like going with a mono-repo.
My preference (BEFORE LEARNING THIS)
I like to keep each environment in different branches of the same name. Each branch is in the same repository and only allow merges from one to the other in order from lowest environment to highest for each application.
Each branch that is named after its environment is treated as a source of truth for that environment.
I believe this provides more decoupling and for teams to work with more independence.
My personal preferences are just that though and shouldn’t prevent you from producing your gitops the way you think they would work best for your need... even if I still think my approach is best.
My mentor’s preference (and I agree)
We should have a repo for each environment and application. I personally am very against this though I’ve never seen it done. It just sounds like a headache and would be a cause for environment/appliction drift from one repo to the next. We will learn a lot more about application manifests and environment manifests in a bit.
In comparison to the branch approach, this actually provides more decoupling for teams to work with more independence than other approaches.
What really matters...
Truly though, the only thing that matters is that everything is code because your code in a repository has history and is a source of truth. It is our desired state. Our job, is to push code changes to git and it’s up to machines to figure out how to make the actual state with the desired state.
Dad’s Dream CD tool
In this section we will try to figure out how to apply gitops principles in their purest form.
We will focus exclusively on the defining the desired state in git, and letting the processes inside the cluster figure out what to do.
We will do that without any communication from the outside world except for the initial setup. That is at least the intention.
We know how to deploy using commands, but as we have discussed in this section - we shouldn’t do that.
We will try to set up a system in which no one will ever execute kubectl apply, helm install, or any other similar command. At least not after initial setup.
When I say no one, I mean precisely that.
No one will be able to install or update an application directly.
So how do we do that?
I bet you are thinking we just define those same commands we said no one should run in our CD pipelines for a robot to run or maybe you are thinking that we would run one of those commands inside the same cluster or that we will let them control remote agents.
Welp, you are wrong.
We won’t run those commands ourselves or tell a dumb tool to do it.
I will introduce you to a new tool and a new way of looking at the world and You. Will. Love. It.
You will ask yourself (or your bosses), why didn’t I do it this way years ago!?
In the next section we will discuss Argo CD.
Argo CD has entered the chat
In an effort to prevent you from having false expectations and this may come off as a scathing review but its honest.
Continuous Delivery is not what some people are trying to sell you.
Argo CD self describes as a declarative, gitops, continuous delivery tool for cloud native kubernetes.
Argo CD is not a CD tool.
This description is unfortunately dead wrong and slightly misleading, IMHO.
A lot more is needed for it to be able to claim that it enables continuous delivery for an SDLC.
Continuous Delivery is the ability to deploy changes to production as fast as possible without sacrificing quality or security.
You may think that any tool that can deploy changes to production is a CD tool, but you’d be wrong for thinking that.
Continuous delivery is about automating all the steps from a push of code to a repository all the way until your code is deployed into production.
Argo CD does not fit this description.
So what is Argo CD if not a CD tool?
Let me give you a very different explanation of what Argo CD is. Argo CD is a declarative gitops deployment tool for kubernetes.
Argo CD is a tool that helps us forget the exsistence of kubectl apply, helm install, and similar commands.
It is a mechanism that allows us to focus on defining the desired state of our environments and then pushing those definition into a git repository. It is up to Argo CD to figure out how to converge our desired state and actual state.
That sounds like the same thing with extra words...
Ok, you! Look, Argo CD is one of the best tools today to deploy application inside of kubernetes clusters, hands down.
I still won’t say it’s a CD tool. Don’t let my initial comments turn you off from it, though.
It is based on gitops principles and it is a perfect fit when combined with other tools like Circle CI.
But it’s still not a CD tool.
It provides all the building blocks we might need if we would like to adopt gitops principles for deployments and inject them inside of our SDLC (Software Development Life Cycle - or application lifecycle management).
Let’s discuss it.
Installing and configuring Argo CD
As you can probably guess, we are going to run Argo CD inside a Kubernetes cluster. To do this, we will need a kubernetes cluster with NGINX ingress controllers. The other need we have is to access the ingress stored in an environment variable INGRESS_HOST
.
Er’y body be clusterin... clusterin 🎶
First let’s create a new project directory called “argo” and create our three terraform files and one yaml (https://github.com/DrewKarr/DevopsDad/blob/main/ingress-nginx.yaml) inside of it:
resource "random_string" "main" {
length = 16
special = false
upper = false
}
data "google_billing_account" "main" {
display_name = "My Billing Account"
open = true
}
resource "google_project" "main" {
name = "drewlearns"
project_id = var.project_id != "" ? var.project_id : "doc-${random_string.main.result}"
billing_account = var.billing_account_id != "" ? var.billing_account_id : data.google_billing_account.main.id
}
resource "google_project_service" "container" {
project = google_project.main.project_id
service = "container.googleapis.com"
}
resource "google_container_cluster" "primary" {
name = var.cluster_name
project = google_project.main.project_id
location = var.region
min_master_version = var.k8s_version
remove_default_node_pool = true
initial_node_count = 1
depends_on = [
google_project_service.container
]
}
resource "google_container_node_pool" "primary_nodes" {
name = var.cluster_name
project = google_project.main.project_id
location = var.region
cluster = google_container_cluster.primary.name
version = var.k8s_version
initial_node_count = var.min_node_count
node_config {
preemptible = var.preemptible
machine_type = var.machine_type
image_type = var.image_type
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
autoscaling {
min_node_count = var.min_node_count
max_node_count = var.max_node_count
}
management {
auto_upgrade = false
}
timeouts {
create = "15m"
update = "1h"
}
}
resource "null_resource" "kubeconfig" {
provisioner "local-exec" {
command = "KUBECONFIG=$PWD/kubeconfig gcloud container clusters get-credentials ${var.cluster_name} --project ${google_project.main.project_id} --region ${var.region}"
}
depends_on = [
google_container_cluster.primary,
]
}
resource "null_resource" "destroy-kubeconfig" {
provisioner "local-exec" {
when = destroy
command = "rm -f $PWD/kubeconfig"
}
}
resource "null_resource" "ingress-nginx" {
count = var.ingress_nginx == true ? 1 : 0
provisioner "local-exec" {
command = "KUBECONFIG=$PWD/kubeconfig kubectl apply --filename https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml"
}
depends_on = [
null_resource.kubeconfig,
]
}
output "cluster_name" {
value = var.cluster_name
}
output "region" {
value = var.region
}
output "project_id" {
value = google_project.main.project_id
}
variable "region" {
type = string
default = "us-east1"
}
variable "project_id" {
type = string
}
variable "cluster_name" {
type = string
default = "drewlearns-03-06-"
}
variable "min_node_count" {
type = number
default = 1
}
variable "max_node_count" {
type = number
default = 3
}
variable "machine_type" {
type = string
default = "e2-standard-2"
}
variable "image_type" {
type = string
default = "cos_containerd"
}
variable "preemptible" {
type = bool
default = true
}
variable "billing_account_id" {
type = string
default = ""
}
variable "k8s_version" {
type = string
}
variable "ingress_nginx" {
type = bool
default = false
}
Now we want to run the following commands to install argo cli locally and to build our cluster. This may be the last application you ever install on a cluster by executing commands from a terminal! You may even decide to remove helm from your laptop after we are done installing argo on our cluster.
#!/bin/bash
set -ve
# Install Argo CLI locally
brew install argo
brew tap argoproj/tap
brew install argoproj/tap/argocd
gcloud auth application-default login
# gcloud container get-server-config \
# --region us-east1 # produces list of k8s_versions (optional)
export K8S_VERSION=1.22.6-gke.1000
export KUBECONFIG=$PWD/kubeconfig
terraform init
terraform apply --var k8s_version=$K8S_VERSION --var ingres_nginx=true --wait && \
while [ -s echo $INGRESS_HOST ]; do
export INGRESS_HOST=$(kubectl \
--namespace ingress-nginx \
get svc ingress-nginx-controller \
--output jsonpath="{.status.loadBalancer.ingress[0].ip}")
done;
kubectl get nodes
helm version # if this fails, you need helm 3
kubectl create namespace "argocd"
helm repo add argo \
https://argoproj.github.io/argo-helm
cat argo/argocd-values.yaml
In the cat argo/argocd-values.yaml
output above you will see that we are setting ingress to be enabled but allowing http connections. We should never do this in an environment that isn’t meant for learning.
We also set our install CRDs
to false.
Helm 3 removed the installCRDs hook so CRDs must be installed as if there were normal kubernetes resources - think of it as a work around.
There is one thing missing from that yaml, it does not contain the server.ingress.hosts value.
I cannot know in advanced the address that the cluster will be accessible so we will set that up as an argument.
Let’s continue installing argo.
#!/bin/bash
helm upgrade --install argocd argo/argo-cd --namespace argocd --set server.ingress.hosts="{argocd.$INGRESS_HOST.codingwithdrew.com}" --values argo/argocd-values.yaml --wait
export PASS=$(kubectl --namespace argocd \
get pods \
--selector app.kubernetes.io/name=argocd-server \
--output name \
| cut -d'/' -f 2)
argocd login \
--insecure \ # we don't have TLS :(
--username admin \
--password $PASS \
--grpc-web \
argocd.$INGRESS_HOST.codingwithdrew.com
echo $PASS
argocd account update-password
open -a "Google Chrome" http://argocd.$INGRESS_HOST.codingwithdrew.com
kubectl --namespace argocd get pods
# We should see an application controller, dex server,
#redis, repo-server, and server pods running.
Deploying an application with Argo CD
Now that Argo is up and running, let’s deploy a simple application with a persistent volume attached.
Setting up a git repo with Argo CD
Instead of telling kubectl that we would like resources we define in a pvc.yaml file you will clone in a moment - we will inform argocd that there is a git repo and it should use it as the desired state of that application but before we do that we need to create a namespace we would like that application to live in.
Clone this repository: https://github.com/DrewKarr/pvc-app then run the following commands:
kubectl create namespace pvc-app
argocd app create pvc-app --repo https://github.com/DrewKarr/pvc-app --dest-server https://kubernetest.default.svc --dest-namespace pvc-app
This was pretty uneventful huh?
If you retrieved the pods inside the pvc-app namespace you will see that there are none.
We didn’t deploy anything so far.
Synchronization
So far, all we have done is establish a relationship between the repository and Argo cd but the definition of the application is still present.
We can confirm that by opening the Argo CD UI (your domain may be different than nip.io)
open -a "Google Chrome" http://argocd.$INGRESS_HOST.nip.io
As you can see, there is 1 application - or to be more precise - a definition of 1 application in the argocd records but it doesn’t exist in the cluster.
<
The reason is that the current status is set to “OutOfSync” with ArgoCD. We told Argo CD about the existence of a repository where the desired state is defined but we never told it to sync that desired state with the actual state, so it never attempted it.
Silly computers and their malicious compliance...
Tempted to press the sync button aren’t ya?
Admit it.
Press it!
A new dialog window will appear.
<< INSERT SCREENSHOT >>
It will show a list of resources that can be synchronized. We can see that Argo CD was able to figure out that we have a service, a deployment, and an ingress resource defined in the associated repository.
Click the “SYNCHRONIZE” button.
You can see that the status changes to “Synced” and “Progressing”. After a while, it should switch to “Healthy” and “Synced”.
Destroying our application 😞
We could have accomplished the same result through the Argo CD CLI. Everything that can be done in the UI can be done in the CLI and vice versa. We could technically do everything we need with kubectl as well - but if you read my rant on gitops boundaries, you know better.
Our pvc-app repository is now synced inside the cluster.
The actual state is now the same as the desired one and we can confirm this by listing all the resources in the pvc-app namespace. I hope that by now, you know how to do this but here it is:
kubectl -n pvc-app get all
If we wanted to stop here we could but our goal isn’t to have a clicky UI that we can manually click to sync environments. We are Devops professionals and we want to control entire environments using gitops principles. We may also want to use pull requests to decide what should or shouldn’t be deployed and when. We might want to control who can do what.
The problem is that we may have gone in the wrong direction so let’s destroy this project and start over.
argocd app delete pvc-app
With that command, we just destroyed our application.
We can confirm the deletion by using the UI where you will see the application is gone. This doesn’t necessarily mean that all the resources of the application are gone as well.
Let’s confirm.
kubectl -n pvc-app get all
We confirmed it was gone.
Let’s destroy our namespace now because we don’t need it.
kubectl delete namespace pvc-app
In the next section we will learn that Argo CD can work with more than just kubernetes yaml files and that it can be used with helm. The logic is exactly the same only the definitions differ.
Deploying an application with Argo CD
Now that Argo is up and running, let’s deploy a simple application with a persistent volume attached.
Setting up a git repo with Argo CD
Instead of telling kubectl that we would like resources we define in a pvc.yaml file you will clone in a moment - we will inform argocd that there is a git repo and it should use it as the desired state of that application but before we do that we need to create a namespace we would like that application to live in.
Clone this repository: https://github.com/DrewKarr/pvc-app then run the following commands:
kubectl create namespace pvc-app
argocd app create pvc-app --repo https://github.com/DrewKarr/pvc-app --dest-server https://kubernetest.default.svc --dest-namespace pvc-app
This was pretty uneventful huh?
If you retrieved the pods inside the pvc-app namespace you will see that there are none.
We didn’t deploy anything so far.
Synchronization
So far, all we have done is establish a relationship between the repository and Argo cd but the definition of the application is still present.
We can confirm that by opening the Argo CD UI (your domain may be different than nip.io)
open -a "Google Chrome" http://argocd.$INGRESS_HOST.nip.io
As you can see, there is 1 application - or to be more precise - a definition of 1 application in the argocd records but it doesn’t exist in the cluster.
<
The reason is that the current status is set to “OutOfSync” with ArgoCD. We told Argo CD about the existence of a repository where the desired state is defined but we never told it to sync that desired state with the actual state, so it never attempted it.
Silly computers and their malicious compliance...
Tempted to press the sync button aren’t ya?
Admit it.
Press it!
A new dialog window will appear.
<< INSERT SCREENSHOT >>
It will show a list of resources that can be synchronized. We can see that Argo CD was able to figure out that we have a service, a deployment, and an ingress resource defined in the associated repository.
Click the “SYNCHRONIZE” button.
You can see that the status changes to “Synced” and “Progressing”. After a while, it should switch to “Healthy” and “Synced”.
Destroying our application 😞
We could have accomplished the same result through the Argo CD CLI. Everything that can be done in the UI can be done in the CLI and vice versa. We could technically do everything we need with kubectl as well - but if you read my rant on gitops boundaries, you know better.
Our pvc-app repository is now synced inside the cluster.
The actual state is now the same as the desired one and we can confirm this by listing all the resources in the pvc-app namespace. I hope that by now, you know how to do this but here it is:
kubectl -n pvc-app get all
If we wanted to stop here we could but our goal isn’t to have a clicky UI that we can manually click to sync environments. We are Devops professionals and we want to control entire environments using gitops principles. We may also want to use pull requests to decide what should or shouldn’t be deployed and when. We might want to control who can do what.
The problem is that we may have gone in the wrong direction so let’s destroy this project and start over.
argocd app delete pvc-app
With that command, we just destroyed our application.
We can confirm the deletion by using the UI where you will see the application is gone. This doesn’t necessarily mean that all the resources of the application are gone as well.
Let’s confirm.
kubectl -n pvc-app get all
We confirmed it was gone.
Let’s destroy our namespace now because we don’t need it.
kubectl delete namespace pvc-app
In the next section we will learn that Argo CD can work with more than just kubernetes yaml files and that it can be used with helm. The logic is exactly the same only the definitions differ.
Creating “The Real Deal” with Environments
Introduction to defining whole environments
In this section we are going to set up a lot of things. We are going to use helm with Argo CD. We are going to need at least 3 types of definitions and we need to define a manifest for each of the applications we are going to work on, and manifests of the whole environment (such as production or preview).
Environment manifests can be split into two groups. We need a way to define references to all the apps that are running in an environment with all of it’s environment specific parameters.
We also need environment specific policies - For example namespaces, quotas, limits, allowed types of resources, IAM, etc.
Environment Manifests |
---|
Application-specific manifests |
Environment-specific manifests |
We have already learned about application specific manifests so we are going to focus on Environment specific manifests.
UIs are really only supposed to help us gain insights, not convert humans into button clicky clacky mindless machines. If you know anything about me, UI’s and design matters. So for a tool to be a good option, it has to have a great UI and argo nails it. However, the UI is more for the end customer (developers) and for triage purposes, not for routine operations.
As such, we are going to try to automate everything instead of using clicky buttons to synchronize our environments. I know some of you are thrilled pink about this.
Strategize our Manifests
Environments usually contain more than one application - we can not have those in a repository of one of the applications.
It could be 1 repo for all environments with a branch for each environment or even directories for each environment. If we use directories, it’d basically be a gross monorepo - let’s avoid that. While neither of these strategies are “bad”, they are still valid.
I want the best one for developer experiences and since I get to pick we are going to create a separate repository for each environment.
As you will see later, Argo makes it easy to use a different approach so don’t take this as the “ONLY” way, just the best way as Drew sees it for his use cases.
Getting started
Fork this repo: https://github.com/DrewKarr/environments. If you don’t know how to fork a repo, shame on you. Google it. lol. No seriously, here is a link to a solid resource on forking a project: https://docs.github.com/en/get-started/quickstart/fork-a-repo
Clone your newly formed repository
Run the following command, set the GH_NAME
to your github repository account name, mine is drewkarr
export GH_NAME=drewkarr
git clone https://github.com/$GH_NAME/environments.git
cd environments
We need to find a way to restrict the environment. As I previously mentioned we may also need to be able to define which namespaces can be used, which resources are allowed to be created, what the quota limits are, etc...
We can do some of those things using standard kubernetes resources.
For example: we can define resource limits and quotas for each namespace in a kubernetes yaml file. Unfortunately, some things we need are missing in kubernetes such as defining policies for the whole environment without restricting it to a single namespace. This is where Argo steps up to the plate.
We have already used an Argo CD project without even knowing we created it in the first project we ran through. That application we ran was placed into the default project that is similar to kubernetes namespaces. If we don’t specify any, it is going to be the one called “default”.
This is initially OK but as the number of applications managed by Argo increase, we need to start organizing them into projects.
What are Argo CD projects?
Argo CD projects provide a logical grouping of applications that are useful when Argo CD is used by multiple teams. This will be come obvious when we look at the features it provides.
It can restrict what may be deployed through a concept called “trusted git repository”.
It can also be used to define which clusters and namespaces are allowed to deploy apps.
It also allows us to define the kinds of permitted objects like deployments, demo certs, and RBAC (Role Based Access Control)
There are a ton of options that projects enables us to set limitations on but we aren’t going to list them all out. Instead, let’s take a look at our definition in our repository.
cat project.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
description: Production project
sourceRepos:
- '*'
destinations:
- namespace: '*'
server: https://kubernetes.default.svc
clusterResourceWhitelist:
- group: '*'
kind: '*'
namespaceResourceWhitelist:
- group: '*'
kind: '*'
We are defining the production project can use any of the repositories listed under “sourceRepos” (We made it a wildcard).
We need to change it to specify that we can only use 2 destination namespaces:
destinations:
- namespace: 'production'
server: https://kubernetes.default.svc
- namespace: 'argocd'
server: https://kubernetes.default.svc
We need to make sure that Namespace
is the only cluster wide resource which can be created.
clusterResourceWhitelist:
- group: '*'
kind: 'Namespace'
Next we need to specify the NamespaceResourceBlacklist
that cannot be edited. Within this project, we will not be able to create any of those resources.
NamespaceResourceBlacklist:
- group: ''
kind: 'Namespace'
- group: ''
kind: 'ResourceQuota'
- group: ''
kind: 'LimitRange'
- group: ''
kind: 'NetworkPolicy'
Next we need to specify the NamespaceResourceWhitelist
that can be edited/created.
NamespaceResourceWhitelist:
- group: 'apps'
kind: 'Deployment'
- group: 'apps'
kind: 'StatefulSet'
- group: 'extentions/v1beta1'
kind: 'Ingress'
- group: 'v1'
kind: 'Service'
In this context, these resources listed above are the only resources that can be created within the project namespaces “production” and “argocd”.
Let’s create the project and see what we will get:
kubectl apply --filename project.yaml
We can also (and preferably) accomplish this using the following command with argocd
argocd proj create -f project.yaml
I’m demonstrating the two commands so that you can see how similar and consider your options in your environment though the only advantage is that you aren’t using kubectl cli. If you recall, it is my opinion that kubectl commands should never be ran in production by humans other than observing and querying our resources.
As you can see in the definition, there are APP project resources. Knowing this, we can use kubectl to retrieve them.
Let’s list all the Argo CD projects and confirm that the newly created one exists.
kubectl --namespace argocd get appprojects
You can also look in the UI (obviously your domain will differ)
open -a "Google Chrome" http://argocd.$INGRESS_HOST.example.com/settings/projects
We should see default and production
<< INSERT PICTURE >>
Now that we established that the apps in the Argo CD project can be deployed in the production namespace, we should probably create the kubernetes namespace “production”.
kubectl create namespace production
Application Manifests
Let’s look in the helm directory
ls -1 helm
We should see a Chart.yaml and templates directory.
The resources defined in the templates here are special. Let’s take a look...
ls -1 helm/templates
We can by looking at their names that there are 2 applications. They are identical so let’s take a look at one.
cat helm/templates/application1.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: environments
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
path: helm
repoURL: https://github.com/drewkarr/environments.git
targetRevision: HEAD
helm:
values: |
image:
tag: latest
ingress:
host: codingwithdrew.com
version: v3
destination:
namespace: production
server: https://kubernetes.default.svc
syncPolicy:
automated:
selfHeal: true
prune: true
By doing this, we are complying with the most important priciple of gitops!
Let’s break this file down even further:
Understanding the values of our argo helm chart
This definition has the same purpose as the argocd create proj
command with the main difference being that this is IaC. We aren’t executing ad hoc commands. Instead we are defining our desired state and publishing it to a git repository.
What matters is that the definition in this file defines the argocd application and not much more. It belongs to the project “production” and it contains a reference to our repository and the path inside it.
We are linking the application to a specific repository and as you will see later, any time that we change the content of that repository, it will update argocd and apply those changes.
We don’t often change the definitions of our applications frequently, so this likely won’t even happen very often, at least not the ones that live in the application repositories.
We can see in the helm section of the file above that we are trying to replace the values of variables that are specific to the environment. Every time we want to release a new image to production, we can just change that value in that file and let Argo do what needs to be done for the state.
In the host value, we set it assuming that this application may run in environments other than production like preview, development, stating, integration, etc. We need the ability to define it in each so instead of using template, we hardcode it under the project.
Please do NOT update your [spec.source.helm.ingress.host](http://spec.source.helm.ingress.host)
to your domain.
The destination is set to say that we want to run that application in the production namespace inside the local server (https://kubernetes.default.svc). This value we have set means it will run inside the same cluster that we have Argo CD set.
We also have the syncPolicy here set to “automated”, “selfHeal: true”, and “prune: true”. This means that no matter what the sub values are, it will sync the application whenever there is a change to the manifest stored in git. Its function is the same as the clicky synchronize button we clicked earlier. selfHeal
set to true will synchronize actual state (the one in the cluster) with the desired state (the one in the git repo) or if we make any changes to the actual state.
If were were to manually try to change what is running in the cluster, it will auto correct that. This is something that is missing in many CD tools like cloudformation.
This means, we cannot change anything related to the application directly, only argo can do that. Argo will assume that any manual intervention on the cluster is unintentional and will undo it but syncing again. This may sound dangerous but it’s a good thing! It enforces gitops principles and the rule that no one should be running kubectl commands in production to manually change anything and it improves traceability because we will see the changes in our git history.
Finally when syncPolicy.automated.prune
is set to true, argo cd will sync even if we delete files from the repo. It will assume the deletion was intentional and an expression of the desire to remove the resources defined in them from the live cluster.
These values under syncPolicy are disabled by default as a security precaution. The project Ago decided that we need to be explicit whether we want automation and at what level - we aren’t testing the waters though, we want full automation. Let’s be brave and automate all the things!
This demonstration would not be complete if I didn’t show alternative solutions for other environments.
Other applications and environments
We could take a look at application2.yaml ...
cat helm/templates/application2.yaml
but there isn’t any difference other than the fact that it’s a separate application. The fact that you can have multiple applications in the same environment is the entire point of this.
We could use a command that we know like helm upgrade --install
and argo cd would gain knowledge about those two applications and start managing them, but we will not do that.
Applying the chart would not do what we are hoping to do in the first place. It would monitor the repositories and those repositories for the 2 applications do have manifests for the applications but not the overwrites specific to production.
Basically, the two applications are references repos that would be sync’d on changes but if we make changes to the yaml files, they would not be synchronized since there is no reference to the Argo CD production repository where those files live.
An Environment as an Application of Applications
At this point, what we need is to define in our implementation of Argo CD app of apps - we will create an application that will reference those two applications we already created. Those applications will be referencing the base application’s manifests strored in application repositories.
Pretty confusing huh?
Let’s look at another Argo CD application definition by executing
cat apps.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production
namespace: argocd
finalizers:
- resources-finalizer.argocd.argoproj.io
spec:
project: production
source:
repoURL: https://github.com/drewkarr/environments.git
targetRevision: HEAD
path: helm
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
selfHeal: true
prune: true
syncOptions:
- CreateNamespace=true
This is mostly the same as our previous section with the major difference being the reference to https://github.com/drewkarr/environments.git (environments repo) and syncOptions.
Note that we are using a separate repo for each environment (hypothetically).
We are finally ready to apply that definition.
Now that you have changed the repository above run git commit -am "changed org name"
and then git push
so that argo will see it.
By defining the definition of the app of the apps, we are able to tell argo about them without having to add them directly because argo will follow the references set in the repoURL field.
kubectl --namespace argocd apply --file apps.yaml
Open the UI and you can see that there are 3 applications.
We have production which is the app of the apps that references the helm directory inside the apps of the production repository. Inside that repository is the chart that defines two applications - (application1 and application2).
In the UI if we click the production application, it consists of 2 resources - if you expand them (they are hidden), you will see that its relation should be more clear. Production has 2 resources: application1 and application2.
Even though Argo CD sees production as a separate application, it realistically is the full production environment.
If you added any charts to the repo, they would also live directly in the production environment and would be updated automatically. Also any changes to the referenced repositories would be applied automatically to the cluster.
Let’s see what we got
kubectl -n production get all
We can see that everything we did was indeed created.
We can see our ingresses like this:
kubectl -n production get ingresses
The process at scale would still be the same as doing 2 applications making Argo a powerhouse of simplicity.
The coolest thing about this is we did all that by executing a single command. All we did was deploy a single resource that defines a single application in Argo CD and it did the rest by recursively following references.
Our job from here
All we would do after the application is created is to push changes to manifests in their respective repositories and let argo do the rest.
Management of applications with Argo CD
What does the management of applications with Argo Cd look like beyond our initial setup?
Any change to the cluster should start with a change to the desired state in the kubernetes manifests or helm charts and end with the push of that change to the repository. The system we have created has the job of making the cluster configuration match the desired state.
We already have this set up so all we are missing is seeing it in action.
Remember in the Section called Creating “The Real Deal” with Environments where we said “we will use this spec.source.helm.ingress.host in a later section to demonstrate our CD pipeline in action”.
The best place to update the host would be in the application1.yaml and application2.yaml files to your domain located in the helm/templates/ directory.
We will also need to change the “latest” image tag to a specific version to fix this problem. Let’s do this with a magic command.
cat helm/templates/application1.yaml \
| sed -e "s@[email protected]@g" \
| sed -e "[email protected]@application1.$INGRESS_HOST.nip.io@g" \
| tee helm/templates/application1.yaml
Commit your changes and push them to the repository.
So just to recap, we haven’t touched the live system, we just made changes to our IaC without breaking the principles of GitOps.
We can watch the magic happen in the Argo CD UI or with a fancy kubectl command to see what image tag is currently in production.
kubectl -n production get deployment application1-application1 --output jsonpath="{.spec.template.spec.containers[0].image}"
You should see the image tag update. If it still says “latest”, it is likely you were just a little impatient, give it a moment and run the command again.
Let’s check to see the update to the ingress host.
kubectl -n production get ingresses
Destroying resources
What happens when we delete application2.yaml from the helm/templates directory?
rm helm/templates/application2.yaml
Commit and push your code.
If you recall, we set the following syncPolicy:
syncPolicy:
automated:
selfHeal: true
prune: true
Head back to the Argo CD UI, initially we will see all 3 applications (production, application1, application2).
Since prune is set to true, application2 should disappear, momentarily.
We can also check by running:
kubectl -n production get pods
Cleaning up after ourselves
As always, we are done walking through our example and should destroy all the resources associated with it to avoid unnecessary charges.
Run the following commands:
kubectl delete namespace argocd
kubectl delete namespace production
cd ..
Argo CD Progressive Release Strategies
Argo CD Birbs
Let’s learn about progressive delivery with Argo CD.
Progressive deliver is a deployment strategy that allows us to roll out new features gradually while avoiding down time.
This is an iterative approach to deployments, you may have heard it called “Canary Releases” in the past.
This is a pretty broad definition since it encompasses a lot more than just Canary Releases such as Rolling Updates, Blue-Green Deployments, and more.
All of these strategies have one thing in common. They all roll out new releases progressively. Durations of the progressive release varies from place to place but the important part is to define it as an iterative approach to deployment.
Progressive delivery is an advanced technique of deploying software that lowers the risks associated with a hard cutover and minimizes the blast radius of post deployment issues.
We will discuss the why, the who, and why you wouldn’t want to leverage this strategy in detail.
Why do we want it?
Gary likes doing things the hard and painful way. He is a traditionalist.
The traditional deployment (recreate) strategy Gary employs across the board consists of creating new deployments and shutting down old ones. This is a “rip the bandaid off” approach to updates and doesn’t ensure zero down time, in fact, it ensures downtime where the old and the new release is not running and your end users aren’t happy.
This “recreate strategy” or “Gary” strategy is one of the reasons many companies have less frequent releases - businesses and Gary’s everywhere hate down time.
To be fair, if there is built in downtime with a release, it does make sense to do it infrequently. This is also bad because the end user and business end doesn’t like it when new features aren’t available or bugs fixed.
So why would anyone use Gary’s strategy?
- Your name is Gary...
- You may not know there is a better way - problem solved, continue reading.
- The main reason this strategy is used is because deploying releases that produces downtime is the only option we have in some cases. Many resources or architectures don’t allow anything but shutting down first and then deploying such as database migrations.
- Another example is when an application cannot scale. It is impossible to deploy a new release with zero-downtime without running at least to replication of an application in parallel, and it allows us to test our deployment in production. Testing in production is a novel idea right?
If applications cannot scale horizontally (only 1 replica is allowed) and there can be no overlap in run time meaning the recreate strategy is the only boat you can sail on.
Horizontal scaling is easy to solve though, right? Right?!
hOrIzOnTaL ScAlInG Is eAsY! aLl i hAvE To dO Is sEt tHe rEpLiCaS FiElD In tHe mAnIfEsT To aCcOmMoDaTe tHe dEpLoYmEnT Or a sTaTeFuL SeT To a vAlUe hIgHeR ThAn oNe aNd tHe pRoBlEm iS SoLvEd!
Stateful applications that cannot replicate data between replicas cannot scale horizontally - the effects of scaling would likely be catastrophic to your end goals.
This means, we have to change our application to be stateless or figure out how to replicate data. Replicating data is a terrible idea and hard to do well. Consistent and reliable replication is hard to accomplish. It’s so complicated that even some databases can’t do it. Accomplishing this for our purposes is a rabbit hole we won’t dive into.
It’s much easier and better to use an external database for all the state of our applications and as a result they become stateless.
Scalable applications can leverage progressive delivery, right? Right!?
Let’s not be hasty. There is more to progressive delivery than just needing to be stateless. It means that not only do you need multiple replicas of your application to run in parallel, but also that two releases will run in parallel. This means that there is no guarantee which version the end user will get nor which release other applications are communicating with.
See the problem?
The solution, each release needs to be backward compatible. It doesn’t matter what the scope of the change is, if it is going to be deploy without downtime, two releases will run in parallel.
We have no way to tell which release a user/process will hit even if we use feature flags.
The requirements (boiled down)
We know that horizontal scaling and backward compatibility are requirements to avoid downtime for the use of progressive release strategies.
Optionally, but probably necessary for progressive delivery:
- You will need to have a firm grasp of traffic management.
- You will need a continuous delivery pipeline.
- You’ll likely need observability tools like sentry or datadog
- Alerting setup.
- ChatOps
While you don’t need the bulleted list above, your life will only be hard without them and that’s not a goal of mine or anyone I know, besides Gary, but we don’t have to talk about Gary. He isn’t a happy guy.
The most important bit to drive home
Progressive delivery is NOT a practice that is well suited for immature teams.
It requires a lot of experience but may not look that way initially but when it gets to production at scale, things can get really hairy really quickly if the processes and tools we use are not accompanied by extensive training and experience.
This is not a tool that you will want to deploy with 1 strong team member but team members come and go. This is a strategy that everyone on the team needs to be proficient at.
You’ve been warned.
What are we going to focus on?
We won’t really explore this recreate (A.K.A Gary) strategy further in this section as we kind of already covered this. If you are lucky enough to work in a company that doesn’t have legacy applications, you won’t even need to think about this again.
We also won’t explore the rolling update strategy since you are probably already using it even if you aren’t aware of it because it is the default deployment strategy in k8s. All of our examples so far have used it.
We also won’t explore blue green deployments because they are a tool of Gary’s and the past when infrastructure was static and applications were mutable.
We will only explore the Canary release strategy here.
Why use Argo Rollouts to deploy applications?
Argo Rollouts provides advanced deployment capabilities. It supports all the flavors of deployment strategies. Argo Rollouts is a robust, comprehensive, and mature solution that leverages many different processes and tools. At the same time, it’s simple and easy to use.
Argo CD Rollouts integrates with ingress controllers like NGINX and AWS ALBs (Application Load Balancers), service meshes like Istio and those supporting the SMI (Service Machine Interface) like LinkerD.
Through them, Argo can control traffic, making sure that all the requests that are made that match specific criteria are reaching new releases. Take that, feature flags!
On top of all that, it can create query metrics from various providers and make decisions on whether to roll forward or rollback based on the results.
Those metrics can come from kubernetes, prometheus, WaveFront, Keyenta.
I’d compare it to other tools but I think it stands alone.
Let’s look at some practical examples so that we can determine whether it’s a practical tool worth adopting.
Installing and configuring Creating Argo Rollouts
Argo Rollouts can integrate with ingress or with service meshes and since service meshes allow more possibilities and that Istio is the most commonly used one we will learn it (and because it’s the one I’ve chosen to use 😝 in this article Book )
Set up
We will need a kubernetes cluster as you can probably guess.
We will also need to install Istio.
You will need to create an environment variable:
export ISTIO_HOST="IP THAT ISTIO IS ACCESSIBLE"
brew install argoproj/tap/kubectl-argo-rollouts
kubectl create namespace argo-rollouts
kubectl --namespace argo-rollouts apply --filename https://raw.githubusercontent.com/argoproj/argo-rollouts/stable/manifests/install.yaml
helm create drewlearns
We can see all the new kubectl commands we just installed by running:
kubectl argo rollouts --help
In your helm/templates directory create some new files and paste the contents of the following toggles inside them. They are broken down in a logical grouping of files and tiered within each toggle is a breakdown of the definitions inside. Toggle away!
../.gitignore
/.idea /*.iml /public /drewlearns /env .waypoint /creds /shipyard.yaml *account*.json*
/helm/values.yaml
image: repository: drewkarr/rollouts tag: latest pullPolicy: IfNotPresent ingress: enabled: true host: codingwithdrew.com resources: limits: cpu: 100m memory: 256Mi requests: cpu: 80m memory: 128Mi readinessProbe: periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 hpa: false knativeDeploy: false istio: enabled: false rollout: enabled: false steps: - setWeight: 10 - pause: {duration: 2m} - setWeight: 30 - pause: {duration: 30s} - setWeight: 50 - pause: {duration: 30s} analysis: enabled: true
Most of this file doesn’t really need any explanation but we will use it as a blueprint for how we construct the rest of our files. These are basically default production ready values (even though this is just a lame demo).
We will use these values later when we have gained a better understanding of the fully automated rollout process.
helm/templates/base-rollout.yaml
{{- if .Values.rollout.enabled }} --- apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: {{ template "fullname" . }} labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" app: {{ template "fullname" . }} spec: selector: matchLabels: app: {{ template "fullname" . }} template: metadata: labels: app: {{ template "fullname" . }} istio-injection: enabled spec: containers: - name: {{ .Chart.Name }} image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} ports: - containerPort: 80 livenessProbe: httpGet: path: / port: 80 initialDelaySeconds: 60 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 readinessProbe: httpGet: path: / port: 80 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {{ toYaml .Values.resources | indent 12 }} strategy: canary: canaryService: {{ template "fullname" . }}-canary stableService: {{ template "fullname" . }} trafficRouting: istio: virtualService: name: {{ template "fullname" . }} routes: - primary steps: {{ toYaml .Values.rollout.steps | indent 6 }} {{- if .Values.rollout.analysis.enabled }} analysis: templates: - templateName: {{ template "fullname" . }} startingStep: 2 args: - name: service-name value: "{{ template "fullname" . }}-canary.{{ .Release.Namespace }}.svc.cluster.local" {{- end }}
These templates are almost the same as our kubernetes deployment - everything that we define as a deployment can be defined as a rollout with a few small differences:
apiVersion
The
apiVersion
should be argoproj.io/v1alpha1 (this may likely change over time, most likely)kind
The kind should be
Rollout
, pretty obvious huh?New definitions we need a firm grasp on:
Instead of typical recreate and rolling updates strategies,
spec
,ports
andstrategy
.The strategy is currently set to canary and as we discussed, we won’t explore other options because they are dated.
.spec.strategy.canary
There are a number of fields here that provide information about the rollout and tells Argo how to perform the process.
canaryService
There is a reference to the
canaryService
which it will use to redirect some traffic to the new deployment.stableService
There is a
stableService
which is where the default traffic will go to, it is used for the release that is being decommissioned (hopefully).trafficRouting
trafficRouting
contains the reference to the.istio.virtualService
- We will see this in action in a bit and how it controls traffic through it.steps
steps
are set to the helm value.rollout.steps
which we will also explore shortly.analysis
This field is inside an if conditional based on the
rollout.analysis.enabled
value. We will disable this initially so we will come back to this shortly.
helm/templates/rollout-analysis.yaml
{{- if .Values.rollout.enabled }} --- apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: {{ template "fullname" . }} spec: args: - name: service-name metrics: - name: success-rate interval: 10s successCondition: result[0] >= 0.8 failureCondition: result[0] < 0.8 failureLimit: 3 provider: prometheus: address: http://prometheus-server.monitoring query: | sum(irate( istio_requests_total{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}", response_code=~"2.*" }[2m] )) / sum(irate( istio_requests_total{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}" }[2m] )) - name: avg-req-duration interval: 10s successCondition: result[0] <= 1000 failureCondition: result[0] > 1000 failureLimit: 3 provider: prometheus: address: http://prometheus-server.monitoring query: | sum(irate( istio_request_duration_milliseconds_sum{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}" }[2m] )) / sum(irate( istio_request_duration_milliseconds_count{ reporter="source", destination_service=~"{{ "{{args.service-name}}" }}" }[2m] )) {{- end }}
This file defines our analysis which will be used to determine if a deployment is good or not.
helm/templates/rollout-service.yaml
{{- if .Values.rollout.enabled }} ****--- apiVersion: v1 kind: Service metadata: name: {{ template "fullname" . }}-canary labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" spec: type: ClusterIP ports: - port: 80 targetPort: 80 protocol: TCP name: http selector: app: {{ template "fullname" . }} {{- end }}
The service we see right now is the service referenced in the canary services of the rollout and it will only be used during the process of the rollout and for all other cases, the one defined in service.yaml will be used.
The only real difference from this service and the service.yaml is the name.
You are already familiar with how kubernetes services work so I’m not going to dive in this file at all - if you do need a refresher I have a good resource here: https://codingwithdrew.com/drew-learns-kubernetes-an-introduction-to-devops/#45.
We may need to make other changes for a typical rollout, for example: The reference in a horizontal scaling group needs to reference rollout (and any other resources we may need) instead of a deployment.
Let’s take a look at our hpa.yaml file:
helm/templates/hpa.yaml
{{- if .Values.hpa }} --- apiVersion: autoscaling/v1 kind: HorizontalPodAutoscaler metadata: name: {{ template "fullname" . }} labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" app: {{ template "fullname" . }} spec: minReplicas: 2 maxReplicas: 6 scaleTargetRef: {{- if .Values.rollout.enabled }} apiVersion: argoproj.io/v1alpha1 kind: Rollout {{- else }} apiVersion: apps/v1 kind: Deployment {{- end }} name: {{ template "fullname" . }} targetCPUUtilizationPercentage: 80 {{- end }}
This is basically the same HPA we would use for any other type of application with the exception being the
apiVersion
and thekind
set inside thescaleTargetRef
which will have different values based on whether it is a deployment or a rollout.
The next 2 definitions we are going to look into are specific to Istio.
helm/templates/istio-virtual-service.yaml
{{- if .Values.istio.enabled }} --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: {{ template "fullname" . }} labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" spec: gateways: - {{ template "fullname" . }} hosts: - {{ template "fullname" . }}.local - {{ .Values.ingress.host }} http: - name: primary route: - destination: host: {{ template "fullname" . }} port: number: 80 weight: 100 - destination: host: {{ template "fullname" . }}-canary port: number: 80 weight: 0
The most important part of the istio yaml files is the
route
where we have 2 destinations nested inside. The first being the deployment and the second being the canary. Each one is referencing one of the respective services which is typical for istio.One important callout is the weight of the primary destination (deployment) is set to 100 with the weight of the canary being set to 0.
This means all (100%) of the traffic will go to the primary service which will be the release that is fully rolled out.
At runtime, argo actually will manipulate this field and we will see this inflight shortly.
helm/templates/istio-gateway.yaml
--- apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: {{ template "fullname" . }} spec: selector: istio: ingressgateway # use istio default controller servers: - port: number: 80 name: http protocol: HTTP hosts: - "*" {{- end }}
We aren’t really going to break down the following files just yet (perhaps not at all since we have discussed them in detail in previous sections) so you can just review the contents and copy paste them into their respective helm/template directory.
helm/templates/deployment.yaml
{{- if .Values.ingress.enabled }} --- {{- if .Capabilities.APIVersions.Has "networking.k8s.io/v1beta1" }} apiVersion: networking.k8s.io/v1beta1 {{ else }} apiVersion: extensions/v1beta1 {{ end -}} kind: Ingress metadata: name: {{ template "fullname" . }} labels: draft: {{ default "draft-app" .Values.draft }} chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" annotations: kubernetes.io/ingress.class: "nginx" spec: rules: - http: paths: - backend: serviceName: {{ template "fullname" . }} servicePort: 80 host: {{ .Values.ingress.host }} {{- end}}
helm/templates/service.yaml
--- {{- if .Values.knativeDeploy }} {{- else }} apiVersion: v1 kind: Service metadata: name: {{ template "fullname" . }} labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" spec: type: ClusterIP ports: - port: 80 targetPort: 80 protocol: TCP name: http selector: app: {{ template "fullname" . }} {{- end }}
helm/templates/ing.yaml{{- if .Values.ingress.enabled }} --- {{- if .Capabilities.APIVersions.Has "networking.k8s.io/v1beta1" }} apiVersion: networking.k8s.io/v1beta1 {{ else }} apiVersion: extensions/v1beta1 {{ end -}} kind: Ingress metadata: name: {{ template "fullname" . }} labels: draft: {{ default "draft-app" .Values.draft }} chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" annotations: kubernetes.io/ingress.class: "nginx" spec: rules: - http: paths: - backend: serviceName: {{ template "fullname" . }} servicePort: 80 host: {{ .Values.ingress.host }} {{- end}}
helm/templates/ksvc.yaml--- {{- if .Values.knativeDeploy }} apiVersion: serving.knative.dev/v1alpha1 kind: Service metadata: name: {{ template "fullname" . }} labels: chart: "{{ .Chart.Name }}-{{ .Chart.Version | replace "+" "_" }}" spec: runLatest: configuration: revisionTemplate: spec: container: image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}" imagePullPolicy: {{ .Values.image.pullPolicy }} livenessProbe: httpGet: path: / initialDelaySeconds: 60 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 readinessProbe: httpGet: path: / periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 resources: {{ toYaml .Values.resources | indent 14 }} {{- end }}
As you can see, our whole application lives in the helm directory as templates and later we will see how the default values can be used as a framework for constructing the templates.
Deploying the first release
We have explored a lot of ground so far but now is the time we get to pull levers and do cool stuff.
Before we begin though, let’s consider some specific customizations we are going to use for the first release.
In the spirit of starting simple and building a more and more complex architecture, follow along!
Starting simply with Gary Ops
We are going to have manual approvals for rolling forward releases initially.
Let’s take a look at our values we will use to overwrite the default production values.
Let’s take a look at those values and how they work.
2x-values.yaml
ingress: enabled: false istio: enabled: true hpa: true rollout: enabled: true steps: - setWeight: 20 - pause: {} - setWeight: 40 - pause: {} - setWeight: 60 - pause: {duration: 10} - setWeight: 80 - pause: {duration: 10} analysis: enabled: false
We are disabling NGINX ingress because this is an example app and it isn’t really needed.
istio
andhpa
is set to enabled because we will need it for the rollout services.We are enabling rollout here and then setting the stepWeight values that increase over time.
The magic of the progressive canary rollout is the steps and pausing with updated weights. Steps define what should be done and in what order.
Basically it’s saying “send 20% of the traffic to the canary then pause indefinitely. When the next step comes around send 40% and pause indefinitely... and so on”.
Pausing indefinitely doesn’t necessarily mean forever - These steps basically mean pause until we promote that release until the next step manually. Remember, we are keeping things simple for now and using “Gary Ops”.
Until we perform a manual action of promotion it will pause. When we approve the next step, it will be a signal to end the pause and continue to the next action. At the 3rd time, it will just wait 10 seconds then promote automatically, this is much more aligned with continuous delivery pipelines but limited to a deployment process.
This will give us a high overview of how to automate using time as a value to evaluate if a rollout can progress.
Finally, the last value is setting
analysis.enabled
to false. We aren’t ready to explore this just yet so it’s disabled. Rest assured, we will get there.
Let’s create this project!
helm upgrade --install drewlearns helm \
--namespace drewlearns \
--create-namespace \
--values 2x-values.yaml \
--set ingress.host=drewlearns.$ISTIO_HOST.nip.io \
--set image.tag=2.0.0 \
--wait
We have now deployed the first release of the application defined in our helm directory.
We used the values defined in the file 2x-values.yaml and we set the image tag to 2.0.0. We are using one of the older releases of the application so we can see how it behaves later on.
We can see the information of the rollout process through the command:
kubectl argo rollouts --namespace drewlearns get rollout drewlearns-drewlearns --watch
We can see that the application is healthy and that all eight of the steps of the canary strategy have executed.
The actual weight is set to 100 meaning all the requests made to our application are going to the release we just deployed. That means the old revision is set to 1.
You will also notice that there is a replica set with 2 pods even though we deployed a rollout resource - it behaved as a deployment except for the strategy.
The rollout created a replica set which created a new pod. Later on the horizontal pod kicked in to scale to two pods since we defined a minimum set of 2.
More importantly, the rollout ignored the steps we defined, it just ran all 8 steps and didn’t wait for our manual approval. What the heck, right?
We didn’t find a bug, we just deployed our first iteration of our application so argo knows to ignore those incremental steps and to just deploy it fully as normal. Pretty neat huh? Also how lame would it be to do a rolling deployment when there wasn’t already a deployment to roll over from?
The canary deployment strategy only applies to the subsequent releases. This means it’ll only apply the canary strategy if the application is already running.
Let’s ensure the application is running by opening it in our browser:
open -a "Google Chrome" http://drewlearns.$ISTIO_HOST.nip.io
Leveraging the Canary Strategy
So far, we haven’t seen any advantage to leveraging the Canary Rollout feature but that’s about to change.
Let’s deploy a second release and kick off the redeployment process.
helm upgrade drewlearns helm \
--namespace drewlearns \
--reuse-values \
--set image.tag=2.0.1
Let’s watch the rollout:
kubectl argo rollouts --namespace drewlearns get rollout drewlearns-drewlearns --watch
It processed the first step which se the setWeight
value to 20% then the 2nd sub step which is set to pause indefinitely.
So currently it’s waiting for us to promote it to the next step, but how do we check to make sure that only 20% of our requests go to the new release?
Let’s run it and check for the header value of “Version: 2.0.1”
for i in {1.100}; do
curl -IL http://drewlearns.$ISTO_HOST.nip.io \
| grep -i --color "VERSION: 2.0.1"
done | wc -1;
When you run this command, you should get my website returning from your application with a version 2.0.1 in the header approximately ~20 % of the time (likely less but not more). If you wanted to see a more precise number you could send more requests for a larger sampling.
We can edit the command to search for Version: 2.0.0
and we would see that approximately 80% of the traffic goes to the original deployment.
We can also check the status of the new release by running:
kubectl --namespace drewlearns get virtualservice drewlearns-drewlearns \
--output yaml
We can see that Argo Rollout has updated this file to be a weight of 20 for the canary and 80 for the already-deployed release.
Drew is a seasoned DevOps Engineer with a rich background that spans multiple industries and technologies. With foundational training as a Nuclear Engineer in the US Navy, Drew brings a meticulous approach to operational efficiency and reliability. His expertise lies in cloud migration strategies, CI/CD automation, and Kubernetes orchestration. Known for a keen focus on facts and correctness, Drew is proficient in a range of programming languages including Bash and JavaScript. His diverse experiences, from serving in the military to working in the corporate world, have equipped him with a comprehensive worldview and a knack for creative problem-solving. Drew advocates for streamlined, fact-based approaches in both code and business, making him a reliable authority in the tech industry.