20 May 2020

1918 words 10 mins read

Spinnaker and Multi-Cluster GKE Introduction

This series is going to cover the work involved and design decisions made to build a multi-cluster, multi-region Google Cloud (GCP) & Google Kubernetes Engine (GKE) environment. Once created with Terraform, Spinnaker will be deployed and will manage our Kubernetes workloads from one of the GKE Clusters.

Goals & Posts

This series has a bunch of objectives

  • Goals and creating a basic repository and module structure (this post)
  • Create a project & folder layout
  • Create a common GKE module and deploy it
  • Deploy Spinnaker onto the core cluster
  • Create a secure ingress with authentication and automatic certificates for admin utilities
  • Create cross-project IAM roles
  • Configure Spinnaker to be able to connect to each cluster using the service accounts
  • Deploy Platform Workloads1 with Spinnaker
  • Configure a pipeline to deploy Business Workloads2 to each cluster

Posts

Why?

but why

Why GKE?

While many workloads run well in Cloud Run or Lambda, some workloads and scales make a Kubernetes (k8s3) environment appropriate. Running a Kubernetes environment is something that takes significant work. Consuming a managed service4 is going to save time and money. When it comes to managed Kubernetes, I think GKE is one of the nicest along with Digital Ocean’s5. While I’ll be building out on top of GKE the later Kubernetes and Spinnaker posts can be modified to apply on other Kubernetes implementations.

Why Multi-Cluster?

Distance to end users.

The round trip time for a packet from the West Coast of the USA to East coast Australia is typically 170 to 200ms. When cables fail, and the only way to your service is the long way around the globe, this can grow to over 500ms. This latency impacts how your end-users think of your service so having a cluster closer to them when possible is a quality of life improvement.

Life-cycle benefits

Multiple clusters let you take individual clusters offline for maintenance add new clusters when you grow and survive outages in a region. Building in this pattern from the beginning of your operation means you have a well-practised procedure for building new clusters, for migrating traffic between them.

Reduction of Blast Radius.

Bad changes happen. They get pushed, and people don’t realise before you get impacted. Having a small number of large clusters allows you to do a graceful roll out of changes and reduces the impact of faulty changes. This isolation combines well with Spinnaker’s Pipeline and post-deploy test infrastructure.

Why You Should Look at Spinnaker?

Spinnaker is a specialised deployment tool. Created to fit the workload requirements and constraints at Netflix, Waze and Google give it powerful capabilities, but it’s certainly not a lightweight service. For many teams and organisations, Spinnaker is more than you need. Spinnaker is designed and optimised for those who need to manage deployments into multiple environments in multiple regions and support many services and teams that have high velocity. For smaller organisations, other solutions are good enough.

Pipeline Reuse

The deployment pipeline for a stateless Go service running on a Kubernetes cluster is going to look similar to a deployment for a stateless java service running on the same Kubernetes cluster. You can manage this similarity with brute force when you have only a few services. A copy of the pipeline configuration in the repo to extend your CI process is cheap, fast and works. Until it doesn’t. As your environments grow, there are more clusters with more services on them. This growth increases the amount of work in maintaining your pipeline configurations. Being able to template your pipeline and inherit it in each helps reduce this maintenance work.

You might do this with a complex internally maintained groovy shared library in Jenkins or a bunch of custom plugins in Buildkite6. But I think that by constraining Spinnaker to only doing CD you reduce the potentials you need to support and allow a better template implementation. The Spinnaker team are also working on improvements to how pipelines work. The work on the managed delivery feature is quite interesting and has significant potential.

Deployment Audit Logs

In my opinion7 most ITIL style change management is less than useful. It’s better to shift this change responsibility to the team building and owning the service in question. Allowing them to build their own decisions (informed by best practices sure) and own this space. I base this on the work of Nicole Forsgren, Jez Humble, Gene Kim and more8. Start with these three if you want to go deeper into this topic.

However, having a central log of all the changes in your environments by the various teams is incredibly useful and something that all organisations should try and maintain. Spinnaker can output events onto an external audit log.

Versioning of Deployment Resources

Have you ever9 gone to deploy a service from a few months ago and found that you’re not deploying the same service? Perhaps the CI build reran and consumed updated libraries. Your helm chart is no longer available from your upstream source, or you’re unsure where you found it. Spinnaker versions and store every deployment. Giving you the ability to rollback or redeploy something months after you last did.

Multiple Provider Support

Not all teams run on the same infrastructure. You might be running on App Engine or raw AWS EC2s or AWS Fargate or k8s. A tool doesn’t need to support everything, but it needs to at least work with what you do. Spinnaker is a tool that requires some scale in your organisation before it becomes viable. In larger organisations, parts of the organisation need different providers, and Spinnaker allows you to standardise while supporting many providers

Canary & Rollback Implementations

Automated canary and rollback make deployments safer. Safer deploys can happen more often. Frequent deploys is a habit that leads to safer deploys. Additional information in your deployment process from running canaries and confidence with automated rollbacks deliver safer and faster deploys. Building confidence in your deployment process provides teams with better engagement with the process and help build better products.

Why Spinnaker instead of …

It’s essential to ask this question. A SAAS10 CI11 solution is more than good enough for many organisations. In this space, there are many great providers. Buildkite, Gitlab CI, Google Cloud Build and Github Actions and more will provide something that is good enough. Organisations with substantial running infrastructure, more than a few teams or a large enough traffic volume can hit scaling issues in these CI products.

Not Gitops?

Well it depends what you mean by Gitops? If Gitops means having one or more source repositories that describe your environment. This repository gets consumed by automated systems that manage and validate your deployments. In this case, it’s is a great idea. If Gitops gets defined by the practices of tools that use git for multi-cluster coordination and promotion, not so much. Once you have more than a single cluster, automated coordination both across tiers and within them comes into play. How do you manage gradual roll-outs of cluster workloads across multiple when they all pull the same repo? Or manage an automated coordinated rollback after a partial failure?

Why are you writing all of this?

Because.

Prerequisites

Disclaimers

  • The GKE clusters is not optimal for you. Please don’t blindly copy.
  • The terraform examples in the posts are excerpts. They do not contain all resources or attributes. Check the companion repository for more detail.
  • If you are working through this from start to finish, do not believe in my git log.
  • I am terrible at names.

Workspace Structure

This series has a companion repository at github.com/kcollasarundell/GKE-spinnaker-post. Each post has a branch that represents the state at the end of the post. I’ll be walking you through the process of creating most of it if you want to build it yourself or you can work off a clone of my repository.

Terraform all the things

Create some files in the root of your workspace

touch main.tf variables.tf local.auto.tfvars

Inside main.tf we’re going to set up our initial terraform providers these providers configure our connection to google. We have both the stable google provider and the beta provider as some of the features I want to demonstrate are behind the beta flag in GKE. These both get pinned to a version, so we don’t have them randomly upgrading.

main.tf

provider "google" {
  version = "~> 3.16.0"
}
provider "google-beta" {
  version = "~> 3.17.0"
}

data "google_billing_account" "bills" {
  billing_account = var.billing
}

We also create a data resource to retrieve the billing account information. Billing accounts on GCP are how Google bills you.

Our billing account consumes a variable through the var.billing reference. We need to declare that variable as well. commonly variables are kept in a variables.tf file to make it easier to look them up.

variables.tf

variable "billing" {
    description = "The billing account to tie all projects tog
}

At this point, we are almost ready to go. We still need to pass in our billing account. You can either pass this in on the command line every time or provide it via a tfvars file. In this case, we create a file called local.auto.tfvars. All files with an auto.tfvars suffix get loaded automatically by terraform

local.auto.tfvars

billing = "billingAccounts/000000-FFFFFF-000000"

Your billing account needs to have the billingAccounts/ prefix then your billing account number.

We’re also going to need a home for our work. While we could leave it all in the root of the repository we are going to wrap it in a module so we can isolate it. Our modules live under the modules folder. Our first module is our core module. This module is the entry into the code that declares the rest of our modules.

mkdir -p  modules/core

Your repository should look something like the companion repository on the 0-gke-and-spinnaker branch

And we’re done. At least for this post

This series is not yet complete and updates are coming.
Subscribe to RSS or follow @kcollasarundell on twitter.


  1. Platform Workloads are things that don’t form a direct part of your business plane. These could be monitoring tools, systems to handle logs or provide access to your support teams. ↩︎

  2. Business Workloads support business goals. These are your API services, web servers, databases and more. In this series, we’ll be using HTTPbin most of the time. ↩︎

  3. Kubernetes is commonly shrunk to k8s because there is a k then 8 letters and an s. This pattern is similar to internationalisation shrunk to i18n pattern. You also see similar shrinking in authN and authZ to represent authentication and authorisation. ↩︎

  4. Yeah, well, that’s just, like, your opinion, man.. ↩︎

  5. I haven’t used Azure K8s or more than a POC of EKS ↩︎

  6. Buildkite is pretty awesome. It’s one of the least worse CI tools out there. ↩︎

  7. Again. My thoughts\opinion. ↩︎

  8. One day, I’ll collect all these resources and put them somewhere. Mostly so I can have one place to find them. ↩︎

  9. Have you ever, ever felt like this? Have strange things happened, Are you going round the twist? ↩︎

  10. Software As A Service. ↩︎

  11. Continual Integration. ↩︎