22 May 2020

1597 words 8 mins read

Building GKE Clusters

Now that we have an environment to run GKE clusters in we should probably build out a few clusters. Thankfully this post shouldn’t be as wordy as the last, but there is still a bunch of stuff to cover. We’re going to go over some of the features of GKE and value.

This is part of my Multi-Cluster GKE series. Check the other posts for more information

Design

Building on our earlier identified objectives, we want to consider the goals for our cluster and how we can build for these.

  • Runs without public IPs on nodes
  • Supports Spinnaker
  • VPC-Native mode

Both Spinnaker and Kubernetes have a non-trivial operational overhead, and successful implementations need to optimise around these requirements. To do this, we need to support multiple teams using it as a shared platform. We also want to make sure we activate features that can reduce the toil involved in operating and maintaining the cluster. There are also some small1 security issues we need to address when Terraform builds a GKE cluster. Finally, there is the cost of running the clusters. I’m covering the cost of all the infrastructure I use in this series, and I want to minimise how much it costs me. These choices do have side effects that need consideration.

  • Support multiple workloads with different access requirements
  • Support cross-project access through service accounts
  • Reduce management overhead
  • Quick security wins and clear mistake prevention
  • Reduce Cost

Implementation

We’ll expand our gke module from the previous post. Thankfully we’re only going to be creating two resources, one data object and a handful of outputs in this post.

Our two resources are the google_container_cluster and google_container_node_pool . These both have a fair few options, and this post won’t touch on even half, so I recommend checking out the terraform documentation on these resources. The below code blocks are going to be excerpts so if you want the full resource check the companion repository link at the end.

modules/common/gke/main.tf

This first section contains a bunch of boilerplate. It sets up our loop, ensures we use the beta provider (for certain features later) and sets the name, location and project to run in.

resource "google_container_cluster" "clusters" {
  for_each = var.clusters
  provider = google-beta
  name     = "${var.name}-${each.key}"
  project  = var.project
  location = each.value.region

The location attribute has multiple purposes. If we specify a single zone, it creates a zonal cluster. Zonal clusters can run nodes in any Availability Zone(AZ) in the region but have their control plane running in one AZ. In the event the AZ containing your control plane has issues your cluster keeps running, but without the control plane, it is quite degraded. Next is our network configuration.

We start with defining the network and subnetworks that we need. Subnetworks are exclusive to each cluster, and we use a map lookup off our earlier subnetwork resources. ip_allocation_policy ensures we build a VPC native cluster. You’re going to want a VPC native cluster in most cases as it lets you use higher-performing features like Network Endpoint Groups. Network Endpoint Groups give you better traffic distribution for services exposed via a load balancer.

  network = data.google_compute_network.host-vpc.self_link
  subnetwork = google_compute_subnetwork.subnets[each.key].id
  ip_allocation_policy {
    cluster_secondary_range_name  = google_compute_subnetwork.subnets[each.key].secondary_ip_range[0].range_name
    services_secondary_range_name = google_compute_subnetwork.subnets[each.key].secondary_ip_range[1].range_name
  }

   private_cluster_config {
    enable_private_endpoint = false
    enable_private_nodes    = true
    master_ipv4_cidr_block  = each.value.control_network
  }

The last attribute in this snippet contains the various settings for a private cluster. enable_private_endpoint disables the public control plane API endpoint. It’s not the best name as the private endpoint exists as soon as you set enable_private_nodes. This attribute also removes the public IPs from the nodes. The master_ipv4_cidr_block attribute controls what range to use to peer to the GKE hosted controllers. This CIDR block needs to be a /28 range and also needs to be unique throughout the network.

Keeping a Kubernetes cluster up to date is a significant amount of work. The rapid release cycle, combined with our ever-increasing backlog, can make it challenging to keep up. To help reduce this toil, we’ll use the release_channel feature from the beta provider. These channels allow you to decide what your risk to feature appetite is as they set both the upgrade and patch frequency. I’ve chosen the Rapid channel for this cluster so I can validate upcoming releases and use features recently released.

  release_channel { channel = "RAPID" }
  maintenance_policy {
    daily_maintenance_window {
      start_time = "04:00"
    }
  }

This attribute combined with the maintenance_window (which in this case configures the start_time for maintenance), auto_upgrade and auto_repair attributes can substantially reduce the work required to stay up to date. You may not want to run all your clusters on the rapid cycle but some distribution so you can validate inbound changes would be advisable. Additionally, you can configure the maintenance policies per node_pool to distribute your change windows.

Now to start with some of the security options. enable_shielded_nodes enables a set of node verification and trust systems including secure boot and firmware validation. It’s worth running if it’s available, but you should check the documentation first. We also configure some things that should be the default in our master_auth block.

  enable_shielded_nodes = true
  master_auth {
    username = ""
    password = ""
    client_certificate_config {
      issue_client_certificate = false
    }
  }

Without the master_auth block the GKE API (as called by terraform) generates a basic auth user and password, this is problematic.

I am not fond of this

But if we set the username and password fields to be empty, we can guarantee that the GKE API does not create this user. To prevent another problematic authentication method, we need to set the client_certificate_config block and ensure they are not issued. Client certificates are not rotatable or revocable and if they leak you should delete your cluster. These methods are unwise for day to day operation, nor are they required for services. It’s not great that these attributes have to be specified to prevent problems.

To allow different workloads to run with different permissions, we need multiple blocks. In our cluster configuration we need to add a workload_identity_config block. identity_namespace is always going to be the same value as gke only supports a single possible identity namespace per project

  workload_identity_config {
    identity_namespace = "${var.project}.svc.id.goog"
  }

Later in our node pool, we need to ensure we configure the workload_metadata_config attribute or the node-pool gets tainted and destroyed every run. Again a mildly problematic implementation.

Our final bit of code is all about our initial nodes. As we’re managing our node pool separately, we don’t want to keep the default node pool. We have a small initial count and remove it as soon as the provisioning completes. As terraform refreshes the state of the cluster every run, we need to configure the life-cycle hooks to prevent unwanted replacement.

  remove_default_node_pool = true
  initial_node_count       = 1
  lifecycle {
    ignore_changes = [node_config,node_pool, initial_node_count]
  }
}

The cluster we have at this point isn’t overly useful. We need to add some compute nodes to give us somewhere to run our pods.

resource "google_container_node_pool" "clusters" {
  for_each   = var.clusters
  provider   = google-beta
  name       = "${each.key}-pool"
  project    = var.project
  location   = each.value.region
  cluster    = google_container_cluster.clusters[each.key].name
  autoscaling {
    min_node_count = 1
    max_node_count = 10
  }

Again we have a bit of boilerplate and then we have the autoscaling block. Autoscaling allows the cluster to grow and shrink as increasing utilisation and reducing cost. We then get on to toil minimisation. Earlier we configured a release channel and maintenance window. These two blocks work with that to configure options that relate to this. auto_repair removes faulty nodes and replaces them in your cluster. auto_upgrade upgrades our node pool automatically.

  management {
    auto_repair  = true
    auto_upgrade = true
  }
  upgrade_settings {
    max_surge       = 1
    max_unavailable = 0
  }

The upgrade_settings block allows us to configure the velocity of our upgrades much like the Kubernetes maxSurge and maxUnavailable options. max_surge allows us to add up to 1 node (in this case) to the cluster when doing an upgrade. max_unavailable makes sure we don’t reduce capacity during an upgrade. The proportions here change over time and depend on your desired upgrade velocity and disruption budgets. They also depend on the behaviours of your workloads. A workload may make this rolling upgrade not suitable. In those times a full blue-green node pool swap may be required.

  node_config {
    preemptible  = true
    machine_type = each.value.node_size
    disk_size_gb    = 10
    disk_type       = "pd-ssd"
    oauth_scopes = [
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
    ]

Small isolated Kubernetes clusters have similar operational support requirements as much larger shared clusters. This continual overhead is something to consider when designing your compute infrastructure. Aiming for an environment that exhibits sublinear scaling as it grows enables many efficiencies. Unfortunately, most cloud provider IAM solutions support only one single workload per compute node. This pattern would result in all workloads on the same node having the same IAM privileges. As you are likely operating a shared cluster, this is not optimal. Workload Identity is how GKE connects a Kubernetes Service Accounts (KSA) with a Google Service Account (GSA). The workload identity service runs in your cluster and proxies connections to the GCP metadata service. Replacing the metadata service allows it to connect the service accounts. Amazon EKS has a similar solution and tools like Kiam and Kube2iam when using Kops or Kubespray.

    workload_metadata_config {
      node_metadata = "GKE_METADATA_SERVER"
    }
  }
}

There are some additional outputs, but we’ll start using those in a later post.

This series is not yet complete and updates are coming.
Subscribe to RSS or follow @kcollasarundell on twitter.


  1. Not small at all. ↩︎