# Kubernetes at LIFX ### Kevin Collas-Arundell #### [@kcollasarundell](https://twitter.com/) #### 09 Aug 2018
Presented to REA devops guild !section ### Caveats * Will probably not scale to larger teams * Practices developed for a small team * High adjacency to domain knowledge * Mostly nice services that follow most of the 12 factor pattern * Ignoring stateful applications Note: !section Speaker notes are incomplete they are coming soon !section ### Where we were * Mesos/Marathon cluster * Self managed systems * Internally developed tools to expose services * Old versions of all the things * Super small team Note: LIFX was already operating in a way similar to kubernetes. Already organisationally there. most of the changes with moving to kube resolved around the difference between services and actual loadbalancers but spoilers !section ## Why Kubernetes? * Good primatives to build on * Good momentum * More integrated solutions * Does out of the box what we needed custom code to do Note: We were looking at upgrading mesos and found that this was going to be difficult This lead to us looking at other options available. Kubernetes seemed to have much more momentum when we first looked at this. Better tools allowed us to remove code from our deployment code and from our services. !section ### Why GKE? * limited ops time * better to spend time on more valuable tasks * smaller systems to manage * no kubernetes controllers to manage Note: When we were looking at solutions we compared * upgrading mesos, * building a new dcos cluster * running our own kube cluster. going with gke minimised the amount of day to day admin work and gave us autmoated systems to manage the underlying work !section ### Migration time <video data-autoplay loop src="/talks/slow.mp4"></video> !section ## First thing to do is make "all" the tools ### for a very small value of all Really just a small Go app with [text/template](https://golang.org/pkg/text/template/), [crypto/aes](https://golang.org/pkg/crypto/aes/) piping templates to kubectl and bash to glue it all together. Note: There was some concern that the options available when we started the migration didn't fit our use case overly well. Combined with our simple workflow some tools were over engineered So text template and yaml files are how we work With the release of ksonnet and the upcoming helm3 this probably needs to be reviewed. !section ### "Normal" migration process * Create a build pipeline * Create walls of yaml * Test deploy * Change dns * Profit? * Realise you missed several blobs of yaml or broken lib<!-- .element: class="fragment fade-in" data-fragment-index="1" --> * Revert dns change <!-- .element: class="fragment fade-in" data-fragment-index="2" --> * Update yaml, Fix libraries <!-- .element: class="fragment fade-in" data-fragment-index="2" --> * Deploy <!-- .element: class="fragment fade-in" data-fragment-index="2" --> * Change dns <!-- .element: class="fragment fade-in" data-fragment-index="2" --> Note: So we eventually worked out a process that worked for most of our services Create a pipeline and the yaml, push it out. test it and update dns. Super easy Except then you notice you missed some env vars or a library is broken and you rollback and fix and push and rollback and fix and push In the end we got much better at this. Several services were migrated in less than 1 person Day Annoying ones took longer but still Since we now had this process it seemed a great time for !section <!-- .slide: class="center" --> ### Holiday <video data-autoplay loop src="/talks/holiday.mp4"></video> Note: We had only migrated a handful of services and I escaped for a month. And when I came back Nick, Stephen and Dan had migrated pretty much 75% of our services Pretty awesome tbh much better than my first christmas at lifx. The earlier pattern of deployments continued as we slowly ground out our services. That is until we pretty much ran out of "normal" migrations. !section #### Abnormal migrations ##### The LIFX Broker * One tcp connection from every device * Super long tcp connection lifetimes * Many Many Many connections * Perfect thundering herd !section #### Mesos deploys ![Not great](/talks/deployMesos.png) Note: Mesos deployments on the version we ran were not overly controlable and had restrictive grace periods. This lead to the most appropriate deployment being a rolling restart. The problem with this is that the last broker to be restarted would have gained a share of each earlier brokers connections and need to shed them over the same period so it's a bit choppy !section #### Home baked deploy system ![Shiny, ](/talks/deployCustom.png) Note: !section #### Kubenetes built in deployments ![dope](/talks/deployKube.png) Note: This was the deploy yesterday. Infact it was 3 deploys with 2 existing !section #### New connections ![Super Dope](/talks/deployNewConnections.png) Note: Unlike the other deployments this is actually multiple deploys concurrently going out to push a bugfix we discovered during the deploy !section #### Lessons * Services and Deployments just worked ```yaml maxSurge 100% terminationGracePeriod 3600 ``` Note: With two lines of yaml we replaced thousands of lines of code 2 Lines! !section ### 11 Months from initial commit to almost finished ![it is always dns](/images/itsdns.png) Note: it's always dns !section <!-- .slide: class="center" --> ## Here be ~~dragons~~ conjecture !section <!-- .slide: class="center" --> <video data-autoplay loop src="/talks/opinion.mp4"></video> Opinions here are my own !section #### Kubernetes Suitability & Capability <!-- .slide: class="center" --> * Self managed kubernetes clusters and small teams ![Nope](/talks/nope.png) <!-- .element: class="fragment fade-in" data-fragment-index="1" --> Note: As a small team we just would not have the internal bandwidth to run and support a kubernetes stack without the ability to leverage tools like GKE. Even then a team with less operational experience or more feature pressure is probably better aimed towards more managed solutions. !section ### Where does Kubernetes fit | Number of services -> | Few | Many | Many Many | |--- | --- |--- |--- | |1 Team | bash for loops | Probably* | Yes* | |Several Teams | Perhaps Nomad or mesos? | Yes | Very yes | \* Managed kubernetes only Note: Kelsey hightower had a tweet about when you should start looking at kubernetes. 1 Machine just use ssh 2 machines just wrap it in a for loop 3 machines puppet or similar 5 machines kubernetes I think that's accuratish but i think it is really a service*teams relationship This really applies to all cluster management systems. Kubernetes, DCOS or Nomad though in different amounts. Nomad fits the fewer services but many servers workflow really well as it doesn't add much service abstraction While dc/os and kubernetes work better with many teams and many services as it provides nice abstractions !section ### Blobs of yaml and bash What could go wrong !section <!-- .slide: class="center" --> ### Config Drift <video data-autoplay loop src="/talks/drifting.mp4"></video> ### So much config drift Note: Solution :shrug: !section ### Tooling * Communication between teams is going to be an issue * "Best practice" will evolve over time * Tooling needs to support changing deployment patterns and config patterns over time Note: You probably don't want committed yaml in all the many repos. !section ### Best kubernetes features (IMNSHO) * Nodes are just nodes * Sidecars * Webhooks make process automation easier. Note: * Don't use much of kubernetes * Economy of scale. Gets better with size. * Large clusters are better than small and per team clusters are :| We don't use much of Kubernetes. The more significant capabilities of kubernetes require larger scales than we really use. Nodes in kubernetes are just empty boxes with docker (for now) and a kubelet binary. This simplifies your ami builds and management requirements. You don't even need to grant access into the nodes for "local" debugging as kubelet gives you strong rbac and remote attach capabilities. Sidecars let you have composable services. This works well when you have specialised processes consumed across teams or for use with Service mesh solutions or other tooling. Webhooks are basically compliance and integration hooks. Admission controllers let you proactively check the state for all the things as they are being created or changed rather than reactively on an event. Mutation webhooks can add premade objects to resources to reduce the amount that individual teams have to manage. !section ### Best kubernetes features (continued) * Operators and custom resources * Abstracts infrastructure to allow common deployment systems and patterns * Tooling being built for multiple clusters Note: Operators == Codified Operational Knowledge. Automation to provide or manage collections of services Custom resources allow description of custom services. Like the deployment -> pod mapping etcd of 5 nodes with version x or handle upgrading your ES cluster Most services aren't special. once you satisfy their package dependencies they can be abstracted away Tools like spinnaker allow you to build templated deployment pipelines. Letting teams worry about less This common pipeline allows you to update it overtime as practices change Operators are "codified Operational knowledge" You see these as ways to manage sets of pods for various services. the elasticsearch operator or kubedb or etcd operators. These build and maintain various cluster resources in repeatable ways. This is easy to build for internal services as well. the metacontroller or coreos controller framework generate or have systems for you to use in your own operators for your own internal practices A lot of what kubernetes provides are smoother abstractions of the underlying infrastructure. But most applications don't really need the flexibility of kubernetes Building common tooling ontop of kubernetes to construct in (to borrow a term from the spinnaker project) a paved road. This means that teams don't need to redevelop deployment processes or have a large amount of boilerplate Tools like spinnaker with it's templated pipelines that give you pipelines that can be used multiple times combined with the smaller contract of kubernetes give allow you to reuse large amounts of work developed by others (either inside or outside the company) !section ## TL;DR ### Do I think you should strongly consider Kubernetes <video data-autoplay loop src="/talks/Yarp.mp4"></video> !section ### Do I think it's going to be easy ![Nope](/talks/nope.png) !section This migration would not have succeeded if it wasn't for the other team members. They built a lot of the tools, developed many of the processes and did most of the migrations * Daniel * Stephen * Nick * Carl !section ## Questions? <video data-autoplay loop src="/talks/balancer/cat.mp4"></video> Note: Cat is there to distract you from asking questions !section ### References coming soon