What the NSA and CISA Left Out of Their Kubernetes Hardening Guide

On March 11, 2022 the NSA and CISA released a new edition of their Kubernetes Hardening Guide. This guide dives into "best practices" of how to configure your Kubernetes clusters from a security standpoint. While this guide contains great information about auditing, logging, firewall configurations, and best practices for running containers, it doesn't provide much depth on authentication and authorization in Kubernetes or guidance on how to implement it. This post will fill in that gap.

Before getting to the "what" of Kubernetes authentication and authorization that was missed by the NSA and CISA guide, let's focus on the "why". There is no shortage of data that shows how vulnerable our IT infrastructure is to Phishing and other social engineering attacks. Verizon's Data Breach Report in 2021 found that 36% of breaches came from social engineering. With such a high volume of breaches coming from social engineering it's no wonder that OWASP, The Open Web Application Security Project, made broken access controls the number one security risk to web applications, with Identification and Authentication Failures being the number seven risk. While the advice given in the hardening guide is incredibly important, it can all be for nothing if a developer get's phished and lets an attacker in the front door! With that in mind, let's explore what should be added to the hardening guide for both authentication and authorization.

Authentication

The hardening guide doesn't cover much about authentication, other than saying it should be setup. It's important to approach cluster authentication from the perspective of the use cases that need to get authenticated:

Containers inside your cluster
Developers and Administrators
External Services and Pipelines

Each of these use cases has unique challenges that must be addressed and can easily be abused. Let's work through each of these use cases and provide some actionable configuration options and why you should use them, starting with your containers.

Container Identity

Why is it so important to be careful with your Pod's identity? The same report from Verizon that we cited above found that nearly as many breaches were caused by basic web application attacks. This means that if an attacker is able to exploit a known vulnerability in a component of your web application the attacker could take the token use to provide your container's identity against your cluster. Minimizing what your container's identity is capable of doing is an important part of mitigating this threat.

When talking about container identity, the hardening guide does a good job of explaining how a Pod get's an identity from a ServiceAccount. It also recommends disabling Pod identity if your Pod doesn't need it, which is good advise. Disabling a Pod's ServiceAccount token will prove to lose its effectiveness as more applications rely on the Kubernetes API as a datacenter API, storing more custom configuration objects directly in the API server. This trend makes it more important to have a strong focus on least privilege access controls.

The guide also points out that if these ServiceAccount tokens are lost, for instance due to vulnerability in the Pod's application, they could be used to access the cluster and be abused. This is true, but starting in Kubernetes 1.21 Pod's identity will not come from the static Secret generated when a ServiceAccount is created, but instead from the TokenRequest API. The TokenRequest API "projects" a time constrained token that can be tied not just to a ServiceAccount, but to an individual Pod. A major benefit of these tokens is that the API server will no longer accept them when their Pod dies. For now, an expired TokenApi token is still accepted by the API server, but it is logged so it can be tracked. The long term goal is to move Pod identity to short lived tokens that are useless when they expire, making Pod identity less risky and easier to use. It's important to point out that while a Pod doesn't currently mount the static token created by a ServiceAccount's creation, the token does exist by default until 1.24, making it still important that you use RBAC to protect those Secrets and don't mount them manually to your Pods.

Another important aspect to container identity to consider is how your containers interact with outside services. Few clusters exist in a vacuum and often need to interact with external APIs. Starting in Kubernetes 1.21, you can configure your external services to validate a projected token using the keys from the cluster's OpenID Connect discovery document. This gives an external system everything it needs to know in order to validate your Pod's identity without having to make an API call to your cluster. Using this identity for accessing external systems increases the overall security of your environment because now there isn't a shared secret that can be compromised without at least knowing it's being abused.

To add a projected token to your Pod for an external api, add a projected token to your Pod's definition:

apiVersion: v1
kind: Pod
metadata:
  name: connect-to-external-service
spec:
  containers:
  - name: myapp
    image: myimage
    volumeMounts:
    - name: token-vol
      mountPath: "/service-account-for-external"
      readOnly: true
  serviceAccountName: default
  volumes:
  - name: token-vol
    projected:
      sources:
      - serviceAccountToken:
          audience: https://my-remote-service.dev/
          expirationSeconds: 3600
          path: token

In this Pod definition there's a projected token for the default ServiceAccount for the audience https://my-remote-service.dev/. This token won't be accepted by the API server because it doesn't have the right audience, so it can't be abused to by an external service.

(special thanks to Duffie Cooley and Jordon Liggit for pointing this out to me)

Having explored how container identity can be secured using the TokenRequest API, next you will need to address how people interact with your clusters.

Developers and Administrators

At some point, a person needs to interact with every cluster. Whether it's because of a manual configuration or debugging an issue it's going to happen. The hardening guide mentions that each cluster should have authentication enabled, but doesn't provide any specific guidance on what should be enabled and why. When you explore the options for authenticating people into your clusters there should really only be two options to consider:

OpenID Connect
Your cloud's integrated IAM

When configuring user access for your cluster, OpenID Connect is presently the most secure option when configured correctly. By correctly:

Use multi-factor authentication - There is no end of research that shows any MFA will defeat most phishing attacks
Use short lived tokens - Since OpenID Connect tokens can be easily abused, make your tokens useless quickly. I recommend lifetimes of one minute, relying on kubectl to refresh the user's token as needed. This way, if something does leak a user's token it's unlikely an attacker will be able to use it once they acquire it.
Include groups in your tokens for authorization. This will be explained in more details when we get to Authorization.
Know how to revoke a user's session quickly. If someone needs to be revoked quickly, know how to do it with your identity provider.

There are multiple OpenID Connect identity providers that are commercial and open source and it would be outside of the scope of the hardening guide to recommend one. That said, I will shamelessly plug our own OpenUnison! It follows all of the above configuration guidance out-of-the box and works with LDAP, Active Directory, GitHub, OpenID Connect, and SAML. OpenUnison provides secure access to kubectl, the Kubernetes Dashboard, and your other management applications. Finally, it will work with both your on-prem clusters and with your cloud managed clusters.

In addition to OpenID Connect, your cloud's IAM option is also going to be a good choice from a security standpoint. Cloud providers spend considerable resources securing and integrating their IAM into Kubernetes. There are valid reasons to use OpenID connect with cloud managed clusters using an impersonating reverse proxy (like kube-oidc-proxy with OpenUnison), but that's outside the scope of a hardening guide.

Certificates should be avoided, except for "break glass in case of emergency" scenarios. I cover this topic in detail in a previous blog post. I also did a lightning talk on the subject at KubeCon NA Security Day 2020. In short, there are three reasons to not use certificates for authenticating to your cluster:

Kubernetes has no way to test if a certificate has been revoked. The only way to keep a lost key pair from being used against your cluster is to completely re-key your cluster.
It's very hard to include groups, which may change, in a certificate. If your groups do change, see #1.
For certificate authentication to maintain its security, the keys and certificates must be generated and signed properly. This process is easy to do wrong and in most cases is not done correctly.

If someone were to lose a keypair due to a social engineering attack, or just because they copied it to the wrong thumb drive, there's no way to disable the compromised keypair without re-keying the entire cluster. There are ways to bind a certificate to a hardware token, like a smart card, but kubectl doesn't know how to work with these technologies.

Most cluster deployments will generate a master certificate that bypasses RBAC (more on that in the Authorization section), and this should only be used when your identity provider is not available.

In addition to not using certificates for authentication, do not use ServiceAccount tokens for users for multiple reasons:

ServiceAccount tokens can't have groups, making authorization configuration difficult
Static ServiceAccount tokens never expire, they need to be deleted
ServiceAccount tokens were never designed to be used from outside of your cluster, leaving you open to unintended consequences
Starting in Kubernetes 1.24 ServiceAccount tokens won't generate Secrets by default

Lastly, never write your own authentication. Writing your own authentication is similar to writing your own encryption. Every instance I've seen has either been a pail imitation of OpenID Connect or a scheme to "pass the password". Leverage existing standards with thousands of deployments and thousands of hours of review by experts.

To learn more about the details of how Kubernetes authenticates users, I will shamelessly plug the book I co-authored: Kubernetes: An Enterprise Guide 2nd Ed. You can get the chapter (link to github) on authentication 100% free, no registration required! We also have a YouTube Channel with videos from the labs, including setting up user authentication.

Having explored how people should authenticate to your cluster, the last use case to explore is how external services and pipelines will authenticate to your cluster.

External Services and Pipelines

One of the hardest ways to properly secure access to a cluster is from an external process, like a pipeline. Pipelines, and other external services, are integral to the success of most Kubernetes deployments. There are three ways an external service should authenticate to a cluster:

OpenID Connect
Cloud IAM Identity
Using an impersonating reverse proxy

Similar to your users, pipelines and external services can leverage your OpenID Connect identity provider to get a short lived token without sharing a password or static access key with your API server. This token can be retrieved using a stronger credential than a password, such as a Certificate. This gives the same benefits as integrating your users and can be leveraged from most of the Kubernetes SDKs. As an example, you can use Okta and OpenUnison to securely access your API server.

If your service is running in the same cloud as your cluster, it likely has an identity that your cluster will recognize. Use this identity, as it's the most likely to be securely provided to your service.

Finally, using a reverse proxy with OpenID Connect is a good option as well. In this scenario, a reverse proxy is configured to recognize the OpenID Connect tokens generated by your pipeline without having to have an external identity provider. For instance, if your pipeline runs in one cluster but it's interacting with another cluster the reverse proxy can be configured to trust your pipeline's API server's OIDC discovery URL letting the pipeline's Pod's identity be used without a shared secret. The reverse proxy can then use impersonation to provide the adequate rights to call. This configuration means there's no static credentials, but requires careful authorization configuration to make sure that access can't be escalated.

What you should not do, is use certificates or static ServiceAccount tokens in your pipelines, for the same reasons as to why they shouldn't be used for user accounts.

In this section, access for users, containers, and external services as discussed. Rely on short lived tokens or Cloud IAM whenever possible. Avoid using certificates and static service accounts. Having hardened your authentication, next we'll explore how the hardening guide could have addressed authorization.

Authorization

The hardening guide provides more details about authorization, with generally good advice:

Use Kubernetes' built-in Role Based Access Control for authorizing access to the API server
Use ClusterRoles for providing access to namespaced objects so you can re-use the same permission sets across your cluster
Use a least-privilege approach

This is important advice, but misses some important context and advice that would flow from the authentication section above. First, use groups for authentication whenever possible. Auditing access based on RoleBinding and ClusterRoleBinding objects is painful, requiring you to enumerate each one to identify who has access. For instance, if your bindings all referenced users directly and you wanted to answer the question, "Who is a cluster-admin?" you need to go to the cluster-admin RoleBinding and look at each account. This becomes more difficult when you take into account that many clusters provide permissions that span multiple bindings. If you instead use a group, you can quickly query your identity provider which is much better suited for this type of ad-hoc querying. Also, if you want to remove a user's access, you need to find that user across every binding, but if you're using groups you can simply remove the user from the group. It makes for a much easier to manage and audit environment.

This is harder to do with ServiceAccounts, because you can't assign them to arbitrary groups. This is one reason why it's not a good idea to use them from outside of your cluster.

The other important step you should take in your clusters isto automate your RBAC object creation, especially in multi-tenant clusters. This helps to avoid misconfigurations and makes it easier to track down who has access and identify improperly created access controls. I talked about this and demoed it at KubeCon EU 2021 in I Can RBAC and So Can You! In this session I used OpenUnison to automate the provisioning of "Teams" that had access to multiple namespaces by way of Fairwind's RBAC Manager. Using automation, we didn't need to create any new ClusterRoleBindings, or individual RoleBindings in each new namespace. We created a CRD that we associated with a "team", every namespace created for that "team" had access granted to the "team" members without any manual object creation. This leaves you a clear path for audit because now, if you find RBAC objects that are outside of that pattern it raises a flag for investigation.

Finally, Rory McCune pointed out while reviewing this post, to make sure to review the RBAC configuration that comes with the systems you install. It's really easy to mark bindings with stars and request too much, leaving your clusters open in unexpected ways. If you want to write RBAC permissions that you want to be able to change over time without having to directly edit them in a safe way, use an aggregate Role or ClusterRole to dynamically generate the correct permissions based on future requirements without having to use stars in your RBAC definitions. For example, take the admin ClusterRole, it includes:

aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      rbac.authorization.k8s.io/aggregate-to-admin: "true"

This means that every ClusterRole with the label rbac.authorization.k8s.io/aggregate-to-admin: "true" will be included in the admin ClusterRole. When you define a new resource that you want namespace administrators to be able to create, you define the ClusterRole with this label and the aggregate rbac controller will automatically update the admin ClusterRole for you.

Why is it so important to avoid stars in your RBAC definitions? Here's a concrete example. Every cloud provider has an operator that let's you, as the user, define cloud resources via the Kubernetes API. You need an S3 bucket? Create an object in your namespace and the operator will see it and create the S3 bucket for you. What about something that incurs cost based on its existence though, like a database? If the admin ClusterRole just had stars instead of specifically listing every permission a namespace administrator should have there would be no way to use RBAC to limit who can create a database, or an IAM role, or any other type of AWS infrastructure you can think of? By using an aggregate ClusterRole for namespace admins, simply creating the custom resource doesn't mean a namespace administrator can now create an instance of one. It needs to be explicitly authorized. As Ian Coldwater says "We are all made of stars, but your RBAC shouldn't be"!

Conclusions and Special Thanks

The hardening guide from the NSA and CISA is an important starting point for securing your clusters, but it's not a complete guide. It's important that you dive deeper into how you authenticate your users and authorize access to resources then what the hardening guide provides. Once again, I will shamelessly plug Kubernetes, An Enterprise Guide 2nd Ed as a great place to start. In addition to the free chapter on authentication we dive into the specifics of implementing RBAC, building admission policies with GateKeeper, implementing node security policies with GateKeeper, Isito identity integration, and finally automating your GitOps infrastructure.

Special thanks to Rory McCune and Joshua Paul for reviewing this post and providing great feedback!