By Merlin Carter
Co-author: Alencar Souza, Senior DevOps Engineer at Project A
It’s always a delicate task, connecting one system to another — especially when it comes to mapping access control features. For example, Kubernetes has its own Role-based access control (RBAC) system, and the various cloud providers (who host Kubernetes clusters) all have their own separate Identity Access Management systems (IAM). They needed to be connected.
Kubernetes began supporting RBAC back in 2017. But understandably, customers who used managed Kubernetes solutions didn’t want to manage access control in two places. They wanted to connect the Kubernetes built-in RBAC with the IAM systems of the environment that hosted it.
Google’s GKE, Amazon’s EKS, and Microsoft’s AKS all tackled this requirement, albeit with varying degrees of success. In our experience, Google did the best job, and Amazon did the worst. I’ll describe why in a moment, but in fairness, I want to point out that Kubernetes was born at Google, so it’s not exactly a surprise that they’re the best at integrating it.
Why this is important
IAM and access control are crucial parts of every business system. But some access control systems are so mind-meltingly complex that very few employees actually understand how it works. This leads to misunderstandings, mistakes, or just plain old “CAFA,” also known as “Cluster Admin for All” (thanks goes to Cat Cai, Shopify’s production Engineering manager, for that lovely turn of phrase).
The point is, if you don’t know what you’re doing, you can block major parts of the development process.
Indeed, we saw this play out at one of our portfolio companies. They were using Amazon’s Elastic Kubernetes Service (EKS). All of a sudden, due to a series of errors, a cluster became inaccessible to the entire company. In part, I would say that too much complexity contributed to these mistakes. Specifically, the complexity around how the AWS IAM system works with Kubernetes RBAC.
The moral of the story is (as always) to go for the simpler solution — at least if you’re a startup with a small tech team.
Here’s how it went down
One day, someone from this company’s business intelligence team was deploying a change to a staging cluster. Don’t ask me why someone from BI had such unfettered access to a cluster. I blame CAFA.
Anyway, something went wrong with the deployment, and it failed. After the failure, this person had then used kubectl to try and view some specific resources, but all of a sudden, they were locked out. When others tried to connect to the cluster in kubectl, they encountered the same problem.
My co-author, Alencar, was then called on to try and fix the issue. Even when he (a sysadmin) tried to access the cluster, he couldn’t. Eventually, he figured out what had happened and was able to grant fine-grained permissions to the people who really needed cluster access.
A simple but annoying cause
The heart of the problem centered around aws-auth configmap — a key component of access control in EKS. Fair warning, I’m going to have to go on a little digression and explain the aws-auth configmap to those you have yet had the pleasure of making its acquaintance — if only just to make it clear why you don’t want to be messing with it.
Introducing the enigmatic “aws-auth configmap”
This specific configmap is used to map AWS IAM roles to rough equivalents in the Kubernetes RBAC system. You do this indirectly by mapping your IAM roles to “subjects”. A subject is then, in turn, bound to a Kubernetes ClusterRole. Yes, it sounds complicated, and it is. Why can’t you map an IAM role directly to a Kubernetes ClusterRole? It’s hard to explain. So I’m gonna leave it at “that’s just how it works”.
How the BI person (let’s call him “Otto”) got admin permissions is perhaps better explained in this diagram.
The subject is the “glue” that connects an IAM role to a Kubernetes role. A subject can be a user group or a user ID. But you don’t create these groups or users in Kubernetes itself (Kubernetes does have its own built-in groups, but as far as I know, you can’t create new ones). Users and groups are passed to Kubernetes from the client (i.e. kubectl) when authenticating.
The “user” and “group” come from metadata in the certificate that you present when authenticating with the cluster (well, not “you” per se, the client does this automatically behind the scenes). That metadata, in turn, comes from your AWS account details.
How do you fix an empty aws-auth configmap?
As we learned, an empty or misconfigured aws-auth configmap means that no one has access. Except for the account that was used to create the cluster (no mapping is required for the cluster creator). When a cluster is provisioned, the creator’s ARN (Amazon Resource Name) is added to the Kubernetes built-in system:masters group. So they always have access…unless you’ve deleted their account.
Yeah, that’s right, the person who created our wayward cluster had long since left the company, and their account had been retired. However, according to this Stack Overflow question, others have had similar problems and have gotten around it by creating an IAM user with the same user name.
The new user will receive the same ARN as the old user. So their ARN will still match the original ARN listed in the Kubernetes system:masters group. And voila — you can access the cluster again. This is precisely what Alencar did, and he was able to restore the aws-auth configmap to its original settings.
How do you avoid this happening in the first place?
To answer this question, let’s take a quick look at the root causes.
Too much privilege
First, we have the problem of the BI person accidentally wiping the aws-auth configmap. They should never have the power to do that in the first place. The problem is people often are too generous with the IAM roles that they map to the built-in system:masters group.
This system:masters group is bound to the cluster administrator role and has more privileges than the average contributor needs — including the privilege to mess stuff up (ba-dum ching!).
On the other hand, it’s fiddly, delicate work to create a custom role in Kubernetes and bind it to one of your own user groups. Many people ain’t got time for that and just use one of the user-facing roles built into the Kubernetes RBAC system. So it’s understandable why the BI person might have been assigned system:masters rights. Nevertheless, they should have been assigned a more restrictive role that better matches their use case. In other words, apply the principle of least privilege.
Not enough automation
Next, we have the issue of being locked out because the cluster creator’s account no longer existed. To add a bit of context here, the former employee was a DevOps engineer who had used Terraform to provision the cluster.
However, he had run Terraform from his local machine, which was configured with his personal AWS credentials. He could have configured Terraform to use a different set of credentials instead of the default set….but it still wasn’t ideal that he’d run it from his local machine.
The process of provisioning clusters should be automated and run from a CI/CD environment rather than a developer’s workstation. When you talk about CI/CD, most people think of deploying applications. But you can also use the same paradigm to deploy infrastructure. That way, you can create a dedicated IAM user and role for the CI/CD process.
For example, you could create the IAM user “ProvisioningBot” and the role “ProvisionClusters”. Then create a policy that enables the “ProvisioningBot” to assume the role “ProvisionClusters”. Then you can set up a job in your build environment that uses Terraform to provision your clusters. This job will use ProvisioningBot’s AWS credentials for the provisioning and authenticating with the cluster.
The next time you need a new cluster (for example, you want to add another testing cluster), you would define the cluster’s attributes in a variables file (region, permissions, and so on) and trigger the job with those variables. ProvisioningBot and Terraform will do the rest.
That way, you’re less dependent on any single employee. You can have multiple people permitted to initiate a provisioning job.
So why do we think access control is easier in GKE?
Simply put, you have to do less Kubernetes configuration. Google’s Kubernetes Engine is better integrated with Google Cloud’s IAM — so it’s a lot easier to give new employees access. Let’s run through a basic scenario and compare how you would handle them in both EKS vs GKE.
A new employee starts and, for now, you just want them to have read-only your clusters and nodes.
You use a standard Google IAM role intended for this case:
- You add the user account to the relevant Google Cloud project.
- You give them the project-level role of “Compute Viewer“.”
That’s it. No ConfigMap is needed. No risk if it’s wiped.
Setting them up could be just as easy, provided the following prerequisites have been fulfilled.
- Your admin has created a suitable IAM group for this type of user — for example, “EKS-viewers”.
- Your admin has created a suitable IAM role and policy for this type of user — for example, “EKS-read-only”.
- Your admin has mapped the role “EKS-read-only” to the group “EKS-Viewers” in the aws-auth configmap.
- Your admin has bound the group “EKS-Viewers” to the internal Kubernetes “view” ClusterRole (by creating a ClusterRoleBinding when provisioning the cluster).
Only if all of these steps have been done correctly can you:
- Add the user to a relevant group such as “EKS-Viewers”.
- Update your IAM rules to allow the user to assume a read-only role ie. “EKS-read-only”.
In other words, it’s bloody complicated. Your admin had better know exactly what they’re doing because, in EKS, you can’t update the RBAC configuration after you create a cluster. Once your cluster is provisioned, the RBAC configuration is immutable, and you can’t configure new RBAC roles or bindings without recreating the cluster. So if you want to give new people permissions, you are stuck with the choice of giving them one of Kubernetes built-in roles like cluster-admin (by mapping them to the user group system:masters).
For an alternative comparison, consider the following representation:
In terms of admin permissions, you can see that there’s no difference between EKS and GKE. Naturally, whoever created the cluster, automatically gets admin access to it.
But when it comes to updating or viewing permissions, we start to see big differences.
View or Update access to ANY Cluster
- On EKS: If you’re not an admin, your access to a cluster depends on how the cluster was configured when it was created. Your admin should have configured IAM and RBAC and mapped them properly. The process is the same regardless of whether you need access to all clusters or a single cluster.
- On GKE: the console offers a wide range of built-in roles that let you view or update resources on any cluster. If a new employee joins a project, you can give them read-only access to all clusters with a couple of clicks. This is extremely convenient as long as you’re not fussy about which clusters people can access.
View or Update access to a SINGLE Cluster
- On EKS: As mentioned previously, you have to configure each cluster individually anyway, so there’s not much difference in the process. Ideally, you’re using a Terraform module to provision your clusters. So, if you want to provision a cluster with a restricted set of users, you would change the variables that you use when you apply the module. BUT, as I said before, you better get it right. You can’t create new Kubernetes Roles or RoleBindings after the cluster has been provisioned
- On GKE: This is one case where you would need to get to grips with the Kubernetes RBAC system. Google’s IAM roles provide “broad brush” permissions. You can’t pinpoint specific clusters or resources on clusters. To do that, you need more precise role definitions — which only Kubernetes RBAC supports. So, if you want to restrict new developers to the dev cluster, you would add them project but leave them with the most limited IAM roles. Instead, you’d use the Kubernetes RBAC configuration to extend their permissions — specifically just on the dev cluster.
First, we presented a real-life case where we lost access to a cluster because A) an inexperienced user had too much access and B) The admin account was bound to a single employee who had left the company. Then we compared how EKS and GKE integrate with Kubernetes RBAC. The point of this exercise was to illustrate how complexity can be a hindrance when managing access control.
Admins who are inexperienced or in a hurry will opt for the path of least resistance. Especially if your only RoleBinding is to system:masters, and you’re not able to change it after your cluster is created. If someone later asks for access to the cluster, the most tempting option is to give them a role bound to system:masters — the so-called “Cluster Admin For All”.
But, if you have a system like GKE that lets you change roles through a tight Kubernetes integration, it’s just as easy to give someone limited permissions. With limited permissions, our fabled BI user might not have wiped the ConfigMap file by mistake. And with less dependency on the authentication configmap, wiping it would have had a limited impact anyway — since most users inherit their permissions from the GKE IAM integration.
The point is GKE permissions are more flexible and easier to learn, which makes it less likely that you’d run into an incident like the one we’ve described. If you already have in-house AWS experts, then fine, stick with what you know. But if you want to move your app to managed Kubernetes, and your sysadmins don’t have much experience with either AWS or Google Cloud, then go with GKE. They’ll thank you for it later.