Slack recently posted a detailed description of the software architecture of its new role management system. Slack needed to build a system that was more flexible than the one it previously had. It created a custom containerized Go-based permission service that integrates with its existing systems over gRPC. As a result, its customers’ admins can now have granular control over what their users can do.
Slack engineers Jake Byman, Aish Raj Dahal, and Jose M. Medina explain the motivation for creating a new role management system:
The standard types of roles we offered to customers were too broad, and delegating a generic admin role can grant someone with too much power — what if you only want a specific user to be able to manage specific channels? When you make them an admin, they can perform a wide variety of actions beyond the scope of the intended purpose. (…) We needed to build a system that was more flexible and allowed for granular permissions.
Slack opted to create a Role-Based Access Control (RBAC) system, where admins define roles, which are a set of permissions for actions in the system. Then, they grant users one or more of these roles on a specific context or entity (an organization, a workspace, etc.). Slack always makes an authoritative check with the permissions service when a user takes action to ensure that the user can perform that action.
On the other hand, the display in the Slack client is based on a non-authoritative copy of the permission set. The backend maintains this permission set in Slack’s Flannel edge cache and updates it near real-time. The client reads permission information from this cache to optimize for low latency.
Backward compatibility was a significant consideration for the slack engineers. To reduce the risk of deploying the permission system in a way that would disrupt users, Slack engineers opted for a staged deployment scheme. First, they created a loopback gRPC service that lived inside the Slack web app. They rolled out this change to their internal workspace and then to pilot customers.
Once they established that this change was safe, they started reading from the actual external service in “dark mode.” In this mode, they read the permission check result from both the new service and the existing web app, but without using the new service’s result. If the results didn’t match, they’d raise an alert for further investigation. Once they were confident enough of the new service, they switched to reading from it in “light mode,” solely using its results as the source of truth.
Role information is persisted in a Vitess data store. Slack engineers state that they “opted to have our permissions service read and write from the same Vitess store used by our web app monolith.” They took this design decision to have a single centralized data store and avoid data drift. The Vitess database acts as the source of truth for the entire system.