WEFOX ITALY Journey to SaaS Multi-Tenancy on Amazon EKS

This blog was authored by Marco Sciatta, Principal Architect, WEFOX ITALY; Majid Shokrolahi, Sr. Solutions Architect, AWS; Tsahi Duek, Principal Solutions Architect, Containers, AWS.

Wefox Italy is a leading insurance company dedicated to innovative solutions in the insurance sector.

In Wefox Italy, various teams work on different products, many of which have recently required support for a multi-tenant model with robust data and resource isolation to meet the needs of our new clients. Simultaneously, the increasing number of requests and instances to maintain has placed considerable pressure on operations teams. We needed a solution that would allow our teams to release and maintain various instances of applications effortlessly, fast, and within a standardized environment.

To address these challenges, we adopted a Software as a Service (SaaS) model. We developed a versatile platform that enables the consistent deployment of any application in a standardized environment, leveraging the approach to offer the highest level of customization.

This post details how we built a comprehensive SaaS solution using Amazon Elastic Kubernetes Service (Amazon EKS) and GitOps practices. By leveraging AWS services, we created a configurable and easy-to-maintain environment.

We designed the solution described in this post to support different personas in the organization and their needs. Below is a description of each of the personas and their requirements:

SaaS platform team manages both application and customer tenant onboarding. This requires a tenant management system that enables application teams to siloed model of their app for different customer-tenants.

DevOps team responsible for building and managing system-wide capabilities such as API GW configuration, service mesh deployments, AuthZ (Authorization)and AuthN(Authentication) with Identity Providers (IdP), and similar services.

Application team responsible for building customer-facing applications, and deploying them in a silo-mode per customer tenant using the capabilities the SaaS platform team and the DevOps team delivers.

Overview of solution

Our solution is primarily based on the EKS Saas Factory Reference Architecture and comprises three distinct clusters: One SaaS Control Plane cluster and two Data Plane clusters.

The SaaS Control Plane Cluster hosts all essential services required for the SaaS to operate, including centralized management, tenant management services, and monitoring tools. It is managed by the SaaS Platform team and its services are exposed to application teams that need to onboard new applications for customer-tenants.

The two Data Plane clusters are designated for deploying workloads, ensuring that application services run independently of the core operational services. The first is our Application Shared Services cluster, managed by the DevOps team (which differ from the SaaS control plane, and includes authorization, s , and Kong API GW services). The services deployed in this cluster are being used as the foundation of the system-wide capabilities being used by any application in the cluster. The second is our Business Applications cluster, where application teams deploy their business services. Currently, we have a single business cluster, but as our applications and customer base grow, we plan to add more business application clusters while maintaining the shared services cluster as the foundation for all deployments. The original business cluster will remain the default deployment environment.

Both data planes operate within the same Istio Service Mesh, delivering an advanced security layer and comprehensive traffic control between them. Using Istio, we can effortlessly incorporate advanced capabilities as standard application features.

This dual-cluster approach ensures a clear separation of concerns, significantly enhancing security and optimizing scalability. The following diagram shows the architecture.

Figure 1: the high level architecture

A core aspect of this solution is the adoption of GitOps methodologies, which streamline and automate our operations, granting a full customizable environment. For infrastructure management, we use Terraform as our Infrastructure as Code (IAC) tool, orchestrated by our internal GitLab CI/CD pipelines. This approach allows us to define and provision infrastructure in a consistent, repeatable manner. Complementing this, we employ Argo CD to manage Kubernetes manifests, ensuring that the desired state of our applications is automatically maintained on the clusters. By integrating Terraform and Argo CD, we achieve a seamless, end-to-end solution for both infrastructure and application management. This powerful combination allows us to automate deployment processes and ensure consistent environments across our entire platform.

Key Benefits of the solution:

Control over automation: Adopting Terraform as Infrastructure as Code (IaC) tool, and Argo CD for application management, allow us to automate, via CI/CD, Terraform modules and helm charts, every part of application deployment and lifecycle, but keeping it fully configurable and customizable.
Security: EKS Pod Identity, ensures pod-level access control to AWS services based on the principle of least privilege. Istio, with mTLS(Mutual TLS) and authorization policies, along with policy-as-code via Open Policy Agent (OPA), provides granular permission management, enhancing overall security posture.
Sensitive Value Management: Secrets and parameters are securely stored in AWS Secrets Manager or Parameter Store, a capability of AWS Systems Manager and automatically injected into the appropriate pods via the secret-store CSI driver. This approach prevents sensitive data leakage and ensures secure storage and handling.
Tenant Isolation: The combination of the Amazon VPC CNI plugin with network policies and Istio authorization policies provides enhanced security through network-level isolation and granular access control, improved compliance with regulatory standards, and streamlined management of security configurations.
Centralized Toolset: Applications can utilize different services that are provided as shared services. This includes Keycloak as an Open Telemetry collector to collect monitoring data, and more. Also, some functionality can be activated via configuration like the usage of policy-as-code.
Cost Efficiency: Using Amazon EKS enables cost-effective resource management by leveraging and dynamic scaling with Karpenter. Additionally, this solution reduces maintenance requirements and allows our operations team to focus on infrastructure management rather than individual applications, further improving overall efficiency.

The baseline infrastructure is provisioned via Terraform and orchestrated by our internal GitLab CI/CD pipeline. The VPC, networking constructs, the three Amazon EKS clusters, and all the necessary AWS supporting services such as IAM Roles, are deployed by terraform codes, using standard AWS modules and EKS, that allow provision a ready-to-use Amazon EKS cluster already with some useful Kubernetes objects installed, like the secret-store-csi-driver and aws-load-balancer-controller.

The primary Amazon EKS cluster, referred to as the control plane, host all the necessary services for the operation of SaaS solutions. These are the services used in tenant and application management as well as the services to manage specific components such as the Kong API gateway control plane. One of these is Argo CD, utilized for deploying and managing workloads across all the clusters defined.

The second cluster, known as the shared cluster, hosts services that are not tied to specific applications but are instead utilized universally across all applications.

The last cluster is the application cluster, which serves as the default deployment environment for applications unless otherwise specified.

As part of the base infrastructure, Istio service mesh is installed in a primary-primary configuration between the shared and applications’ clusters. During the provision of the Amazon EKS clusters, Argo CD is automatically configured to be able to deploy workloads on both data plane clusters.

Application deployment and Stacks

One of the pillars of our SaaS solution is maintaining a GitOps operational model.

To achieve this, the application concept is structured around four distinct repositories each designed to manage different aspects of the application lifecycle: infrastructure, deployments, APIs and domains, and policies as code. Figure 2 shows the stacks and repositories.

Figure 2: the stacks and repositories

At the same, to support different deployment models or tiers for the same applications, we introduced the concept of stacks (silo, shared, ….)

This approach provides the necessary granularity to represent any kind of deployment model within a SaaS environment, however, we decided to support out-of-the-box a silo model with a namespace-per-tenant isolation on shared cluster providing 4 different repository templates that are used as the foundation of every application:

Stack: A standard Terraform repository coupled with a CI/CD pipeline. This repository contains a terraform code to deploy the default “silo” stack. The terraform code is composed of different modules to provision infrastructure components based on application configuration specified in a yaml file in the same repo , called app_config
Manifests: Contains the Kubernetes manifests that will be deployed through an argocd application. it contains, as default the manifests to install the baseline of a tenant namespace (the namespace definition, network policies, and so on) plus an opinionated structred directory tree.
API: Contains a CI/CD pipeline able to provision on the API gateway, some open API definitions, and configuration trough a declarative format. Following the Apiops concepts.
Policy: Includes reusable templates of Rego policies that can be used with Opa when enabled in the application config.

We keep these repositories separate to maintain clear ownership between operation and applications teams.

Application Onboarding

We provide a UI to internal teams to simplify the application creation process. The UI allows to specify vary application details, such as the name, domain names, and AWS resources that need to be provisioned and available to all the services comprising the application.

Figure 3: UI for internal teams – part 1

When the application is created, the application manager service (an in-house developed tool we call saasd in the control plane cluster clones the four repository templates, initializes them with the specified configuration, and then uploads them to the appropriate application GitLab group.

Once the application and repository are created, services can be added to the application.

Figure 4: UI for internal teams – part 2

Service configuration includes infrastructure and deployment settings, such as required AWS services like Amazon RDS databases, image tags, Kubernetes service ports to expose, and environment variables.

When the service is configured, saasd divides the information between infrastructure and deployment and does a commit on both application repositories with appropriate changes.

On the infrastructure repository, the app_config file is changed with the relevant information for the new service.

On the manifest repository, a new directory tree is added with a helm chart and value file definition.

Figure 5: App onboarding

Tenant onboarding

Creating an application in the SaaS solution doesn’t deploy anything into the cluster, as infrastructure is only provisioned when a tenant is onboarded. Tenant onboarding involves creating application infrastructure and service deployments based on the selected stack and application repositories. Tenants can be onboarded to any configured SaaS application through a one-click operation in the user interface. When a new tenant is created, the onboarding pipeline triggers automatically. This pipeline uses four microservices deployed on the control plane cluster to prepare the SaaS environment.

Figure 6: Tenant onboarding

The flow is described as following:

The tenant registration service call the tenant manager to store tenant data information

The tenant registration service call the user manager service to create a dedicated realm on the IdP, register first user and an OAuth application for the newly tenant
The tenant registration service call the provisioner with the data collected by previous service.
The provisioner calls the saad service, retrieve the application details, and trigger the appropriate app_stack repository CI/CD via API passing some parameters like tenant and stack to deploy.

When the App Stack repository CI/CD pipeline is triggered, the given parameters are used to dynamically configure a Terraform state backend and launch the appropriate stack terraform code.

Figure 7: Tenant onboarding and application deployment on the cluster

This enables deploying the same application stack for different tenants, each with an independent Terraform state file. The Terraform code provisions resources as specified, with the final module bridging infrastructure and deployments. It gathers non-sensitive information and registers a Helm chart in the control plane cluster. This chart contains an Argo CD application pointing to the application manifest repository, configured to sync the baseline chart. The baseline chart installs manifests to create a tenant namespace, configure isolation policies, resource quotas, and Argo CD applications for each service, following the App of Apps pattern.

Application Releases

In a SaaS environment, maintaining consistent infrastructure and codebase across all tenant deployments is crucial. The solution uses GitOps for managing both deployment and infrastructure updates. When teams release new service versions by pushing container images to Amazon Elastic Container Registry(Amazon ECR), they update Kubernetes manifests in the repository either automatically through their service CI/CD pipeline or manually. Since each tenant’s Argo CD application with the default Silo stack model has its own Argo CD application pointing to the same application manifest repository, Argo CD automatically reconciles the manifests for every application’s tenant in the cluster. For infrastructure changes, DevOps teams modify the Terraform code in the stack repository and merge in repository where a CI/CD pipeline is triggered. The CI/CD pipeline is composed by dynamic steps, built dynamically by a script. The script queries the tenant manager to retrieve the list of tenants deployed with that application stack. For each tenant, it dynamically creates a GitLab CI pipeline step, which includes instructions to apply the Terraform configuration using the appropriate backend for state file for every tenant.

Figure 8: Application release workflow

Like the first onboarding, the Terraform code at the end installs a new release of the base Helm chart with the new infrastructure’s values, so Argo CD can reconcile the application state for that tenant. In this way we ensure that the deployment infrastructure on the cluster is updated sequentially in an appropriate manner.

This approach is consistently applied across all tools developed to simplify application maintenance. For instance, the UI used to add new services to an application parses the configuration and automatically does a new commit on the manifest or infrastructure repository, mimicking the manual process. This process is the same for Shared Services onboarding.

Conclusion

Building a SaaS solution presents challenges like tenant isolation, seamless onboarding, and noisy-neighbor issues. Our architecture transforms typical SaaS constructs into a versatile platform supporting diverse applications and infrastructure deployments. Using a silo resource model, we ensure security and isolation within a fully automated, shared cluster environment. The GitOps approach enables customization of the application lifecycle, allowing deployment across different AWS accounts, clusters, or VPCs without affecting other infrastructure components. We encourage you to implement this versatile architecture to streamline multi-tenant deployments and embrace GitOps for enhanced control. To learn more, refer to Amazon EKS SaaS Factory Reference Architecture and Amazon EKS SaaS guides.

Containers