Building machine learning operations framework with Amazon SageMaker: Technical Safety BC’s Journey

Technical Safety BC (TSBC) regulates the safe installation and operation of technical systems (electrical, gas, boiler, elevator, etc.) in British Columbia. This post showcases how the TSBC built a machine learning operations (MLOps) solution using Amazon Web Services (AWS) to streamline production model training and management to process public safety inquiries more efficiently.

Solution overview

TSBC developed an orchestrated MLOps framework with Bitbucket repositories. Approval workflows streamline model training and deployment processes, and AWS cross-account deployment enables resource allocation across organizational boundaries. The framework uses templates with Hugging Face text classification models, enabling rapid deployment of natural language processing capabilities.

The multi-account framework is set up with a shared services account and a production account. The shared services account is the MLOps hub, which orchestrates the continuous integration and continuous delivery (CI/CD) pipeline management, model development, workflow training, and endpoint testing. It streamlines ML lifecycle management and stores model and pipeline artifacts.

The solution consists of three core templates:

Model build and training template: This template automates the CI/CD pipeline for model building, manages data processing and model training, and implements model governance and approval workflows.
Model serverless inference template: This template automates model deployment to serverless endpoints, enables cross-account model deployment, and implements automatic updates as model status changes.
Model batch inference template: This template manages batch inference pipeline infrastructure and enables automated batch transforms.

The following diagram shows the solution architecture.

Figure 1. Architectural diagram of the solution

Prerequisites

To develop and deploy the solution, you need to have the following prerequisites in place:

An AWS Account for Shared Services
An AWS account for Production
IAM roles for cross account deployments

On-premise Bitbucket Connection to CodePipeline Operation workflows

The model build and training template creates and automates a CI/CD pipeline for building and training a model. The code commits trigger the Amazon SageMaker pipeline workflow. The pipeline retrieves and organizes data from the Amazon Simple Storage Service (Amazon S3) bucket and organizes it into training, validation, and test sets. The model undergoes a training, and validation process, where data is fed and the results are examined iteratively so that the model meets performance standards. Successful models are packaged with inference script and registered. Manual approval is required for serverless or batch inference workflow.

The model build and training template consists of the following elements:

Model-build repository: An on-premises Bitbucket repository with seed code for model training with Hugging Face text classification model inference scripts. You modify training parameters, evaluation methods, and data locations.
AWS CodePipeline:Monitors changes in Amazon S3 and triggers AWS CodeBuild to execute SageMaker pipelines.
AWS CodeBuild project: Handles the construction and execution of Amazon SageMaker pipelines.
S3 bucket: Cross-account enabled S3 bucket storing artifacts from CodePipeline and SageMaker pipelines.
Cross-account model package group: Amazon SageMaker model package group with restricted access to the production account.
AWS Key Management Service (AWS KMS) key: Enables cross-account model encryption and decryption for deploying models across different AWS accounts.

The following diagram shows the workflow of the model build and training template.

Figure 2. Model build and training template

The model serverless inference template automates a CI/CD pipeline for deploying a model to serverless endpoints. The pipeline is triggered by either code changes or model updates. The latest approved model deploys to the staging endpoint. After the manual approval, the pipeline deploys the model to the production endpoint.

The model serverless inference template consists of the following elements:

On-premises Bitbucket repository: This repository contains code to create and configure the serverless endpoint. Data scientists update configuration files to specify serverless endpoint settings for the current and production account deployments.
CodePipeline pipeline: This pipeline detects changes from Amazon S3 and model status updates. It triggers the project to execute the Terraform code, creating a serverless endpoint with the latest approved model package. After a manual approval process, the pipeline runs another CodeBuild project to create the production account serverless endpoint.
CodeBuild projects: This is a two-stage deployment in which the first CodeBuild project creates a staging endpoint using the approved model. Then, the second CodeBuild project deploys the model to a serverless endpoint to production environment.
Amazon EventBridge rule: The rule detects model package status changes, which triggers the pipeline to recreate the endpoint with the newly approved model or rollback.
S3 bucket: The S3 bucket stores the pipeline artifacts.

The following diagram shows the model serverless inference template.

Figure 3. Model serverless inference template

The model batch inference template automates a CI/CD pipeline to deploy batch inference pipeline infrastructure to the staging account and the production account. The pipeline is triggered as the batch transform configuration changes are pushed to the model batch inference repository.

Figure 4. Batch inference pipeline deployment workflow

The Amazon SageMaker pipeline retrieves the latest approved model package, creating model and batch transform. The new files in Amazon S3 trigger the pipeline to perform batch transforms. The output files are available for retrieval in the S3 bucket. This flow is shown in the following diagram.

Figure 5. Batch inference pipeline workflow

The batch inference pipeline workflow consists of the following elements:

Bitbucket repository: Each model batch inference pipeline has its own on-premises Bitbucket repository. The repository contains code to create and configure batch inference infrastructure. The data scientist can modify the configurations of batch transform according to their specific use case, such as updating instance type, instance count, strategy, max concurrent transform, or max payload. It includes Terraform code to provision following resources:

- AWS Lambda: The staging and production accounts each have a Lambda function, which retrieves the latest approved model package.
- Amazon EventBridge rule: The staging and production accounts each have one EventBridge rule. The rule detects new objects and passes the bucket name and object key into the pipeline as batch transform inputs.
- SageMaker pipeline: There is one SageMaker pipeline in the production account. The pipeline for batch transform created as the infrastructure is deployed to the staging account. The pipeline definition is saved and modified for production. Terraform uses this definition to create a pipeline for batch transform in the production account.
- S3 bucket: The production account has one S3 bucket for batch transform input and output.
CodePipeline pipeline: The workflow contains one CodePipeline pipeline. Changes in the S3 bucket trigger the CodeBuild project for batch transform setup. The data scientist reviews and approves transform configurations. After approval, a second CodeBuild project deploys infrastructure to production. The pipeline orchestrates the entire process from development to production deployment.
CodeBuild projects: The workflow contains two CodeBuild projects. The first CodeBuild project creates a pipeline for batch transform and generates Terraform configuration files for the staging and production accounts. Initial infrastructure deployment occurs in the staging account. The second CodeBuild project uses artifacts from the first project and uses the production configuration to provision infrastructure in the production account.
S3 bucket: The workflow has one S3 bucket for pipeline artifacts and batch transform input and output.

Unlocking benefits with AWS technologies for MLOps

TSBC has successfully delivered solutions that directly benefit clients and enhance operational efficiencies. Two notable projects showcase the public sector impact of this MLOps approach:

Automated email routing: The AI model automatically categorizes and routes client emails to relevant service teams. As client emails arrive, the model classifies based on the type of request and then routes to teams responsible for resolving the issues. This streamlined workflow has improved the response times for resolving safety issues for the public sector.
Automated CSAT analysis: The manual review of customer surveys has been replaced with AI models for sentiment and category classification. This operational efficiency enables faster customer feedback analysis and improved response times for client support.

With the successful adoption of these templates, TSBC has standardized the workflow for managing model lineage and streamlined the model training and inference operations for projects.

Conclusion

TSBC developed an open source framework on AWS streamlining MLOps patterns through three templates. These templates standardize MLOps across projects, enforcing data and model governance practices for security and reliability. Other public sector organizations can use this framework to accelerate their MLOps processes on AWS Cloud infrastructure.

AWS Public Sector Blog