Part 4 - A staging environment

Introduction

The previous article improves delivery quality, adding a test and human validation before deploying the application. Despite precautions, a new version of an application may fail, and restoring the service is a priority. In order to restore the service more quickly, this article integrates a rollback capability.

Here is the code for this article.

Feel free to create an issue to comment on or help us improve here.

Current status

The previous article leads to these files:

function_app.py: with the application API.
train.py: running the model training.
test.py: testing the API.
requirements.txt: containing the application dependencies.
.github/workflows/main_az-fct-python-ml.yml: the GitHub Action workflow doing the model training, asking for deployment validation then deploying the API.

Blue-green deployment

Blue-green deployment is a pattern for instantaneously switching from one version of an application - called blue - to another - called green. The blue application runs at version N, and the green application runs at version N’. During deployment, requests sent to the blue application are redirected to the green application: they switch. But the blue application continues to run in the background, and in the event of failure, it receives back requests, putting offline the green application: they roll back.

Another advantage of this deployment model is that it’s native in Azure Function.

Blue-green deployment is a standard, well-documented deployment model. The reader can find more information here or on the Internet.

Azure Function deployment slot

A deployment slot is an independent live application with a dedicated endpoint and lifecycle hosted on an Azure Function. The default deployment slot is transparent: we use it to host the application. Deployment slots on the same Azure Function can swap, enabling the blue-green deployment feature.

The current consumption plan limits the number of deployment slots to two, but Prenium and Dedicated allow more.

Adding a slot

Slots are in the “Deployment slots” panel of Azure Function. Initially, only the “Production” slot is present. Add a new slot by clicking “Add slot”, name it staging, and confirm. Two slots are now present, but the new slot is empty and we need to push an application as we did on the production slot.

Deployment slot demo

Authentication

The swap operation between slots is performed via the Azure CLI, which needs an authentication different from the one we are currently using. This authentication between the GitHub Action and Azure can be performed with OpenID Connect (OIDC). OIDC is a tokenless authentication mechanism capable of replacing the current token-based authentication.

Using OIDC removes the necessity of token rotation.

Following GIFs drives the reader through the main steps enabling OIDC, but a complete documentation is available here.

First, create the Microsoft Entra application:

Then, allow the Microsoft Entra application to deploy on both Azure Function deployment slots: App permission

Then, allow GitHub Action to use the identity: Federated identity

Now, on GitHub, add the Microsoft Entra application information: GitHub secrets

Now, update the workflow to use the Microsoft Entra application as explained here. The workflow ‘Deploy to Azure Functions’ can not be updated yet and will still use the GitHub Secret to publish due to a technical reason. But, the secret needs to be replaced by the publish profile value of the staging slot because we are now publishing on it not on the Production slot.

Workflow

Updating the N version of the application to the N’ version with the blue-green deployment using the deployment slot and the GitHub Action leads to the following infrastructure interaction:

Update the GitHub Action workflow

The current workflow contains the build and deploy jobs. The swap operation can be placed in the deploy job, which does not allow rollback. A solution is to create another job focused on the swap operation. In the event of failure, this job can be manually re-triggered, rollbacking the application.

Rename the deploy workflow to deploy-on-staging to be more precise, then update the deployment step to point on the staging slot by replacing Production by staging:

deploy-on-staging:
  runs-on: ubuntu-latest
  needs: build
  environment:
    name: 'Production'
    url: ${{ steps.deploy-to-function.outputs.webapp-url }}

  steps:
    ...

    - name: 'Deploy to Azure Functions'
      uses: Azure/functions-action@v1
      id: deploy-to-function
      with:
        app-name: 'az-fct-python-ml'
        slot-name: 'staging'
        package: ${{ env.AZURE_FUNCTIONAPP_PACKAGE_PATH }}
        scm-do-build-during-deployment: true
        enable-oryx-build: true

Then, create a new job to swap slots:

swap-staging-and-production:
  runs-on: ubuntu-latest
  needs: deploy-on-staging
  environment:
    name: 'Production'
    url: ${{ steps.deploy-to-function.outputs.webapp-url }}

  steps:
    - name: 'Az CLI login'
      uses: azure/login@v1
      with:
        client-id: ${{ secrets.AZURE_CLIENT_ID }}
        tenant-id: ${{ secrets.AZURE_TENANT_ID }}
        subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
        enable-AzPSSession: true

    - name: Swap staging and production
      run: az functionapp deployment slot swap -g az-fct-python-ml_group -n az-fct-python-ml --slot staging --target-slot production

Where az-fct-python-ml_group and az-fct-python-ml are, respectively, the Ressource Group and the Azure Function created during the step 1 of this guide.

For simplicity, we only use one GitHub Environment and Azure pre-defined role for readability, which is not the best for security.

Rollback

HTTP 503

The reader is probably already aware of the cold start, but during a swap, API users may also encounter HTTP errors during a couple of seconds due to technical reasons. To limit this error, add in the app settings of all slots the variable WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG set to 1.

A good article on this topic here.

App settings

Summary and next step

We can now quickly revert to the previous version of our application. But in fact, we’re deploying an API that doesn’t need updating: we’re updating its model without touching the API code. The following article focuses on remote model storage, limiting infrastructure updates to each version.

Any comments, feedback or problems? Please create an issue here.