ML Pipeline Templates: End-to-End Tutorial

ML Pipeline Templates provide step-by-step guidance on implementing typical machine learning scenarios.  Each template introduces a machine learning project structure that allows to modularize data processing, model definition, model training, validation, and inference tasks.  Using distinct steps makes it possible to rerun only the steps you need, as you tweak and test your workflow. A well-defined, standard project structure helps all team members to understand how a model was created.

ML pipeline templates are based on popular open source frameworks such as Kubeflow, Keras, Seldon to implement end-to-end ML pipelines that can run on AWS, on-prem hardware, and at the edge.  Kubeflow is open source, cloud native solution for ML.

The following Kubeflow extensions are introduced to address common challenges:

  1. get_secret for managing Kubernets secrets from Jupyter notebooks
  2. S3 filesystem for Kubeflow
  3. Kaniko KFP component for building Docker images
  4. Template magic for Notebooks
  5. Extensions for Notebooks: debugging, variable explorer
  6. Environment configuration for Notebooks based on configmaps and secrets
  7. Continuous deployment for Kubeflow Pipelines

Machine learning is increasingly used in real world systems, such as autonomous vehicles, voice recognition, language translation, and many others.  However, to generate a positive ROI using ML, it needs to be operationalized (deployed into production).  Since machine learning is still a new field, early adopters have run into obstacles deploying machine learning into production due to friction with engineering and IT teams. In some cases, work done by data scientists and machine learning engineers is wasted because it never escapes the lab due to technical constraints, or can't be scaled to larger data.

To address these problems, Machine Learning can take a page from a DevOps playbook.  In order to deliver business value and to build intelligent software products, data science teams need to focus on the following priorities:

  • Start by asking the right questions - define clear requirements for questions that machine learning models should answer/predict/estimate. 
  • Implement several experiments, train, and evaluate the model in the Lab environment.  Ingesting, cleaning, and labeling training data sets are important steps in this process, because data scientists need as much quality data as possible in order to build and train their ML models.  The more high quality data they get, the more accurate are the model predictions.
  • As soon as possible, deploy the model to Test or Production environment in order to expose the model to the real life data.  
  • Based on user feedback and learnings for real life data, iterate frequently to build even better and more intelligent products.

Overview

Machine Learning Pipelines play an important role in building production ready AI/ML systems.  Using ML pipelines, data scientists, data engineers, and IT operations can collaborate on the steps involved in data preparation, model training, model validation, model deployment, and model testing.  Machine Learning pipelines address two main problems of traditional machine learning model development: long cycle time between training models and deploying them to production, which often includes manually converting the model to production-ready code; and using production models that had been trained with stale data.  There are many common steps in ML pipelines that should be automated and tracked.  For example, model validation steps must be tracked to help data scientists with evaluation of model accuracy and picking the optimal hyper parameters.  Agile Stacks provide reusable machine learning pipelines that can be used as a template for your machine learning scenarios.  

The following diagram shows a typical ML Pipeline.  It provides a platform for running machine learning experiments: making it easy for you to try numerous ideas and techniques and manage your various trials/experiments.  The arrows indicate that machine learning projects are highly iterative. As you progress through pipeline steps, you will find yourself iterating on a step until reaching desired model accuracy, then proceeding to the next step.  Here is an excellent blog by Jeremy Jordan that discusses machine learning workflow in more detail.  The workflow is not complete even after you deploy the model into production: you get feedback from real-world interactions and repeat data preparation and training steps.

ML-pipeline

In this tutorial we will cover how to leverage Kubeflow Pipeline templates to get your ML experiments from the lab into the real world as quickly as possible.  Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Kubernetes.  We use Kubernetes for automating deployment, scaling, and management of containerized applications.  Kubernetes has evolved into the de-facto industry standard for container management and for machine learning at scale. There are many reasons why Kubernetes provides the best platform for machine learning at scale: repeatable experiments, portable and reproducible environments, efficient utilization of CPUs/GPUs, tracking and monitoring metrics in production; and proven scalability.

Based on Agile Stacks Kubeflow Pipeline template, we will implement a machine learning pipeline for training, monitoring and deployment of deep learning models. We will use popular open source frameworks such as Kubeflow, Keras, Seldon to implement end-to-end ML pipelines.  The Kubeflow project is designed to simplify the deployment of machine learning projects like Keras and TensorFlow on Kubernetes. These frameworks can leverage multiple GPUs in the Kubernetes cluster for machine learning tasks. We will also cover how you can use Kubeflow pipelines to continuously deploy models to production and retrain models on real life data.

After completing this tutorial, you will practical experience of working with ML pipelines using Python and Jupyter notebooks as primary tools.  No prior knowledge of Kubernetes is required.  We will be implementing the following steps during the tutorial:
  • Learn how to build pipelines for training, monitoring and deployment of deep learning models.
  • Prepare and store training data in S3 buckets or NFS volumes.
  • Build, train, and deploy models from Jupyter notebooks.
  • Train Sequence to Sequence NLP model using multiple GPUs.
  • Deploy your machine learning models and experiments at scale on AWS Kubernetes service
  • Reduce cost with automation, node autoscaling, and AWS spot instances.
  • Deploy machine learning models on AWS, GCP, or on-prem hardware.
  • Compare results of experiments with monitoring and experiment tracking tools.
  • Automate experiment tracking, hyper parameter optimization, model versioning, deployment to production.
  • Implement simple web applications and send prediction requests to the deployed models using Seldon
  • Sample pipelines, algorithms, and training data sets will be provided for solving common problems such as data cleansing and training on a large amount of samples.

Step 1: Define the Problem

Text summarization is the problem of creating a short and accurate summary of a longer text document.  We will be using sequence to sequence neural network to summarize text found in GitHub issues, and make predictions - extract meaning out of text data and generate an issue title based on the issue text.  This type of models can be used for text translation or for free-from question answering (generating a natural language answer given a natural language question). It is applicable any time you need to generate text.  To train the model, we will gather many training samples (Issue Title, Issue Body) with the goal of training our model to summarize issues and generate titles. The idea is that by seeing many examples of issue descriptions and titles a model can learn how to summarize new issues.  We are going to implement a complete ML Pipeline based on GitHub issue summarization model by Hamel Husain.  To test the model on real life data, we are going to implement a web application to request inferences from the deployed model.

Step 2: Create ML Pipeline from Template

You can use the Agile Stacks Machine Learning Stack to create ML pipelines from several reusable templates.  When building pipelines with Agile Stacks ML Pipeline Templates, you can focus on machine learning, rather than on infrastructure. 

If you are working on this tutorial during a workshop, please follow instructions to login into development environments provided at the start of the workshop.  If you are going through this tutorial outside of workshop, you need to provision a development environment in your own AWS or GCP cloud account.  You can register for Agile Stacks trial account or subscribe via AWS Marketplace.

Once you receive an activation email, login into Agile Stacks Console and create a cloud account by clicking on Cloud > Cloud Accounts > Create.  Then create an environment: Cloud > Environments > Create.  For example, name your environment as DEV, and associate it with the cloud account you just added.  Now we can create a stack template: Templates > Create.  Give your stack template a unique name, and select the following stack options from the catalog: Kubernetes, Dashboard, Harbor, Minio, Prometheus, ACM (or Let's Encrypt), Kubeflow.   

create_stack_template

Click Build Now - this will save stack template in Git as a set of Terraform and CloudFormation scripts, and deploy it into your cloud account.  On the next page, name your Kubernetes cluster - for example dev.demo05.superhub.io.  For this tutorial, you can use default setting for the count and types of Kubernetes nodes.  Notice that if "On Demand" check box is left unchecked, we will deploy Kubernetes nodes on AWS spot instances which offer a cost effective way to run Kubernetes.  At a later time in this workshop, you can add a few GPU nodes to this cluster in order to train the model on a large number of training samples.  Refer to the following screen shot for recommended settings.  Now you need to wait 10-15 minutes for Kubernetes cluster to be deployed.

As you work through this tutorial, it uses billable components of AWS or GCP - in the range between $1 and $2 per hour.  To minimize costs, follow the instructions in step 7 (cleanup) when you've finished with the tutorial.

create_stack_2

Note that in order to provide sufficient CPU and RAM for Jupyter notebooks, you need to select at least one Worker node of size r4.xlarge or larger.  Later on you can add a GPU node to train the model faster.

 

Step 2: Create ML Pipeline from Template

After your Kubernetes cluster is deployed, create a new ML Pipeline from a template by clicking on Stacks > Applications > Install.  Select "Machine Learning" template, and then Kubeflow (Keras and Seldon).  Compose your machine learning application: select previously created environment (DEV), and use the following options:

Kubeflow engine: Kubeflow
Artifact object storage: Minio
Source code repository: Github-token-dev
Docker registry: Harbor
Application Name: "kfp1"
Bucket Name: "kfp1"
Source code repository name: "kfp1"
Docker registry: "kfp1"

 

kubeflow_app1

Click the Install button.  After the application is created, you will receive 3 URLs:

Source code: Git repository URL where ML Pipeline source code is stored
Notebook: URL to the pipeline editor
Bucket: URL to the object storage bucket where all training data and model files are stored

Kubeflow_pipeline_template

Click on Notebook button to open the Jupyter notebook.

 

Step 3: Define Pipeline Parameters

You can click the Notebook button to open Kubeflow Pipeline Workbench.  Alternatively, navigate to your stack instance by clicking Stacks > Instances > List.  Select your stack instance and click Kubeflow button.

Click Activity (top left icon), and then click Notebooks.  You should see a list of Jupyter notebooks for each pipeline template that was created.  Select your Notebook server name and click Connect.

jupyter1

 

Jupyter notebook integrates code and its output into a single document that combines visualizations, narrative text, mathematical equations, and other rich media. The intuitive workflow promotes iterative and rapid development, making notebooks a preferred tool for many data scientists.   You can review all steps of the machine learning pipeline by browsing Python files in workspace > src folder.

Step 4: Examine project structure

A well-organized machine learning codebase should modularize data processing, model definition, model training, validation, and inference tasks.  Using distinct steps makes it possible to rerun only the steps you need, as you tweak and test your workflow. A step is a computational unit in the pipeline.  A well-defined, standard project structure helps all team members to understand how a model was created.

Data sources and intermediate data are reused across the pipeline, which saves compute time and resources.

Directory strucure

data
├── training-latest <- Data volume (S3 bucket) where training data and model are stored workspace
├── components <- Project components used for pipeline steps such as training ├── nbextentions <- Notebook extension libraries that implement common tasks
├── 101-training.ipynb <- Wrapper Notebook for initial experimentation and fine tuning the model
├── 201-distributed-training.ipynb <- Wrapper Notebook for training at scale using containers an KF pipeline
├── 201-data-pipeline.ipynb <- Example Notebook to implement pipeline steps in Go
├── README.md <- The top-level README for developers using this project.
 

Step 5: Build Keras model 

Next, open the pipeline definition notebook: "workspace > 101-training.ipynb"
In the initial experimentation step you will use Jupyter notebook running on a single node to define the model, evaluate model on a small data set, In the Training step You will define a notebook that will do three important steps.  First, it will download a data set in CSV file format (we call this operation data import), and split it into training and validation data sets.  Next, it will clean data by removing rows that contain invalid data, and finally execute model training step. S3 bucket is mounted a file system mounted to directory data: this is place for all training artifacts. Keras requires training data to be accessible via file system interface.  However, instead of downloading the data to a local file system, you can access data via mounted object storage.  From Jupyter notebook, navigate to directory  /data/training-latest to see all files you will use for model training and model validation.
 
Sections 1.1. and 1.2 define experiment parameters and download training data from an external S3 bucket.  You can use SAMPLE_DATASET (2MB) for initial experimentation to test your data preparation steps and then switch to FULL_DATASET (3GB) to train your model to higher accuracy.  After downloading the training data set, you will see dataset.csv in directory data/training-latest.  The same directory is also available from the notebook via terminal command line:
ls /home/jovyan/data/training-latestdata/training-latest
 
In notebook section 2.2 you will use ktext library to convert text into vectors both for for title and body text.  Ktext library is a thin wrapper around keras and spacy text processing utilities, and provides clean, tokenize, and apply padding / truncating such that each document length is 70 characters long.  Also, you will keep only the top 8,000 words in the vocabulary and set the remaining words to 1 which will become common index for rare words. 
 
Note that TRAINING_DATA_SIZE is a parameter that is used to limit the size of training data set up to TRAINING_DATA_SIZE number of samples.  If it's set to 2000 this is not sufficient to achieve model accuracy.  Increase this parameter value to 100,000 in order to improve model accuracy.   In Step 6 you will build a pipeline that can process significantly larger training data sets.
 
In section 3 (Training) the notebook will call training.py to define the model and train it.  Training is implemented as a Python file to keep it separate from the data preparation code and to reference the same code both from a notebook and from a Kubeflow Pipeline.  The model uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.  The sequence to sequence model has two components: the encoder and the decoder. The encoder extracts features from the text and presents this information to the decoder, and the decoder takes that information and attempts to generate a text summary.  Please visit this tutorial to learn more about sequence to sequence learning.
Seq2seq inference
 
The code for training the model is in training.py file and involves calling the fit method on the model object you defined history = seq2seq_Model.fit.  You need to pass additional parameters such as callbacks for logging, the number of epochs, and batch size.  These parameters are called model hyper parameters.  During initial experimentation with training data and the model you may need to change hyper parameters to achieve better model accuracy.  The trained model and weights are saved to a file training.h5.
 
Finally, you will see examples of model predictions on a test data set to evaluate if model accuracy is improving. It is useful to see examples of real predictions on a holdout set to get a sense of the performance of the model.  Seq2Seq_Inference allows to run tests of model predictions.
 
 

Step 6: Run the Training Pipeline

Next, open the pipeline definition notebook: "workspace > 201-distributed-training.ipynb"
 
You will define a pipeline to execute all data processing and training steps as part of an automated workflow that can be executed at scale on multiple CPUs or multiple GPUs.  Using pipelines you can train the model on 9 million samples instead of few thousand files to achieve desired model accuracy.  In the Training Pipeline step you will define a pipeline that will do two important steps. The pipeline will download a data set in CSV file format (we call this operation data import), and split it into training and validation data sets.  Next, it will clean data by removing rows that contain invalid data, and finally execute model training step.  The following step runs the training experiment and passes parameters such as location of training dataset files, sample size, learning rate, and the file name for trained model.
 
jupyter2

 

Click on the link that takes you to the Pipelines Workbench.  Here you can view all experiment runs, and for each experiment you can see parameters and log files.
 
kubeflow_pipeline
 

 

Step 6: Deploy model serving application

After you train your model, you can deploy it to get predictions on test data or on real life data that the model has never seen before.  We are going to package the model as a Docker image and deploy it as a REST API. In addition we are going to create a simple Web application to send requests to the model for predictions.  

Seldon is a great framework for deploying and managing models in Kubernetes.  Seldon makes models available as a REST APIs or as a gRPC endpoints and helps to deploy multiple versions of the same model for A/B testing or multi-armed bandits.  Seldon takes care of scaling the model and keeping it running all your models with a standard API.

Open the file “serving.py” to take a look at the code we use in order to uses the trained model files to generate a prediction.  We are going to package this code in Docker container and deploy it as Seldon API. In the notebook step [181] we create Dockerfile, and in step [186] we deploy the model to a Kubernetes cluster by running kubectl command from Jupyter notebook.  Note that “templates/seldon.yaml” provides a template for Kubernetes deployment file. We are going to automatically inject all environment specific details via parameters: model name, model version, path to model files, and Seldon authentication secrets. Keras model is referenced from Seldon container via parameter MODEL_FILE with value "/mnt/s3/latest/training/training1.h5".  You can examine seldon.yaml file but you don’t need to change any parameters. The following screen shows the output from model deployment step 3.2. You can test the model by sending it some test data for predictions using seldon.prediction API as shown in step 3.3.

seldon_deployment

It takes a few minutes to deploy the model via Seldon - if you run validation too soon, you may get an error.  Just wait 2-3 minutes and try sending a test request to Seldon again.  Step 3.3 provides a code sample you can use to test a Keras model deployed via Seldon.  You can edit the value of test_payload to send several test payloads to your model.

model_testing

Step 7: Deploy Web Application

Now the model is deployed as REST API, and you will build a simple Python Flask application to provide a web UI for end users to interact with the model.  The test application is going to pull a random issue from GitHub when a user clicks 'Populate Random Issue' button.  When a user clicks "Generate Title" button, the  application will send issue text to the model via REST API to generate the issue summary.  In step 4.1 you will build an application container.  The source code for web application is available in a Python file which you can view or edit from the Jupyter notebook: workspace / components / flask / app / src / app.py.


Once the application is deployed you can access it using the following URL:
https://kubeflow.svc.stack_id.account_id.superhub.io/webapp-github/

 

Step 8: Wrapping Up

We have completed an end-to-end ML pipeline that allows production ML lifecycle management.

 

For more information about Kubeflow, visit https://www.kubeflow.org.

 

The code for this tutorial is on GitHub (Kubeflow Extensions), and it’s available for one-click deployment as ML pipeline template “Kubeflow Pipeline” on Agile Stacks.  While you can run this tutorial on any Kubernetes cluster, the easiest way to create a lab environment is with Agile Stacks.  You can register for a free trial account using the following link:

Free Trial

 

In this tutorial we implemented a complete machine learning pipeline from data preparation to model training, and serving. Feel free to experiment with it and adapt it to your needs. The objective is to make your machine learning projects more agile, make iterations faster, and models better.

 

 

 

 

Agile Stacks is a registered trademark of Agile Stacks, Inc. All product names and registered trademarks are property of their respective owners.