DevOps for Machine Learning

DevOps for Machine LearningAgile Stacks Machine Learning Stack allows data science teams to build, train and deploy machine learning models efficiently and at scale on Kubernetes.  The Machine Learning Stack incorporates open, standard software for machine learning: Kubeflow, TensorFlow, Keras, PyTorch, Argo, and others.  Machine learning is a multi-step process, and Automation Hub connects all tools in the machine learning pipeline, delivering unmatched levels of automation and ease-of-use for machine learning initiatives.  This significantly reduces the effort to build, train & deploy machine learning models.

Both data science and data engineering problems need to be solved in parallel to enable data scientists to be successful.  Because the Machine Learning stack is designed to automate the complexity of machine learning pipelines, data scientists have more time to focus on the modeling tasks.

DevOps automation for ML allows to accelerate the process by which an idea goes from development to production.  It helps to achieve several key objectives:

  • Fastest time to train, with as much data and as accurately as possible
  • Fastest time to inference, with ability to rapidly retrain
  • Safe and reliable deployments to observe model behavior in the real world

Another area of automation which is addressed by Machine Learning stack is experiment tracking and model versioning. Deploying machine learning systems to production typically requires ability to run many models and multiple versions of models at the same time.  Your code, data preparation workflows, and models can be easily versioned in Git, and data sets can be versioned through cloud storage (AWS S3, Minio, Ceph).  Version control of all code, models, and workflows allows to concurrently run multiple versions of models to optimize results, and  provides ability to rollback to previous versions when needed.  Instead of ad-hoc scripts, you can now use Git push/pull commands to move consistent packages of ML models, data, and code into Dev, Test, and Production environments.

The Agile Stacks Control Plane provides a central access point for data scientists and software developers looking to work together on machine learning projects. It streamlines the process of creating machine learning pipelines, data processing pipelines, and integrate AI/Machine Learning with with existing applications and business processes. 

A typical Machine Learning pipeline includes a number of steps:  

  1. Data preparation / ETL
  2. Model training and testing
  3. Model evaluation and validation
  4. Deployment and versioning
  5. Production and monitoring
  6. Continuous training / reinforcement learning

At the core of Machine Learning Stack is the open source Kubeflow platform, enhanced and automated using AgileStacks’ own security, monitoring, CI/CD, workflows, and configuration management capabilities.  Kubeflow is Google led open source project designed to alleviate some of the more tedious tasks associated with machine learning. It helps with managing deployment of machine learning apps through the full cycle of development, testing, and production, while allowing for resource scaling as demand increases.

Machine Learning Template

With Agile Stacks, you can compose multiple best of breed frameworks and tools to build a stack template and essentially define your own reference architecture for Machine Learning.  Stack services are available via simple catalog selection and provide plug-and-play support for monitoring, logging, analytics, and testing tools.  Stack template can also be extended with additional services using import of custom automation scripts.

Stack Service


Available Implementations

ML Platform

The Kubeflow project is dedicated to making deployments of machine learning

workflows on Kubernetes simple, portable and scalable.

Kubeflow, Kubernetes

ML Frameworks

Supported machine learning and deep learning frameworks, toolkits, and libraries.

TensorFlow, Keras, Caffe, PyTorch

Storage Volume Management

Manage storage for data sets (structured, unstructured, audio and video),

automatically deploying required storage implementations, and providing

data backup and restore operations.

Ceph (block and object), Minio,


Image Management

Private Docker registry allows to secure and manage the distribution of container

images. A container registry controls how container repositories and their images

are created, stored, and accessed.

Amazon ECR, Harbor Registry

Workflow Engine

Specify, schedule, and coordinate the running of containerized workflows and jobs

on Kubernetes, optimized for scale and performance.


Model Training

Collaborative & interactive model training

JupyterHub, TensorBoard,

Argo workflow templates

Model Serving

Export and deploy trained models on Kubernetes. Expose ML models via REST

and gRPC for easy integration into business apps that need predictions.

Seldon, tf-serving

Model Validation

Estimate model skill while tuning model’s hyperparameters. Compare desired

outputs with model predictions.

Argo workflow templates


Data Storage Services

Distributed data storage and database systems for structured and unstructured data

Minio, S3, MongoDB,

Cassandra, HDFS

Data Preparation and

Workflow application templates allow to create data processing pipelines to automatically

build container images, ingest data, run transformation code, and schedule workflows
on data events or messages

Argo, NATS, workflow

application templates

Infrastructure Monitoring

Monitor performance metrics, collect, visualize, and alert on all performance metric

data using pre-configured monitoring tools. Gain full visibility into your training and

inference jobs.

Prometheus, Grafana 

Model Monitoring

Continuously monitor model accuracy over time, and retrain or modify the model as needed.

Prometheus, Grafana, Istio

Load Balancing & Ingress

Expose cluster services and REST APIs to Internet. Ingress acts as a

“smart router” or entry point into your cluster. Service mesh brings reliability,

security, and manageability to microservices communications.

ELB, Traefik, Ambassador


Generate and manage SSL certificates, securely manage passwords and
secrets, implement SSO and RBAC across all clusters in hybrid cloud environment.

Okta, Hashicorp Vault,
AWS Certificate Manager

Log Management

Aggregate logs to track all errors and exceptions in your model creation pipeline.

Elastic stack: Elasticsearch,
Fluentd, Kibana


Kubeflow Pipelines

Kubeflow Pipelines provide a workbench to compose machine learning workflows, and packages ML code to make it reusable to other users across an organization.  It provides a workbench to compose, deploy and manage machine learning workflows that perform orchestration of many components: a learner for generating models based on training data, modules for model validations, and infrastructure for serving models in production.  Data scientists can also test several ML techniques, to see which one works best for their application.

Machine Learning Pipeline Templates

Machine Learning Pipelines play an important role in building production ready AI/ML systems.  Using ML pipelines, data scientists, data engineers, and IT operations can collaborate on the steps involved in data preparation, model training, model validation, model deployment, and model testing.  Agile Stacks Machine Learning pipeline templates provide out of the box implementation for common ML problems such as NLP processing with RNN sequence-to-sequence learning in Keras, and serving of models with Seldon.  The pipelines allow to model multi-step workflows as a sequence of tasks, where each step in the workflow is Python file.  Pipeline steps can be executed from Jupyter notebook for initial experiments, or scaled across multiple GPUs for faster training on large amounts of data.   Data scientists can define data preparation tasks and other compute intensive data processing jobs that can auto-scale across multiple Kubernetes containers.  Highly automated approach for data ingest and preparation allows to prevent data errors, increase velocity of iterating on new experiments, reduce technical debt, and improve model accuracy.

Machine Learning pipelines are used by data scientists to build, optimize, and manage their end-to-end machine learning workflows.   To help with experiment tracking, multiple workflows can be created from a single template.  With distributed training, data scientists can achieve significant reduction in time to train a deep learning model.  Agile Stacks pipeline templates provide complete DevOps automation for ML pipelines.  When data scientists are enabled with DevOps automation, operations team no longer needs to provide configuration management and provisioning support for common requests such as cluster scale up and scale down, and the whole organization can become more agile.  Continuous deployment of new models in highly automated and reliable way is a key for building advanced machine learning systems that combine multiple models to deliver the best accuracy, while constantly monitoring model performance on real life data.

Get in touch with our team to discuss your machine learning automation requirements and deployment approach.  Agile Stacks generates automation scripts that can be easily extended and customized to implement even the most complex use cases.


Book a Demonstration


Agile Stacks is a registered trademark of Agile Stacks, Inc. All product names and registered trademarks are property of their respective owners.