DevOps for Machine Learning
The world is being reshaped by machine learning. But AI/Machine Learning is not exactly the easiest field to get into. Building any production-scale machine learning system involves building a data pipeline and entails dealing with many different moving parts. Can machine learning itself be automated? While there is no way to completely automate all tasks of data scientists, many tasks in ML workflows can be automated. Hyper-parameter search, data preparation, data processing, deployment and integration, model validation are examples of tasks that can be efficiently automated. For data scientists to focus on algorithms and models, they need to automate other tasks.
DevOps automation for ML allows to speed up the process by which an idea goes from development to production. It helps to achieve several key objectives:
- Fastest time to train, with as much data and as accurately as possible
- Fastest time to inference, with ability to rapidly retrain
- Safe and reliable deployments to observe model behavior in the real world
Agile Stacks Machine Learning Stack significantly reduces the complexity of implementing production-scale machine learning systems. DevOps automation and CI/CD pipeline are essential for achieving efficient workflow for a data scientist, but manually writing custom DevOps scripts is not the answer. With custom-built scripts, even simple changes often require disproportionate investments of time and effort. Using Agile Stacks Machine Learning stack, you can generate DevOps automation scripts, spend less time on deployments, and more time working on your models and experiments.
The Machine Learning Stack incorporates open, standard software for general machine learning: Kubeflow, TensorFlow, Keras, PyTorch and others. These are tightly integrated with infrastructure services for distributed training, monitoring, data processing, simulation. It helps data scientists to seamlessly leverage distributed training This significantly reduces the time it takes to train the model and achieve desired results.
The Agile Stacks Control Plane provides a central access point for data scientists and software developers looking to work together on machine learning projects. It streamlines the process of creating machine learning pipelines, data processing pipelines, and integrate AI/Machine Learning with with existing applications and business processes. Using Machine Learning stack, you can implement an entire AI pipeline that allows to build, train, and deploy machine learning solutions that are fully automated, scalable, and portable.
A typical AI pipeline includes a number of steps:
- Data Preparation / Processing
- Model Training and Testing
- Model Validation
- Deployment and Versioning
- Production and Monitoring
- Feedback / Reinforcement Learning
The Machine Learning Stack is based on Kubeflow. It is enhanced and automated using AgileStacks’ own security, monitoring, CI/CD, workflow automation, and configuration management capabilities. Kubeflow is Google led open source project designed to alleviate some of the more tedious tasks associated with machine learning. It helps with managing deployment of machine learning apps through the full cycle of development, testing, and production, while allowing for resource scaling as demand increases.
Machine Learning Template
With Agile Stacks, you can compose multiple best of breed frameworks and tools to build a stack template and essentially define your own reference architecture for Machine Learning. Stack services are available via simple catalog selection and provide plug-and-play support for monitoring, logging, analytics, and testing tools. Stack template can also be extended with additional services using import of custom automation scripts.
The Kubeflow project is dedicated to making deployments of machine learning
workflows on Kubernetes simple, portable and scalable.
Supported machine learning and deep learning frameworks and libraries
TensorFlow, Keras, Caffe, PyTorch
Storage Volume Management
Manage storage for data sets (structured, unstructured, audio and video),
automatically deploying required storage implementations, and providing
data backup and restore operations.
Local FS, AWS EFS, AWS EBS,
Ceph (block and object), Minio,
Private Docker registry allows to secure and manage the distribution of container
images. A container registry controls how container repositories and their images
are created, stored, and accessed.
GitLab Docker Registry,
Specify, schedule, and coordinate the running of containerized workflows and jobs
on Kubernetes, optimized for scale and performance.
Collaborative & interactive model training
Argo workflow templates
Export and deploy trained models on Kubernetes. Expose ML models via REST
and gRPC for easy integration into business apps that need predictions.
Estimate model skill while tuning model’s hyperparameters. Compare desired
outputs with model predictions.
Argo workflow templates
Distributed data storage and database systems for structured and unstructured data
Minio, S3, MongoDB,
Workflow application templates allow to create data ingest pipelines to automatically
build container images, run transformation code, and schedule workflows based
on data events or messages
Argo, Jenkins, workflow
Monitor performance metrics, collect, visualize, and alert on all performance metric
data using pre-configured monitoring tools. Gain full visibility into your training and
Continuously monitor model accuracy over time, and retrain or modify the model as needed.
Prometheus, Grafana, Istio
Load Balancing & Ingress
Expose cluster services and REST APIs to Internet. Ingress acts as a
“smart router” or entry point into your cluster. Service mesh brings reliability,
security, and manageability to microservices communications.
ELB, Traefik, Ambassador
Generate and manage SSL certificates, securely manage passwords and
secrets, implement SSO and RBAC across all clusters in hybrid cloud environment.
Okta, Hashicorp Vault,
AWS Certificate Manager
Aggregate logs to track all errors and exceptions in your model creation pipeline.
Elasticsearch, Fluentd, Kibana
Machine Learning pipelines are code, and classic DevOps tools (Git, Jenkins, Docker, Argo) play an important role in Machine Learning automation. Using Docker makes sharing machine learning projects with others a painless process. When sharing a project via a container you are not only sharing your code but your development environment as well ensuring that your code can be reliably executed, and your work reliably reproduced. In addition, since your work is already containerized, you can easily deploy it for inference at scale using Kubernetes container orchestration.
Machine Learning Templates allow to package machine learning code in Docker containers, create CI/CD pipelines, and deploy machine learning workflows to Kubernetes clusters for distributed training or for large scale inference. With distributed training, data scientists can achieve significant reduction in time to train a deep learning model. When data scientists are enabled with DevOps automation, Operations team no longer needs to provide configuration management and provisioning support for common requests such as cluster scale up and scale down, and the whole organization can become more Agile.
Get in touch with our team to discuss your machine learning automation requirements and deployment approach. Agile Stacks generates automation scripts that can be easily extended and customized to implement even the most complex use cases.