Build automated machine learning pipelines using CI/CD techniques applied to the domain of machine learning
Key Features
Create reproducible and automated machine learning pipelines using DVC and CML Speed up your machine learning development and promote collaboration using CI/CD techniques Ensure you stay ahead of the curve in the fiercely competitive machine learning market
Book DescriptionThe process of deriving useful insights from machine learning can be an arduous, though rewarding, one, even for data science practitioners. Its worth investing in any tools or techniques that can assist with the process.
Open Source MLOPs with DVC and CML will take you through two such techniques, which will allow you to automate your machine learning pipelines and make them eminently reproducible.
You'll begin with an introduction to Data Version Control (DVC) and learn how it can help you keep track of your machine learning artifacts using a familiar Git-like approach. This will lead you on to building end-to-end machine learning pipelines, complete with visualizations of the results. We move on to Continuous Machine Learning (CML), with which you can automate the training and testing of machine learning models so they can run alongside the rest of your CI/CD pipeline, ensuring stability and reproducibility.
By the end of this book, you will be able to develop reproducible pipelines as directed acyclic graphs and run those pipelines effortlessly in the cloud to speed up the development of your machine learning models.What you will learn
Create an S3 bucket to act as a remote repository Use remote storage and a GitHub repository to create a model registry Construct pipelines in YAML format in the dvc.yaml file Define for loops within the DVC pipeline to reduce repetition Share experiments with a coworker Access and save objects using DVC's Python API Run CML workloads on AWS EC2 instances including GPU-equipped machines Report results such as DVC metrics and plots to a GitHub pull request
Who this book is forPredominantly this book will be for people who want to learn how to use DVC and CML to build pipelines of the deployment of machine learning models. These people are most likely to be data scientists, or possibly software engineers, or students in training on PhD or MSc programs who are developing machine learning models. The book may also be useful for those interested in the Data Version Control aspect who are not (or not currently) developing or deploying machine learning models.
A bare minimum knowledge of data analytics, and a concern for producing analysis reproducibly and eagerness to learn is expected.
Table of Contents
A Brief Introduction to MLOps
First Steps with DVC
Using Remote Storage
Sharing Data with Registries
Troubleshooting Issues with DVC
Building pipelines with DVC
Advanced Pipelines Parameterization and foreach Stages
Creating Plots with DVC
Experiment Tracking
Deploying models with DVC
Automating your pipelines with github actions
Running GPU and compute heavy workloads
Train, test, and deployment with CML
Matthew Upson is a Data Scientist and Founder of MantisNLP experienced in Natural Language Processing and Machine Learning / Data Engineering problems. Previously he was the Lead Data Scientist at Juro, a legal tech startup where he used AI to make contracts faster, smarter, and more human. Prior to working at Juro he worked as a Data Scientist in the UK Government predominantly on Machine Learning services for Natural Language Processing. Version Control, Continuous integration, and Cloud Computing.