A recent search on Indeed.com shows an open request for data scientists at approximately 3,600 job postings. But being a data scientist doesn't mean needing to be an Ops engineer. There is a difference:
- Data Scientist - as widely defined, is an individual with the primary role of sourcing data to be used in algorithmic and statistical analysis to identify trends, insights and other relevant information
- Ops Engineer - aka Application Operations Engineer, is pragmatically the technical staff charged with the optimization, maintenance, monitoring and governance of in-production business systems.
Yet both roles are being done by the same person these days since the shortage of skills is debilitating.
As a data scientist, you are feeling overwhelmed! I get it because I’ve been there too as an adjunct faculty member at an east coast University. Not only do I write the syllabus and lesson plans for my software engineer course but I also select and set up the tool for the student labs and exercises. It’s great the first time because I learn so much but every semester after that makes these operational tasks a chore which I should be able to hand to my teaching assistant. To do that, I need to prepare.
I almost forgot - my software engineering course is titled CSC-649-Secure Software. The students learn the tricks of the trade to write better software than ever before. Then we use machine learning to find patterns of poorly written code in publicly available software.
So, without further ado, here are my thoughts about selecting a tool for my software engineering course.
Use A Proven Platform
Enter Kubeflow, an open-source machine learning (ML) platform that’s great for organizing and powering your projects using a comprehensive and flexible library of tools. It comes with a lot of public material to learn from and has extensive open source community support.
As important as selecting the machine learning platform is selecting the hardware platform. I suggest not running a junior or senior-level course on minikube. Remember, these students are preparing to join a highly competitive workforce. Providing real-world compute power should be a highly regarded priority. In other words, consider what a hiring manager will likely ask about their AI course work? Examples could include, “Please tell me the systems you’ve worked on?” or “How large were the data sets ingested, and what about performance did you learn?” I find it unlikely they would learn much running limited ML pipelines on even a high-end MacBook. Access to public cloud providers such as Google, AWS, and Azure is within a phone call to your campus CIO for preferential pricing. Also, each provider offers grants for precisely this reason:
Work Within The UI, First
To start, definitely leverage those Kubeflow Pipeline templates so you can stay on top of the processes without getting bogged down. Since the ML Pipeline can also set default parameters for you, it’s a decent way to prepare without working entirely from scratch on something new.
I’ve always been diligent at preparing my labs a couple of weeks ahead of when I teach them. My teaching assistants aren’t always so good at testing my steps beforehand, however. With Kubeflow’s UI, there is far less chance my TA’s will get into trouble when they finally get around to testing my lab work.
Bake In High Availability
In an educational lab environment, high availability and scaling are used synonymously (sometimes). Availability is making sure the students can get to their machine learning environment in the first place. Scale is making sure there is enough compute and performance provided to the students to run their models.
Maybe I will get the chance to write a longer blog on this upcoming recommendation but for today and at your acceptance - I’m going to impart learned advice on high availability and scaling. Here it goes - when you set up your lab compute environment for Kubeflow, take the time and effort to deploy it as multi-node and multi-tenancy. Here is a quick guide to multi-tenancy - https://www.kubeflow.org/docs/components/multi-tenancy/overview/.
Start With Scale In Mind
As educators, we always want our students to overachieve. And if all goes as hoped then students would tax their labs. Student data sets that are too large to run on a single node need an efficient way to balance the workload. The beauty of Kubernetes running in the background is its native ability to do just that when you’ve already set up your cluster as multi-node. If you want to start small, my recommendation is a master node then two worker nodes. This will go a long way to not running out of computing resources when your students start throwing large data sets into training.
Don’t forget that Kubeflow offers a lot of resources that help with automating more of the training and deploying process. Use your Kubeflow Pipelines UI to take advantage of these features with helpful tips. What I learned from my first lab was that I over-engineered for the worst-case scenario. My public cloud bill wasn’t astronomically high, and I was lucky at that.
Embrace MLOps To Let Your Students Compete In The New Workforce
Whether you are a Chair, faculty or staff involved with the AI curriculum it is critically important we continue to drive the highest value for our students. This means delivering your best material and tutelage as a data science expert. My tips above can help relieve that pounding concern of “Yet again” for every semester’s lab setup. Using Agile Stacks, you can still be a data scientist but let automation run your backend operations. Last year, John Mathon, our CEO, wrote a blog A case for machine-learning infrastructure automation to keep Higher Ed competitive because it is important to give our next generation data scientists the tooling and support they need to compete globally as the next data science workforce.