Sunset for the Wild West: Making ML disciplined by default
Session Abstract
Many novel machine learning techniques started as clever hacks that just happened to work, but the demands of building real systems can be at odds with this creative culture. Learn about our open-source stack to improve quality-of-life for ML researchers and infrastructure teams alike — and how their concerns aren’t as different as you might think.
Session Description
At first glance, MLOps teams have an unenviable challenge, since they exist to bridge the gap between machine learning practitioners and infrastructure engineers, who work at opposite ends of the application stack and have distinct vocabularies, skills, and goals. Practitioners often adopt an anything-goes creative approach and figure out why a technique works after it’s already getting results; this culture has led to many advances in applied machine learning but can be in tension with building reliable systems. However, there’s a surprising commonality between ML practitioners and infrastructure teams, and their concerns may not be as different as they appear.
Infrastructure engineers care about security, observability, and predictable utilization while ML practitioners care about reproducibility, understandability, and performance. This session will argue that the diverse concerns of these groups are often manifestations of the same underlying systems challenges, and that the same open-source tools can help both audiences address their pain points. We’ll draw on our experience helping researchers get experiments into production at scale and helping infrastructure teams deploy and manage enormous clusters. Most importantly, we did this while meeting practitioners where they are: without requiring researchers to become release engineers or demanding that SRE teams start caring about gradients or manifolds.
You’ll come away from this talk with concrete tools and playbooks to make machine learning systems safer and more predictable, to eliminate the error-prone manual work of getting code from an experimental environment ready for collaboration or production, to help researchers achieve reproducible results, to better understand the software your team wants to run and the infrastructure that supports it, to balance overhead and observability for demanding workloads, and to ensure that you know at a glance what’s actually running on your compute clusters — from project-specific Kubernetes configurations all the way down to device drivers and everything in between.