Applying the platform engineering principles to Machine learning projects

MLS is a relatively new engineering discipline with the objective to make all operational aspects of machine learning-based systems predictable, reliable, and cost-effective. It employs well-established engineering principles and practices of software development, DevOps / release engineering, and Site Reliability Engineering (SRE). The engineering discipline, when applied correctly, makes all machine learning apps and systems accessible to the end-user just like Enterprise and large-scale web apps currently are.

Good MLOps team partners with data scientists and data engineers as well as solution architects to deliver:

  • Powerful and expensive GPU infrastructure can be provisioned, made accessible to Data scientists, scaled up and down, and delisted on-demand and via automation. This translates into tremendous agility and cost-saving during model building and validations
  • Scalable Spark/Hadoop clusters provisioning as well as storage and streaming infrastructure on demand to meet the needs of data engineers
  • Infrastructure and automation for new-gen Kubernetes based ML/Solutions like Kube flow
  • All infrastructure and resources for applications and platforms that use ML models

The figure here shows infrastructure provisioning for a solution that uses Machine learned product recommendations for a super busy e-commerce site with millions of visitors per month.
HPC MLOps are experts in partnering with solution architects, data engineers, and data scientists to automate the provisioning of such solutions, making them scalable, secure, and accessible worldwide, and set up traditional CICDs for code promotions/new releases.