On-Prem vs Cloud-Based Kubernetes Cluster

On-Prem vs Public Cloud-based Kubernetes cluster

This document assesses the total cost of ownership (TCO) of On-Prem vs Public Cloud-based Kubernetes clusters. Using multiple real-world scenarios (use cases), the High Plains Computing (HPC) team has put together this report that provides an estimate of overall compute, storage, and I/O needs along with a cost comparison for CapEx, OpEx, and TCO.

Here are a few TCO calculators our team used for this exercise:

We compiled this TCO after reviewing a lot of sales data and comparison of several On-Prem Kubernetes distributions. Pricing for public-cloud Kubernetes is based on AWS’ EKS and EC2. (Google and Azure Cloud prices were found to be very similar.)

During the preparation of this assessment, our team used many managed services because it reduced the team’s development time as well as administrative overhead. This results in lower TCO for the public cloud. For example, the TCO of PostgreSQL-based AWS RDS service would be considerably lower than on-Prem PostgreSQL.

When determining the TCO for on-prem solutions, the estimates are provided a wide range. This is because the list prices for Kubernetes setup/install, and licensing costs for various components vary greatly between various on-prem vendors and come with discounts from the listed prices. Our Cloud solution TCO should be fairly accurate though.

Anyhow all TCO calculations are ballpark figures and this exercise just demonstrates what particular use cases would be a better match for on-prem vs. public-cloud Kubernetes.

Use Case 1: Product Recommender

This scenario is based on the TCO of a medium-scale product recommender service running on the Kubernetes cluster. The recommender service is used by a busy e-commerce site to get ranks for product search results as well as personalized product recommendations for users visiting the site.

This service is created as a cloud-native application consisting following set of microservices

  • A set of scheduled ETL batch jobs to ingest clickstream/website analytics, rating and review, customer order history, and inventory data.
  • A scheduled job to calculate product and user similarity matrices and apply collaborative filtering to generate personalized recommendations for users.
  • Caching the recommendations in Redis Cache.
  • Serving recommendations via a simple flask application.

For this solution, a total vCPU need of 500 or less was estimated.

Following is a reference architecture and deployment diagram of the solution described above

Here is the TCO table for this use case:

Use Case 2: High performance and Secure Kubernetes

In this scenario, Kubernetes is hosting an enterprise microservice architecture application for a small but very successful Fintech/crypto trading company.

This trading application is a set of collaborative microservices used for crypto coins-based equity trading. The application uses the Istio service mesh for controlling traffic and managing dependencies between various microservices. The application quickly calculates optimal strategies for trading, order splitting, spreading, and routing orders and suborders.

This application is both compute and I/O intensive and deployed on a large cluster of VMs that are high-performance machines with 96 or more vCPUs. Cost also includes hundreds of terabytes of high provisioned IOPS SSD storage.

The primary concerns of company stakeholders were related to the security and scalability of the Kubernetes cluster. So several technologies were included in the calculated TCO, including web application firewall, DDOS protection, Content distribution network (CDN) based caching, various network scanning, and compliance monitoring tools scanning containers for any vulnerability.

The total compute need for this solution was estimated to be less than 3000 vCPUs. This Kubernetes cluster is assumed to be deployed in multiple regions of the US and within multiple availability zones of each region.

The figure below shows the deployment architecture of this use case.

Use Case 3: A Website Analytics Provider

This use case is about a large website analytics provider that captures clickstream data from various busy websites and provides useful analytics for its customers. The Analytics provider also transforms and streams selected clickstream data to its select customers, so that its customers can also use their own clickstream data to create smart ML-based solutions.

To cater to the needs of its analytics customers, this company provides several dashboard applications where customers can query as well as visualize their website visitors and traffic

The solution uses a very large fleet of pods to run non-blocking and asynchronous stream processing REST service for its hundreds of thousands of javascript clients embedded in its customer websites. These javascript clients send user clickstream data to the analytics provider. The provider uses streaming infrastructure such as Apache Kafka for handling streamed data from multiple websites and uses a whole range of apache tools for its analytics services. 

These tools include Apache Spark, Spark Streaming, Flink, Presto, Hive, and thousands of ML workloads to generate decision support and predictive analytics and answer its customer’s data warehouse queries. 

TCO is based on the average use of 40K vCPUs. Several petabytes of both Object and Block storage costs were added to the overall TCO. Once again, we used high-performance VMs that have 96 or above vCPUs to build this Kubernetes cluster, We estimated one-tenth of the nodes will be GPU-assisted nodes for ML workload. From an operational and solution architecture perspective, such clusters are generally broken down into several smaller clusters but for this exercise, we assumed a really big Kubernetes cluster.

Here is the TCO table for this use case:

Public cloud simply becomes very pricey at the high-end and complex Kubernetes solutions.

A Note about on-prem offerings

For on-prem Kubernetes, we looked at offerings from leading vendors and included the cost of Kubernetes add-ons such as CSI storage providers, high-performance networking and software-defined network providers, High-performance Ingresses providers, and several other vendors for application and cluster security. Still, TCO for on-prem computing varies so much on various factors that provided TCO is just a guesstimate. 

You can check our application modernization services.

Delivering the Best

Thousands of AWS and CNCF-certified Kubernetes solution partners have unique expertise and focus areas. Our focus is on best practices in security, automation, and excellence in Cloud operations.

Please reach out to us if you have any questions.

Social Share :

What is Retrieval Augmented Generation

What is Retrieval Augmented Generation Introduction Retrieval-augmented generation (RAG) is a cutting-edge technique that combines…

AWS Inspector: A Quick Security Guide

Security is crucial when using cloud-based applications. Cloud security involves using tools, regulations, and services…

AWS Security Enhancements

In today’s swiftly evolving tech landscape, prioritizing security is imperative. As a leading cloud service…

Ready to make your business more efficient?