Kubernetes based Data Lake Platform for Smart Computing

Kubernetes based Data Lake Platform

Why Kubernetes for data lake platform?

Kubernetes is an excellent platform for containerized modern world applications connected with the Internet of Things (IoT) devices such as trackers, geo-fencing, motion-sensing, security cameras, sound, heat/humidity, and other remote sensing devices.

This document elaborates one such use case where a completely secure, and auto scalable Kubernetes based Data lake solution was built for a modern global economy forecaster.

Use case

A modern global economy forecaster needs to deploy massive sensors to measure housing, agriculture, foot traffic, weather conditions to collect diverse information and refine this information into new-gen economic indicators. Achieving this goal requires a platform that can scale and do high performance computing for calculation, slice /dice, and aggregate data for analysis as well as host ML/AI applications to leverage massive wealth of data.

Reference Architecture

Following is a reference architecture of IoT sensors based data lake platform which can handle terabytes of streamed data from sensors round the clock.

Reference Architecture

Key features of this architecture are

  1. The solution to handle real-time data from millions of sensors /data point concurrently
  2. MQTT Broker ensures no dependency between sensors and Data warehouse and async operations. The fleet of sensors can be independently upgraded/added/retired without any impact 
  3. Highly scalable and stable apache spark for managing unstructured data and transforming it to make it gold data. Gold data can be sliced/diced at a very high speed or inserted into the analytic database. Gold and platinum data can be sold to affiliates or shared with. Apache spark is the #1 solution for distributed and high-performance computing ( https://spark.apache.org/ )
  1. Apache hive (https://hive.apache.org/) for building a massively scalable Data warehouse to manage raw/unstructured and semi-structured data. this makes all data accessible via the usual and familiar SQL interface.  Hive is the #1 solution for all data lakes on public clouds
  2. On-prem distributed database that will scale with your data needs (  https://greenplum.org/ ) to slice and dice terabytes of data 
  3. Data storage needs are fulfilled via scalable Kubernetes storage such as Rook ( https://rook.io/ )
  4. KubeFlow to let you build and operate ML apps/models based on real-world data to solve real-world problems like Covid-19 spread, drought forecasting and 
  5. All running on fail-safe and highly scalable Kubernetes (https://landscape.cncf.io/)  the number #1 platform for running containerized apps 

High Plains Computing Advantage

At High Plains Computing, we specialize in setting up and provisioning Kubernetes on your own Data Centers or on your favorite Public clouds. Our Kubernetes solutions are designed for excellent operational excellence and top-notch security, efficient and cost-effective compute and other resource utilization, and meet or exceed reliability demands.

High plains computing does the following steps to get you started for a great custom IoT solution on Kubernetes.

Custom IoT solution on Kubernetes

Committed to delivering the best

Thousands of AWS and CNCF-certified Kubernetes solution partners have unique expertise and focus areas. Our focus is on best practices in security, automation, and excellence in Cloud operations.

Please reach out to us if you have any questions.

Social Share :

AWS Inspector vs Guardduty

Both Amazon Inspector and Amazon GuardDuty are services that enhance your cloud security posture. Both…

AWS Cost Optimization Best Practices

Introduction Cloud costs are a daily concern for companies running applications on Amazon Web Services…

Python-based MultiParty Computation

How multiparty computation can be built quickly using the PyMPC library? This document describes how…

Ready to make your business more efficient?