Database Selection in Kubernetes

Written by admin

It can be perplexing when you visit the CNCF Kubernetes landscape page for databases. There are over 50 different products listed for you to choose, but which one is appropriate for specific need at the time you are looking at that table can be very confusing. This article is a small effort in easing the challenge of seeing what is what and which specific database is your current need.

We will try to cover this challenge by divide and conquer rule. First table in the list gives you some of the major players from the 50+ list on Kubernetes Landscape page. Look at the “Primary Use / Purpose” column to see what is your need at this moment. Once identified, browse directly to the specific section of this article to learn more about each of DBs in that specific category.

Name Website Primary Use / Purpose Works With Open Source
Apache CarbonData carbondata.apache.org Analytics Big Data: Spark SQL Yes
Apache Druid druid.apache.org Analytics JDBC Yes
snowflake snowflake.com Analytics SQL No
Presto prestodb.io Analytics (Query Engine) SQL Yes
BIGCHAINDB bigchaindb.com Blockchain SQL Yes
Cockroach Labs cockroachlabs.com Distributed Processing (CloudNative) SQL Yes
Apache Ignite ignite.apache.org In-Memory SQL and Key-Value Store Yes
hazelcastIMDG hazelcast.com In-Memory Multiple Languages Yes
Redis redis.io In-Memory Redis Commands Yes
VoltDB voltdb.com In-Memory SQL Yes
Apache Hadoop hadoop.apache.org Massive Parallel Processing MapReduce Yes
Crate.IO crate.io Massive Parallel Processing SQL Yes
ArangoDB arangodb.com Multi Model AQL (DML) Yes
Crux opencrux.com Multi Model Kafka Yes
Dgraph dgraph.io Multi Model GraphQL Yes
FoundationDB foundationdb.org Multi Model Multiple Languages Yes
InterSystems IRIS Data Platform intersystems.com Multi Model SQL No
OrientDB orientdb.org Multi Model SQL Yes
Apache Cassandra cassandra.apache.org NoSQL Using stored Keys Yes
Infinispan infinispan.org NoSQL Java Yes
mongoDB mongodb.com NoSQL JSON Yes
Couchbase couchbase.com NoSQL Database N1QL Yes
IBM DB2 ibm.com/db2 RDBMS SQL No
MariaDB mariadb.org RDBMS SQL Yes
Ms SQL Server microsoft.com RDBMS SQL No
MySQL mysql.com RDBMS SQL Yes
Oracle oracle.com RDBMS SQL No
PostgreSQL postgresql.org RDBMS SQL Yes
KubeDB kubedb.com Framework Yes

 

In the sections below we will provide more details and basic introduction on each of the databases listed above within each Category (Color coded “Primary Use/Purpose” column above)

Analytics

Analytics databases (also known as On-Line Analytical Processing – OLAP) systems are used to store and manage big and structured data. These systems are optimized for faster queries and provide complicated aggregate functions. These are widely used for Analytics/BI, and reporting purposes in a typical setting.

Name Salient Features
Apache CarbonData
  • Fully indexed columnar and Hadoop native data-store for processing PetaBytes of data
  • Multi level indexing, compression and encoding techniques targeted to improve performance of analytical queries
  • Multi level indexing also reduces I/O scans and CPU processing
  • Can write to S3, OBS, HDFS, and Alluxio
  • Integrates with Big Data ecosystem Spark and Presto
Apache Druid
  • Distributed system to be used for OLAP queries on streaming data and time-series
  • Commonly used by very high volume data by companies like Netflix, AirBnB, Alibaba etc
  • It is a very fast data base for huge volumes
prestoDB
  • High performance, distributed SQL query engine for big data
  • Can query data from where it lives, multiple sources even within same query: Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, MongoDB and Teradata
  • Targeted at analysts who expect response times ranging from sub-second to minutes
  • Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse
Snowflake
  • An enterprise analytics database
  • Designed to work on public clouds AWS, Azure
  • You only subscribe the service. Store data in one place pay for storage and pay for compute only when you run queries
  • Users can use ANSI SQL to do all DB operations

 

Blockchain

Blockchain databases are designed to keep an immutable record of all transactions. When there is a need for some data to be stored which can not be changed by anyone, you want to pick a blockchain database. A blockchain as a database can contain any information, however, blockchains are not really good at storing vast amounts of data due to network limitations and cost, etc. In the case of the open-source cryptocurrency Bitcoin, only information such as ownership, a timestamp, and other small details are recorded in the ledger. (see details about Blockchain DBs here)

Name Salient Features
BigChainDB
  • Works with MongoDB SQL as its powered by MongoDB
  • Decentralized control via federation of nodes
  • Data storage is immutable
  • Design your own private network with custom assets, transactions, permissions and transparency
  • Transaction level permission-ing

 

In-Memory

In-memory databases can persist data on disks by storing each operation in a log or by taking snapshots. In-memory databases are ideal for applications that require microsecond response times and can have large spikes in traffic coming at any time such as gaming leaderboards, session stores, and real-time analytics. (see details here)

Name Salient Features
Apache Ignite
  • Distributed in-memory data store that delivers in-memory speed and unlimited read and write scalability to applications
  • SQL and key-value store that supports any kind of structured, semi-structured and unstructured data. Data stored as key-value
  • Durable, consistent, and highly available
  • Ignite cache keeps a subset of records in memory, when required, more data is loaded into memory
hazelcastIMDG
  • Provides central, predictable scaling of applications through in-memory access to frequently used data and across an elastically scalable data grid
  • Enables you to use an unparalleled range of massively scalable data structures with your Python applications
  • Enables the largest data sets to run efficiently in an in-memory cluster across your most popular data APIs.
Redis
  • In-memory data structure store, used as a database, cache, and message broker
  • Redis provides data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes, and streams
  • Redis hash data structure store lets you store and retrieve data which makes it fast
  • Mostly used for key-value look ups and not sql type joins
voltDB
  • An In-Memory database with community edition and enterprise license
  • Lets you access the data using SQL statements that are run as Java Stored procedures
  • Optimized for a specific application by partitioning the database tables and the stored procedures that access those tables across multiple “sites” or partitions

 

Massive Parallel Processing (MPP)

Analytical Massively Parallel Processing (MPP) Databases are databases that are optimized for analytical workloads: aggregating and processing large datasets. MPP databases tend to be columnar, so rather than storing each row in a table as an object (a feature of transactional databases, MPP databases generally store each column as an object. This architecture allows complex analytical queries to be processed much more quickly and efficiently. These analytic databases distribute their datasets across many machines, or nodes, to process large volumes of data (hence the name). These nodes all contain their own storage and compute capabilities, enabling each to execute a portion of the query. (see details here)

Name Salient Features
Apache Hadoop
  • Allows for distributed storage and distributed computing (Massive Parallel Processing – MPP)
  • Efficiently store and process large datasets ranging in size from gigabytes to petabytes of data
  • Achieves fault tolerance by replicating the blocks on the cluster
  • File system is HDFS and processing is MapReduce processing model
  • Hadoop cluster includes a master and multiple worker nodes
Crate.io
  • Database purpose-built for machine data, with a unique architecture designed for machine data use cases
  • Use SQL to process, aggregate and join data
  • Distributed SQL query engine features columnar field caches, and a more modern query planner, which makes joins and aggregates real fast
  • Automatic replication of data across cluster make it easy when a disaster hits
  • Real-Time data ingestion, i.e. read massive data while you ingest data at large scale
  • Time series analysis is made fast and easy with automatic table partitions

 

Multi Model

Most database management systems are organized around a single data model that determines how data can be organized, stored, and manipulated. In contrast, a multi-model database is designed to support multiple data models against a single, integrated backend. Document, graph, relational, and key-value models are examples of data models that may be supported by a multi-model database. (see details here)

Name Salient Features
ArangoDB
  • Unlike many NoSQL databases, ArangoDB is a native multi-model database (key/value pairs, graphs or documents)
  • Use ArangoDB when your application design will grow with time, you want to stay flexible
  • Multiple teams can create objects as they need, e.g. Graph, key-value, or Document
  • Reduces the complexity of the technology stack for your application or usage
  • A native multi-model database allows you to have polyglot data without the complexity
crux DB
  • Crux is bitemporal, document-centric, schemaless, and designed to work with Kafka as an “unbundled” database.
  • Droadly useful for event-based architectures and is a critical requirement for systems in any industry with strong auditing regulation
  • Supports a Datalog query interface for traversing graph relationships across your documents.
  • Can run in distributed and non-distributed modes
dgraph
  • Dgraph is an open source, fast, and distributed graph database written entirely in Go
  • Horizontally scalable transactional graph database with fast arbitrary-depth joins using a GraphQL-like query language.
  • Common uses of graph databases are master data management, recommendation engines, etc.
  • Fast data retrieval for connected data
foundationDB
  • Distributed architecture that gracefully scales out, and handles faults while acting like a single ACID database
  • Provides amazing performance on commodity hardware, allowing you to support very heavy loads at low cost.
  • Stores each piece of data on multiple machines according to a configurable replication factor
InterSystems IRIS Data Platform
  • All data in an InterSystems IRIS database is stored in efficient, tree-based sparse multidimensional arrays
  • Highly available, saleable, and resilient
OrientDB
  • Combines the power of graphs and the flexibility of documents into one scalable, high-performance operational database
  • Open source but enterprise version is also available
  • Written in Java and has a very small server distribution (2mb)
  • Its fast, Stores up to 120,000 records per second

 

NoSQL

A NoSQL (originally referring to “non-SQL” or “non-relational”) database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Such databases have existed since the late 1960s, but the name “NoSQL” was only coined in the early 21st century, triggered by the needs of Web 2.0 companies. NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also sometimes called “Not only SQL” to emphasize that they may support SQL-like query languages or sit alongside SQL databases in polyglot-persistent architectures. (see details here)

Name Salient Features
Apache Cassandra
  • Distributed, wide-column store, NoSQL database management system
  • Can copy data to multiple sites for stronger disaster recovery and business continuity
  • Supports very heavy load applications, likes of Facebook and Netflix
Couchbase
  • Cloud NoSQL database
  • JSON Store and uses N1QL language to retrieve data
  • Specialized to provide low-latency data management for large-scale interactive web, mobile, and IoT applications
  • Simple, uniform and powerful application development APIs across multiple programming languages
  • Couchbase documents are JSON, a self-describing format capable of representing rich structures and relationships
Infinispan
  • A distributed cache and key-value NoSQL data store software developed by Red Hat.
  • Get to your data from multiple protocols and data formats
  • Ensure data is always available to meet demanding workloads.
  • Clustered processing makes is faster processing data in real time
mongoDB
  • MongoDB is a scalable, flexible NoSQL document database platform
  • It is the leading global cloud database service for modern applications
  • It provides developers with a number of useful out-of-the-box capabilities, whether you need to run privately on site or in the public cloud

 

RDBMS

Most of us are very familiar with Relational Database Management Systems (RDBMSs) as they have been around forever. RDBMSs facilitate storage and retrieval of data for many companies and business. We will provide some of the key players in this area.

Name Salient Features
IBM DB2
  • Transparently compress data to decrease disk space and storage infrastructure requirements
  • Greatly reduce the cost and risk of moving legacy applications to Db2. This means you can use your existing skills and assets for quicker, easier migrations
  • You can take advantage of in-memory columnar technology as well as parallel vector processing, data skipping, and data compression
MariaDB
  • Its speed is one of its most prominent features
  • MariaDB is remarkably scalable, and is able to handle tens of thousands of tables and billions of rows of data
  • It can also manage small amounts of data quickly and smoothly, making it convenient for small businesses or personal projects
Microsoft SQL Server
  • Gain insights from all your data by querying across your entire data estate – SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, and others – without moving or replicating the data
  • Build a shared data lake by combining both structured and unstructured data in SQL Server and accessing the data using either T-SQL or Spark
  • Use SQL Server with Windows and Linux containers, plus deploy and manage your deployments using Kubernetes
MySQL
  • Most popular Open Source SQL database management system, is developed, distributed, and supported by Oracle Corporation
  • Designed to be fully multithreaded using kernel threads, to easily use multiple CPUs if they are available
  • Executes very fast joins using an optimized nested-loop join. Implements in-memory hash tables, which are used as temporary tables
Oracle
  • One of the most widely used database in the world. Trusted by almost every major organization in the world
  • Connects with major Operating Systems including Linux (ODBC)
  • includes performance optimizations for commonly used features such as LOBs, PL/SQL, and Index Organized Tables
PostgreSQL
  • PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development
  • PostgreSQL runs on all major operating systems, has been ACID-compliant since 2001
  • Developers can build applications, administrators can protect data integrity and build fault-tolerant environments, and manage your data no matter how big or small the dataset

 

Frame Work

At this point we are only listing one item in this category. These systems help you write code that makes it easy to deploy databases in Kubernetes.

Name Salient Features
KubeDB
  • KubeDB is a framework for writing operators for any database that support the following operational requirements
    • Create a database declaratively using CRD
    • Take one-off backups or period backups to various cloud stores, eg,, S3, GCS, etc
    • Restore from backup or clone any database
  • Currently KubeDB includes support for following datastores:
    • Postgres
    • Elasticsearch
    • MySQL
    • MongoDB
    • Redis
    • Memcached

 

Conclusion

You are very brave if you have made it this far. As you can see choosing a database for Kubernetes system is a task that requires lots of due diligence and research. Once you have completed the research, implementation of a specific technology is very well documented by their respective vendors.

High Plains Computing team is always ready to help with any Kubernetes related product review, architecture design, or design review for your project needs. We have a team of seasoned CKA admins with many years of cloud native technology experience.

 

Sources: Wikipedia, CNCF, OpenSource, DevTeam, and all other respective product vendors (links above).