Over the past two decades, on-premises data warehouses have been the go-to solution for data management. These systems have been relied on by organizations to store and analyze large amounts of data, thanks to the emergence of ETL (Extract, Transform, Load) tools and the adoption of common stacks such as OBIEE, SQL Server Data Warehouse, and DB2/Cognos. However, as technology continues to evolve, it becomes increasingly important to consider migrating these on-premises data warehouses to the cloud. This article explores the reasons behind this migration and highlights the advantages of AWS Redshift as a cloud-based data warehousing solution.
Why Move to the Cloud?
Some of the good reasons to migrate your entire data warehouse setup to the cloud include
A Great Data Cloud Ecosystem
Amazon provides key technology enablers for modern data warehousing. The list below describes some of them and the diagram shows them at work.
These technology enablers include:
- Elastic Computing: Elastic serverless computing in Lambda functions, Glue ETL jobs, and Elastic Map Reduce service that can crunch any size data
- Data lakes: A data lake built on top of S3 object storage stores unstructured data in CSV or Parquet format. Glue Crawler crawls S3-stored files and updates metadata so Data can be queried as SQL without major ETL and ingestion into a data warehousing database. Data lakes can help build cost-effective data warehouses even for petabyte-scale datasets.
- Support for streaming data sources: Modern applications architecture favors ingesting real-time data vs batch mode applications. One such example is a product recommendation system ingests a stream of web analytic feed of user activities on busy brand sites and processes it to generate product recommendations. Batched mode operations can delay generating recommendations for hours or even days, depending upon schedule but stream mode can have the latest recommendations within minutes. Social media data and Twitter feeds are a few other examples.
- Support for on-demand data visualization and reporting services that can leverage AI/MLto provide much enhanced BI.
With most databases in the cloud, the process of Extracting, Transforming, and Loading becomes more streamlined. Cloud-based ETL tools offer enhanced flexibility and scalability.
Amazon offers an impressive array of modernized ETL tools that can work with streamed data sources, ingest data in diverse formats, and operate on large volumes of data.
The screenshot below shows how a serverless AWS Glue job can be configured to scale to any size.
Cloud-based data warehousing solutions often provide a more cost-effective approach, eliminating the need for expensive hardware investments and maintenance. Amazon (AWS), for example, offers diverse plans for its Cloud-based Columnar data warehouse database Redshift, including a free tier for tryout and on-demand pricing that allows users to stop service at any time and deeply discounted reserved instances. QA and Testing DB clusters can benefit from a stop-and-resume pricing model where the cluster can be stopped and billing stops.
Cloud platforms, such as AWS, offer unlimited scalability, allowing organizations to accommodate growing data volumes without disruptions. Amazon offers diverse node types to suit computing needs.
Here is an example of a starter Redshift DB cluster based on instance type DC2 where the underlying nodes that make the DB cluster have local SSD
The cluster consists of 4 DC2-large instances offering 640 GB of storage for data warehousing projects. Each node has 15 GB of memory and two vCPUs. Such a cluster would cost $700-$800 per month.
Once the POC for the data warehousing project is approved, the Customer may choose a 4 RA3 Node cluster to have an aggregate of
- 512 TB total data warehouse high-performance managed SSD storage that automatically expands to S3 when data size grows
- 192 vCPUs for Cluster
- 1.5 TB memory for cluster
On-demand pricing would cost $38,000 per month, but 50-90% discounted pricing can be obtained via a one or three-year reserved instance. Operating a 0.5 petabyte Data warehouse with a monthly TCO of less than $20K is only possible in the Cloud.
Accessibility & Collaboration
Cloud-based data warehouses enable remote access and collaboration, allowing teams to work together seamlessly from different locations.
Amazon Redshift is a business intelligence tool that uses an AI-powered architecture with massively parallel processing (MPP) to enable quick and cost-effective decision-making. With AWS’s zero-ETL approach, all your data is unified for powerful analytics, near-real-time use cases, and AI/ML applications. You can easily and securely set up, share and collaborate data within and across organizations, AWS regions, and even 3rd party data providers, supported by leading security capabilities and fine-grained governance.
The figure above shows Amazon Redshift architecture. It is a massive parallel processing architecture (MPP) with query processing distributed across multiple nodes. Nodes store data in compressed columnar format, which suits BI applications’ queries. Redshift spectrum adds the additional capability of using external S3 storage and processing semi-structured data in parquet format to be used as part of a data warehouse and extend storage capabilities to exabyte scale
Features of AWS Redshift
AWS Redshift is a powerful cloud-based data warehousing solution that offers numerous benefits for organizations considering migration. Its advantages include
A Massive Scale Data Warehouse
It is easy and cost-efficient to build an exabyte-scale data warehouse using Redshift. AWS provides rapid snapshot-based incremental backups and quick restores. Several node types with varying computing and storage are available, making it a very flexible data warehouse on the cloud.
A Unified Data-Driven Near zero-ETL Approach
Amazon Redshift’s zero-ETL approach enables interoperability and integration between various data sources, including Amazon S3 data lakes, operational and NoSQL databases, and streaming data services. It’s easy to ingest data into the warehouse or access it in-place ingestion, eliminating the need for complicated ETL pipelines. One such elimination of ETL example is when Redshift is configured to ingest data from transactional databases by processing redo log entries and then streaming it to S3. Redshift can leverage AWS Database Migration Service(DMS) to read redo logs and its S3 import/export feature to receive copies of transnational data. If this copied data needs some lightweight transformation, DMS can do that as a batched job, or data can be directly modified within Redshift and stored as L2/L3 tables optimized for BI applications.
In the above example, there is little or no traditional ETL pipeline setup, however, should a need for an ETL pipeline arise, the AWS Data ecosystem has a high-performance AWS Glue service that does Apache spark-like big data operations to slice and dice data from any datastore as well as high-performance streaming services like AWS Kinesis and Kinesis Firehose.
Fast and Secure Data Collaboration
Amazon Redshift allows secure data sharing across AWS regions, teams, and third-party data warehouses without data movement. Multiple teams can access shared data sets across regions and third-party data warehouses. AWS Lake Formation governs data sharing. Amazon Redshift provides fine-grained access controls and authentication with a single sign-on for your organizational identity, all included at no extra cost. Users and groups with access to shared data can query it using standard SQL and analytics tools, and join it with local data. New metadata views and modified JDBC/ODBC drivers ensure seamless integration with shared data. See the diagram for the data-sharing architecture.
Comprehensive Analytics & ML
Amazon Redshift simplifies data analysis, enabling you to execute SQL queries, build dashboards, and develop real-time AI applications. With Amazon Redshift, you can quickly spin up a Redshift Serverless endpoint and use the Query Editor to load, analyze, visualize, and collaborate on data from multiple sources. You can even submit plain English queries and receive custom SQL code recommendations with Amazon Q Generative SQL. With Amazon Redshift ML, you can build, train, and deploy machine learning models with familiar SQL, all from within the warehouse.
Amazon Redshift offers a pay-as-you-go pricing model, allowing organizations to pay only for the resources they consume. With automated workload management, Redshift optimizes resource allocation, further reducing cost. Amazon Redshift provides up to three times better price performance than other cloud data warehouses.
Security and Compliance
Amazon Redshift provides robust features, including encryption at rest and in transit.It complies with industry standards, such as PCI DSS, HIPAA, and SOC2, ensuring data protection and regulatory compliance.
As on-premises data warehouses started gaining mainstream popularity 20 years ago, organizations relied on tools like OBIEE, SQL Server Data Warehouse, and DB2/Cognos.
However, the evaluation of technology has led to the migration of these data warehouses to cloud-based solutions like AWS Redshift. With its high performance, scalability, cost-effectiveness, integration with other AWS services, and strong security features, Redshift is a top choice for organizations migrating their on-premises data warehouses to the cloud. By embracing cloud-based solutions, organizations can unlock the full potential of their data and drive innovation in the digital era. If your organization requires any assistance with AWS Redshift, please don’t hesitate to contact HighPlains Computing