Snowflake: beyond the cloud data warehouse
As part of growing our expertise in state-of-the-art Big Data tools, a research team of #GFTexperts: Piotr Mosiniak, Krzysztof Faliński and Daniel Grabowski, prepared a thorough exploration of Snowflake, one of the hottest Big Data solutions right now. Check out their practical walk-through of the core features and concepts, including practical use cases in generating real-time quotes and an Oracle DB migration.
The growing interest in processing massive volumes of data resulted in the demand to create a more user-friendly system for its storage and sharing. Snowflake, unlike other Big Data solutions, requires minimal configuration, works in a very user-friendly manner (with SQL as the main language to interact with data) and is easily scalable according to your needs (it is cloud native). One could call it a new type of data warehouse in the cloud that can ingest data from many sources, processes huge volumes of it and integrates with many external reporting tools such as Tableau, Qlik or Power BI with ease.
Snowflake is built as “multi cluster, shared data architecture”. When creating an account, you can choose which cloud provider you wish to use as a backend for your snowflake instance – you can choose among AWS, Azure and GCP.
Architecturally, compute layer is separated from storage layer and uses other cloud services, hence you can scale layers separately using specific settings.
Costs are expressed as Snowflake Credits which are used to pay for consumption of resources on Snowflake. A Snowflake credit is a unit of measure, and it is consumed only when a customer is using resources.
Benefits of using Snowflake
Snowflake is highly available because it is distributed across available zones of the platform on which it runs. It is designed to operate continuously and tolerate failures with minimal impact to customers.
Thanks to its cloud nature, it is also scalable vertically and horizontally and can perform as fast as you need. If you need to boost performance, you can scale computing power up and then, after the job is done, scale it down to save money.
Snowflake supports structured and semi structured data so as you don’t have to keep fixed relational schema. It is easy to manipulate with data, because we can use the well known SQL.
Data sharing is easy to achieve and manage. Snowflake’s architecture enables data sharing among any data consumer — whether they are a Snowflake customer or not. You can manage additional Snowflake accounts for external clients. Adding other client’s accounts doesn’t slow the system down because Snowflake supports HA access to data and uses separate computing power. Queries from one virtual warehouse never affect queries from another, and each virtual warehouse can scale up or down as required.
The common step of each approach is to properly configure the environment in Snowflake. The basic element is the creation of a Warehouse, where we declare the performance tier for processing unit and other features such as the number of instances in the cluster and the scaling policy. It was also very important to define auto suspend time for cost saving.
The next step was to create a system user to limit permissions to the appropriate part of the system.
TRADE LENS – Real-Time Quotes
The first goal was to process trade prices form stock exchange in “live” mode and then analyze them in the analytical management tool – Power BI. We used Azure Event Hub data from another pre-existing PoC project as input source. In order to transfer data, we used Kafka Connect deployed in a docker container on Azure, which was responsible for sending data to raw Snowflake tables. Then, from the raw tables, the data were processed to final tables with appropriate formats.
For visualization purposes, we used Power BI to query Snowflake database with Direct Query mode, which conducts SQL queries in real time.
ORACLE DB migration
For the purposes of replicating a real scenario of migration, we have created a virtual machine with Oracle SQL version 12 on Azure. It was populated with data generated by the database performance testing tool – TPC-DS. This tool has the ability to generate a given data size – in our case it was 1 GB of data with 24 tables.
Generating test data and configuration necessary for import
The tool allows you to automatically generate the desired data volumes. For this purpose, the scripts provided in TPC-DS had to be ran and the expected size had to be determined. The data was created in the form of text files with the symbols “|” separating successive values.
In this way, 24 files corresponding to the database schema were created.
Data import to Oracle DB
We performed the import using the sqlldr utility, which requires the previously described pair for each of the tables (data file and control file). Using a python script, we generated the “CONTROL_FILE” files needed to import data into the Oracle database. The configuration files describe how sqlldr parses the data from the source file and what table in the database it should go to. A sample configuration file “inventory.cfg”:
We performed the import process using the script:
The database was placed on a standard HDD with basic performance tier settings.
During the import, it was necessary to scale up the disk with the database to a faster version (SSD Premium), because the import using the standard settings took about 6 times longer (6h vs 1h).
Migration – first plan, one-time migration
The first way was to migrate data from files from the database export. For this purpose, we exported data to files using the Oracle SQL Developer tool. It turned out that importing to Snowflake requires the use of a dot as a decimal separator and this could not be configured in the end system itself – we had to prepare the data in this way from the very beginning.
Then the files were placed in Azure Blob Storage, where separate folders were created for each of the tables. Snowflake reads multiple files from each folder and thus does an additional migration as you add more data (not repeating previously imported data).
To import data to Snowflake, it was also necessary to create a SAS token that allows access to resources contained in blob storage. It is a means of protection against unauthorized access from the public network. In the token you can separately set what resources will be accessed, what actions will be allowed and what the token expiry date is.
Then we set up the data processing and storage system on the Snowflake side. First, the data downloaded from Azure went to the stage. Configuration on the Snowflake side is to create a temporary storage location for raw data downloaded unchanged from Blob Storage.
It was also necessary to create a table space into which this data will be transformed. To create the tables, we used the same SQL scripts that were used to create the schema in the source Oracle database.
The next step was to determine how to transform the raw data in CSV format into the tablespace. For this purpose, it was necessary to define file formats as below:
The next step was to configure the stage to copy data to tables using the COPY INTO instruction using the previously created format.
Migration – second plan, going incrementally
The second method used Kafka Connect with plugins to connect the database (using the JDBC protocol) and Snowflake. In this case, it is possible to migrate a running database without stopping it or taking a snapshot from one moment.
A great advantage of this solution is that there is no need to stop the application when data is being exported. An additional benefit is the ability to maintain two data sets (source database and data in Snowflake), so you can check their consistency before making a decision to switch to a migrated system.
In order to perform an incremental migration, it was necessary to ensure that the data set contained a field with a unique key. In the case of a complex primary key, it was required to add extra ID with the auto-incrementation function so that it was possible to bypass the already imported data and specify the appropriate offset.
To transfer data from Oracle DB to Snowflake, we used Kafka Connect with two connectors: jdbc-connector, snowflake-connector and the Azure Event Hubs.
The Event Hubs for Apache Kafka feature provides a protocol head on top of Azure Event Hubs that is compatible with Kafka clients built for server versions 1.0 and later and supports both reading from and writing to Event Hubs, which are equivalent to Kafka topics. Conceptually, Kafka and Event Hubs are very similar to each other: they’re both partitioned logs built for streaming data, whereby the client controls which part of the retained log it wants to read.
We have used distributed mode in order to run Kafka Connect with multiple workers and ease of configuration. Based on Event Hubs, the solution was scalable, fast and fault tolerant, so it is suitable for production deployment.
It has been deployed to Container instance from deployment template. We have guaranteed the highest performance thanks to internal network between all components: Oracle, Kafka Connect, Event Hubs and Azure Snowflake version and partitioning Event Hubs.
We had to create 4 groups of Event hubs not to exceed the limit of 10 Event Hubs per one group/namespace. Our data sources were using 24 topics, one for each data source and 12 topics for Kafka Connect’s configurations. Kafka connectors migrated data in parallel using separated topics as separated ELT pipelines.
A large amount of data was readily embraced by our solution, which turned out to be as effective as the boa snake in the book The Little Prince. This is clearly visible when compared to the picture drawn by the Little Prince.
Jdbc and snowflake connectors were configured through Kafka connect REST API. Configuration for both of them was a little bit tricky. There were some issues that we had to look after and to be honest, a huge chunk of our work was devoted to overcoming them.
When configuring the JDBC connector, the most important elements were:
- specifying data types by using the appropriate CAST, where the problem was reading the data through connector (without CAST the data was read as bytes in base64) and setting the mappingtype (best_fit / none / precision)
“query”:”SELECT CAST(WR_RETURNED_DATE_SK AS NUMBER(18,0))
- setting the migration method to iterative (incrementing / bulk) with specifying a unique column to set the offset
- increasing performance of the process thanks to the parallelization of the process by using multiple tasks and multiple partitions
When configuring the Snowflake connector, the most important element was the correct configuration of topic mapping to the target table.
Those messages are automatically consumed and transferred to sink raw snowflake tables. Under the hood it is a complex process that includes:
- consuming messages
- compressing json
- copying data to snowflake storage
- creating snowflake tables on top of it
Transforming data on snowflake side
Finally, after moving the data to Snowflake, there were two more steps left to do:
1. Create the final table with original oracle DB schema
2. Transform raw table in json format to final table identical to the source DB schema
Problems in our use cases
During the works, we encountered several important problems to solve. They resulted from the lack of knowledge of the system or the limitations of the tools used for migration.
- Snowflake does not enforce the unique constraint by design while some other systems might want this information – this way they won’t get lost while travelling through Snowflake. Due to that, we had to prepare custom solution to keep unique data from trade lens event hub (some items were duplicated).
- Snowflake needs a dot as a decimal separator, and for this reason, we had to export data from the database with dots rather than comas.
- Snowflake materialized view is not supported by Microsoft Power BI. What is very important is to check whether SELECT used in Connector configuration is proper, because in our use case, we encountered a problem with mismatched columns.
- Jdbc connector has troubles with the oracle type, especially with precision and scale in Number type. The workaround was to cast to Number(x,0), which introduced some complications in configuring the jdbc connector.
Snowflake is one of the leading cloud data warehouse technologies on the market today and with the recent advances in technology and the shift to cloud-based data warehousing, ELT approaches have become far more common.
- In our opinion, Snowflake is the easiest Big Data implementation to start analyzing and processing huge amounts of data without overhead in configuration or the need for extensive knowledge of Big Data
- Snowflake follows the trend of user-friendly solutions that do not require complicated configuration and tools. On the other hand, such a solution does not allow much freedom and extensibility due to being a closed product.
- In this solution, DevOps and DBA skills are even more important than skills related to Big Data systems, because assembling the system and creating subsequent elements needed to transfer data requires running many components and using SQL.
- Snowflake cooperates with a whole range of connectors, pipelines, reporting systems, business analytics applications and data analytics such as Spark.
Finally, it is a cloud-agnostic data warehouse. This is an aspect that is crucial for many customers who wish to prevent relying on any given provider exclusively.