Quantcast
Channel: inovex GmbH
Viewing all articles
Browse latest Browse all 53

Snowpark for Spark Users: What You Need to Know

$
0
0

From previous blog posts, we are now familiar with some basic concepts and features of Snowflake. In this post, we want to focus on a feature that will be of interest to those who already operate a data warehouse in Databricks, or with Apache Spark: Snowpark. Before diving into the topic, this article will first answer the fundamental question of what Snowpark is and how it differs from Spark.

What is Snowpark?

Snowpark is an umbrella term by Snowflake for runtime environments and libraries that allow us to execute non-SQL code like Python, Java, and Scala directly on Snowflake. For simplicity, in the rest of this post, we will focus on Python as a non-SQL language. Within Snowpark, the most important library is the Snowpark API, a classic Dataframe API, which translates Python code directly into SnowSQL, the SQL dialect of Snowpark, to execute it as usual on Snowflake. Essential here are the runtime environments for (vectorized) user-defined (Table) functions (UDxFs) as well as stored procedures (SPROCs) to execute non-SQL code directly in Snowflake.

How do these three components, Snowpark API, UDxFs, and SPROCs work together, and, most importantly, when should you use which?

  • For transformations that could also be implemented in SQL, use the Snowpark API. In the same case, you would use PySpark in the Spark world.
  • Complex operations not representable in SQL and requiring libraries like PyTorch or Scikit-Learn are performed using UDxFs. The same case when you turn to UDFs in the case of Spark.
  • For the orchestration and automation of the above-described transformations and operations, use SPROCs. In the Spark world, the equivalent would be Databricks workflows, which allow setting up a notebook as a job that can be executed at predefined times.

It is already apparent how similar Snowpark and Spark are from a user’s perspective. This is advantageous as existing Spark ETL pipelines can often be migrated to Snowpark quite easily.

In addition to the three components of Snowpark already mentioned, there is another library named Snowpark ML API. This consists of a Modelling module, which offers parallelized ML algorithms similar to Spark ML, and an Operations module for MLOps tasks. For organizing ML models, there is a Model Registry in the Operations module.

To add to the complexity of terms under the Snowpark umbrella, Container Services also pushes in as another runtime environment. Container Services allow the direct deployment and execution of Docker containers within Snowflake and are used under the hood by Streamlit applications on Snowflake. The following diagram shows all the components mentioned.

Although Snowpark, by Snowflake’s own definition, comprises many components, it is very common to refer to just the three components Snowpark API, UDxFs, and SPROCs as Snowpark. This also allows a direct comparison between Spark and Snowpark, in the spirit of the word creation S(now)park.

How does Snowpark differ from Spark under the hood?

So far, it sounds almost as if Snowpark is simply Spark on Snowflake. Why use a whole new library and not just use the Snowflake Connector for Spark?

To understand this, let’s take a brief deep dive into Spark. Spark is a complete framework for distributed computing, in which the Spark API is just the “user interface“. For the execution of a Spark query, the code is first converted into a logical plan (Execution Plan) that represents all necessary operations in a graph. The Spark driver uses this logical plan to create a physical plan for each executor, which can be executed on the respective nodes.

The Snowpark API is, so to speak, only the first step of this process: The conversion of non-SQL code into another instruction, which is more understandable for the system. In Spark, this is directly a logical plan, whereas Snowpark performs a trans-piling to SnowSQL. Especially in the preview phase of Snowpark, SQL error messages sometimes surprised users, even though they had defined the query using Dataframes in Python. That has improved a lot over time; error messages are much more meaningful nowadays.

In Snowpark, unlike Spark, there is no distributed computing runtime below the Snowpark API, but our familiar Snowflake with all its advantages. Unlike Spark, there is thus almost no administrative overhead, and the mature and finely-tuned Snowflake engine optimizes our Snowpark query. Thus, there is no need for its tuning of memory, JVM heap space, garbage collection parameters, and much more as in Spark. Instead, full focus on the development of the actual application!

How does Snowpark differ from Spark in use?

Programming

Since the Snowpark Python API has adopted almost 100 % of the PySpark API, Spark users will feel immediately at home with Snowpark. For simple workflows, migration is limited to just swapping the import statements from pyspark to snowflake, i.e.,

import pyspark
from pyspark.sql import SparkSession, SQLContext
from pyspark.context import SparkContext
from pyspark.sql import functions as F
from pyspark.sql import types as T

becomes

import snowflake
from snowflake import Session # No SQLContext or else needed!
from snowflake.snowpark import functions as F
from snowflake.snowpark import types as T

However, for a larger codebase, the freely available tool SnowConvert is recommended, which automatically makes these changes. SnowConvert also adds appropriate comments at points where there is no direct replacement of PySpark functionality by Snowpark, e.g., for PySpark methods:

  • DataFrame.coalesce does not exist, as the processing model is not based directly on partitions like in Spark,
  • DataFrame.foreach rephrase as UDF and apply to the really affected columns, not the entire row,
  • DataFrame.head limit followed by collect,
  • DataFrame.cache cache_result.

Testing and Debugging

For checking the execution of a query, Spark offers the so-called SparkUI. Since Snowpark translates the query into SnowSQL, you use the usual Query History under Snowpark, where you can also view the Profile Overview with the Execution Plan. At this point, there are no fundamental differences between Spark to Snowpark; one just needs to get used to another user interface.

Testing is very important for any data pipeline and data product. Here, Spark, especially the open-source version compared to Databricks & Snowpark, has the advantage that tests can also be run locally. Until recently, Snowpark was somewhat (snow-)blind in this area, as the testing code required an internet connection to Snowflake. Fortunately, Snowpark now offers a “Local Mode“ as a public preview feature, which allows tests to be run without a connection to Snowflake, e.g., with pytest.

Availability of Python packages

One of the reasons for using a non-SQL language at all is, of course, the extended functionality and access to a rich ecosystem of packages like Python. With Spark, there are almost no restrictions here, and any third-party packages can be used, whether installed from PyPI or provided directly. Snowpark, as in also in other cases, chooses a different path, the path of maximum security. By default, only packages from a special Snowflake Channel at Anaconda can be installed, which have been previously checked by Snowflake itself for security and thus guaranteed not to contain any malicious code. Currently, more than 2000 Python packages are available on the channel, and more are added daily. This means that the most commonly used Python packages are also available for Snowpark. If not, there is still the option to access PyPI and other repositories by loosening the security settings using the session.custom_package_usage_config parameter. This method also allows the provision of owned Python packages. This replaces the slightly older variant where special Python zip packages had to be created, which were then provided via a stage. However, a disadvantage of Snowpark should not be concealed. So far, only pure Python packages, i.e., without native code components, are supported unless they are found on the Snowflake Anaconda Channel.

Loading from external data sources

For many Spark users, the large number of Spark extensions and plugins are of great value. Prominent examples of plugins are Spark connectors for various data sources like Postgres or MongoDB. When using Snowpark, one has to rethink a bit here, as the Snowpark API, as described above, is only a kind of frontend for Snowflake. Thus, the responsibility for integrating additional data sources lies directly with Snowflake. External data sources must be made available on cloud storage, known as staging, and then loaded into Snowflake using Snowpipe. Additionally, external providers like Fivetran allow connecting almost 500 external data sources to Snowflake and Snowflake itself also provides connectors for Kafka and Spark. However, data does not always have to be integrated into Snowflake. A current preview feature of Snowflake is Iceberg Tables, an open table format by Apache based on Parquet. This allows Snowflake to be used directly on its data storage for the first time, almost without performance losses. Overall, the connectivity of Snowpark and Spark is certainly similarly powerful but implemented differently.

Conclusion

We have closely examined Snowpark as an attractive alternative to Spark and highlighted both similarities and differences. In general, migration from Spark to Snowpark is often easy and benefits from the simplified maintenance and data governance of Snowflake.

We are happy to advise you on whether a migration to Snowpark is worthwhile for you and also support you in the implementation. More information about our services around Snowflake and how to contact us can be found here.

References and useful links


Viewing all articles
Browse latest Browse all 53