Member-only story

Flowman — A Declarative ETL Framework powered by Apache Spark

Don’t reinvent the wheel by writing more boilerplate code. Focus on critical business logic and delegate the tricky details to a clever tool.

9 min readJun 2, 2023

Introduction

Apache Spark is a powerful framework for building flexible and scalable data processing applications. Spark offers a relatively simple yet powerful API that provides the capabilities of classic SQL SELECT statements in a clean and flexible API. At the same time, Spark can access all kinds of data sources, such as object stores in the cloud (S3, ABS, …), relational databases via JDBC and much more via custom connectors (Kafka, HBase, MongoDB, Cassandra, …). This wide range of connectors, coupled with the extensive support for data transformations, supports both simple data ingestion, pure data transformation or any combination such as ETL and ELT — Sparks flexibility supports very different types of workflows to fit your exact needs.

But in the end, a framework for a data processing application is not enough. You still need to build an application on top of Apache Spark that includes all the business logic and lots of boilerplate code. In addition, some important data-related topics fall outside the scope of Apache Spark, such as robust schema management, including schema evolution and migration strategies, or capturing relevant metrics to measure data quality.

In this situation, a tool at a higher level of abstraction can provide practical
solutions, and Flowman is just such a tool. It is
completely open source and free of charge, with pre-built packages for various Spark versions (starting from 2.4 until 3.3) and environments (pure Spark, Cloudera, AWS EMR, Azure Synapse).

Declarative Approach

Flowman uses a purely declarative approach with simple YAML files to specify all data sources, transformations, and data sinks. This makes the life of a data engineer much easier as no expertise in classical software engineering is required, and he/she can focus on “data”.

A powerful set of command-line applications then parses these specification files and…

Flowman — A Declarative ETL Framework powered by Apache Spark

Don’t reinvent the wheel by writing more boilerplate code. Focus on critical business logic and delegate the tricky details to a clever tool.

Introduction

Declarative Approach

Written by Kaya Kupferschmidt

Responses (4)