Apache Spark’s fine-grained processing of RDBMS records part 1: Introduction

At a high level, in a typical enterprise data infrastructure, there usually exists at least two relatively separate systems: a transactional system (OLTP), and an analytical system (OLAP, Data Warehouse). An ETL data pipeline would then propagate data from the former to the latter on an ongoing basis. Nowadays, in the industry, Apache Spark is often seen as a tool that plays a prominent role in this pipeline. Notably, its typical use case when it comes to data ETL, besides to read data then transform, is to write data, either directly or via an intermediate step, to some kind of analytical system.

In the recent past, I was involved with building a pipeline where Spark is used to write data directly to a transactional system, particularly to an OLTP MySQL database, as opposed to an analytical system. This is a slightly less common use case for Spark, such that in my opinion, prompts a topic interesting enough that deserves a series of posts. Moreover, it’s reasonably narrowly focused in scope and relatively “straightforward” enough to hopefully be effectively presentable in the blogging format. Thus, this series is created to discuss a case study inspired by a real-world production system.

At a high level, the requirements for Spark batch jobs that write data to the database are the following:

Support updating of existing records in transactional tables.
Support inserting of new records to transactional tables.
Support partial job success.
Support some mechanism to report failed records.
Support the changes of schema written to transactional tables between different runs.

These requirements demand a fine-grained, “delicate” processing of relational records in the context of an OLTP database. Spark’s native operations can’t meet all of them out-of-the-box. Throughout this series, we’ll dig into the details of this case study and end up with a solution with working Scala code.

In this study, we work with Spark 2.1.0, but the claims in this series should equally apply to the latest version, being 2.3.2 at the time of this writing.

Now, let’s start our fun ride together. In the next post, we’ll set the context for this study.

Discover more from Data in the New Age