Our long and fun journey now comes to the end.
About Spark
Apache Spark, as an analytic processing framework, is designed to process data in bulk. Particularly, when it comes to writing to an RDBMS, the standard behavior of Spark’s native operations is such that all records in one batch are processed in one single database transaction (*). Each individual record does not necessarily receive the amount of care for a “first-class citizen” as it receives when it’s processed in a transactional system.
Our Challenge
Given that, in this study, we started out facing a somewhat unusual challenge. Here, Spark, although designed for analytic processing, is used to write data in bulk to a transactional RDBMS. We need to implement Spark jobs that write data of varying sizes, from small to potentially “massive”, to a relational database. This database is always busy with lots of concurrent transactional user requests, a characteristic typical of an OLTP system. This poses some requirements that Spark’s native standard operations can’t fulfill. It forces us to march on this journey, in search of the answer.
Our Solution
We evaluated several alternatives in increasing level of sophistication (approaches I, II, and III). Finally, we found THE ONE.
Our solution processes each record at a granular level in its isolated transaction. Thus, a Spark job would effectively be composed of a (potentially giant) collection of full-blown OLTP transactions. In a sense, we have the Spark job emulate many concurrent OLTP user sessions writing to the database. This is in contrast to the OLAP use case, which is Spark’s natural style. In this use case, again, the entire dataset is treated as one (potentially giant) transaction where each record, in some sense, is somewhat insignificant (*).
Instead, we extract individual records from the Spark dataframe and elevate each to first-class citizen status. Doing so enables meticulous care with deliberation required in this study dedicated to each record. It, again, is as if one user is working in a session on a user-facing OLTP app, intentionally and explicitly writing a given record to the database (like placing an online order). And for a given Spark job run, in effect, we could potentially have millions of such user sessions executing in parallel like this via many (e.g. hundreds) concurrent database connections. By manipulating data at this fine-grained level, we can comfortably tackle all the requirements for which Spark’s native operations are not necessarily fully equipped to handle.
Our Implementation
In this direction, we moved forward, aiming at production-grade implementation. After solving some initial hiccups (here and here), we were ready to move on to the central development.
To start, we built a series of utilities and supporting facilities. First, we created a utility to retrieve then convert data from SparkSQL to Jdbc. Second, we constructed a facility that covers generic write capability to a Jdbc database. Then, on top of it, we built facilities to handle the Update and Insert individually. Finally, we provided the Failure Reporting capability.
All the supporting functionalities eventually culminated in the final fully functioning app that meets all the requirements.
In the next post, there will be a full code listing of the series.
Parting Words
(parting for this series only, of course)
Hopefully, you’ve been enjoying the journey so far. Hopefully, it provides you with some ideas, techniques, or even answers to solve your problems. If not, at least perhaps you’ve been entertained a bit? Any suggestions, comments, feedback, let me know.
(*) To be more precise, when the dataframe consists of multiple partitions, each partition corresponds to one database connection and transaction. The point remains: records of each partition are handled in bulk, they all either succeed or fail together.