Where Are We?
So where are we now? Let’s take a step back. We started with the set of requirements for the write operation from a Spark dataframe to a database table. Mainly, we’re required to conduct the two specific operations (update, and insert) that support the three capabilities (partial success, failure reporting, and dynamic schema). After the investigation, we identified approach IV as the direction forward. Fundamentally, it writes to each record one at a time using a prepared statement over a manually managed jdbc connection.
For the update operation that has been discussed pretty extensively so far, the prepared statement is constructed from a SQL string template of the following form:
update transaction_table set string_data_col = ?, number_data_col = ? where unique_key_col_1 = ? and unique_key_col_2 = ?
The SQL string template for the insert operation would have the following form:
insert into transaction_table (unique_key_col_1, unique_key_col_2, dataset_id, string_data_col, number_data_col) values (?, ?, ?, ?, ?)
Runtime Binding Values vs Compile Time Literal Values
In fact, the more general form of the above SQL templates would contain the mixture of placeholders, represented by the question mark ? to be bound to real values at runtime, and literal values. For example, the template used to insert the dataset 199 to the transaction_table might look like the following:
insert into transaction_table (unique_key_col_1, unique_key_col_2, dataset_id, string_data_col, number_data_col) values (?, ?, 199, ?, ?)
The dataset id of 199 is known before the start of the execution of the SparkSQL job carrying out the Insert operation. It’s known from the event sent to the data pipeline 1 that triggers this job. Within the duration of the job, this value then will become a constant, shared by all rows in the resulting insertDF. That is, all rows in this dataframe have the same value of 199 in their dataset_id column. Thus, we might have some performance gain if we bake (hard-code) the value 199 into the SQL template as above as opposed to treating it as a placeholder with unknown value at pre-compilation time.
This performance gain will become more considerable when two conditions exist. First, the dataset size is non-trivial. Second, there are more columns that also similarly have constant values (their values are the same across rows in the entire dataframe) that are known at the time of building the SQL template.
This reasoning holds for the update SQL template, too. So, in constructing the template, whether for the update or insert operation, we need to provide a mechanism that enables selectively baking in a literal value or creating a placeholder for runtime binding.
Summary and Next Step
In summary, both the Insert and Update operations share some commonalities:
-
They are handed over a jdbc connection to the database, managed at the partition level.
-
Suppose somehow the SQL string template (whether insert or update) is constructed successfully, they will use the jdbc connection to precompile it to create a server-side prepared statement. This is the step getting ready for successive executions of all rows in the partition.
-
In the prepared statement, they need to distinguish between placeholder columns to be bound to real values at runtime, and columns containing literal values available at compile time.
-
Then during execution, for each row in the partition, they will write it (either insert or update accordingly) to the corresponding target record in the table. This is done immediately following the proper binding of values supplied from the row to their correct placeholder columns in the prepared statement.
-
After done processing the entire partition, they will close the connection, completing the job.
This insight translates to several abstract classes shared by both the Update and Insert operations, presented in the next post.