This is a bonus post. It chronicles an activity that doesn’t generate a concrete outcome, yet reflects the considerations emerging at this point in an engineering project.
Motivation
Now approach IV is chosen. Approach III is eliminated mainly because it doesn’t meet the Partial Success Requirement and the Failure Reporting Requirement.
However, as mentioned, there might be a possibility to integrate validations in the pipeline in the TRANSFORM stage to minimize the chance of write failure to MySQL. In that case, approach III might still deserve some merit.
This triggers some thoughts. If somehow the updateDF is ensured to always succeed, that is there will be no record failure, then how will approach III fare with approach IV? Is there any sense of comparison between the two? Under what circumstances one might perform “better” than the other? It may make sense to investigate if there are certain situations, e.g. specific workloads, that give one approach an edge over the other. If such situations exist, it might be beneficial to figure out what the characteristics of those workloads are. Then, perhaps it’s reasonable to implement both approaches in the pipeline such that for a given run, depending on the specific runtime conditions and workload characteristics, the appropriate implementation will be activated.
This generates the motivation to conduct some tests on the happy case (no record write failure) to compare the two approaches on issues of concern. These issues include, among others, Performance and Scalability.
Somewhat Unsatisfactory Test Results
In the real production pipeline I was involved with, at this point, we conducted a set of tests. Although their results provided some insights, we deemed they are somewhat unsatisfactory. This is because by mistake, the tests erroneously based on the sub-optimal implementation of the execSQL() function. This mistake hurts the performance of approach IV and not III because at runtime, execSQL() is invoked repeatedly in approach IV, one for each record, while only once for the entire dataset in approach III. This would yield a likely unfair result.
These tests deserve a revisit in the future with a better setup. Note that they are non-trivial to be conducted rigorously and meaningfully. The reason is there’s a large number of different variables, parameters, and configurations for the Spark cluster, the MySQL database, the test dataset,… that need to tune to reflect a real and reasonable working condition. That’s not to mention the coordination required to make sure the tests are conducted under a relatively isolated and stable environment. Examples in this regard include the MySQL database, which usually is accessed concurrently by OLTP apps, having to be relatively free, or at least under good control, under tests. Moreover, each test in the set should be conducted repeatedly multiple times to avoid outliers.
For this reason, the test results will not be presented here for now. Instead, a brief discussion of the test method follows.
Comparative anatomy of approaches III and IV
Given the updateDF typically having multiple partitions, these two approaches proceed differently as follows
Approach III
Step 1: Dump data from those multiple partitions in parallel to the temp table.
Step 2: In the Spark driver program, issue a single Update statement that will join that temp table with the target table and update all the target records in the given dataset at the same time.
Step 1 is a parallel operation (inserting records into the temp table in parallel via multiple connections). Step 2, on the other hand, updates the entire dataset in one single big SQL statement as opposed to dividing the work into smaller chunks to be carried out in parallel.
So in this approach, a “sequential” operation follows a parallel operation.
Approach IV
Only 1 step: Issue the Update statement one by one for each record individually from all the partitions.
There are multiple concurrent open database connections, one for each partition. Each of these connections executes the Update statements for its partition in parallel with other partitions.
Thus, this approach is a fully parallel operation throughout.
Discussion
Both approaches ultimately share the same bottleneck: the single target database server. While approach IV is fully parallel, it fundamentally executes many “small” statements separately, one record at a time. In approach III, on the other hand, although the main work is carried out in one single “large” bulk statement that updates all records, the MySQL engine should be optimized to handle this use case generally well. Thus, how these two approaches fare might depend on various factors, workload characteristics, runtime conditions, and aspects specific to the environment and system configurations.
Tests’ Dimensions
A set of tests was conducted to get a sense of the Performance of the two approaches relative to each other. They were also designed to examine their Scalability characteristics (and perhaps where their limit might lie). Because operationally, in some sense, the significant difference between these approaches is their level of parallelism, one dimension of the tests to contrast their Performance is the level of parallelism. On the other hand, when concerning Scalability, we care about the load, or the size of the dataset. Thus, the tests have these 2 controlled dimensions:
- Level of parallelism: the updateDF’s number of partitions, which translates to the number of cores and nodes required in the Spark cluster. This dimension also dictates the number of open concurrent connections to the MySQL server (for step 1 of approach III, and the entirety of approach IV).
- Dataset’s size: the total number of records in the dataset to be updated.
Tests’ Method
First, determine the reasonable range for each of the 2 dimensions. Then, fix one dimension at a reasonable value within its range, and vary the other dimension in its determined range, conducting the tests along the way. Do this for both dimensions. Each test measures the time elapse of the write operation, which is the measure most readily and conveniently available. This means for approach III, the tests measure the time elapse for the steps 1 and 2 separately.
The range for the Level of Parallelism is determined to be at concrete discrete points from a few to more than 100 partitions. The range for the Dataset Size is determined to be at concrete discrete points from tens to a few million records.
The fixed value for the Level of Parallelism is determined to be a number that is not too small to obscure the Performance difference between the two approaches, and not too large to put the lower range of the Dataset Size in too underutilization.
The fixed value for the Dataset Size is determined to be a number that is not too small to put the higher range of the Level of Parallelism in too underutilization, and not too large to avoid overwhelming the lower range of the Level of Parallelism.
If these tests are to be re-conducted, we might need to revisit all these parameters.
Assessment of Tests’ Results
To have a sense of the relative Performance between the two approaches, compare their results in the same test configuration (same setting for the 2 dimensions).
For a given approach, to have a sense of its Scalability, compare its results across different test configurations. For instance, when fixing the Level of Parallelism and varying the Dataset Size, observe to see if Performance varies proportionally and vice versa.
The tests’ results themselves, as explained earlier, were deemed not entirely satisfactory. If these tests are to be re-conducted, the new results might be presented later.