Tuesday, July 20, 2021

Upsert Parquet Data Incrementally

Incremental data load is very easy now a days. Next generation Databricks Delta allows us to upsert and delete records efficiently in data lakes. However, it's a bit tedious to emulate a function that can upsert parquet table incrementally like Delta. In this article I'm going to throw some light on the subject.

Hadoop follow WORM (write once and read multiple time) that doesn't allow us to delete rows from Data Frame. But then question appears how to handle restatement data? When last week data got changed in current week, we need to update the row with latest value in master table.

In this example we'll process historical and latest data before overwriting existing table, however, for large data sets, it will impact engine performance. We need to segregate input data in such a way so that only the partition gets update where there is a change. For that, data wrangling is most important.

In below set of example Data Frame, assume Table 1 is our previous week data and Table tow has been received in current week. Significant point is that the row of week ID 260 (Sales value) got changed in Table 2. We have to keep that change in master data.

Let's prepare three sets of test Data Frame and apply upsert function with first two. Using "testthat" R library we'll compare the result with third Data Frame.