A while back I needed to tap into third-party JSON data and convert it for use within my own application, ETL-style.
I naively expected that ingesting that data would just mean writing a few lines
of JavaScript, but quickly realized that data in the real world is messy: Due to
the JSON being entered manually, without much in the way of automated
validation, records don’t adhere to a strict schema. Consequently their fields
might show up inconsistently, meaning the conceptual model only reveals itself
when analyzing the entire data set to detect commonalities. Similarly, values
might be assigned arbitrarily (e.g. blank values could be null
, empty strings
or simply omitted). Because the data set continues to evolve, its shape might
shift over time as entities and attributes are added, removed or even
reinterpreted – often after months of inactivity.
Thus my initial, naive implementation quickly ballooned into a complex web of imperative validation. This prompted me to look into tools like joi, but that didn’t lead to significant improvements in terms of complexity and maintainability. What I really want is declarative, easy-to-scan definitions of both the entities within that JSON data and the corresponding transformations into my own model.
Let’s assume we want to collate various book databases, then a definition for one imaginary (and comparatively simple) third-party data source might be sketched out like this:
This lays out our expectations regarding incoming data on the left, both in
terms of structure and value types, along with the corresponding target model on
the right. Due to the aforementioned inconsistencies, we have to account for
varying value representations and recognize certain fields as being optional
(dashed lines) on the left-hand side, but can consolidate values within our
target model: A number might also be wrapped in a string ("123"
), blank
(null
or an empty string) or missing entirely, but always ends up either as a
number or a null
value on our end. Similarly, a blank author
field ends up
as an empty authors
array.
Thus the target model is a simplified, normalized projection of the original
data, which means we don’t have to worry about gross inconsistencies in further
processing. Because we’re not currently interested in the editions
field, we
omit it from the target model. It might seem odd that we’re expressing
expectations on data we end up discarding, but that’s on purpose here: It’s
quite possible that future additions will contain information we’re interested
in, so we want to be notified when that happens.
It would be nice if we could easily recognize those structures in our code as well – indeed, it’s kind of essential for the long-term viability of a project that we only return to a few times per year. This is how declepticon evolved from my initial implementation:
Here fields
describe the third-party JSON data and associated assumptions on
our part while slots
defines corresponding transformations into our own model.
(eager
ensures that a slot is populated early on so it can be used within the
respective string representation; see warning messages below). We can then apply
this descriptor to validate incoming data and turn it into model instances:
Any discrepancies will be reported:
This tells us that we should update our descriptor and perhaps even incorporate those newly added fields into our own model.
So declepticon essentially provides a primitive type system (alternately called a rich data-driven specification language), enforced at runtime. I wish there was an existing solution I could have used, but the alternatives I’d found proved unsuitable for this particular objective.
If you happen to have use for it, please let me know how declepticon works for you – either down in the comments or on GitHub.