Monday 13 February 2023

Confusing words in Data Modeling

For a few weeks now, I have been helping out my friends at Hackolade with some really interesting work around their core product, the Hackolade Studio and how to make it even more successful in the marketplace. This means talking to a LOT of people - both current customers, active users, partners, friends in the industry - and more. It’s been fantastic to talk to so many interesting people, and to learn so much from their insights.

During these conversations, I have noticed that there’s quite a bit of work to be done to clarify and straighten out the meaning of the words that we use in these conversations. I have noticed that in the NOSQL data modeling space, we are not always very precise with our words - and that this imprecision can lead to all kinds of misunderstandings. Specifically, I have been struck by the confusion around 3 words: model, schema, and metadata.

This is not a confusion that is specific to my conversations of the past few weeks. A quick google search will show you that there have been literally millions of articles and conversations online asking the same question - and this has been going on for a very long time.


Again: I don’t think that this is a new conversation. However, it does seem to me that the confusion is kind of new - it does seem to have become more acute. In the old days of Relational data modeling, people seemed to have a much clearer view on what was meant with modeling terms. Specifically, people always talked about
  1. The conceptual data model: defines what the system contains, and is about collecting and agreeing on a shared vocabulary for the domain. Typically, data architects and business stakeholders are the users who create conceptual data models. These types are built with the intention to organize and define business concepts, the relationships and the attributes for these concepts.
  2. The logical data model: defines the business rules and how a system has to be implemented. It is not specific to a database. Users who generally create LDMs are data architects and business analysts. These models are used as a foundation for physical data models because they establish the structure for the data elements and identify the primary keys and foreign key relationships, .
  3. The physical data model: how the system will be implemented to a specific database management system. This is where the database designers and developers come in to create physical data models that are expansions of the logical and conceptual models. These models are designed for technical purposes that support the business objectives.
Note that in these three modeling terms, we never ever mentioned the word “schema”. That’s completely normal and intentional: models are very different from schema. Models are visual, documentation artefacts that are intended for humans, and that are specifically designed to facilitate conversations about data structure between stakeholders. Schema, on the contrary, are outputs of the physical data modeling exercise – and represent a machine readable contract with regards to the structure of the data as it will be guarded and enforced by the database management system.

Now here’s the issue: in the world of NOSQL, where polyglot persistence architectures are the rule and the norm and where many different data backends will be hosting many different types of data in many different data models, we seem to have forgotten some of the above terminology and we are getting confused by sloppy interpretations about these terms.

What about model / schema / metadata?

So what is the issue? Well first of all, I think that people are using these terms interchangeably way too easily. I have spoken to very smart people in our industry that sometimes say one thing but mean something else, and then in the next sentence get very specific around the precise meaning that they are trying to articulate. Let’s make up our mind, shall we!

In general, I do think these three terms articulate very different concepts in the data modeling profession, and that as such, we should take deliberate care when using them. I know I am really just a novice at this, but from my point of view
  • The database model is a graphical representation of the structure of the data that we are going to store in a database. As I outlined above, we would refer to “conceptual data models”, “logical data models” and “physical data models”, in traditional relational database management architectures. Models would typically be represented in some entity-relationship diagram, as the model is primarily a representation of how the software developer understands the business reality of the user. It’s the documentation of that conversation between the business and IT.
  • The database schema is the (set of) technical artefacts(s) that can be the result of the physical modeling work, and which administrators / devops staff will use as they are tasked with the day-to-day operation of the software system, to enforce the rules that we have formalized for this system. It’s a set of formalized rules that you will not be allowed to break easily inside the data model: things like uniqueness, optionality, etc. Typically, these rules will be formalized in a “data definition language” (DDL), which will be the output of the physical data model that a database management system will enforce.
  • The database metadata encompasses both the model, the schema and more. Specifically, it would also include some level of formal documentation of the model and the schema, which would be properly written and maintained on a regular basis. The metadata of a database model is the comprehensive set of everything that we would need to demonstrate the proper governance of our data modeling efforts to whoever wanted to know.
Positioned in a simple Venn diagram, I would articulate the three different words and their relation to one another like this:

Meaning: the database METADATA contains the database MODEL and the database SCHEMA, both of which are touching eachother as they are very much connected. In that touchpoint, database architecture would be very important, as it would ensure that the rift between model and schema does not become too wide.

That brings me to the end of this article. In writing it, I hope I have clarified some topics that have clearly been unclear in the NOSQL Data Modeling space, and that struck me during my first couple of ventures learning the important topics that the industry is struggling with. I am sure that this is not the end of the journey – I plan to explore these topics more in future article.

As always: let me know if you have any thoughts about this article – I would love to discuss.

All the best

Rik

No comments:

Post a Comment