The evolution of the modern data platform
This blog is a preview of our thought leadership paper ‘The evolution of the modern data platform’. To download the paper in full – click here
The history of the data warehouse dates back to the 60s/70s when the likes of Bill Inmon, and others, started discussing the concept of the data modelling practice. It looked to address problems such as data redundancy and duplication, data integrity and the associated costs. Historically, data was stored and duplicated across multiple decision points within an organisation, creating integrity issues on information shared. The data warehouse aimed to resolve these issues through consolidating and centralising data in a trusted holistic storage layer.
As time moved on, different data warehouse modelling practices took shape, to address challenges around performance, time to delivery and scalability. From Ralph Kimball to Data Vault, different ways and means to store data became the data debate of the time.
However, everything changed with the technology explosion which was the notion of big data. Technologies were now capable of ingesting disparate types of data incredibly fast and gaining answers from specific optimised queries on data quicker than holistic data modelling. As time moves on, new data capabilities and architectural methods are developing as we speak. This point of view looks to walk through the data evolution, to explore where we are currently, with the modern data platform.
The data evolution 1 – relational database management systems
Data warehouses sought to address duplication of data, performance and reduce associated costs. Before data modelling was a concept to consider, organisations moved data from applications into siloed stores to gain access. This practice is still seen today, where business units that need to apply calculations on their data, pull it into individual stores, apply those calculations and store them locally for future use.
Business domains like actuarial analysts, or finance accountants, still store data within Excel spreadsheets and Access databases, building macros to output calculated metric values which they distribute back to consumers of data. The speed at which those teams can produce results leads them down this path.
However, these siloed data stores, with little to no governance, have historically caused a lot of challenges, from poor data quality, redundant data, duplication, little or no security, accessibility… the list goes on. By consolidating data within a central layer – the data warehouse – these challenges are addressed. Governance may be applied across the estate, regardless of which data modelling method is chosen, and trust is instilled.
The data model
At a high level, data was scheduled out of the application database and moved into a predefined data model. That data model was designed with a relational database management system (RDBMS), and was schema-bound on writing data into it. This means, if an organisation decided to build its data warehouse in Kimball, the modelling structure would be with a “bottom-up” approach, and domain level dimensional data marts would be built, which would holistically form a central data warehouse through shared dimensions. If that organisation decided to build its data warehouse “top-down”, with the Inmon approach, the full data warehouse model would be designed first, usually with a highly normalised model (3NF), and data marts would be created off that holistic model – much like a Data Vault model, where dimensional marts could be developed off a highly normalised link / satellite / hub model. All these methods were designed with the optimisation capabilities of RDBMS, which include various indexes, statistics, materialised views and partitioning. Technology vendors focused on creating tooling which helped the breadth of capabilities needed by organisations, becoming “the jack of all trades, the master of none, but oftentimes better than a master of one.”
IT needed to build a data warehouse model which would be seen as being subject-orientated, time-variant, non-volatile and integrated (Inmon, 1995). Regardless of which modelling method one chose, there was an integrated holistic data model which needed to be developed, and it took time. Focus was on the full centralised platform and not on the specific questions the end consumers were asking. In the time it took to develop the full model, data silos were being built, and shadow IT formed within domains.
Should the data warehouse project fail, not only were data silos being built whilst waiting, the failed project had large cost implications which defeated the entire purpose of its being in the first place.