Disclaimer: Post was initially published on datacraze.pl domain. It was migrated, as previous domain is going to be driven in Polish.
10 important elements of data warehouse, elements relevant for you if you are working with data, you are part of data engineering team, or your are generally interested in this topic. Below list is not a comprehensive guide to Data Warehousing it is a subjective list of important components that should be considered in each data warehousing project.
Without further ado, enjoy the read, and let me know your thoughts.
1. Quality Screens
I’ve encountered this term in one of fundamental books for Data Warehousing – Kimball & Caserta -The Data Warehouse ETL Toolkit. To put it simply it is just a test or better a set of tests, that are integral part of your ETL (extract – transform – load) process.
Quality Screens can be divided into 3 groups:
- Columns Screens
- Structure Screens
- Business Rules Screens
Columns Screens, are to check for instance if values are not null, greater than X characters (yeah, I know you can force by on the CREATE TABLE level).
Structure Screens, validates relations between tables like for instance 1 to many, foreign / primary keys etc.
Business Rules Screens, all agreed with business stakeholders rules, for example business unit codes in certain format etc.
There are 2 expected results: success or failure – hello captain obvious 🙂
With failure you as a designer of whole process can either, push erroneous row to Error Tables (if you have them) or “kill” the whole process – this is “by the book” solution. There is also a 3rd option when it comes to failure execution, you will not find it in the books – continue with the process and insert wrong rows to final tables, but please don’t do it you’ve set quality screens for a reason.
Screens are useful and a very handy feature in every Data Engineer toolkit, but beware everything comes with its price. Quality screens are additional step in an overall transformation process, and by design they are meant to process each and every row that is being loaded.
Want to know more about quality screens, please check this Whitepaper by Kimball Group.
2. External Parameters File / Table
If you are keeping parameters / variables inside of you code base as a “hardcoded” fragments – shame on you! This is a joke, but frankly, such cases should be avoided, you can either use external configuration files, or keep variables as environment variables or have a separate table inside your data warehouse solely for the purpose of maintaining all the necessary configuration variables.
What are the benefits of such approach you may ask – to make life easier lets focus on a configuration table inside of data warehouse:
- single place for all necessary configuration – known by every Team member;
- ease of use – changing, adding or removing existing parameters is just one insert or delete away;
- ease of expansion – if you want to expand existing configuration set by new parameter (column) you can easily add it to existing table, without any need to change already available settings;
I’ve not mentioned here version control of configuration parameters, there is separate point about that below.
3. Team and Its responsibilities
Heart and soul of each data warehouse. Good team will save even the most terrible day. This point may sound trivial, but believe me you can have top-notch technologies under the hood, bulletproof ETL process, best in class analytics / visualization layer, but at the end there is a group of people behind it – developing, analyzing, designing, supporting and pushing the boundaries even further.
Team is important, no doubt about it, but when you take a closer look the most important factors around the Team is feeling of ownership, synergy (honesty) and clear roles and responsibilities.
Once those factors are established properly, you will most likely get following benefits:
- self-organizing (watch out buzzword detected) team, that can share tasks to be done between each other without a hassle;
- clear expectations as to what is the role and path for each person in Team;
- honesty between team members – it won’t be always sunshine and rainbows, thus honesty of opinion is important to stop think about the problem, solve it and move forward. Without synergy or honesty Team will work – for some time – until it cracks and all the hell will broke loose, at that time you may expected hidden grudges to pop out, blaming and lots of negative emotions;
- support between team members in case of problems, without explicit need to ask somebody out;
- group of people that are doing their best to create best in class data warehouse solution, that are not afraid to raise tough questions but also to work together to solve them.
You can find more about Team and its possible structure in This Article.
There is also a good presentation about “Essentials for Building and Leading Highly Effective Development Teams by Tugberk Ugurlu @ DevConf”.
More about Mission of a ETL Team in upcoming post of mine, please signup to newsletter if you feel interested.
4. Up to date data connectors to external sources
We are living in the world of interconnected stuff. There is a lot of products that are generating, storing and transforming data that you and your company can add into data warehouse to get a useful insights.
Starting from existing “old” players: ERP (Enterprise Resource Planning) system, CRM (Customer Relation Management), to “new” guys: Marketing data, Web Analytics (web page visits, mentions of your brand in the internet etc.), ending on data from sensors and different IoT (Internet Of Things) devices.
Not getting into technical details, as you may be using different technology stack, without a doubt most of us are using some sort of data connectors in our data ingestion process. Be it ODBC / JDBC drivers, 3rd party delivered connectors for examples for SAP, Salesforce products, REST API connectors etc. In most of the cases – at least from my experience – once installed on a server such connectors are rarely updates (unless something fails :)), that is a mistake.
What you may be losing, not updating you data source connectors:
- lack of support for new features introduced in the products that you are getting data from – new SQL structures, functions etc.;
- lack of support by product provider – it may happen that company responsible for product that you are getting data from, may not be supporting errors with outdated data connectors – although in most of the cases versions should have backward compatibility;
- slower – or maybe better – missing an opportunity to have better query / extract performance;
- potential errors during data extraction if some functions / features previously used were deprecated or removed from product;
- unexpected loss of data during extraction. It may happen that during an expected load of 20 000 rows only 10 000 will be loaded into your staging area of data warehouse. If in addition you don’t have any sort of Quality Screens (described in point 1), you may found this error someday in the future, not during a particular failed extraction.
Consider keeping data connectors up to date, but as with each potential upgrade, please remember about backing up existing configuration. make changes first on test / integration / development environments, validate if ETL process is working smoothly after upgrade and data is being populated and treated in the same fashion.
5. Consistent architecture between environments (development / uat (user – acceptance – testing / production)
Continuous Integration / Continuous Delivery / Version Control are standing in the forefront of “modern” development process, yet I can still hear that this table is different on UAT than on Production, yet it shouldn’t be. That “development” branch is not up to date with “production” version etc.
It shouldn’t make problems on production environment, although it definitely pisses off development team or release team (if it’s a separate Team in your setup).
In most of the cases this is a human or rather should be process error, that could be avoided.
If you happen to have such cases, review your release and development process, check for potential spots that can be misinterpreted and fix them. Besides there are multiple tools / products on the market that can help with maintaining consistent architecture between environments (apart of already mentioned standard CI/CD or Version Control).
Some of such tools:
- Flyway – https://flywaydb.org/
- LiquiBase – http://www.liquibase.org/
- Couchmove – https://github.com/differentway/couchmove
6. Repository of DDL’s and other script files (.SQL, Bash / Powershell)
This point is somewhat similar to point 5, but a bit more general. To start up if you can afford to have backups every single minute, and you can restore whole environment from that backup in a reasonable time even for developer environment, than you may argue if keeping all scripts in common version control repository is for you. Although it doesn’t cost a lot and doesn’t require a lot of space, so still please consider that 🙂
It’s not an article about benefits of having version control in processes, but you should definitely consider a single place where you will store all your data warehouse scripts. Just in case you should have a possibility to get back to any particular version of the scripts (and even if you don’t want to use version control SVN or GIT, and you will keep your files as XYZ_V1.sql / XYZ_V2.sql, that’s fine, but please keep them).
It may be useful for you developers to easily recreate environment on local machine, or for you continuous integration process for cycling rebuilds of environments and checking if everything is in order. There are many more positive sides of it, you can easily think about others don’t you?
One thing to bare in mind, it’s important to think about the structure of such repository. In time it will grow bigger, you may get files that you’ve not considered previously that you will have in your structure. You may want to use it as a base place for CI/CD processes. Creating best possible structure for your repository may be time consuming at the beginning but will save your time and stress later on.
Example of such framework (suited for Qlik Business Intelligence product, but can be well adjusted to your needs): QDF (https://community.qlik.com/groups/qlikview-deployment-framework).
7. Testing processes – unit tests, integration tests, regression tests
In the first point of this article – Quality Screens – we’ve discussed approach to test data during whole ETL process, this point on the other hand is dedicated to testing your code before even getting into data extraction layer.
I don’t think that introducing the concept of testing itself is needed, we all know about testing right? Before we jump into getting the data from source systems, loading to data warehouse, transforming, massaging, adjusting etc. It would be worth to spend some time on building even simple unit tests for the particular phases of the processes.
Let’s assume that you are going to extract 5 columns from a data source table XYZ into Data Warehouse staging / extraction layer, you can prepare a simple test, where you will test:
- creation / delete of a table in staging / extraction layer;
- insertion of records to newly created table;
- update of records;
- deletion of records;
- data types of inserted records;
You get my point right? If you have consciously decided not to implement Quality Screens into your Data Warehouse, not implementing even simple unit tests will be a mistake. You will most likely end up in a Production Outage Purgatory, where you won’t be handling errors during ETL process, and you will continuously struggle with incidents with production environment, that may lead to a trustless source for your business users and that you would want to avoid at all costs.
There are number of tools that you can use for testing purposes, just to name a few:
8. Audit tables, monitoring and alerting of audit tables
This point is suitable for whole bunch of separate articles, but to make a couple of short sentences here.
An article without a fancy quote is not a good article, so here it goes:
[blockquote align=”none”]“If you don’t know where you make your mistakes, that’s your worst mistake: not knowing where your mistakes are at.” Meek Mill [/blockquote]
Audit tables, are tables where you store “metadata” informations about everything that is happening during ETL process (or any other process that is happening in Data Warehouse). They can give you an information, how fast tables are growing, what is the average size of load (in terms of amount of records), abnormal records insertion, sky is the limit here.
Having audit tables in place is one thing, other thing is to make them work for you without much of a manual hassle from your side. I’m talking here about proactive (and reactive) monitoring and alerting. Reviewing all of the audit tables can be a tough and time-consuming job, but hey you can be smart.
Think about couple of critical points that you need to monitor, points that may cause most of the problems or that are the most critical in terms of quality for your business stakeholders (pro tip: keep in mind Pareto principle – 80/20 rule – in our case 20% of cases is causing 80% of problems – focus on these crucial 20%).
If you have come up with critical points to monitor, you can easily creating a home-grown solution to send automated e-mail notifications, or any other maybe 3rd party provided tools that can help you out with this task.
Audit Tables same as Quality Screens or Testing, are bringing an overhead to your overall process, there is no doubt about it, but same for everything in life, it all has its price. If you will be willing to pay this price, you will later on reap the benefits of having such solution in place – proactive monitoring and communication with business stakeholders, possible fraud detection etc.
More about audit tables and data quality architecture in following Whitepaper.
9. Health of tables, indexes, partitions etc.
This is a tricky point as you may not have an administrator access to your data warehouse. There might be a separate DBA (Database Administration) Team dedicated for this function.
However if you do have an access to tables, indexes, partitions etc. metadata, meaning you are able to check size of table, index fragmentation etc., than to keep the professional approach, you definitely should monitor health of your platform.
You can either come up with automated or semi-automated mechanism, or you can do it manually from time to time (definitely make it automated), but scripts for this processes done once, can be beneficial many times in the future.
Especially, when your data warehouse will grow, tables size will increase rapidly, indexes may be defragmented, and thoughtfully chosen partitions might be not relevant with existing volume or usage of data.
Some useful informations about database health based on SQL Server: StackOverflow.
10. Known and described data lineage
Data Lineage, do you know where a particular field from the source is being used? If not, you are screwed, maybe not yet, but you will be, believe me – been there done that.
There will be a day that someone – either from a Development Team or a Business Stakeholder – will come and ask, “From where exactly we are getting this data – pointing to a particular column”. If you’ve done your homework right, you will have your data lineage “tool” to help you with such questions, if not, then yeah go ahead CMD+F / CTRL + F and go through your version control (hopefully you have one) to look for specific field.
Below is a subset of benefits that comes with well maintained Data Lineage:
- ease of use when it comes to finding the usage of particular fields;
- time saving – for your team, for you and for your business partners (if you will make it public) searching for source places, end usage places etc will take less time;
- helpful for architects and analysts in your team – to find out whether something is already used, and from where; to check what sources and fields from that sources builds conformed dimensions etc.;
- helpful during audit procedures – especially relevant in enterprise organization, that are under audit regulations. Access to Data Lineage can be regulated, thus you can grant rights to read the lineage to different parties, even “trusted” 3rd parties if needed;
- critically important during acquisitions – if your company has been bought or you’ve bought another company, integration process can be accelerated with well maintained data lineage.
More about Data Lineage (a bit outdated but still relevant presentation): Presentation Link
Above points are just a subset of elements that you as a Data Engineer / Data Architect or in general person that works with Data Warehouses should keep an eye on.
From my personally perspective they are important, if not to say crucial, you may have a different opinion and that is great. As long as time was spent to design whole processes and all pros and cons are known and well discussed, that is all fine. In the end life is full of give-and-take situations, I wish you to make the best possible options at a point of time that you can in terms of designing and developing data warehouse architecture and processes.
Please let me know what do you think, are there any points far more relevant in your opinion?