Principles of a Well-Designed Data Architecture
Data architecture principles include the set of rules that pertain to data collection, storage, transformation, and consumption. These principles form the foundation of the data architecture framework and help build effective data strategies and data-driven decisions.
Keep it Simple
Data is a shared asset, to make the most of the business value of data, it has to be understood by all stakeholders alike. From the business analytics team to the DataOps team, all stakeholders need to be on the same page. So we need to keep our data architecture simple enough otherwise we would be creating just another silo.
Simple doesn’t mean avoiding the complexity but breaking the complexity into simple and manageable pieces. A simple design makes it easy to discover, prepare, build and deploy data and its pipelines. It also helps in making the data ecosystem more reliable, scalable, reusable, and maintainable.
Separate the Concerns
Just like software architecture, data architecture also has several different concerns, all need to be dealt with separately. It provides more degrees of freedom in all aspects of design, build, deployment and usage. Some of the examples are decoupling storage with computing and having separate layers for ingestion, derivation, and consumption.
Follow Best Practices
There are documented best practices to build and manage data ecosystems around data storage, data integration, master data, metadata, and data lineage.
While we need not apply them as it is, we need to:
- be aware of these best practices
- analyze which practices matter the most in our context
- apply them to a level that suits us
Note: Writing clean and modular code (with appropriate documentation) to build data pipelines is also recommended, which also helps in tracking data lineage.
Validate with POCs
Every data ecosystem is different, what has worked for one may not work for another in the same way. It’s better to start with proof-of-concept (POCs) to validate what is working and if some customization is required. It also helps us to revise our timeline and resourcing estimates.
Just Enough Approach
The whole data ecosystem can’t be migrated and/or optimized in one go, without impacting the day-to-day operations. We need to identify high-impact items with reasonable efforts.
This approach is an inspiration from the Pareto principle, which says that roughly 80% of consequences come from 20% of causes. Not all data points are equal in a data ecosystem, we need to identify those ~20% of data points (by generating data usage metrics) that cover ~80% of the overall impact.
And then first target those high-impact data points and related components, like building data catalog and preparing data lineage, and identify data quality issues and pipeline optimization opportunities.
No Free Lunch
While NFL theorem was proposed for ML/AI optimization, it holds true for data ecosystem and even in general life. We need to understand and accept that every approach or paradigm has its benefits and limitations. Our job is to evaluate what suits our requirements, what features are crucial to us, and what are limitations that we can live with. And have a mechanism in place to evaluate different approaches against our data ecosystem requirements.
Be Adaptable
The data landscape is changing day by day, it’s highly unlikely that we end up using the current data ecosystem as it is for 5 or 10 years down the line. Like we have shifted from Oracle DWH to Teradata to Hadoop to Spark in the past, a new paradigm/framework may suit our needs better in the future. The key is to be adaptable, be prepared to try new things, and have a mechanism to evaluate upcoming paradigms/approaches against our current approach and if it eases our current way of working.
T-shaped Skill-set
From data requirements to deployment in production, so much goes into a data platform. It is very much required for every team member in data team to have a cohesive view of the platform.
While it is almost impossible for every team member to develop expertise in each area, just being concerned about one’s scope of work proves to be short-sighted and hinders collaboration.
In my view, each data professional (from business analyst to dataops engineer) should strive to achieve T-shaped skill-set, which means expertise in his area and just enough knowledge of other areas. This way we can anticipate what we are expected to do in order to make other team member’s job easier.
Ankit Rathi is a Cloud Data Technologist, published author & well-known speaker. His interest lies primarily in building end-to-end data/AI applications/products following best practices of Data Engineering and Architecture.