
Data Maturity Model: Data Sanitization and Data Storage
Welcome to the next edition of our series on the Aionic 6-Stage Data Maturity and Readiness model. Whether you’re simply looking ahead to chart your evolutionary path, you’ve just graduated from Source Systems, or you’re already an advanced organization leveraging AI, we are excited to introduce our next two maturity levels.
And good news! The subsequent two levels are a bit more exciting, as you can begin to transform your data into the structure and format needed to activate advanced AI, Machine Learning, and Business Intelligence solutions. Let’s dive in…
Maturity Level 2. Data Engineering & Sanitization
So, you’ve identified your source systems, extracted relevant data through services, and have designed a solid foundational data architecture. What’s next? This next stage of data maturity centers on the organization of data, and the setup of data pipelines and operational models in alignment with best practices. At this stage, we’re focused on transforming your data into an actionable state. And while it is often overlooked, data preparation and sanitization is absolutely critical to making data actionable.
If you’ve ever built LEGOs, this step should sound familiar. Sure, when you first open a new set, all of the pieces are organized into nice packets. But if you’ve ever rebuilt LEGO sets, the process is a little different, and requires organizing all pieces into discrete piles, separated by things like type, color, or shape, before you can truly start building. In a similar manner, organizations must take steps to organize their data before they can truly begin building intelligent solutions that can impact the business.
As part of this process, it’s important to focus on building repeatable patterns and frameworks that can continually support data organization efforts. This includes the setup of data pipelines that can manage the intake of new data and support the continued growth of your data set in a highly structured manner. In organizing your data, as well as in establishing principles and infrastructure that can maintain organization into the future, you are placing yourself and your team in the best possible position to reach your AI and Machine Learning goals.
With data effectively organized, you can now have confidence in the accuracy and validity of the data that will be used as the basis for insight generation. This unlocks a world of possibilities and means you’re ready to start building your AI masterpiece!
Maturity Level 3. Data Storage
But wait! There’s one more important piece of the puzzle. Activating a robust AI and/or Machine Learning model requires a strong data persistence tier. As AI and Machine Learning models are mathematical in nature, data often (but not always) needs to be stored and transformed into numeric values, and to be normalized in order to ensure conditional independence. While some models, like decision trees, do not require this level of data transformation, it is important to understand the specific requirements of each Machine Learning or AI model that you intend to deploy. Regardless of the model, this level of data maturity requires a strong data persistence tier that can support the data pipeline, and the organization and categorization of data that comes from different heterogeneous sources and arrives in different formats.
When we dig a bit deeper, some probabilistic models require that data meet a certain set of requirements. In some instances, data must be continuous (numeric). In some instances, data must follow a normal distribution. And in some instances, data must have conditional independence.
While requirements differ based on your discrete objectives – e.g. not all models require data to be continuous nor follow a normal distribution - it’s important to align your data storage with the objectives of your business. There are a few options that exist for your data storage, including Data Warehouses, Data Lakes, and Data Lakehouses, each of which coincides with different layers of data processing.
Data Warehouses. Data Warehouses are very structured and a more classic approach to data storage.
Data Lakehouse. Data Lakehouses can be viewed as a midpoint between a Data Lake and a Data Warehouse, possessing a combination of raw and structured data.
Data Lake. Data Lakes store vast amounts of raw and non-structured data.
It’s important to note that the way we view the distinction between Data Warehouses, Data Lakehouses, and Data Lakes are evolving, with industry leaders like Snowflake providing storage options for both structure and unstructured data alike.
Having your data organized appropriately, and also stored and structure in a manner that supports the requirements of your desired AI model, is imperative to the success of your overarching initiative. Often, we recommend running normalization tests and descriptive statistics in order to validate the data set prior to assembling and training your models.
Once your data storage is setup appropriately, you can begin to train your models, and use the resulting data set to activate Business Intelligence, Machine Learning, Generative AI, Predictive Analytics, and more!
Self-Examination
Here are a few more self-evaluation questions to see how you stack up with our latest data maturity levels.
Do you have a data pipeline in place?
Do you have a modern data storage solution in place? In what state does your data reside?
Do you have a Data Lake? A Data Lakehouse? Or a Data Warehouse? Have you explored or discussed these storage options within your organization?
Have you identified data requirements for the specific AI and Machine Learning models that you are aiming to activate? For example, do you require data normalization? Do you require conditional independence? Do you require a continuous numeric data set?
Does your current data set reflect these requirements?