When is a data set “ready” for data analytics? The correct answer is probably “never”, or at least “never completely”. There is always more and better data to collect. There is always more that could be done to improve the quality of the data. Waiting until everything is “just right” before we start means we will never start.
So we have to work with imperfect data. Data that is “good enough” for basic data analytics. How good is good enough? Here are our thoughts in 10 steps:
- All our source data has been properly defined, that is, we know where all of the sources are and what each contributed, even though we might not use them for our current analysis
- Duplicates do not exist in the data being analyzed
- There are no data missing in the records. If there are data missing and we need a value in those fields, then we need to run a data imputation plan. Imputation is the art of “guessing” at the values that should be there and inputting those values. It is important to be objective about the choices made here, we can’t create data that works for our current study but will negatively impact the results of another program.
- If we are building a more traditional data warehouse (as opposed to an unstructured data lake), then we probably want to normalize our data to some level. The normalization process is well documented elsewhere in multiple sources, as it is a fundamental principle of modern database design. Normalization helps eliminate redundancy within the records, and can make a database more efficient.
- Common terms, otherwise generally known as data unification. Field names and data types should match, so if there is a telephone number field, then the field name for all records is “telephone”, and similarly for data types if there is a gender field then the value for “male” should be “male” not “M” or integer 1. These transformations can be handled when the data are mapped from sources to the unified destination. This step might not be required for a data lake, but it could provide enough efficiencies to be worth doing anyway.
- Data is structured efficiently enough and compute capability is powerful enough that data analytics can complete in reasonable amount of time.
- The data set encompasses the question. If not, can more data be obtained? Can we start with a different question?
- When multiple data sources feed into the system of record, the system should be automated as much as possible
- The data set is organized to scale full automation from front to back with real-time analytics. Perchance to dream.
- Someone is accountable for the responsible use and security of the data, and empowered to ensure good governance
The really important thing is to get started with the data available. Small successes early will create more demand from a wider internal audience, and may compel some parties to share previously unavailable data in exchange for fresh insight. With continuing diligence, the “good enough” data can become really good data. In time, the data may become good enough to be marketable on its own.