In the second phase of CRISP-DM, Data Understanding, it’s finally time to roll up our sleeves and get our hands dirty. We need to continue to build the foundation of knowledge that we started in phase one.
In this phase, we are keeping in mind the goals we teased out in phase one: Business Understanding. Now, we look closely at the data sources we will use to meet the business goals and drive value for the stakeholders.
The Business Understanding and Data Understanding phases go hand in hand. Often, there is some iteration back and forth during these phases. The data team should be asking a lot of questions to clarify their business understanding as they get to know the data. Unless they have previous experience with the datasets, they will also have a pile of questions about the data.
Table of Contents
- Who is involved in the Data Understanding Phase of CRISP-DM?
- What tasks are performed during the Data Understanding phase of CRISP-DM?
- Collect Initial Data
- Describe the Data
- Verify Data Quality
- Perform Exploratory Data Analysis (EDA)
- Clarify Legal, Regulatory, and Ethical Considerations
- Update Documentation
- Revisiting Data Understanding
- Conclusion
Table of Contents
- Who is involved in the Data Understanding Phase of CRISP-DM?
- What tasks are performed during the Data Understanding phase of CRISP-DM?
- Collect Initial Data
- Describe the Data
- Verify Data Quality
- Perform Exploratory Data Analysis (EDA)
- Clarify Legal, Regulatory, and Ethical Considerations
- Update Documentation
- Revisiting Data Understanding
- Conclusion
Who is involved in the Data Understanding Phase of CRISP-DM?
Some key players at this stage of CRISP-DM will be the subject matter experts (SMEs) who can answer questions about how data is generated and the meaning of each data element. The project manager will be key to facilitating requests for information (RFIs) between the data team and the stakeholders. Also, IT or Data Engineering support will be necessary at this phase to facilitate access to key datasets.
What tasks are performed during the Data Understanding phase of CRISP-DM?
Here’s a quick outline of tasks in the second phase of CRISP-DM:
Collect Initial Data
Data resources can come in all shapes and sizes. A data resource might just be some files CSV, text files, pdfs, log files, etc. This is incredibly common in businesses early in their Digital Transformation journey. Data resources can also be more sophisticated datasets such as tables in a database or entire data warehouses. Here are some common data sources:
Describe the Data
Consult with Subject Matter Experts (SMEs) and Data Matrices
It can be tempting to jump right into EDA and modeling at this point, but STOP! First, have a conversation with a subject matter expert on the data and ask questions about how the data is generated. Also, ask about the semantic meaning of the values, what is the data’s intended use, whether is there a data matrix, and do they have any ideas about what features could be added that would be useful to your project.
What information can an SME provide?
SMEs are invaluable sources of information, and they might not even know it. They have a treasure trove of information about where the data comes from, who or what is used to generate it, how long it has been collected, sources of error, syntax quirks, semantic oddities, what values might be potentially interesting, what values are useless, what transformations of values could be useful (feature engineering), etc., etc.
What is a Data Matrix?
Another excellent source of information about a dataset is a data matrix. A data matrix is a document that contains metadata about the elements in the dataset. In traditional columnar datasets, this will include the meaning of each column, the expected range of values, datatype, any encoding or lookup table information, null values, units of measurement, location of reference documentation, and more. You may also have a data matrix that describes the rows: what is being sampled and how often?
Some Data Professionals ARE the SMEs
Some data professionals were SMEs and are now running data projects. These data scientists have a huge advantage in being able to tap into their domain knowledge when getting started with a project. Their challenge will be to not skip this step but to take time and document their knowledge for the sake of the whole data team.
Other data professionals will come to the project with data skills, but no specific domain knowledge. In this case, collaboration with SMEs will be incredibly valuable and necessary.
Verify Data Quality
Before performing any exploratory data analysis (EDA), we need to do some initial data quality checks and cleaning. Chances are good that you will encounter dirty data. This is data that is incorrect, incomplete, or corrupted. It can result from errors in data entry, data transmission, or data storage, and can cause problems such as inaccurate or misleading analysis, wasted resources, and lost opportunities.
Questions to ask during data quality checks
- 1
What are the null values?
- 2
Are there any “initialization values” or other junk data that don’t fit the semantic definition of the columns? (e.g. Values of -140 when the semantic meaning of the column is revolutions per minute)
- 3
Are there any values that don’t fit the syntactic definition of the columns? (e.g. strings in a column of floats)
- 4
Are there duplicates?
- 5
Is there variation in the data?
- 6
Are there gaps in the data? (e.g. if the data has timestamps, are the values sampled at regular intervals)
These data quality checks are useful in identifying potential issues in the data. A thorough check here will save time down the road!
Dirty Data Checklist
You should have good documentation about what is removed from the dataset and what remains as this could be a source of bias or error in your dataset. For example, if you remove all of the rows where “revolutions per minute (rpm)” is null, you may have removed all of the rows that correspond to a specific event where “rpm” isn’t recorded. You need to do a little EDA on the data that you remove from a dataset to ensure you aren’t accidentally removing something valuable.
Perform Exploratory Data Analysis (EDA)
Okay, we can finally jump into EDA. Hadley Wickam, Chief Data Scientist at R Studio and author of R for Data Science describes EDA as an iterative process by which you
- Generate questions about your data.
- Search for answers by visualizing, transforming, and modeling your data.
- Use what you learn to refine your questions and/or generate new questions.
EDA is guided by the following questions:
- What type of variation occurs within my variables?
- What type of covariation occurs between my variables?
Here are some ideas about what to look for in different types of data:
Continuous Data
Categorical Data
Ordinal Data
Many data platforms have developed automation around EDA, and I will list a few R and Python packages that do the same. Remember to go beyond the automated tools and look for things such as patterns within groups and patterns over periods of time.
Python Packages for Auto-EDA
R Packages for Auto-EDA
Clarify Legal, Regulatory, and Ethical Considerations
When working with data, there are a few things to keep in mind to make sure you’re doing it legally, ethically, and in compliance with regulations.
You’ll want to make sure you’re handling personal data in a way that follows laws like the GDPR and CCPA. If you are working with health or medical data, you may need to be aware of HIPPA laws. Oh, and don’t forget to check if there are any specific regulations related to the industry you’re working in, such as ITAR compliance regulations for United States defense-related technical data.
Also, keep data secure and set up a good system for maintaining datasets. In some cases, the location of certain types of data is auditable, so it’s important not to scatter intermediary datasets all over your hard drive. Additionally, you should have a clear policy in place for how you’ll use data and how long you’ll keep it, and make sure you’re clear with stakeholders and data owners about what you’re doing.
With ethics in mind, be aware of any potential biases that could cause problems. Consider biases introduced during the collection of the data. Also, evaluate what biases could be introduced by your method of sampling the data.
Here’s a checklist to run through with the data owner or SME:
Update Documentation
If your organization has a standard method of documentation, use that to track notes and findings from the data understanding phase. Try to find a method of documentation that is flexible because you will return to add notes again and again. By flexible, I just mean something that can be added to without getting too messy. A running Word document is okay, but probably not enough.
You will need code snippets, charts & graphs, notes from conversations, etc. You will likely have multiple documents:
- A Word document or spreadsheet containing a list of data sources, SME contact info, data owner, access info and high-level notes, location of data documentation
- Data Matrices
- Code notebook or script containing steps (with comments!) used to clean the data
- Word document containing dated notes from conversations with SMEs (this can be organized into an FAQ or Wiki)
- EDA Code notebook or dashboard with charts, graphs, and analysis of initial impressions
Two excellent standards of dataset documentation are Datasheets for Datasets and Nutrition Labels for datasets.
Revisiting Data Understanding
The Data Understanding phase of CRISP-DM is one that we need to come back to again and again. We will honestly be constantly learning new quirks and details of the dataset as we go. We may uncover some hidden bias during the model evaluation phase or we may realize that the method of data generation has been a limiting factor in how accurate our model can ever be.
CRISP-DM is meant to be an iterative, circular process. It’s important to communicate to the Project Manager and stakeholders that you will continue to come back and add to the documentation of this phase as you go.
Conclusion
Phase 2 of CRISP-DM: Data Understanding, is foundational to our success in achieving the business goals established in Phase 1. We need to document, clarify, revisit, and share our findings from the data understanding phase. The work we do here sets us up for being able to deliver a high-quality model, API, dashboard, report, or whatever is the expected deliverable.
Trust me when I emphasize that the time spent in the data understanding phase is well worth it and will directly affect the result of this project and any projects that are built upon it. In the next Phase, we will start prepping our data!
[…] Phase 2: Data Understanding […]
[…] adequate time in the business understanding and data understanding phases of the project. Ask a lot of questions. Document […]
[…] Data Understanding: Gather data sources, data definitions, talk to subject matter experts, conduct exploratory data analysis and data quality checks […]