In the second phase of CRISP-DM, Data Understanding, it’s finally time to roll up our sleeves and get our hands dirty. We need to continue to build the foundation of knowledge that we started in phase one.

In this phase, we are keeping in mind the goals we teased out in phase one: Business Understanding. Now, we look closely at the data sources we will use to meet the business goals and drive value for the stakeholders.

The Business Understanding and Data Understanding phases go hand in hand. Often, there is some iteration back and forth during these phases. The data team should be asking a lot of questions to clarify their business understanding as they get to know the data. Unless they have previous experience with the datasets, they will also have a pile of questions about the data.

Table of Contents

Table of Contents

Who is involved in the Data Understanding Phase of CRISP-DM?

Some key players at this stage of CRISP-DM will be the subject matter experts (SMEs) who can answer questions about how data is generated and the meaning of each data element. The project manager will be key to facilitating requests for information (RFIs) between the data team and the stakeholders. Also, IT or Data Engineering support will be necessary at this phase to facilitate access to key datasets.

What tasks are performed during the Data Understanding phase of CRISP-DM?

Process diagram of Phase 2: Data Understanding.

Here’s a quick outline of tasks in the second phase of CRISP-DM:

  • Collect Initial Data

    • Establish access
    • Ask for a data matrix or documentation for each source
    • Document who is the Data Owner or an SME per data source
  • Describe Data

    • Consult with subject matter experts (SMEs) and data matrices
      • Understand the semantic and syntactic meaning of the data elements
      • Understand the data-generating process
      • Document known and suspected biases in the dataset (every data set has biases!)
      • See Datasheets for Datasets and Nutrition Labels
  • Verify Data Quality

  • Explore Data

    • Consult with SMEs AGAIN to share EDA results
      • Verify Findings
      • Discuss application to the business problem
      • Brainstorm feature engineering
  • Clarify Legal, Regulatory and Ethical Considerations of the data

  • Update Documentation

Collect Initial Data

Data resources can come in all shapes and sizes. A data resource might just be some files CSV, text files, pdfs, log files, etc. This is incredibly common in businesses early in their Digital Transformation journey. Data resources can also be more sophisticated datasets such as tables in a database or entire data warehouses. Here are some common data sources:

Comma Separated Values (CSVs), Excel files, text (txt) files, PDFs, log files, etc. These can also be domain-specific files, such as machine-generated XML files.

Accessing these files might require gaining permission to a shared folder.

Preprocessing this data type will take a little more work since it will probably require writing a script to parse the files into an analytics-ready dataset.

If you are lucky enough to have a Data Engineering team, they may be able to take the raw files and load them into a table in a Data Lake.

Businesses and Organizations store a tremendous amount of information in traditional databases. If your data is in a database, you will need credentials to authenticate with appropriate access levels to read and query the necessary tables.

A subject matter expert should be able to point you in the right direction regarding what tables you might want to look at. Ask a lot of questions about available tables. Sometimes there are additional lookup tables that provide the key to understanding data in larger tables. In addition, you may be able to enrich one table with information from another and thus add features to your dataset.

Documentation is also key when working with traditional databases. You may find coded values or non-human-readable values that you will want a “decoder ring” to understand. Unfortunately, this documentation doesn’t always exist, so again, lean on SMEs to help make sense of the data and document everything!

A Data Warehouse is a long-term data store that usually stores data in data cubes as facts and dimensions. These facts and dimensions are carefully crafted aggregations from raw transactional data that provided the granularity needed for various business units to monitor Key Performance Indicators (KPIs) and make decisions.

For the sake of a Data Mining Project, data professionals typically try to get as close to the raw data as possible, but in some cases, the raw data is only kept for a certain period and all historical data is aggregated into a data warehouse using Business Rules. You will need to find a balance between the age of the data and the granularity needed because older data will be less granular in a data warehouse.

Similar to accessing data in a traditional database, you will need credentials with appropriate access to query a data warehouse. You will also need access to an SME or documentation to understand the transformations that underly the datasets of interest.

Unlike traditional databases, data warehouses naturally have more documentation since they require business rules to be coded into the underlying transformations that populate the facts and dimensions. You will want an SME to help point you to some places of interest in the Data Warehouse.

A Data Lake is an interesting place to pull data from. Depending on your organization, this may be very organized, or it may be an undocumented data swamp. Look for an SME and ask a lot of questions. Don’t be surprised if you find more duplicate data here and intermediate datasets. If you are given access to a development environment, there might be experimental tables, datasets, and views.

Just like the traditional database and the data warehouse, you will need credentials and permissions to access specific data in the lake.

Automated Programming Interfaces (APIs) are excellent ways to access data that exists outside of your organization. Keep in mind that many APIs for enterprise use have terms and conditions, so take time to be familiar with these. Also, familiarize yourself with data rates and limits. You might be able to pull 1TB of data, but your organization might get a big bill in the mail as a result, so do your homework before you query!

In some cases, you may need to scrape data from the web to build a dataset. A lot of websites have APIs, which are generally preferred to web scrapers, but not all. Make sure to read the fine print of the website and any disclaimers or terms and conditions on the site before writing a script to collect data.

Describe the Data

Consult with Subject Matter Experts (SMEs) and Data Matrices

It can be tempting to jump right into EDA and modeling at this point, but STOP! First, have a conversation with a subject matter expert on the data and ask questions about how the data is generated. Also, ask about the semantic meaning of the values, what is the data’s intended use, whether is there a data matrix, and do they have any ideas about what features could be added that would be useful to your project.

What information can an SME provide?

SMEs are invaluable sources of information, and they might not even know it. They have a treasure trove of information about where the data comes from, who or what is used to generate it, how long it has been collected, sources of error, syntax quirks, semantic oddities, what values might be potentially interesting, what values are useless, what transformations of values could be useful (feature engineering), etc., etc.

What is a Data Matrix?

Another excellent source of information about a dataset is a data matrix. A data matrix is a document that contains metadata about the elements in the dataset. In traditional columnar datasets, this will include the meaning of each column, the expected range of values, datatype, any encoding or lookup table information, null values, units of measurement, location of reference documentation, and more. You may also have a data matrix that describes the rows: what is being sampled and how often?

Some Data Professionals ARE the SMEs

Some data professionals were SMEs and are now running data projects. These data scientists have a huge advantage in being able to tap into their domain knowledge when getting started with a project. Their challenge will be to not skip this step but to take time and document their knowledge for the sake of the whole data team.

Other data professionals will come to the project with data skills, but no specific domain knowledge. In this case, collaboration with SMEs will be incredibly valuable and necessary.

Verify Data Quality

Before performing any exploratory data analysis (EDA), we need to do some initial data quality checks and cleaning. Chances are good that you will encounter dirty data. This is data that is incorrect, incomplete, or corrupted. It can result from errors in data entry, data transmission, or data storage, and can cause problems such as inaccurate or misleading analysis, wasted resources, and lost opportunities.

Questions to ask during data quality checks

  • 1

    What are the null values?

  • 2

    Are there any “initialization values” or other junk data that don’t fit the semantic definition of the columns? (e.g. Values of -140 when the semantic meaning of the column is revolutions per minute)

  • 3

    Are there any values that don’t fit the syntactic definition of the columns? (e.g. strings in a column of floats)

  • 4

    Are there duplicates?

  • 5

    Is there variation in the data?

  • 6

    Are there gaps in the data? (e.g. if the data has timestamps, are the values sampled at regular intervals)

These data quality checks are useful in identifying potential issues in the data. A thorough check here will save time down the road!

Dirty Data Checklist

  • Perform Data Quality Checks to identify issues

  • Quantify how much of the data is “dirty”
  • Create a plan and a repeatable process (e.g. script) for removing the issues
  • Update documentation with data cleanig details

  • Clean the data! (hit go on your script!)
  • Validate that erroneous data is gone

You should have good documentation about what is removed from the dataset and what remains as this could be a source of bias or error in your dataset. For example, if you remove all of the rows where “revolutions per minute (rpm)” is null, you may have removed all of the rows that correspond to a specific event where “rpm” isn’t recorded. You need to do a little EDA on the data that you remove from a dataset to ensure you aren’t accidentally removing something valuable.

Perform Exploratory Data Analysis (EDA)

Okay, we can finally jump into EDA. Hadley Wickam, Chief Data Scientist at R Studio and author of R for Data Science describes EDA as an iterative process by which you

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modeling your data.
  3. Use what you learn to refine your questions and/or generate new questions.
hand holding question mark representing the curiosity of exploratory data anlysis

Photo by Ann H

EDA is guided by the following questions:

  • What type of variation occurs within my variables?
  • What type of covariation occurs between my variables?

Here are some ideas about what to look for in different types of data:

Continuous Data

  • 5-number summaries visualized as boxplots

  • Histograms

  • Scatterplot of value over time if data is timestamped

  • Scatterplot Matrix to compare pairs of variables

  • Correlation Coefficient (Pearson’s r) to test for linear correlation between pairs of variables

Categorical Data

  • Barchart of number of observations per category

  • 2-way frequency tables to compare distributions of observations among categorical values

  • Chi-Squared Test to test for independence between categorical variables

Ordinal Data

Many data platforms have developed automation around EDA, and I will list a few R and Python packages that do the same. Remember to go beyond the automated tools and look for things such as patterns within groups and patterns over periods of time.

Python Packages for Auto-EDA

R Packages for Auto-EDA

Clarify Legal, Regulatory, and Ethical Considerations

When working with data, there are a few things to keep in mind to make sure you’re doing it legally, ethically, and in compliance with regulations.

lawyer explaining legal, ethical and regulatory considerations for dataset

Photo by Mikhail Nilov

You’ll want to make sure you’re handling personal data in a way that follows laws like the GDPR and CCPA. If you are working with health or medical data, you may need to be aware of HIPPA laws. Oh, and don’t forget to check if there are any specific regulations related to the industry you’re working in, such as ITAR compliance regulations for United States defense-related technical data.

Also, keep data secure and set up a good system for maintaining datasets. In some cases, the location of certain types of data is auditable, so it’s important not to scatter intermediary datasets all over your hard drive. Additionally, you should have a clear policy in place for how you’ll use data and how long you’ll keep it, and make sure you’re clear with stakeholders and data owners about what you’re doing.

With ethics in mind, be aware of any potential biases that could cause problems. Consider biases introduced during the collection of the data. Also, evaluate what biases could be introduced by your method of sampling the data.

Here’s a checklist to run through with the data owner or SME:

  • Data Privacy concerns

  • Data Security precautions

  • Data Accuracy issues

  • Sources of Data Bias

  • Data Governance Policies including retention and disposal

  • Laws and Regulations

Update Documentation

If your organization has a standard method of documentation, use that to track notes and findings from the data understanding phase. Try to find a method of documentation that is flexible because you will return to add notes again and again. By flexible, I just mean something that can be added to without getting too messy. A running Word document is okay, but probably not enough.

You will need code snippets, charts & graphs, notes from conversations, etc. You will likely have multiple documents:

  • A Word document or spreadsheet containing a list of data sources, SME contact info, data owner, access info and high-level notes, location of data documentation
  • Data Matrices
  • Code notebook or script containing steps (with comments!) used to clean the data
  • Word document containing dated notes from conversations with SMEs (this can be organized into an FAQ or Wiki)
  • EDA Code notebook or dashboard with charts, graphs, and analysis of initial impressions

Two excellent standards of dataset documentation are Datasheets for Datasets and Nutrition Labels for datasets.

Revisiting Data Understanding

The Data Understanding phase of CRISP-DM is one that we need to come back to again and again. We will honestly be constantly learning new quirks and details of the dataset as we go. We may uncover some hidden bias during the model evaluation phase or we may realize that the method of data generation has been a limiting factor in how accurate our model can ever be.

CRISP-DM is meant to be an iterative, circular process. It’s important to communicate to the Project Manager and stakeholders that you will continue to come back and add to the documentation of this phase as you go.

Conclusion

Phase 2 of CRISP-DM: Data Understanding, is foundational to our success in achieving the business goals established in Phase 1. We need to document, clarify, revisit, and share our findings from the data understanding phase. The work we do here sets us up for being able to deliver a high-quality model, API, dashboard, report, or whatever is the expected deliverable.

Trust me when I emphasize that the time spent in the data understanding phase is well worth it and will directly affect the result of this project and any projects that are built upon it. In the next Phase, we will start prepping our data!

Leave A Comment

  1. […] Phase 2: Data Understanding […]

  2. […] adequate time in the business understanding and data understanding phases of the project. Ask a lot of questions. Document […]

  3. […] Data Understanding: Gather data sources, data definitions, talk to subject matter experts, conduct exploratory data analysis and data quality checks […]

Let me know what you think!

Related Posts