Table of Contents

Introduction

Data Preparation is the third phase of CRISP-DM and the not-so-secret sauce to successful data mining projects. It’s well-known and often reported that organizations have messy data–data science courses often focus on pre-prepped toy data sets, but in the “real world”, the data can be….well, just plain scary. Many hours are spent cleaning and bemoaning dirty data.

In 2016, Forbes reported that data professionals spend around 60% of their time cleaning data. Luckily this number seems to have declined over the years, probably thanks to more automation and better tooling. Anaconda conducts an annual survey of data professionals and has been asking the same question year after year: “How do Data Scientists Spend their Time?” In 2022, they sampled nearly 2000 data professionals and found that they spend nearly 37.75% of their time on Data Preparation and Data Cleansing tasks. A chunk of time is spent on reporting and visualizing data, while model selection and model training only accounts for about 18% of their time. This is down from about 45% of their time in 2020 (combined data loading and data preparation).

A bar chart shows results of 2022 State of Data Science Survey question. "We asked our respondents how much time they spend on the above tasks, and for each item they entered a number reflecting the percentage of time spent relative to the other options. This is the average of the reported percentages." 22% Data Preparation, 16% Data Cleaning, 16% Reporting and Presentation, 13% Data Visualization, 9% Model Selection, 9% Model Training, 9% Deploying Models, and 7% Other.

A bar chart shows results of 2022 State of Data Science Survey question. "We asked our respondents how much time they spend on the above tasks, and for each item they entered a number reflecting the percentage of time spent relative to the other options. This is the average of the reported percentages." 22% Data Preparation, 16% Data Cleaning, 16% Reporting and Presentation, 13% Data Visualization, 9% Model Selection, 9% Model Training, 9% Deploying Models, and 7% Other.

This post is the third in a series of posts on CRISP-DM. If you haven’t yet, feel free to check out the other posts in this series:

Why is Data Preparation a Key to Success?

Data Preparation is a key to success because having a clean, robust, well-selected training set will

  1. Prevent unintended biases in your dataset from poorly selected training data,
  2. Simplify the modeling process through the inclusion of well-thought-out feature design,
  3. Aid in reproducibility through a well-documented data cleaning/preparation pipeline,
  4. Save you time in the future since your data will be well-prepped for future iterations and analysis and
  5. Increase stakeholders’ confidence in your analysis because you have meticulously understood and accounted for the gaps in your dataset.

We prep data not only out of necessity but also to ensure the quality and integrity of the final analysis. As they say, “Garbage in, garbage out”, which can be taken to mean that if you use garbage data to create a model, your model will be garbage.

How can you reduce the time spent in the Data Cleaning and Preparation Phase?

There are two great ways to reduce time spent in the data wrangling phase. The first is to spend adequate time in the first two phases of CRISP-DM: Business Understanding and Data Understanding. By thoroughly understanding the data and the business problem, you will have a good idea of what issues you will run into, how to mitigate them, and how data issues may affect your final results. You will also understand how data issues occur during data generation which will lead to a quicker understanding of how to fix or mitigate them downstream.

A screenshot of a Data Preparation recipe within the Dataiku Data Science Studio as an example of a GUI-based data prep tool

Example of a Data Preparation Recipe in Dataiku Data Science Studio

The second way to streamline your data-cleaning process is to use tools. There are some great cleaning tools out there. Many Data Science Platforms have built-in data-cleaning GUIs. For example. Dataiku Data Science Studio gives users the ability to use no/low code GUIs to clean their data. The user can choose to run cleaning operations using whatever backend computation engines are available, and without needing to write code-based pipelines. There are many other tools to automate data-cleaning pipelines, especially in the open-source Python, R, and Apache toolboxes.

What is data preparation?

Data Preparation is the process of selecting, cleaning, merging, formatting, and transforming data in order to use the data in a data product. Data Products can be anything ranging from simple charts or tables to fully automated machine learning pipelines. Typically, data preparation is done using tools that allow for the process to be automated and repeated as needed in a data pipeline.

Data prep by any other name…would still be data prep

Just for fun, here’s a list of synonyms for “Data Preparation” that mean relatively the same thing.

  • Data cleaning
  • Data preprocessing
  • Data wrangling
  • Data transformation
  • Data munging
  • Data conditioning
  • Data processing
  • Data integration
  • Data curation
  • Data preparation pipeline

What are the steps of Data Preparation

Process diagram of Phase 3: Data Prep

At a high level, we can group data preparation steps into a few groups:

  • Selecting the data sets: which data sets will you use?
  • Cleaning the data: how will you deal with missing and invalid data?
  • Feature engineering: will you create new columns of data?
  • Integrating/merging datasets: will you enrich data with additional information?
  • Formatting: Do you need to prepare unstructured data or normalize any columns?
  • Documentation: How will you ensure that others understand the decisions you made when designing the data preparation pipeline?

These are not ordered steps; we can do them in the order that makes sense depending on our data and use case. For example, we might need to clean and merge our data together first before doing feature engineering; or, we might have some features in mind that we create before we merge data together. It really depends and you might prototype something that the data engineering team decides to change for the sake of efficiency–you live and you learn.

Selecting Your Datasets

According to Google AutoML Best Practices Documentation, “A well-designed dataset increases the quality of the resulting machine learning model.” The opposite is also true: a bad training set can result in a worthless model.

This step builds upon the data understanding phase. In the second phase of CRISP-DM, you examined available datasets and determined which might be useful to solve the business problem. In the data prep stage, you actually pull your datasets together.

Are you going to pull the entire dataset or a sample?

If you are working with big data, consider your computing capabilities. You may need to create a strategy for how you will sample the data and build a prototype model before running larger experiments. You may need to use a distributed computing tool, like PySpark, to analyze your big data set before selecting your training dataset.

Some Considerations

  • Consider striating your dataset across groups or using random sampling.
  • Consider if you have imbalanced data, how will you ensure you capture enough of all of the relevant groups of data?
  • Consider, if you have temporal data, do you need to grab continuous chunks (or, more formally, sequences) of data?
  • Be sure to document and track your datasets carefully so you don’t have data leakage.

Do you need all of the columns?

Again, if you are pulling from a large dataset, try to be efficient and only pull what you need. During the Data Understanding phase, you may have found columns that have no variation or don’t pertain to your business problem. Leave these out of your dataset so they don’t cost you extra storage and processing power down the road.

Ethical Considerations when Selecting Data

We need to consider ethics at every step of the modeling process. Here in the data selection step of the data prep phase, we need to consider what biases might exist within our dataset. Have we considered how the result of this model might affect different groups and do we have a good representation of groups within our data selection? This is very important when considering people’s data and the demographics of those people, but it’s also important when thinking about machine data or environmental data. Have we considered the geographic locations of the machines? Have we considered the latitude and longitude of our environmental data? Is our sample data over-representing one group and under-representing another group/make/model/location/period/etc?

It’s a good practice to work efficiently with data, even if you have relatively small data. When working with big data, any operation can be computationally expensive equates to energy usage and bandwidth. Be a good steward of energy consumption and a good neighbor on your tech stack–try not to hog resources and use excessive electrical and computing power! Being computationally considerate is just the right thing to do 🙂

Data Cleaning

data cleaning meme

Okay, so now we are getting into the weeds of the dataset. We need to comb through issues and programmatically determine how to solve them. Some common data cleaning and preparation tasks include

  • Handling missing values, outliers, and duplicate data
  • Identifying and correcting errors or inconsistencies in the data
  • Handling missing or incomplete data

Missing values

Missing values can be handled in a few ways, namely, by (1) removing the rows with missing information, (2) imputing the missing values if the data is continuous, or (3) filling in the value with a constant value.

An article in the Journal of Big Data describes some approaches to handling missing data. An important consideration the authors mention is “what is the pattern of the missing data?” and “what is the mechanism of the missing data?”

The pattern of missing data

Let’s first consider the pattern of missing data. It can be either Univariate, Monotone, or Non-Monotone. Univariate is rare and occurs when only one variable has missing values. Monotone missingness refers to when one missing value implies another missing value. In other words, if you have variables height, weight, temperature, and sleep hours for some participants in a study, over time, if the participants drop out, you may only have height and weight data, while temperature and sleep hours are missing because the participant is no longer tracking them. Otherwise, the pattern of missingness is non-monotone, or general and the missingness of one variable does not affect the missingness of another variable (source).

Representation of missing data patterns data. Blue represents observed values; red is missing values.

Mechanisms of missing data

Next, what is the mechanism of the missing data? There are 3 mechanisms: Missing at Random (MAR), Missing Completely at Random (MCAR), and Missing Not at Random (MNAR).

MAR: “There is a systematic relationship between the occurrence of missing values and the observed data, but not the missing data. Whether an observation is missing has nothing to do with the missing values, but it does have to do with the values of an individual’s observed variables. So, for example, if men are more likely to tell you their weight than women, weight is MAR” (source). Or if a machine is starting up or shutting down and some of the sensors stop sending data during the process, those variables would also be MAR.

MCAR: As the name suggests, these values are missing completely at random with no indication of a systematic reason why

MNAR: There is a systematic reason for the missing values that can be modeled and accounted for. For example, this could be a case where the vehicles with a wiring issue are not sending all sensor values.

Why does it matter?

Karen Grace-Martin provides an overview of why and how we need to look at missing values. Understanding the mechanism for missing data is important in order to not introduce bias into your analysis.

Outliers

Outliers deserve careful consideration. The first task is identifying outliers and this can usually be done via boxplots, statistical process control charts, or isolation forest algorithms. Outliers and anomalies are closely related, so it’s important to consider if the outliers provide useful information for your business problem. Are you trying to model “normal” behavior? If so, then you probably want to remove the outliers. Are you trying to detect fraud or rare events like mechanical failures? If so, then this information might be contained within the outliers.

Consider the modeling approach, too. Some modeling approaches are much more sensitive to outliers than others. With Regression, there are Robust regression methods that are much less sensitive to outliers than regular linear regression, as an example.

Duplicate data

Duplicate data can occur due to human error, software error, or data pipeline error. Similar to analyzing missing values, it could be useful to the upstream process to understand if there is a systematic issue somewhere causing duplicate data(e.g. in the data ingestion process). However, for the sake of the analysis, true duplicates should be removed because they will skew the final analysis.

Errors or inconsistencies

It’s common for there to be errors in text fields in human generated data. You may need to clean up misspellings or inconsistent use of punctuation, capitalization, etc. For non-human generated data, you may run into error codes or situations where there was “bad” data caused by wiring or battery issues. Careful consideration of these values will help you determine if you can fix them or if you need to remove them altogether.

Missing or incomplete data

image of a missing puzzle piece

Especially in the case of human generated data, you may encounter issues where forms are not fully completed or surveys are only filled out halfway. In the case of machines, you may again have cases where sensors failed and stopped sending data, resulting in missing values.

Some important considerations for all of these issues is if you are going to fix, eliminate, or ignore the issue and what kind of bias this will introduce into your analysis. There are strategies for imputing or replacing missing values, such as replacing with a measure of central tendancy (mean, median, or mode). This would be most common for continuous data. For temporal data, this could be done through a rolling window function. Missing values can be replaced with a static values in some cases. For example, if you have categorical data for eye color, you can replace the missing value with a standard “N/A”.

Or you might have knowledge of an underlying process that led to the missing data and you know that data is missing during the startup event, so you could get creative and replace all startup values with a static value. On that note, you could also use your knowledge of what “start-up” values look like to create a new column “Start_Up” and code this column 1 for start up and 0 for regular operation. Now, you have just created a feature!

Feature Engineering

Features are just additional variables or attributes to your data. In tabular data, think of features are additional columns. There may be some necessary features that need to be created. For example, if your data includes longitude and latitude, these columns are treated as numeric, with no special formatting. If location or distance provides a signal for your problem, you must engineer a feature that provides that information explicitly.

Some data types that might require feature engineering:

  • Longitude/Latitude
  • URLs
  • IP addresses
  • Email addresses
  • Phone numbers
  • Other geographic codes (for example, postal codes)
  • Sensor values
  • Engineering units
  • Tme series data
  • Granular data at a smaller scale than is being analyzed (ie. millisecond data being analyzed at the 1 minute level of granularity)
  • Unstructured text

The Art of Feature Engineering

Data Professionals and Subject Matter Experts (SMEs) can really let their creatvity shine during the feature engineering stage of data preparation. This is a place to get creative and think about what information you can build using the data that you have. Feature Engineering can be the key to an amazing model. Here’s a quick run-down of some different features that can be created:

  • Geospatial features:
    • Distance between two points
    • Altitude
    • Geofencing groups
    • Zip Code, State, etc.
  • Creating flags
    • Binary values that indicate presence or absence of something
    • E.g., “Is_Adult” or “Is_Minor” flags based on the age column
    • One-hot encoding
  • Creating delta values
    • If the data can be ordered in different ways, you might create a delta value column such as “30_day_delta_weight” which might be a sliding window value measuring current weight to weight of 30 days prior
  • Aggregations
    • If you are grouping your data, or upsampling, you may create some aggregated values such as mean, median, mode, skew, kurtosis, of # of x in group
  • Rolling Values/Window Functions
    • With temporal value, you might calculate a rolling 1 day or 1 week sum or mean value
  • Interactions
    • multiplying/dividing/adding/subtracting two columns together
  • Transformations
    • square root
    • squared-value
    • inverse
    • scaled
  • Text-based features
    • number of words
    • number of unique words
    • number of characters
    • text subjectivity

Beware of the Curse of Dimensionality

black cat walking on road

Photo by David Bartus on Pexels.com

Many AutoML tools have functions that allow you to automatically generate interactions between features and some standard transformations of values. However, be aware of the curse of dimensionality *dum dum dum*. The Curse of Dimensionality which generally states that the more dimensions (i.e. features, variables, columns) you have of your data, the more data you need to have reliable results–exponentially more, in fact.

Recall conversations that you had with SMEs during the Data Understanding phase. Did they provide any insights into indexes or derivations of variables that relate to the physics or known processes in the system? These values should be prioritized.

Additionally, once you build up a feature space, you can use feature reduction methods such as forward feature selection or PCA to pick the most useful subset of features.

Ethics of feature engineering

There are some ethical considerations to make when engineering features for a dataset, especially when working with people data. Depending on the business problem and the use case of the analysis, it may be inappropriate or even illegal to include demographic information about people. There have been cases where no demographic information is included in models, yet the models still show signs of bias because the data that they are trained on is biased. So, it’s important to not only exclude features that may lead to bias, but also to check for signs of bias during model evaluation.

Integrating Datasets

large mechanical gears representing feature engineering

You will need to spend some time combining your datasets together if you have various tables or if you are enriching your data with lookup tables. While this may seem straightforward, it’s an important step to double-check your work. Doing a join incorrectly can result in duplicates appearing, accidentally dropping rows, or unintentionally creating a sparse dataset.

Importantly, document how and why the data is combined the way you have chosen.

Formatting Data

You might come back to this step after the next phase of CRISP-DM, Modeling, but this is where you will transform the data into a format suitable for modeling. This may include normalizing or scaling the data, if necessary. Or you may need to convert to a standardized units (metric versus imperial). If you are working with unstructured text data, you might actually do some formatting at the beginning of the data cleaning process in order to remove stop words, punctuation, etc.

Documentation

Documenting your data cleaning decisions and processes will be the key to the reproducibility of your work and traceability of the data. Many data platforms offer interfaces that make it very easy to document and visualize the steps taken to create the analytics-ready dataset. However, if you are working strictly in a script or notebook, the burden will be on your shoulders to make your code neat and understandable for anyone to pick up and understand.

In your documentation, consider including descriptions of the input and output data, methods of cleaning the data and rationale, and any decisions you made. In future iterations of the CRISP-DM process, these decision points may be revisited and clear documentation will save lots of time!

Conclusion

So much time is spent on data cleaning because the deliterious impact of poor data preparation on the accuracy and reliability of the final result. Developing systematic approaches to cleaning data will help reduce the time spent here. Documentation of datasets will go a long way to saving time in future projects so that data teams don’t need to solve the same problems multiple times. And finally, spending sufficient time understanding the data with the help of SMEs and data owners will pay dividends towards a high quality final product!

Additional resources for further reading

Leave A Comment

  1. […] Trust me when I emphasize that the time spent in the data understanding phase is well worth it and will directly affect the result of this project and any projects that are built upon it. In the next Phase, we will start prepping our data! […]

Let me know what you think!

Related Posts