In this post, we are going to discuss a basic framework that you can use when working through the Model Building phase of CRISP-DM.
Review of CRISP-DM
CRISP-DM stands for the CRoss Industry Standard Process for Data Mining. It is a 6-phase process model for data mining projects developed in the late 1990s and early 2000s by a consortium of companies funded by the European Union. The goal of the consortium was to develop a standardized process for data mining that could be used across industries.
Fast forward to today, the CRISP-DM model has received only minor updates since it was first developed, yet remains the most cited data science lifecycle model in academic papers and among practitioners.
Here are some links to previous posts on CRISP-DM:
Table of Contents
Overview of data modeling
Data Modeling is the fourth phase of CRISP-DM. This is the point in the project where you fit a mathematical or visual model to the data to accomplish a task, answer a question, or solve a specific problem.
If you’re following the CRISP-DM model, then you will already have a clear idea of the business problem, an understanding of the datasets and how they were generated, and some preprocessed data.
Specifically, you have
✅ Gained a solid understanding of the problem and how it is generally solved
✅ Talked to SMEs about variables and built some initial features in the data prep phase
✅ Explored the dataset and have an understanding of the distribution of the variables
✅ Quantified how much complete data you have, and some initial ideas about the limitations and biases of the datasets
✅ Verified that the data is accurate, complete, and consistent
Because of the work up to this point, you have confidence in the results gathered here and you are ready to build some models!
Steps
In the image above, you can see the original subtasks and outputs for the Modeling Phase of CRISP-DM as outlined in 1999. I’ve deviated slightly from these subtasks in the steps below by adding some more specificity, but generally, the steps are the same.
Here are some general steps if you are building a machine learning model:
- Revisit the stated project goals from Phase 1
- Select the models for experiments based on your data understanding
- Build a baseline model
- Experiment with additional models and complexity
- Track experiments
- Assess the models
These steps can be framed differently if you are building a data visualization or a dashboard.
Iteration
You can still expect some back and forth between the data preparation and model building phases. It’s likely that you will run into some questions and need to go back to the data understanding phase to fill in the gaps.
Even though we come to expect this as data professionals, this is one of the reasons that traditional software development methodologies, like SCRUM and Kanban, sometimes fail to support the data science process.
Sometimes, the data needs to be formatted differently to work with a model, i.e. through one-hot encoding, vectorizing text, discretizing ranges, etc. Other times, you may have follow-up questions about the behavior of certain variables. This is the iterative nature of data science and the CRISP-DM model allows for this iteration.
Data Modeling Tasks
When working through the data modeling tasks, consider the following questions:
- What assumptions have been made about the data that are driving your analysis?
- What hypotheses will you be testing?
- What data will be used to test the models? Have you partitioned the data into train/test sets? (This is a commonly used approach in modeling.)
- Are there best practices established for creating train/test sets with the data you are using?
- In supervised learning, make sure to keep a validation set as well–this set is not used in the model development process and acts as “unseen” data to help evaluate how the model might perform once deployed.
- How many times are you willing to rerun a model with adjusted settings before attempting another type of model?
1. Revisit project goals
During the Business Understanding Phase, we discussed the goals of the project, including the type of problem we are solving: regression, classification, or clustering model. We also identified target variables and use-case-specific data mining goals.
Are we determining the probability that a part will fail within the next 1000 miles? Are we trying to predict monthly sales revenue? Are we trying to gain insights into our customer base?
When designing the experiment, determine how to report on metrics related to the business goals. If the goal is to predict the probability of part failure within the next 1000 miles, do you have enough historical data to create a robust train/test set that will actually contain failures? Will you need to wait for parts to fail to determine the goodness of your model?
2. Select modeling techniques
When making your model selection, it is important to take into account the potential impact of the following issues on your decision:
- Does the model require the data to be split into test and training sets? Is there a process in place for this or do you need to design the test/train/validation split? If so, try to design representative test/train sets and avoid data leakage.
- Do you have enough data to produce reliable results for a given model? In statistical analysis, you can consider the Power of the results. In machine learning, you need to generate confidence intervals or probabilities of likelihood to accompany predictions.
- Does the model require a certain level of data quality? Can you meet this level with the current data?
- Are your data the proper type for a particular model?
- What are the underlying assumptions of the chosen modeling technique? Some models require normally distributed data. Most time-series forecasting models require a constant frequency.
- Do you need to balance the data for rare events? This is common in fault prediction, fraud prediction and generally working with data that contains underrepresented groups.
3. Baseline model or benchmark?
Sometimes the term “baseline model” is confused with “benchmark”. So what is the difference?
A baseline model is a low-complexity model that you use to measure the effect and potentially justify the use of a more complex model.
A benchmark is usually used to compare your results to industry standards. In data science, you will often hear about benchmark datasets that are designed to measure model performance for a certain task.
For example, the MNIST data set is a benchmark dataset for the task of image and number recognition in computer vision. For time series anomaly detection, the NAB dataset is used as a benchmark to measure the accuracy of various anomaly detection models.
Baseline models are starting points
An important first step is to build a baseline model. This is a low-complexity model that is used to measure performance gains by more complex models. As you iterate through the model-building process, you perform experiments and build from low complexity to high complexity.
But, beware of the Complexity Tradeoffs! They go by many names:
- Accuracy-Efficiency trade-off
- Complexity-Interpretability trade-off
- Accuracy-Explainability trade-off
- Bias-Variance trade-off
- etc.
Additionally, adding too much complexity can be computationally inefficient, creating waste in the form of additional time and energy required to complete training or inference of the model. In a nutshell, when you add too much complexity to a machine learning model, you risk a number of things, so it is often preferred to start with the lowest complexity model possible.
Key things to keep in mind regarding the complexity of a model:
- How much data is required to train?
- How many parameters are included in the model?
- How many computational steps are required to train the model?
- How much computation is required to score new data?
Benchmarks are measuring sticks
Unlike baseline models, benchmarks are used to assess models and methods on how well they solve a general task such as digit identification or image recognition. In CRISP-DM, if it is your goal to create a state-of-the-art model to perform well on a general task, then keeping track of benchmarks would be important.
If you need to create or beat an existing benchmark, this should be discussed with stakeholders and documented in the project plan as a stated objective of the project.
Most of the time, companies and applied research projects are not focusing on developing general use case models and are more interested in how their models perform on the specific use case identified in phase one. For specific use cases, benchmarks are typically not useful, though you might be able to take a pre-trained, open-source model and fine-tune it for your use case.
4. Experiment with models & complexity
Plan ahead on how you will test the goodness of your model. Consider factors such as interpretability, accuracy, and efficiency.
The scientific method needs to be part of the model building process: Make a hypothesis about the underlying nature of the data, build a model that supports the hypothesis, starting with the baseline model and adding complexity from there.
Evaluate the pre-determined model “goodness” metrics to drive the scientific process.
5. Track experiments
Make sure to keep track of all of the metrics and hyperparameter information for each version of the model.
As you iterate through different modeling techniques, it is important to keep track of the models, the data used, the parameter settings, and the resulting metrics.
In the past, people used spreadsheets to keep track of these, but there are some open-source and automated tools that you can use now:
- Weights and Biases
- Neptune.AI
- ML Flow
- TensorBoard
- and many more.
This step is critical for creating reproducible results and for the next step: model evaluation.
6. Assess the models
You can assess the goodness of your model using different evaluation metrics. There are different metrics for different use cases, but they can generally be grouped based on the task: Regression or Classification.
Regression Metrics
In regression, we try to predict a numeric output. There are several statistical measures for quantifying the error of the predictions. Here are a few:
- MSE: Mean Squared Error (MSE) is the mean difference between the predicted and actual values.
- RMSE: Root Mean Squared Error (RMSE) is the square root of the MSE.
- MAE: Mean Absolute Error (MAE) between the predicted and actual values. This value is easy to interpret in business terms. If you are predicting a $12,000 outcome and the MAE is $500, that is easy to talk about with business stakeholders: Is the model good enough?
- MAPE: Mean Absolute Percentage Error (MAPE) is the mean absolute value percentage error between the predicted and actual values.
- MSLE: Mean Squared Logarithmic Error (MLSE) is the mean squared logarithmic error between the actual and predicted values.
Classification Metrics
In classification tasks, we predict labels for inputs. There are many ways to measure a classifier’s performance. Here are just a few:
- Accuracy: A simple ratio between # Correct / # Predicted.
- Precision: The ratio of True Positives to predicted positives, or the accuracy of the positive predictions.
- Recall: The ratio of true positives to predicted positives and false negatives. This is sometimes called the True Positive Rate and is used to assess the sensitivity of the model. It is a good measure for assessing an imbalanced dataset.
- F1 score: A harmonic mean of the Precision and Recall. This is a way to assess and compare classification models that takes both precision and recall into account.
- Log-loss: A value between 0 and 1 that provides a measure of how close the predicted value is to the actual value.
There are also some visual tools to evaluate these metrics for classification tasks:
- Confusion Matrix: A chart that shows the numbers of correct and incorrect labels per class.
- Precision-Recall Curve: A plot of precision versus recall for a range of k values where k is the number of predictions made.
- AUC-ROC Curve: This is the Area Under the ROC Curve. It shows the performance of the classifier across all possible decision thresholds.
Caveats
There are some caveats to using the metrics listed above. To avoid these pitfalls, it is crucial to understand your data and be on the lookout for a few things:
Unbalanced datasets: These datasets have rare events or imbalanced classes. Rare events are common in fraud detection and predictive maintenance. Imbalanced classes are very common in data about people where groups are underrepresented.
Overfitting: Be skeptical when metrics are too good. It is possible that the model is overfitted to the training data, making the metrics look great. The consequence of this is usually poor performance on real data.
Outliers: Outliers can have an impact on your model metrics, but it’s not always necessary or recommended to remove them. Sometimes, there are alternative modeling methods that are robust to outliers that can be used.
Training versus Evaluation Metrics: During training, we use a loss function to optimize the model. Then, we assess the model using evaluation metrics. It’s best to keep these more or less the same.
Model building phase outputs
Each task in the CRISP-DM model building phase has recommended outputs. In the Model Building phase, these are the recommended outputs:
1. Modeling technique
Document the actual modeling technique or techniques that you have chosen. Include your rationale about why the techniques were chosen.
Include information about any underlying assumptions that are made in order to use the techniques. Did you conduct any statistical hypothesis testing to validate your assumptions? What was your hypothesis about the nature of the data that led to the model choice? Make sure to include the details here.
2. Test design
Describe the plan for training, testing, and evaluating the models. Include the hypotheses that are being tested and the results.
3. Parameter settings
When using a modeling tool, there are typically numerous parameters that can be customized. Keep track of the selected parameter values and explain the reasoning behind each choice.
4. Models
The models are an important output of this phase. These may be saved within a modeling tool, or saved to a file system, but should have a clear naming convention.
5. Model description
Begin developing both technical and non-technical descriptions of the models. The descriptions help promote transparency of the models and aid end-users, stakeholders, and developers in understanding the models.
The descriptions should include the strengths and weaknesses of the models, experiment metrics, and information about any biases or blindspots in the model.
Researchers at Google developed Model Cards to help standardized transparent documentation of machine learning models.
Next step: model evaluation
Once you have preformed the experiments and tracked the model metadata, it is time for Model Evaluation, Phase 5 of CRISP-DM.
Get ready to evaluate the goodness of the model, evaluation of how well the model solves the business problem, and perform an internal Audit of the model.
Conclusion
In this post, we looked at the model building process from a very high level and walked through some specific tasks and considerations that should be made at this step. You should have enough information to make a model development game plan for your project!
Developing the Machine Learning model during a data science project sometimes seems like the main event, but experience and research will tell that building ML models can actually be a very small part of the whole project lifecycle.
New tools spring up constantly to help automate time-consuming parts of data science, but automation does not remove the necessity of the time spent in understanding the problem and the data before diving into modeling.