Just like basic statistics, understanding a Data Science Lifecycle (DSLC) is foundational knowledge for Data Scientists, Data Analysts, and Data Team Managers. Here is a brief introduction to the tried and true process of CRISP-DM.

Table of Contents

What is CRISP-DM?

CRISP-DM stands for the CRoss Industry Standard Process for Data Mining. It is a standard process for knowledge discovery consisting of 6 phases that can be applied across a wide range of applications. The 6 phases are Business Understanding, Data Understanding, Data Preparation, Modeling, Model Evaluation, and Deployment.

According to KD Nuggets Surveys in 2007 and 2014, CRISP-DM is the most widely used methodology for data science and analytics projects: over 40% of survey respondents reported they use it.

Overview of CRISP-DM

Crisp-DM is designed to improve the speed, efficiency, and accuracy of data analysis by iterating through 6 phases:

  • Business Understanding: Understand the current situation and determine the business goals for the project

  • Data Understanding: Gather data sources, data definitions, talk to subject matter experts, conduct exploratory data analysis and data quality checks

  • Data Preparation: Select, clean, format data and create any features needed

  • Modeling: Select modeling technique, generate test design and build model. Assess the model performance

  • Model Evaluation: Evalate process and model results in the context of the business problem

  • Deployment: Produce deliverables, develop plan for model monitoring and maintenance

Cyclical Model

The CRISP-DM diagram shown below makes it easy to visualize how the phases flow together:

CRISP-DM Process Diagram

These steps aren’t meant to be followed linearly. As the arrows in the diagram indicate, some of the steps will lead sequentially from one to the next. In this way, it is a cyclical and iterative process.

For example, when starting a new project, it would be normal to spend a significant amount of time iterating between business understanding, data understanding, and data preparation in order to have a firm grasp on the problem and the data.

Hierarchical Model

The original CRISP-DM 1.0 Guide established the methodology as a hierarchical process model, consisting of sets of tasks described at four levels of abstraction: phase (high-level abstraction), generic task, specialized task, and process instance (low-level abstraction).

Four level breakdown of the CRISP-DM methodology

The model was designed to be a high-level framework that included a strategy for mapping the generic process model to the specialized level. These are the mapping steps provided by the original guide:

  1. Analyze your specific context
  2. Remove any details not applicable to your context
  3. Add any details specific to your context
  4. Specialize generic contents according to concrete characteristics of your context
  5. Rename generic contents to provide more explicit meanings in your context for the sake of clarity

Many teams have used this mapping strategy and published extensions to the CRISP-DM model. These extensions provide practitioners with tested best practices for data mining in various industries!

History of CRISP-DM

The CRoss Industry Standard Process for Data Mining was a result of special funding by the European Commission (EC) in the late 90s. The program’s objective was to establish a standard process for data mining, as the name suggests.

The project statement noted that the rise in High-Performance Computing and problems with interpreting vast amounts of data had led to a need for a process of knowledge discovery that was fast, well-understood, reliable, and valid across a wide range of applications.

Industry Need

To solve the problem, the EC formed a special interest group to broaden the basis for development and testing without sacrificing the efficiency and effectiveness of a small, tightly-focused consortium. The special interest group would also help facilitate the dissemination and exploitation of the results. 

Business Opportunity

The vision was that data warehouse vendors and data mining tool suppliers could exploit the process model to enhance their product and service offerings. The user partners could exploit the results of the project internally to improve business intelligence and decision-making.

The EC saw the business potential and benefit of developing a standardized process for the industry. “[CRISP-DM] will make large data mining projects faster, more efficient, more reliable, more manageable, and less costly. A widely adopted process should foster the development of a multitude of data mining tools which support it, thereby significantly contributing to promoting a profitable use of HPCN technology,” according to the project’s stated objectives

Present day CRISP-DM

Over two decades later, CRISP-DM has stood the test of time. With the evolution of the field of data science, there have been some major expansions to encompass things like data engineering, data analytics, and machine learning engineering.

Additionally, there are a number of published extensions of the model. Many researchers and practitioners in various industries have systematically analyzed their processes and published their best practices in the form of CRISP-DM extensions!

While there have been many specializations of the model, others have built upon it and revised it completely: DASC-PM, TDSP, etc. Yet, CRISP-DM remains the most cited data mining process model in academic papers and by professionals.

Conclusion

CRISP-DM is a standardized process for solving real-world problems with data. It is a simple process made up of 6 phases and you can iterate through these phases in the order that makes sense for your project.

There are many ways to customize it to any particular use case and many published examples to learn from. It’s the best way for small teams and individual data scientists to structure end-to-end projects to deliver value in a systematic way.

Leave A Comment

  1. […] Let’s discuss the first phase of CRISP-DM: Business Understanding. Recall that CRISP-DM stands for the “CRoss Industry Standard Process for Data Mining” and it’s a six phase process for organizing and iterating through a data project. Feel free to check out my previous posts where we discuss Why CRISP-DM is a Data Scientist’s Secret Weapon and What is CRISP-DM, Anyway? […]

  2. […] the second phase of CRISP-DM, Data Understanding, it’s finally time to roll up our sleeves and get our hands dirty. We […]

  3. […] Preparation is the third phase of CRISP-DM and the not-so-secret sauce to successful data mining projects. It’s well-known and often […]

  4. […] CRISP-DM and start with a business use […]

  5. […] What is CRISP-DM and why is it important for Data Scientists to learn? […]

  6. […] in the fifth phase of CRISP-DM, the most widely used data mining lifecycle, we will evaluate the models from the context of the use […]

  7. […] stands for the CRoss Industry Standard Process for Data Mining. It is a 6-phase process model for data mining projects developed in the late 1990s and early 2000s […]

Let me know what you think!

Related Posts