Research & Subject Guides: Data Cleaning and Wrangling Guide: Workflow and Precedence

Recommended Order

When it comes to data cleaning and wrangling, sequence matters. Some techniques are essential for creating a structured, consistent dataset, while others build on those foundations. Not all steps can or should be done in any order; certain techniques need to precede others to prevent unintended issues and make data truly ready for analysis. In general, the recommended order is as follows:

Data Standardization and Consistency Adjustments
Handling Duplicates
Handling Missing Values
Addressing Outliers
Data Wrangling Techniques

Why This Order?

Data cleaning techniques that focus solely on data standardization or formatting adjustments (without removing or imputing values) should take precedence over those that alter the information contained in the dataset. This prioritization ensures consistency and accuracy before more intensive cleaning operations. Below are some examples of techniques that generally come first in the cleaning process:

Correcting Data Types
Standardizing Formats
Fixing Inconsistent Values
Removing or Replacing Irrelevant Characters
Unit Conversion

Applying these initial steps ensures that the data is consistently structured, providing a strong foundation for addressing more complex issues like duplicates, missing values, and outliers. Once the dataset is consistent and well-formatted, techniques involving removal or imputation of values can follow in this order:

Duplicates (First Priority)
Duplicates represent repeated observations that can skew results. Removing them early ensures that each observation is unique, preventing data from being misrepresented in subsequent steps.
Missing Values (Second Priority)
After duplicates are removed, you can address missing values to make the dataset as complete as possible before dealing with outliers. Missing values can affect outlier detection and imputation strategies, so it is helpful to address them before examining data distributions.
Outliers (Third Priority)
Outliers are best addressed last because they require the dataset to be free of duplicates and as complete as possible to ensure accurate detection. Missing values or duplicates can interfere with outlier identification, leading to misclassification or distortion of outlier thresholds.

Data Cleaning and Wrangling Guide

Recommended Order

Why This Order?

Library

Resources

Follow Us