Skip to Main Content
Stony Brook University

Data Cleaning and Wrangling Guide

Essential techniques and best practices for preparing ready-to-use data, with implementation examples in Google Sheets, Microsoft Excel, Python, and R.

Recommended Order

When it comes to data cleaning and wrangling, sequence matters. Some techniques are essential for creating a structured, consistent dataset, while others build on those foundations. Not all steps can or should be done in any order; certain techniques need to precede others to prevent unintended issues and make data truly ready for analysis. In general, the recommended order is as follows:

  1. Data Standardization and Consistency Adjustments
  2. Handling Duplicates
  3. Handling Missing Values
  4. Addressing Outliers
  5. Data Wrangling Techniques

Why This Order?

Data cleaning techniques that focus solely on data standardization or formatting adjustments (without removing or imputing values) should take precedence over those that alter the information contained in the dataset. This prioritization ensures consistency and accuracy before more intensive cleaning operations. Below are some examples of techniques that generally come first in the cleaning process:

  1. Correcting Data Types
  2. Standardizing Formats
  3. Fixing Inconsistent Values
  4. Removing or Replacing Irrelevant Characters
  5. Unit Conversion

Applying these initial steps ensures that the data is consistently structured, providing a strong foundation for addressing more complex issues like duplicates, missing values, and outliers. Once the dataset is consistent and well-formatted, techniques involving removal or imputation of values can follow in this order:

  1. Duplicates (First Priority)
    Duplicates represent repeated observations that can skew results. Removing them early ensures that each observation is unique, preventing data from being misrepresented in subsequent steps.
  2. Missing Values (Second Priority)
    After duplicates are removed, you can address missing values to make the dataset as complete as possible before dealing with outliers. Missing values can affect outlier detection and imputation strategies, so it is helpful to address them before examining data distributions.
  3. Outliers (Third Priority)
    Outliers are best addressed last because they require the dataset to be free of duplicates and as complete as possible to ensure accurate detection. Missing values or duplicates can interfere with outlier identification, leading to misclassification or distortion of outlier thresholds.