Skip to Main Content
Stony Brook University

Data Cleaning and Wrangling Guide

Essential techniques and best practices for preparing ready-to-use data, with implementation examples in Google Sheets, Microsoft Excel, Python, and R.

Reshaping Data

Reshaping data involves changing the structure or layout of your dataset, making it easier to work with during analysis. This can include pivoting data from long to wide format (or vice versa), splitting or combining columns, and more.

Pivoting Data

Rearranging the columns and rows of your data can help you view it from different perspectives. 

For example, consider this clinical trials data from a study testing the effectiveness of three different drugs (Drug A, Drug B, and Drug C) on patients with a specific medical condition. The dataset records information about patients' responses to treatment across various age groups, treatment groups, and outcomes. The data is in a "long" format, where each row represents a single patient’s response to a treatment.

Patient ID Age Group Treatment Response
P1 18-30 Drug A Positive
P2 31-50 Drug B Negative
P3 51-65 Drug A Positive
P4 18-30 Drug C Negative
P5 31-50 Drug B Positive
P6 51-65 Drug C Positive

Pivoting the data allows you to summarize patient responses to different treatments based on age group and treatment type in a "wide" format. By pivoting, you can easily compare the effectiveness of each drug across age groups and identify patterns in patient outcomes.

Google Sheets

Python

# Load the pandas package
import pandas as pd

# Create the dataset
data = {
    'Patient_ID': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
    'Age Group': ['18-30', '31-50', '51-65', '18-30', '31-50', '51-65'],
    'Treatment': ['Drug A', 'Drug B', 'Drug A', 'Drug C', 'Drug B', 'Drug C'],
    'Response': ['Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive']
}
df = pd.DataFrame(data)

# Create a new column with 1 to indicate occurrences
df['Count'] = 1

# Pivot the data
pivoted_data = df.pivot_table(
    index='Age Group',
    columns=['Treatment', 'Response'],
    values='Count',
    aggfunc='sum',
    fill_value=0
)

# Display the final pivoted data
print(pivoted_data)

R

# Load the tidyverse package
library(tidyverse)

# Create the dataset
data <- data.frame(
  Patient_ID = c('P1', 'P2', 'P3', 'P4', 'P5', 'P6'),
  Age_Group = c('18-30', '31-50', '51-65', '18-30', '31-50', '51-65'),
  Treatment = c('Drug A', 'Drug B', 'Drug A', 'Drug C', 'Drug B', 'Drug C'),
  Response = c('Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive')
)

# Pivot and count the positive/negative responses for each treatment and age group
pivoted_data <- data %>%
  # Create a new variable with 1 for each row to count occurrences
  mutate(count = 1) %>%
  # Pivot the data wider, creating columns for each Treatment and Response combination
  pivot_wider(names_from = c(Treatment, Response), 
              values_from = count, 
              values_fill = list(count = 0)) %>%
  # Group by Age Group and summarize the results by summing up counts
  group_by(Age_Group) %>%
  summarize(across(-Patient_ID, sum)) # Exclude Patient_ID from the sum operation

# Display the final pivoted data
print(pivoted_data)

Result:
Age Group Drug A (Positive) Drug A (Negative) Drug B (Positive) Drug B (Negative) Drug C (Positive) Drug C (Negative)
18-30 1 0 0 0 0 1
31-50 0 0 1 1 0 0
51-65 1 0 0 0 1 0

Splitting and Combining Columns

In some cases, splitting columns helps separate meaningful components from a single string, making it easier to analyze specific parts of the data.

For example, let's consider this data with an address column in it:

Patient ID Address
P1 123 Main St, Stony Brook, NY, 11790
P2 123 Main St, Setauket-East Setauket, NY, 11733
P3 123 Main St, Port Jefferson, NY, 11777

You can split the address column into four: street address, city, state, and ZIP code. By doing so, you will be able to sort or categorize your data by city, state, or ZIP code that you won't be able to do in the previous format.

Google Sheets

Python

# Load the pandas package
import pandas as pd

# Create the dataset
data = {
    'Patient ID': ['P1', 'P2', 'P3'],
    'Address': ['123 Main St, Stony Brook, NY, 11790', 
                '123 Main St, Setauket-East Setauket, NY, 11733', 
                '123 Main St, Port Jefferson, NY, 11777']
}
df = pd.DataFrame(data)

# Split the address column
df[['Street Address', 'City', 'State', 'ZIP Code']] = df['Address'].str.split(', ', expand=True)

# Display the result
print(df)

R

# Load the tidyverse package
library(tidyverse)

# Create the dataset
data <- data.frame(
  Patient_ID = c('P1', 'P2', 'P3'),
  Address = c('123 Main St, Stony Brook, NY, 11790', 
              '123 Main St, Setauket-East Setauket, NY, 11733', 
              '123 Main St, Port Jefferson, NY, 11777')
)

# Split the address column

data <- data %>%
  separate(Address, into = c("Street_Address", "City", "State", "ZIP"), sep = ", ")  

# Display the result
print(data) 

Result:
Patient ID Street Address City State ZIP Code
P1 123 Main St Stony Brook NY 11790
P2 123 Main St Setauket-East Setauket NY 11733
P3 123 Main St Port Jefferson NY 11777

In some other cases, you might find the opposite where it is useful to combine multiple columns into one, especially when creating summary data for reports or simplifying data views for presentation.

Let's combine the previously split columns into a single address column:

Google Sheets

Python

# Combine the columns back into a single Address column
df['Full Address'] = df['Street Address'] + ', ' + df['City'] + ', ' + df['State'] + ', ' + df['ZIP Code']

# Drop the split columns (optional)
df = df.drop(columns=['Street Address', 'City', 'State', 'ZIP Code'])

# Display the result
print(df)

R

# Combine the columns back into a single Full_Address column
data <- data %>%
  unite(Full_Address, Street_Address, City, State, ZIP, sep = ", ", remove = TRUE) # Use FALSE if you want to keep the split columns

# Display the result
print(data) 

Result:
Patient ID Full_Address
P1 123 Main St, Stony Brook, NY, 11790
P2 123 Main St, Setauket-East Setauket, NY, 11733
P3 123 Main St, Port Jefferson, NY, 11777

Merging and Joining Datasets

Merging datasets is essential when working with data from multiple sources. This technique allows you to combine information based on common fields, such as IDs or dates. There are several types of merges:

  1. Inner Join: Returns rows with matching values in both datasets.
  2. Left Join: Returns all rows from the left dataset and matching rows from the right dataset.
  3. Right Join: Returns all rows from the right dataset and matching rows from the left dataset.
  4. Outer Join: Returns all rows when there is a match in either dataset.

For example, let's consider these two datasets for an e-commerce store. One dataset contains order information, and the other contains customer data.

Dataset 1: Order Data

order_id customer_id order_date total_amount
101 1 2023-10-26 100
102 2 2023-10-27 50
103 3 2023-10-28 75
104 1 2023-10-29 120
105 4 2023-10-30 80
Dataset 2: Customer Data

customer_id customer_name city
1 Alice New York
2 Bob Los Angeles
3 Charlie Chicago
5 David San Francisco

The two datasets can be merged based on the shared column (customer_id) in four different ways:

Inner Join (keeps only matching records)

An inner join would only return rows where the customer_id exists in both the first (i.e., Order) and second (i.e., Customer) datasets. This would exclude:

  • Orders made by unregistered customers (e.g., Order 105).
  • Customers who haven't placed any orders (e.g., David).
Python

import pandas as pd
inner_join = pd.merge(order_data, customer_data, on='customer_id', how='inner')
print(inner_join)

R

library(tidyverse)
inner_join <- inner_join(order_data, customer_data, by = "customer_id")
print(inner_join)

Result:

order_id customer_id order_date total_amount customer_name city
101 1 2023-10-26 100 Alice New York
104 1 2023-10-29 120 Alice New York
102 2 2023-10-27 50 Bob Los Angeles
103 3 2023-10-28 75 Charlie Chicago

Left Join (keeps all records from the order dataset)

A left join will keep all records from the order dataset, even if there is no matching customer information. This is useful if you are focused on analyzing orders and want to see if there are any orders from customers who are not registered in your system (for example, guest checkouts or unregistered customers).

Python

import pandas as pd
left_join = pd.merge(order_data, customer_data, on='customer_id', how='left')
print(left_join)

R

library(tidyverse)
left_join <- left_join(order_data, customer_data, by = "customer_id")
print(left_join)

Result:

order_id customer_id order_date total_amount customer_name city
101 1 2023-10-26 100 Alice New York
102 2 2023-10-27 50 Bob Los Angeles
103 3 2023-10-28 75 Charlie Chicago
104 1 2023-10-29 120 Alice New York
105 4 2023-10-30 80 NaN NaN

Right Join (keeps all records from the customer dataset)

A right join will keep all records from the Customer Information dataset, including customers who haven’t placed any orders yet. This can be valuable if you want to track customers who are registered but haven't made any purchases, allowing for customer re-engagement efforts.

Python

import pandas as pd
right_join = pd.merge(order_data, customer_data, on='customer_id', how='right')
print(right_join)

R

library(tidyverse)
right_join <- right_join(order_data, customer_data, by = "customer_id")
print(right_join)

Result:

order_id customer_id order_date total_amount customer_name city
101 1 2023-10-26 100 Alice New York
104 1 2023-10-29 120 Alice New York
102 2 2023-10-27 50 Bob Los Angeles
103 3 2023-10-28 75 Charlie Chicago
NaN 5 NaN NaN David San Francisco

Outer Join (keeps all records from both datasets)

An outer join will include all records from both datasets, even if there are no matching records in the other dataset. This is valuable for a comprehensive overview, showing both unregistered customers and customers who haven’t placed orders.

Python

import pandas as pd
outer_join = pd.merge(order_data, customer_data, on='customer_id', how='outer')
print(outer_join)

R

library(tidyverse)
outer_join <- full_join(order_data, customer_data, by = "customer_id")
print(outer_join)

Result:

order_id customer_id order_date total_amount customer_name city
101 1 2023-10-26 100 Alice New York
104 1 2023-10-29 120 Alice New York
102 2 2023-10-27 50 Bob Los Angeles
103 3 2023-10-28 75 Charlie Chicago
105 4 2023-10-30 80 NaN NaN
NaN 5 NaN NaN David San Francisco

As you may have noticed, apart from an Inner Join, all other merge types can produce rows with missing values (represented by "NaN" in the tables above). Therefore, it is important to choose the most appropriate join based on your specific needs.

Filtering and Subsetting Data

Filtering and subsetting involve selecting a specific subset of your data based on conditions or criteria. This technique is crucial for focusing your analysis on the most relevant data.

For example, let's consider this dataset that contains product sales records. Your goal is to filter or subset the data to analyze sales based on specific criteria like product category, date, or total sales amount.

Dataset 3: Sales Data

Order ID Product ID Product Name Category Quantity Total Amount Date Region
1001 P101 Laptop Electronics 2 $1,600 2024-01-10 North
1002 P102 Smartphone Electronics 1 $600 2024-01-11 South
1003 P103 Headphones Accessories 5 $500 2024-01-12 East
1004 P101 Laptop Electronics 3 $2,400 2024-01-12 West
1005 P104 Coffee Maker Kitchen 2 $100 2024-01-13 North

You may want to filter the dataset to show only sales related to Electronics products, so you filter your dataset by Category.

Filtered by Category (Electronics)

Google Sheets

Python

import pandas as pd
electronics_sales = df[df['Category'] == 'Electronics']
print(electronics_sales)

R

library(tidyverse)
electronics_sales <- filter(df, Category == "Electronics")
print(electronics_sales)

Result:

Order ID Product ID Product Name Category Quantity Total Amount Date Region
1001 P101 Laptop Electronics 2 $1,600 2024-01-10 North
1002 P102 Smartphone Electronics 1 $600 2024-01-11 South
1004 P101 Laptop Electronics 3 $2,400 2024-01-12 West

Let's say you want to filter sales that are greater than $1,000.

Filtered by Total Amount (>$1,000)

Google Sheets

Python

import pandas as pd
big_sales = df[df['Total Amount'] > 1000]
print(big_sales)

R

library(tidyverse)
big_sales <- filter(df, Total.Amount > 1000)
print(big_sales)

Result:
Order ID Product ID Product Name Category Quantity Total Amount Date Region
1001 P101 Laptop Electronics 2 $1,600 2024-01-10 North
1004 P101 Laptop Electronics 3 $2,400 2024-01-12 West

Let's say you want to further filter the dataset to show sales where the Total Amount is greater than $1000 and the Date is before 2024-01-12.

Filtered by Total Amount (>$1,000) AND Date (Before Jan 12, 2024)

Google Sheets

You can either use multiple filters like in the previous examples:

Or you can use the QUERY function:

Python

import pandas as pd
filtered_df = df[(df['Total Amount'] > 1000) & (pd.to_datetime(df['Date']) < '2024-01-12')]
print(filtered_df)

R

library(tidyverse)
filtered_df <- filter(df, Total.Amount > 1000 & Date < "2024-01-12")
print(filtered_df)

Result:
Order ID Product ID Product Name Category Quantity Total Amount Date Region
1001 P101 Laptop Electronics 2 $1,600 2024-01-10 North

You might also want to create a subset of the data that only contains Order ID, Product Name, and Total Amount for a sales report. In this case, subsetting your original dataset by specific columns will help.

Subset by Specific Columns (Order ID, Product Name, and Total Amount)

Google Sheets

Python

import pandas as pd
subset_df = df[['Order ID', 'Product Name', 'Total Amount']]
print(subset_df)

R

library(tidyverse)
subset_df <- sales_data %>%
  select(Order.ID, Product.Name, Total.Amount)
print(subset_df)

Result:
Order ID Product Name Total Amount
1001 Laptop $1,600
1002 Smartphone $600
1003 Headphones $500
1004 Laptop $2,400
1005 Coffee Maker $100

In summary, filtering helps you narrow down your dataset to specific rows that meet your criteria (e.g., by date, category, or region), whereas subsetting allows you to focus on specific columns or combinations of columns that are relevant to your analysis, removing unnecessary data.

These operations enable you to extract relevant insights from large datasets by focusing only on the portions of the data that matter for your specific goals.

Aggregating and Transforming Data

Aggregation involves summarizing data, such as calculating averages, sums, or counts. Transforming data includes creating new calculated fields, standardizing data, or applying mathematical functions.

Let's consider this new dataset where students have scores in multiple subjects.

Dataset 4: Student Test Results

Student ID Math Science English
S001 85 78 90
S002 75 88 85
S003 95 82 80
S004 65 85 70
S005 90 79 95

With that dataset, we will perform the following actions:

  1. Creating new calculated fields: Calculating the average and weighted average of test scores.
  2. Standardizing data: Standardizing the test scores.
  3. Applying functions: Assigning final grades based on weighted average

Average 

Google Sheets

Python

import numpy as np
test_df['Average'] = np.mean(test_df[['Math', 'Science', 'English']], axis=1)
print(test_df)

R

data$Average <- rowMeans(data[, c('Math', 'Science', 'English')])
print(data)

Result:
Student ID Math Science English Average
S001 85 78 90 84.33
S002 75 88 85 82.67
S003 95 82 80 85.67
S004 65 85 70 73.33
S005 90 79 95 88.00

Weighted Average 

Suppose the Math subject is weighted slightly more (0.4) than Science and English (0.3 each). We can calculate the weighted average and display the results in a new column:

Google Sheets

Python

import numpy as np

test_df['Weighted Average'] = (test_df['Math'] * 0.4) + (test_df['Science'] * 0.3) + (test_df['English'] * 0.3)
print(test_df)

R

data$Weighted_Average <- data$Math * 0.4 +
                         data$Science * 0.3 +
                         data$English * 0.3
print(data)

Result:
Student ID Math Science English Weighted_Average
S001 85 78 90 84.1
S002 75 88 85 81.8
S003 95 82 80 86.1
S004 65 85 70 73.5
S005 90 79 95 88.2

Standardized Scores

We can also calculate the standardized value for each test score. Standardization is the process of transforming data to have a mean of 0 and a standard deviation of 1. This technique is especially useful when comparing variables that have different units or scales. Standardizing the data ensures that each variable contributes equally to the analysis, making comparisons more meaningful.

Google Sheets

Python

import numpy as np

# Standardization function
def standardize(column):
    mean = np.mean(column)
    std_dev = np.std(column)
    return (column - mean) / std_dev

# Standardize the columns
test_df['Math_Standardized'] = standardize(test_df['Math'])
test_df['Science_Standardized'] = standardize(test_df['Science'])
test_df['English_Standardized'] = standardize(test_df['English'])

print(test_df)

R

data$Math_Standardized <- scale(data$Math)
data$Science_Standardized <- scale(data$Science)
data$English_Standardized <- scale(data$English)

print(data)

Result:

Student ID

Math_Standardized

Science_Standardized

English_Standardized

S001 0.278543 -1.182727 0.697486
S002 -0.649934 1.505289 0.116248
S003 1.207020 -0.107521 -0.464991
S004 -1.578410 0.698884 -1.627467
S005 0.742781 -0.913926 1.278724

Looking at the table above, any score close to 0 indicates the student performed similarly to the class average. A higher (positive) standardized score means the student performed significantly better than the class average, while a lower (negative) standardized score indicates they performed significantly worse than the rest of the class.

Final Grades Based on Weighted Average 

Let's use the weighted average that we calculated earlier to assign a final grade to each student using the following grading scale:

  • A: 85 and above
  • B: 75 to 84.99
  • C: 65 to 74.99
  • D: 55 to 64.99
  • F: Below 55
Google Sheets

Python

import numpy as np

# Define a function to assign a grade based on the weighted average
def assign_grade(weighted_avg):
    if weighted_avg >= 85:
        return 'A'
    elif weighted_avg >= 75:
        return 'B'
    elif weighted_avg >= 65:
        return 'C'
    elif weighted_avg >= 55:
        return 'D'
    else:
        return 'F'

# Apply the grading function to the weighted averages
test_df['Final_Grade'] = np.array([assign_grade(avg) for avg in test_df['Weighted_Average']])
test_df

R

# Function to assign final grades
assign_grade <- function(weighted_avg) {
  if (weighted_avg >= 85) {
    return("A")
  } else if (weighted_avg >= 75) {
    return("B")
  } else if (weighted_avg >= 65) {
    return("C")
  } else if (weighted_avg >= 55) {
    return("D")
  } else {
    return("F")
  }
}

# Apply the function to assign final grades based on the weighted average
data$Final_Grade <- sapply(data$Weighted_Average, assign_grade)

# Display the final dataset with weighted averages and final grades
print(data)

Result:
Student ID Math Science English Weighted_Average Final_Grade
S001 85 78 90 84.1 B
S002 75 88 85 81.8 B
S003 95 82 80 86.1 A
S004 65 85 70 73.5 C
S005 90 79 95 88.2 A