Research & Subject Guides: Data Cleaning and Wrangling Guide: Data Wrangling Techniques

Reshaping Data

Reshaping data involves changing the structure or layout of your dataset, making it easier to work with during analysis. This can include pivoting data from long to wide format (or vice versa), splitting or combining columns, and more.

Pivoting Data

Rearranging the columns and rows of your data can help you view it from different perspectives.

For example, consider this clinical trials data from a study testing the effectiveness of three different drugs (Drug A, Drug B, and Drug C) on patients with a specific medical condition. The dataset records information about patients' responses to treatment across various age groups, treatment groups, and outcomes. The data is in a "long" format, where each row represents a single patient’s response to a treatment.

Patient ID	Age Group	Treatment	Response
P1	18-30	Drug A	Positive
P2	31-50	Drug B	Negative
P3	51-65	Drug A	Positive
P4	18-30	Drug C	Negative
P5	31-50	Drug B	Positive
P6	51-65	Drug C	Positive

Pivoting the data allows you to summarize patient responses to different treatments based on age group and treatment type in a "wide" format. By pivoting, you can easily compare the effectiveness of each drug across age groups and identify patterns in patient outcomes.

Google Sheets

Python

# Load the pandas package
import pandas as pd

# Create the dataset
data = {
'Patient_ID': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
'Age Group': ['18-30', '31-50', '51-65', '18-30', '31-50', '51-65'],
'Treatment': ['Drug A', 'Drug B', 'Drug A', 'Drug C', 'Drug B', 'Drug C'],
'Response': ['Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive']
}
df = pd.DataFrame(data)

# Create a new column with 1 to indicate occurrences
df['Count'] = 1

# Pivot the data
pivoted_data = df.pivot_table(
index='Age Group',
columns=['Treatment', 'Response'],
values='Count',
aggfunc='sum',
fill_value=0
)

# Display the final pivoted data
print(pivoted_data)

R

# Load the tidyverse package
library(tidyverse)

# Create the dataset
data <- data.frame(
Patient_ID = c('P1', 'P2', 'P3', 'P4', 'P5', 'P6'),
Age_Group = c('18-30', '31-50', '51-65', '18-30', '31-50', '51-65'),
Treatment = c('Drug A', 'Drug B', 'Drug A', 'Drug C', 'Drug B', 'Drug C'),
Response = c('Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive')
)

# Pivot and count the positive/negative responses for each treatment and age group
pivoted_data <- data %>%
# Create a new variable with 1 for each row to count occurrences
mutate(count = 1) %>%
# Pivot the data wider, creating columns for each Treatment and Response combination
pivot_wider(names_from = c(Treatment, Response),
values_from = count,
values_fill = list(count = 0)) %>%
# Group by Age Group and summarize the results by summing up counts
group_by(Age_Group) %>%
summarize(across(-Patient_ID, sum)) # Exclude Patient_ID from the sum operation

# Display the final pivoted data
print(pivoted_data)

Result:

Age Group	Drug A (Positive)	Drug B (Positive)	Drug B (Negative)	Drug C (Positive)	Drug C (Negative)
18-30	1	0	0	0	1
31-50	0	1	1	0	0
51-65	1	0	0	1	0

Splitting and Combining Columns

In some cases, splitting columns helps separate meaningful components from a single string, making it easier to analyze specific parts of the data.

For example, let's consider this data with an address column in it:

Patient ID	Address
P1	123 Main St, Stony Brook, NY, 11790
P2	123 Main St, Setauket-East Setauket, NY, 11733
P3	123 Main St, Port Jefferson, NY, 11777

You can split the address column into four: street address, city, state, and ZIP code. By doing so, you will be able to sort or categorize your data by city, state, or ZIP code that you won't be able to do in the previous format.

Google Sheets

Python

# Load the pandas package
import pandas as pd

# Create the dataset
data = {
'Patient ID': ['P1', 'P2', 'P3'],
'Address': ['123 Main St, Stony Brook, NY, 11790',
'123 Main St, Setauket-East Setauket, NY, 11733',
'123 Main St, Port Jefferson, NY, 11777']
}
df = pd.DataFrame(data)

# Split the address column
df[['Street Address', 'City', 'State', 'ZIP Code']] = df['Address'].str.split(', ', expand=True)

# Display the result
print(df)

R

# Load the tidyverse package
library(tidyverse)

# Create the dataset
data <- data.frame(
Patient_ID = c('P1', 'P2', 'P3'),
Address = c('123 Main St, Stony Brook, NY, 11790',
'123 Main St, Setauket-East Setauket, NY, 11733',
'123 Main St, Port Jefferson, NY, 11777')
)

# Split the address column

data <- data %>%
separate(Address, into = c("Street_Address", "City", "State", "ZIP"), sep = ", ")

# Display the result
print(data)

Result:

Patient ID	Street Address	City	State	ZIP Code
P1	123 Main St	Stony Brook	NY	11790
P2	123 Main St	Setauket-East Setauket	NY	11733
P3	123 Main St	Port Jefferson	NY	11777

In some other cases, you might find the opposite where it is useful to combine multiple columns into one, especially when creating summary data for reports or simplifying data views for presentation.

Let's combine the previously split columns into a single address column:

Google Sheets

Python

# Combine the columns back into a single Address column
df['Full Address'] = df['Street Address'] + ', ' + df['City'] + ', ' + df['State'] + ', ' + df['ZIP Code']

# Drop the split columns (optional)
df = df.drop(columns=['Street Address', 'City', 'State', 'ZIP Code'])

# Display the result
print(df)

R

# Combine the columns back into a single Full_Address column
data <- data %>%
unite(Full_Address, Street_Address, City, State, ZIP, sep = ", ", remove = TRUE) # Use FALSE if you want to keep the split columns

# Display the result
print(data)

Result:

Patient ID	Full_Address
P1	123 Main St, Stony Brook, NY, 11790
P2	123 Main St, Setauket-East Setauket, NY, 11733
P3	123 Main St, Port Jefferson, NY, 11777

Merging and Joining Datasets

Merging datasets is essential when working with data from multiple sources. This technique allows you to combine information based on common fields, such as IDs or dates. There are several types of merges:

Inner Join: Returns rows with matching values in both datasets.
Left Join: Returns all rows from the left dataset and matching rows from the right dataset.
Right Join: Returns all rows from the right dataset and matching rows from the left dataset.
Outer Join: Returns all rows when there is a match in either dataset.

For example, let's consider these two datasets for an e-commerce store. One dataset contains order information, and the other contains customer data.

Dataset 1: Order Data

order_id	customer_id	order_date	total_amount
101	1	2023-10-26	100
102	2	2023-10-27	50
103	3	2023-10-28	75
104	1	2023-10-29	120
105	4	2023-10-30	80

Dataset 2: Customer Data

customer_id	customer_name	city
1	Alice	New York
2	Bob	Los Angeles
3	Charlie	Chicago
5	David	San Francisco

The two datasets can be merged based on the shared column (customer_id) in four different ways:

Inner Join (keeps only matching records)

An inner join would only return rows where the customer_id exists in both the first (i.e., Order) and second (i.e., Customer) datasets. This would exclude:

Orders made by unregistered customers (e.g., Order 105).
Customers who haven't placed any orders (e.g., David).

Python

import pandas as pd
inner_join = pd.merge(order_data, customer_data, on='customer_id', how='inner')
print(inner_join)

R

library(tidyverse)
inner_join <- inner_join(order_data, customer_data, by = "customer_id")
print(inner_join)

Result:

order_id	customer_id	order_date	total_amount	customer_name	city
101	1	2023-10-26	100	Alice	New York
104	1	2023-10-29	120	Alice	New York
102	2	2023-10-27	50	Bob	Los Angeles
103	3	2023-10-28	75	Charlie	Chicago

Left Join (keeps all records from the order dataset)

A left join will keep all records from the order dataset, even if there is no matching customer information. This is useful if you are focused on analyzing orders and want to see if there are any orders from customers who are not registered in your system (for example, guest checkouts or unregistered customers).

Python

import pandas as pd
left_join = pd.merge(order_data, customer_data, on='customer_id', how='left')
print(left_join)

R

library(tidyverse)
left_join <- left_join(order_data, customer_data, by = "customer_id")
print(left_join)

Result:

order_id	customer_id	order_date	total_amount	customer_name	city
101	1	2023-10-26	100	Alice	New York
102	2	2023-10-27	50	Bob	Los Angeles
103	3	2023-10-28	75	Charlie	Chicago
104	1	2023-10-29	120	Alice	New York
105	4	2023-10-30	80	NaN	NaN

Right Join (keeps all records from the customer dataset)

A right join will keep all records from the Customer Information dataset, including customers who haven’t placed any orders yet. This can be valuable if you want to track customers who are registered but haven't made any purchases, allowing for customer re-engagement efforts.

Python

import pandas as pd
right_join = pd.merge(order_data, customer_data, on='customer_id', how='right')
print(right_join)

R

library(tidyverse)
right_join <- right_join(order_data, customer_data, by = "customer_id")
print(right_join)

Result:

order_id	customer_id	order_date	total_amount	customer_name	city
101	1	2023-10-26	100	Alice	New York
104	1	2023-10-29	120	Alice	New York
102	2	2023-10-27	50	Bob	Los Angeles
103	3	2023-10-28	75	Charlie	Chicago
NaN	5	NaN	NaN	David	San Francisco

Outer Join (keeps all records from both datasets)

An outer join will include all records from both datasets, even if there are no matching records in the other dataset. This is valuable for a comprehensive overview, showing both unregistered customers and customers who haven’t placed orders.

Python

import pandas as pd
outer_join = pd.merge(order_data, customer_data, on='customer_id', how='outer')
print(outer_join)

R

library(tidyverse)
outer_join <- full_join(order_data, customer_data, by = "customer_id")
print(outer_join)

Result:

order_id	customer_id	order_date	total_amount	customer_name	city
101	1	2023-10-26	100	Alice	New York
104	1	2023-10-29	120	Alice	New York
102	2	2023-10-27	50	Bob	Los Angeles
103	3	2023-10-28	75	Charlie	Chicago
105	4	2023-10-30	80	NaN	NaN
NaN	5	NaN	NaN	David	San Francisco

As you may have noticed, apart from an Inner Join, all other merge types can produce rows with missing values (represented by "NaN" in the tables above). Therefore, it is important to choose the most appropriate join based on your specific needs.

Filtering and Subsetting Data

Filtering and subsetting involve selecting a specific subset of your data based on conditions or criteria. This technique is crucial for focusing your analysis on the most relevant data.

For example, let's consider this dataset that contains product sales records. Your goal is to filter or subset the data to analyze sales based on specific criteria like product category, date, or total sales amount.

Dataset 3: Sales Data

Order ID	Product ID	Product Name	Category	Quantity	Total Amount	Date	Region
1001	P101	Laptop	Electronics	2	$1,600	2024-01-10	North
1002	P102	Smartphone	Electronics	1	$600	2024-01-11	South
1003	P103	Headphones	Accessories	5	$500	2024-01-12	East
1004	P101	Laptop	Electronics	3	$2,400	2024-01-12	West
1005	P104	Coffee Maker	Kitchen	2	$100	2024-01-13	North

You may want to filter the dataset to show only sales related to Electronics products, so you filter your dataset by Category.

Filtered by Category (Electronics)

Google Sheets

Python

import pandas as pd
electronics_sales = df[df['Category'] == 'Electronics']
print(electronics_sales)

R

library(tidyverse)
electronics_sales <- filter(df, Category == "Electronics")
print(electronics_sales)

Result:

Order ID	Product ID	Product Name	Category	Quantity	Total Amount	Date	Region
1001	P101	Laptop	Electronics	2	$1,600	2024-01-10	North
1002	P102	Smartphone	Electronics	1	$600	2024-01-11	South
1004	P101	Laptop	Electronics	3	$2,400	2024-01-12	West

Let's say you want to filter sales that are greater than $1,000.

Filtered by Total Amount (>$1,000)

Google Sheets

Python

import pandas as pd
big_sales = df[df['Total Amount'] > 1000]
print(big_sales)

R

library(tidyverse)
big_sales <- filter(df, Total.Amount > 1000)
print(big_sales)

Result:

Order ID	Product ID	Product Name	Category	Quantity	Total Amount	Date	Region
1001	P101	Laptop	Electronics	2	$1,600	2024-01-10	North
1004	P101	Laptop	Electronics	3	$2,400	2024-01-12	West

Let's say you want to further filter the dataset to show sales where the Total Amount is greater than $1000 and the Date is before 2024-01-12.

Filtered by Total Amount (>$1,000) AND Date (Before Jan 12, 2024)

Google Sheets

You can either use multiple filters like in the previous examples:

Or you can use the QUERY function:

Python

import pandas as pd
filtered_df = df[(df['Total Amount'] > 1000) & (pd.to_datetime(df['Date']) < '2024-01-12')]
print(filtered_df)

R

library(tidyverse)
filtered_df <- filter(df, Total.Amount > 1000 & Date < "2024-01-12")
print(filtered_df)

Result:

Order ID	Product ID	Product Name	Category	Quantity	Total Amount	Date	Region
1001	P101	Laptop	Electronics	2	$1,600	2024-01-10	North

You might also want to create a subset of the data that only contains Order ID, Product Name, and Total Amount for a sales report. In this case, subsetting your original dataset by specific columns will help.

Subset by Specific Columns (Order ID, Product Name, and Total Amount)

Google Sheets

Python

import pandas as pd
subset_df = df[['Order ID', 'Product Name', 'Total Amount']]
print(subset_df)

R

library(tidyverse)
subset_df <- sales_data %>%
select(Order.ID, Product.Name, Total.Amount)
print(subset_df)

Result:

Order ID	Product Name	Total Amount
1001	Laptop	$1,600
1002	Smartphone	$600
1003	Headphones	$500
1004	Laptop	$2,400
1005	Coffee Maker	$100

In summary, filtering helps you narrow down your dataset to specific rows that meet your criteria (e.g., by date, category, or region), whereas subsetting allows you to focus on specific columns or combinations of columns that are relevant to your analysis, removing unnecessary data.

These operations enable you to extract relevant insights from large datasets by focusing only on the portions of the data that matter for your specific goals.

Aggregating and Transforming Data

Aggregation involves summarizing data, such as calculating averages, sums, or counts. Transforming data includes creating new calculated fields, standardizing data, or applying mathematical functions.

Let's consider this new dataset where students have scores in multiple subjects.

Dataset 4: Student Test Results

Student ID	Math	Science	English
S001	85	78	90
S002	75	88	85
S003	95	82	80
S004	65	85	70
S005	90	79	95

With that dataset, we will perform the following actions:

Creating new calculated fields: Calculating the average and weighted average of test scores.
Standardizing data: Standardizing the test scores.
Applying functions: Assigning final grades based on weighted average

Average

Google Sheets

Python

import numpy as np
test_df['Average'] = np.mean(test_df[['Math', 'Science', 'English']], axis=1)
print(test_df)

R

data$Average <- rowMeans(data[, c('Math', 'Science', 'English')])
print(data)

Result:

Student ID	Math	Science	English	Average
S001	85	78	90	84.33
S002	75	88	85	82.67
S003	95	82	80	85.67
S004	65	85	70	73.33
S005	90	79	95	88.00

Weighted Average

Suppose the Math subject is weighted slightly more (0.4) than Science and English (0.3 each). We can calculate the weighted average and display the results in a new column:

Google Sheets

Python

import numpy as np

test_df['Weighted Average'] = (test_df['Math'] * 0.4) + (test_df['Science'] * 0.3) + (test_df['English'] * 0.3)
print(test_df)

R

data$Weighted_Average <- data$Math * 0.4 +
data$Science * 0.3 +
data$English * 0.3
print(data)

Result:

Student ID	Math	Science	English	Weighted_Average
S001	85	78	90	84.1
S002	75	88	85	81.8
S003	95	82	80	86.1
S004	65	85	70	73.5
S005	90	79	95	88.2

Standardized Scores

We can also calculate the standardized value for each test score. Standardization is the process of transforming data to have a mean of 0 and a standard deviation of 1. This technique is especially useful when comparing variables that have different units or scales. Standardizing the data ensures that each variable contributes equally to the analysis, making comparisons more meaningful.

Google Sheets

Python

import numpy as np

# Standardization function
def standardize(column):
mean = np.mean(column)
std_dev = np.std(column)
return (column - mean) / std_dev

# Standardize the columns
test_df['Math_Standardized'] = standardize(test_df['Math'])
test_df['Science_Standardized'] = standardize(test_df['Science'])
test_df['English_Standardized'] = standardize(test_df['English'])

print(test_df)

R

data$Math_Standardized <- scale(data$Math)
data$Science_Standardized <- scale(data$Science)
data$English_Standardized <- scale(data$English)

print(data)

Result:

Student ID	Math_Standardized	Science_Standardized	English_Standardized
S001	0.278543	-1.182727	0.697486
S002	-0.649934	1.505289	0.116248
S003	1.207020	-0.107521	-0.464991
S004	-1.578410	0.698884	-1.627467
S005	0.742781	-0.913926	1.278724

Looking at the table above, any score close to 0 indicates the student performed similarly to the class average. A higher (positive) standardized score means the student performed significantly better than the class average, while a lower (negative) standardized score indicates they performed significantly worse than the rest of the class.

Final Grades Based on Weighted Average

Let's use the weighted average that we calculated earlier to assign a final grade to each student using the following grading scale:

A: 85 and above
B: 75 to 84.99
C: 65 to 74.99
D: 55 to 64.99
F: Below 55

Google Sheets

Python

import numpy as np

# Define a function to assign a grade based on the weighted average
def assign_grade(weighted_avg):
if weighted_avg >= 85:
return 'A'
elif weighted_avg >= 75:
return 'B'
elif weighted_avg >= 65:
return 'C'
elif weighted_avg >= 55:
return 'D'
else:
return 'F'

# Apply the grading function to the weighted averages
test_df['Final_Grade'] = np.array([assign_grade(avg) for avg in test_df['Weighted_Average']])
test_df

R

# Function to assign final grades
assign_grade <- function(weighted_avg) {
if (weighted_avg >= 85) {
return("A")
} else if (weighted_avg >= 75) {
return("B")
} else if (weighted_avg >= 65) {
return("C")
} else if (weighted_avg >= 55) {
return("D")
} else {
return("F")
}
}

# Apply the function to assign final grades based on the weighted average
data$Final_Grade <- sapply(data$Weighted_Average, assign_grade)

# Display the final dataset with weighted averages and final grades
print(data)

Result:

Student ID	Math	Science	English	Weighted_Average	Final_Grade
S001	85	78	90	84.1	B
S002	75	88	85	81.8	B
S003	95	82	80	86.1	A
S004	65	85	70	73.5	C
S005	90	79	95	88.2	A

Data Cleaning and Wrangling Guide

Reshaping Data

Pivoting Data

Google Sheets

Python

R

Result:

Splitting and Combining Columns

Google Sheets

Python

R

Result:

Google Sheets

Python

R

Result:

Merging and Joining Datasets

Dataset 1: Order Data

Dataset 2: Customer Data

Inner Join (keeps only matching records)

Python

R

Result:

Left Join (keeps all records from the order dataset)

Python

R

Result:

Right Join (keeps all records from the customer dataset)

Python

R

Result:

Outer Join (keeps all records from both datasets)

Python

R

Result:

Filtering and Subsetting Data

Dataset 3: Sales Data

Filtered by Category (Electronics)

Google Sheets

Python

R

Result:

Filtered by Total Amount (>$1,000)

Python

R

Result:

Filtered by Total Amount (>$1,000) AND Date (Before Jan 12, 2024)

Google Sheets

Python

R

Result:

Subset by Specific Columns (Order ID, Product Name, and Total Amount)

Python

R

Result:

Aggregating and Transforming Data

Dataset 4: Student Test Results

Average

Google Sheets

Python

R

Result:

Weighted Average

Google Sheets

Python

R

Result:

Standardized Scores

Google Sheets

Python

R

Result:

Final Grades Based on Weighted Average

Google Sheets

Python

R

Result:

Library

Resources

Follow Us