Reshaping data involves changing the structure or layout of your dataset, making it easier to work with during analysis. This can include pivoting data from long to wide format (or vice versa), splitting or combining columns, and more.
Rearranging the columns and rows of your data can help you view it from different perspectives.
For example, consider this clinical trials data from a study testing the effectiveness of three different drugs (Drug A, Drug B, and Drug C) on patients with a specific medical condition. The dataset records information about patients' responses to treatment across various age groups, treatment groups, and outcomes. The data is in a "long" format, where each row represents a single patient’s response to a treatment.
Patient ID | Age Group | Treatment | Response |
---|---|---|---|
P1 | 18-30 | Drug A | Positive |
P2 | 31-50 | Drug B | Negative |
P3 | 51-65 | Drug A | Positive |
P4 | 18-30 | Drug C | Negative |
P5 | 31-50 | Drug B | Positive |
P6 | 51-65 | Drug C | Positive |
Pivoting the data allows you to summarize patient responses to different treatments based on age group and treatment type in a "wide" format. By pivoting, you can easily compare the effectiveness of each drug across age groups and identify patterns in patient outcomes.
# Load the pandas package
import pandas as pd
# Create the dataset
data = {
'Patient_ID': ['P1', 'P2', 'P3', 'P4', 'P5', 'P6'],
'Age Group': ['18-30', '31-50', '51-65', '18-30', '31-50', '51-65'],
'Treatment': ['Drug A', 'Drug B', 'Drug A', 'Drug C', 'Drug B', 'Drug C'],
'Response': ['Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive']
}
df = pd.DataFrame(data)
# Create a new column with 1 to indicate occurrences
df['Count'] = 1
# Pivot the data
pivoted_data = df.pivot_table(
index='Age Group',
columns=['Treatment', 'Response'],
values='Count',
aggfunc='sum',
fill_value=0
)
# Display the final pivoted data
print(pivoted_data)
# Load the tidyverse package
library(tidyverse)
# Create the dataset
data <- data.frame(
Patient_ID = c('P1', 'P2', 'P3', 'P4', 'P5', 'P6'),
Age_Group = c('18-30', '31-50', '51-65', '18-30', '31-50', '51-65'),
Treatment = c('Drug A', 'Drug B', 'Drug A', 'Drug C', 'Drug B', 'Drug C'),
Response = c('Positive', 'Negative', 'Positive', 'Negative', 'Positive', 'Positive')
)
# Pivot and count the positive/negative responses for each treatment and age group
pivoted_data <- data %>%
# Create a new variable with 1 for each row to count occurrences
mutate(count = 1) %>%
# Pivot the data wider, creating columns for each Treatment and Response combination
pivot_wider(names_from = c(Treatment, Response),
values_from = count,
values_fill = list(count = 0)) %>%
# Group by Age Group and summarize the results by summing up counts
group_by(Age_Group) %>%
summarize(across(-Patient_ID, sum)) # Exclude Patient_ID from the sum operation
# Display the final pivoted data
print(pivoted_data)
Age Group | Drug A (Positive) | Drug A (Negative) | Drug B (Positive) | Drug B (Negative) | Drug C (Positive) | Drug C (Negative) |
---|---|---|---|---|---|---|
18-30 | 1 | 0 | 0 | 0 | 0 | 1 |
31-50 | 0 | 0 | 1 | 1 | 0 | 0 |
51-65 | 1 | 0 | 0 | 0 | 1 | 0 |
In some cases, splitting columns helps separate meaningful components from a single string, making it easier to analyze specific parts of the data.
For example, let's consider this data with an address column in it:
Patient ID | Address |
---|---|
P1 | 123 Main St, Stony Brook, NY, 11790 |
P2 | 123 Main St, Setauket-East Setauket, NY, 11733 |
P3 | 123 Main St, Port Jefferson, NY, 11777 |
You can split the address column into four: street address, city, state, and ZIP code. By doing so, you will be able to sort or categorize your data by city, state, or ZIP code that you won't be able to do in the previous format.
# Load the pandas package
import pandas as pd
# Create the dataset
data = {
'Patient ID': ['P1', 'P2', 'P3'],
'Address': ['123 Main St, Stony Brook, NY, 11790',
'123 Main St, Setauket-East Setauket, NY, 11733',
'123 Main St, Port Jefferson, NY, 11777']
}
df = pd.DataFrame(data)
# Split the address column
df[['Street Address', 'City', 'State', 'ZIP Code']] = df['Address'].str.split(', ', expand=True)
# Display the result
print(df)
# Load the tidyverse package
library(tidyverse)
# Create the dataset
data <- data.frame(
Patient_ID = c('P1', 'P2', 'P3'),
Address = c('123 Main St, Stony Brook, NY, 11790',
'123 Main St, Setauket-East Setauket, NY, 11733',
'123 Main St, Port Jefferson, NY, 11777')
)
# Split the address column
data <- data %>%
separate(Address, into = c("Street_Address", "City", "State", "ZIP"), sep = ", ")
# Display the result
print(data)
Patient ID | Street Address | City | State | ZIP Code |
---|---|---|---|---|
P1 | 123 Main St | Stony Brook | NY | 11790 |
P2 | 123 Main St | Setauket-East Setauket | NY | 11733 |
P3 | 123 Main St | Port Jefferson | NY | 11777 |
In some other cases, you might find the opposite where it is useful to combine multiple columns into one, especially when creating summary data for reports or simplifying data views for presentation.
Let's combine the previously split columns into a single address column:
# Combine the columns back into a single Address column
df['Full Address'] = df['Street Address'] + ', ' + df['City'] + ', ' + df['State'] + ', ' + df['ZIP Code']
# Drop the split columns (optional)
df = df.drop(columns=['Street Address', 'City', 'State', 'ZIP Code'])
# Display the result
print(df)
# Combine the columns back into a single Full_Address column
data <- data %>%
unite(Full_Address, Street_Address, City, State, ZIP, sep = ", ", remove = TRUE) # Use FALSE if you want to keep the split columns
# Display the result
print(data)
Patient ID | Full_Address |
---|---|
P1 | 123 Main St, Stony Brook, NY, 11790 |
P2 | 123 Main St, Setauket-East Setauket, NY, 11733 |
P3 | 123 Main St, Port Jefferson, NY, 11777 |
Merging datasets is essential when working with data from multiple sources. This technique allows you to combine information based on common fields, such as IDs or dates. There are several types of merges:
For example, let's consider these two datasets for an e-commerce store. One dataset contains order information, and the other contains customer data.
order_id | customer_id | order_date | total_amount |
---|---|---|---|
101 | 1 | 2023-10-26 | 100 |
102 | 2 | 2023-10-27 | 50 |
103 | 3 | 2023-10-28 | 75 |
104 | 1 | 2023-10-29 | 120 |
105 | 4 | 2023-10-30 | 80 |
customer_id | customer_name | city |
---|---|---|
1 | Alice | New York |
2 | Bob | Los Angeles |
3 | Charlie | Chicago |
5 | David | San Francisco |
The two datasets can be merged based on the shared column (customer_id) in four different ways:
An inner join would only return rows where the customer_id exists in both the first (i.e., Order) and second (i.e., Customer) datasets. This would exclude:
import pandas as pd
inner_join = pd.merge(order_data, customer_data, on='customer_id', how='inner')
print(inner_join)
library(tidyverse)
inner_join <- inner_join(order_data, customer_data, by = "customer_id")
print(inner_join)
order_id | customer_id | order_date | total_amount | customer_name | city |
---|---|---|---|---|---|
101 | 1 | 2023-10-26 | 100 | Alice | New York |
104 | 1 | 2023-10-29 | 120 | Alice | New York |
102 | 2 | 2023-10-27 | 50 | Bob | Los Angeles |
103 | 3 | 2023-10-28 | 75 | Charlie | Chicago |
A left join will keep all records from the order dataset, even if there is no matching customer information. This is useful if you are focused on analyzing orders and want to see if there are any orders from customers who are not registered in your system (for example, guest checkouts or unregistered customers).
import pandas as pd
left_join = pd.merge(order_data, customer_data, on='customer_id', how='left')
print(left_join)
library(tidyverse)
left_join <- left_join(order_data, customer_data, by = "customer_id")
print(left_join)
order_id | customer_id | order_date | total_amount | customer_name | city |
---|---|---|---|---|---|
101 | 1 | 2023-10-26 | 100 | Alice | New York |
102 | 2 | 2023-10-27 | 50 | Bob | Los Angeles |
103 | 3 | 2023-10-28 | 75 | Charlie | Chicago |
104 | 1 | 2023-10-29 | 120 | Alice | New York |
105 | 4 | 2023-10-30 | 80 | NaN | NaN |
A right join will keep all records from the Customer Information dataset, including customers who haven’t placed any orders yet. This can be valuable if you want to track customers who are registered but haven't made any purchases, allowing for customer re-engagement efforts.
import pandas as pd
right_join = pd.merge(order_data, customer_data, on='customer_id', how='right')
print(right_join)
library(tidyverse)
right_join <- right_join(order_data, customer_data, by = "customer_id")
print(right_join)
order_id | customer_id | order_date | total_amount | customer_name | city |
---|---|---|---|---|---|
101 | 1 | 2023-10-26 | 100 | Alice | New York |
104 | 1 | 2023-10-29 | 120 | Alice | New York |
102 | 2 | 2023-10-27 | 50 | Bob | Los Angeles |
103 | 3 | 2023-10-28 | 75 | Charlie | Chicago |
NaN | 5 | NaN | NaN | David | San Francisco |
An outer join will include all records from both datasets, even if there are no matching records in the other dataset. This is valuable for a comprehensive overview, showing both unregistered customers and customers who haven’t placed orders.
import pandas as pd
outer_join = pd.merge(order_data, customer_data, on='customer_id', how='outer')
print(outer_join)
library(tidyverse)
outer_join <- full_join(order_data, customer_data, by = "customer_id")
print(outer_join)
order_id | customer_id | order_date | total_amount | customer_name | city |
---|---|---|---|---|---|
101 | 1 | 2023-10-26 | 100 | Alice | New York |
104 | 1 | 2023-10-29 | 120 | Alice | New York |
102 | 2 | 2023-10-27 | 50 | Bob | Los Angeles |
103 | 3 | 2023-10-28 | 75 | Charlie | Chicago |
105 | 4 | 2023-10-30 | 80 | NaN | NaN |
NaN | 5 | NaN | NaN | David | San Francisco |
As you may have noticed, apart from an Inner Join, all other merge types can produce rows with missing values (represented by "NaN" in the tables above). Therefore, it is important to choose the most appropriate join based on your specific needs.
Filtering and subsetting involve selecting a specific subset of your data based on conditions or criteria. This technique is crucial for focusing your analysis on the most relevant data.
For example, let's consider this dataset that contains product sales records. Your goal is to filter or subset the data to analyze sales based on specific criteria like product category, date, or total sales amount.
Order ID | Product ID | Product Name | Category | Quantity | Total Amount | Date | Region |
---|---|---|---|---|---|---|---|
1001 | P101 | Laptop | Electronics | 2 | $1,600 | 2024-01-10 | North |
1002 | P102 | Smartphone | Electronics | 1 | $600 | 2024-01-11 | South |
1003 | P103 | Headphones | Accessories | 5 | $500 | 2024-01-12 | East |
1004 | P101 | Laptop | Electronics | 3 | $2,400 | 2024-01-12 | West |
1005 | P104 | Coffee Maker | Kitchen | 2 | $100 | 2024-01-13 | North |
You may want to filter the dataset to show only sales related to Electronics products, so you filter your dataset by Category.
import pandas as pd
electronics_sales = df[df['Category'] == 'Electronics']
print(electronics_sales)
library(tidyverse)
electronics_sales <- filter(df, Category == "Electronics")
print(electronics_sales)
Order ID | Product ID | Product Name | Category | Quantity | Total Amount | Date | Region |
---|---|---|---|---|---|---|---|
1001 | P101 | Laptop | Electronics | 2 | $1,600 | 2024-01-10 | North |
1002 | P102 | Smartphone | Electronics | 1 | $600 | 2024-01-11 | South |
1004 | P101 | Laptop | Electronics | 3 | $2,400 | 2024-01-12 | West |
Let's say you want to filter sales that are greater than $1,000.
Google Sheets
import pandas as pd
big_sales = df[df['Total Amount'] > 1000]
print(big_sales)
library(tidyverse)
big_sales <- filter(df, Total.Amount > 1000)
print(big_sales)
Order ID | Product ID | Product Name | Category | Quantity | Total Amount | Date | Region |
---|---|---|---|---|---|---|---|
1001 | P101 | Laptop | Electronics | 2 | $1,600 | 2024-01-10 | North |
1004 | P101 | Laptop | Electronics | 3 | $2,400 | 2024-01-12 | West |
Let's say you want to further filter the dataset to show sales where the Total Amount is greater than $1000 and the Date is before 2024-01-12.
You can either use multiple filters like in the previous examples:
Or you can use the QUERY function:
import pandas as pd
filtered_df = df[(df['Total Amount'] > 1000) & (pd.to_datetime(df['Date']) < '2024-01-12')]
print(filtered_df)
library(tidyverse)
filtered_df <- filter(df, Total.Amount > 1000 & Date < "2024-01-12")
print(filtered_df)
Order ID | Product ID | Product Name | Category | Quantity | Total Amount | Date | Region |
---|---|---|---|---|---|---|---|
1001 | P101 | Laptop | Electronics | 2 | $1,600 | 2024-01-10 | North |
You might also want to create a subset of the data that only contains Order ID, Product Name, and Total Amount for a sales report. In this case, subsetting your original dataset by specific columns will help.
Google Sheets
import pandas as pd
subset_df = df[['Order ID', 'Product Name', 'Total Amount']]
print(subset_df)
library(tidyverse)
subset_df <- sales_data %>%
select(Order.ID, Product.Name, Total.Amount)
print(subset_df)
Order ID | Product Name | Total Amount |
---|---|---|
1001 | Laptop | $1,600 |
1002 | Smartphone | $600 |
1003 | Headphones | $500 |
1004 | Laptop | $2,400 |
1005 | Coffee Maker | $100 |
In summary, filtering helps you narrow down your dataset to specific rows that meet your criteria (e.g., by date, category, or region), whereas subsetting allows you to focus on specific columns or combinations of columns that are relevant to your analysis, removing unnecessary data.
These operations enable you to extract relevant insights from large datasets by focusing only on the portions of the data that matter for your specific goals.
Aggregation involves summarizing data, such as calculating averages, sums, or counts. Transforming data includes creating new calculated fields, standardizing data, or applying mathematical functions.
Let's consider this new dataset where students have scores in multiple subjects.
Student ID | Math | Science | English |
---|---|---|---|
S001 | 85 | 78 | 90 |
S002 | 75 | 88 | 85 |
S003 | 95 | 82 | 80 |
S004 | 65 | 85 | 70 |
S005 | 90 | 79 | 95 |
With that dataset, we will perform the following actions:
import numpy as np
test_df['Average'] = np.mean(test_df[['Math', 'Science', 'English']], axis=1)
print(test_df)
data$Average <- rowMeans(data[, c('Math', 'Science', 'English')])
print(data)
Student ID | Math | Science | English | Average |
---|---|---|---|---|
S001 | 85 | 78 | 90 | 84.33 |
S002 | 75 | 88 | 85 | 82.67 |
S003 | 95 | 82 | 80 | 85.67 |
S004 | 65 | 85 | 70 | 73.33 |
S005 | 90 | 79 | 95 | 88.00 |
Suppose the Math subject is weighted slightly more (0.4) than Science and English (0.3 each). We can calculate the weighted average and display the results in a new column:
import numpy as np
test_df['Weighted Average'] = (test_df['Math'] * 0.4) + (test_df['Science'] * 0.3) + (test_df['English'] * 0.3)
print(test_df)
data$Weighted_Average <- data$Math * 0.4 +
data$Science * 0.3 +
data$English * 0.3
print(data)
Student ID | Math | Science | English | Weighted_Average |
---|---|---|---|---|
S001 | 85 | 78 | 90 | 84.1 |
S002 | 75 | 88 | 85 | 81.8 |
S003 | 95 | 82 | 80 | 86.1 |
S004 | 65 | 85 | 70 | 73.5 |
S005 | 90 | 79 | 95 | 88.2 |
We can also calculate the standardized value for each test score. Standardization is the process of transforming data to have a mean of 0 and a standard deviation of 1. This technique is especially useful when comparing variables that have different units or scales. Standardizing the data ensures that each variable contributes equally to the analysis, making comparisons more meaningful.
import numpy as np
# Standardization function
def standardize(column):
mean = np.mean(column)
std_dev = np.std(column)
return (column - mean) / std_dev
# Standardize the columns
test_df['Math_Standardized'] = standardize(test_df['Math'])
test_df['Science_Standardized'] = standardize(test_df['Science'])
test_df['English_Standardized'] = standardize(test_df['English'])
print(test_df)
data$Math_Standardized <- scale(data$Math)
data$Science_Standardized <- scale(data$Science)
data$English_Standardized <- scale(data$English)
print(data)
Student ID |
Math_Standardized |
Science_Standardized |
English_Standardized |
---|---|---|---|
S001 | 0.278543 | -1.182727 | 0.697486 |
S002 | -0.649934 | 1.505289 | 0.116248 |
S003 | 1.207020 | -0.107521 | -0.464991 |
S004 | -1.578410 | 0.698884 | -1.627467 |
S005 | 0.742781 | -0.913926 | 1.278724 |
Looking at the table above, any score close to 0 indicates the student performed similarly to the class average. A higher (positive) standardized score means the student performed significantly better than the class average, while a lower (negative) standardized score indicates they performed significantly worse than the rest of the class.
Let's use the weighted average that we calculated earlier to assign a final grade to each student using the following grading scale:
import numpy as np
# Define a function to assign a grade based on the weighted average
def assign_grade(weighted_avg):
if weighted_avg >= 85:
return 'A'
elif weighted_avg >= 75:
return 'B'
elif weighted_avg >= 65:
return 'C'
elif weighted_avg >= 55:
return 'D'
else:
return 'F'
# Apply the grading function to the weighted averages
test_df['Final_Grade'] = np.array([assign_grade(avg) for avg in test_df['Weighted_Average']])
test_df
# Function to assign final grades
assign_grade <- function(weighted_avg) {
if (weighted_avg >= 85) {
return("A")
} else if (weighted_avg >= 75) {
return("B")
} else if (weighted_avg >= 65) {
return("C")
} else if (weighted_avg >= 55) {
return("D")
} else {
return("F")
}
}
# Apply the function to assign final grades based on the weighted average
data$Final_Grade <- sapply(data$Weighted_Average, assign_grade)
# Display the final dataset with weighted averages and final grades
print(data)
Student ID | Math | Science | English | Weighted_Average | Final_Grade |
---|---|---|---|---|---|
S001 | 85 | 78 | 90 | 84.1 | B |
S002 | 75 | 88 | 85 | 81.8 | B |
S003 | 95 | 82 | 80 | 86.1 | A |
S004 | 65 | 85 | 70 | 73.5 | C |
S005 | 90 | 79 | 95 | 88.2 | A |
Library Administration: 631.632.7100
Except where otherwise noted, this work by SBU Libraries is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.