Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:
Let's use this dataset as an example:
ID | Name | Age | Salary |
---|---|---|---|
101 | Alice | 25 | 55000 |
102 | Bob | 48000 | |
103 | Charlie | 29 | |
104 | David | 33 | 62000 |
105 | Eve |
import pandas as pd
# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, None, 29, 33, None],
'Salary': [55000, 48000, None, 62000, None]}
df = pd.DataFrame(data)
# Remove rows with missing values
df_cleaned = df.dropna()
print(df_cleaned)
library(tidyverse)
# Create the dataset
data <- data.frame(
ID = c(101, 102, 103, 104, 105),
Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
Age = c(25, NA, 29, 33, NA),
Salary = c(55000, 48000, NA, 62000, NA)
)
# Remove rows with missing values
cleaned_data <- na.omit(data)
print(cleaned_data)
ID | Name | Age | Salary |
---|---|---|---|
101 | Alice | 25 | 55000 |
104 | David | 33 | 62000 |
Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.
# Impute missing values with the mean of the column
for column in df.columns:
if df[column].dtype == 'float64' or df[column].dtype == 'int64':
df[column].fillna(df[column].mean(), inplace=True)
print(df)
# Impute missing values with the mean
data <- data %>%
mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))
print(data)
ID | Name | Age | Salary |
---|---|---|---|
101 | Alice | 25 | 55000 |
102 | Bob | 29 | 48000 |
103 | Charlie | 29 | 55000 |
104 | David | 33 | 62000 |
105 | Eve | 29 | 55000 |
Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.
Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).
Then, you can use this formula to create a row-based flag across all original columns:
import pandas as pd
# Create a column-based flag
for column in df.columns:
if df[column].isnull().any():
df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')
# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')
print(df)
library(tidyverse)
# Create a column-based flag
data <- data %>%
mutate_at(vars(everything()),
list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing")))
# Create a row-based flag
data <- data %>%
rowwise() %>%
mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing"))
print(data)
ID |
Name |
Age |
Salary |
ID_Flagged |
Name_Flagged |
Age_Flagged |
Salary_Flagged |
Missing_Flagged |
---|---|---|---|---|---|---|---|---|
101 | Alice | 25 | 55000 | Not Missing | Not Missing | Not Missing | Not Missing | Not Missing |
102 | Bob | 48000 | Not Missing | Not Missing | Missing | Not Missing | Missing | |
103 | Charlie | 29 | Not Missing | Not Missing | Not Missing | Missing | Missing | |
104 | David | 33 | 62000 | Not Missing | Not Missing | Not Missing | Not Missing | Not Missing |
105 | Eve | Not Missing | Not Missing | Missing | Missing | Missing |
Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:
Let's use this dataset as an example:
ID | Name | Age | Salary | City |
---|---|---|---|---|
101 | Alice | 25 | 55000 | NY |
102 | Bob | 29 | 48000 | LA |
103 | Charlie | 29 | 55000 | NY |
104 | David | 33 | 62000 | SF |
102 | Bob | 29 | 48000 | LA |
106 | Alice | 27 | 59000 | LA |
101 | Alice | 25 | 55000 | NY |
In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.
import pandas as pd # Sample dataset data = {'ID': [101, 102, 103, 104, 102, 106, 101], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'], 'Age': [25, 29, 29, 33, 29, 27, 25], 'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000], 'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']} # Original Dataset df = pd.DataFrame(data) print("Original Dataset") print(df, '\n') # Exact Match - based on all columns df_cleaned = df.drop_duplicates() print("Duplicates Removed - Exact Match") print(df_cleaned, '\n') # Partial Match - based on specific columns only (e.g., 'ID') df_cleaned_id = df.drop_duplicates(subset=['ID']) print("Duplicates Removed - Partial Match based on ID") print(df_cleaned_id, '\n') # Partial Match - based on specific columns only (e.g., 'ID') df_cleaned_name = df.drop_duplicates(subset=['Name']) print("Duplicates Removed - Partial Match based on Name") print(df_cleaned_name, '\n') # Partial Match - based on specific columns only (e.g., 'ID' and 'Name') df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name']) print("Duplicates Removed - Partial Match based on ID and Name") print(df_cleaned_specific, '\n')
library(tidyverse) # Create the dataset data <- data.frame( ID = c(101, 102, 103, 104, 102, 106, 101), Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"), Age = c(25, 29, 29, 33, 29, 27, 25), Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000), City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY") ) # Original Dataset cat("Original Dataset\n") print(data) cat("\n") # Exact Match - based on all columns data_cleaned <- distinct(data) cat("Duplicates Removed - Exact Match\n") print(data_cleaned) cat("\n") # Partial Match - based on specific columns only (e.g., 'ID') data_cleaned_id <- distinct(data, ID, .keep_all = TRUE) cat("Duplicates Removed - Partial Match based on ID\n") print(data_cleaned_id) cat("\n") # Partial Match - based on specific columns only (e.g., 'ID') data_cleaned_name <- distinct(data, Name, .keep_all = TRUE) cat("Duplicates Removed - Partial Match based on Name\n") print(data_cleaned_name) cat("\n") # Partial Match - based on specific columns only (e.g., 'ID' and 'Name') data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE) cat("Duplicates Removed - Partial Match based on ID and Name\n") print(data_cleaned_specific)
Exact Match
ID | Name | Age | Salary | City |
---|---|---|---|---|
101 | Alice | 25 | 55000 | NY |
102 | Bob | 29 | 48000 | LA |
103 | Charlie | 29 | 55000 | NY |
104 | David | 33 | 62000 | SF |
106 | Alice | 27 | 59000 | LA |
Partial Match (ID)
ID | Name | Age | Salary | City |
---|---|---|---|---|
101 | Alice | 25 | 55000 | NY |
102 | Bob | 29 | 48000 | LA |
103 | Charlie | 29 | 55000 | NY |
104 | David | 33 | 62000 | SF |
106 | Alice | 27 | 59000 | LA |
Partial Match (Name)
ID | Name | Age | Salary | City |
---|---|---|---|---|
101 | Alice | 25 | 55000 | NY |
102 | Bob | 29 | 48000 | LA |
103 | Charlie | 29 | 55000 | NY |
104 | David | 33 | 62000 | SF |
Partial Match (ID and Name)
ID | Name | Age | Salary | City |
---|---|---|---|---|
101 | Alice | 25 | 55000 | NY |
102 | Bob | 29 | 48000 | LA |
103 | Charlie | 29 | 55000 | NY |
104 | David | 33 | 62000 | SF |
106 | Alice | 27 | 59000 | LA |
As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.
Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:
Let's use this dataset as an example:
Event ID | Event Date |
---|---|
1 | 12/01/2023 |
2 | December 2, 2023 |
3 | 2023-12-03 |
4 | 12.04.2023 |
5 | 5th Dec 2023 |
Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.
import pandas as pd
from dateutil.parser import parse
# Sample data
data = {
'Event ID': [1, 2, 3, 4, 5],
'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')
# Custom function to parse dates
def parse_date(date_str):
try:
return parse(date_str, dayfirst=False)
except ValueError:
return pd.NaT
# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)
# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')
print("Result")
print(df)
library(tidyverse)
library(lubridate)
# Sample data
data <- tibble(
`Event ID` = 1:5,
`Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")
# Convert 'Event Date' to Date format
data <- data %>%
mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))
# Standardize format to YYYY-MM-DD
data <- data %>%
mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))
cat("Result")
print(data)
Event ID | Event Date |
---|---|
1 | 2023-12-01 |
2 | 2023-12-02 |
3 | 2023-12-03 |
4 | 2023-12-04 |
5 | 2023-12-05 |
Let's use this dataset as an example:
Product ID | Product Name |
---|---|
101 | apple |
102 | Banana |
103 | ORANGE |
104 | baNana |
105 | Apple |
Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:
You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.
import pandas as pd
# Sample data
data = {
'Product ID': [101, 102, 103, 104, 105],
'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}
df = pd.DataFrame(data)
# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()
print(df)
library(tidyverse)
# Sample data
data <- tibble(
`Product ID` = c(101, 102, 103, 104, 105),
`Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)
# Convert 'Product Name' to lowercase
data <- data %>%
mutate(`Product Name` = tolower(`Product Name`))
print(data)
Product ID | Product Name |
---|---|
101 | apple |
102 | banana |
103 | orange |
104 | banana |
105 | apple |
Let's use this dataset as an example:
Sample ID | Weight |
---|---|
A1 | 70 kg |
A2 | 154 lbs |
A3 | 68 KG |
A4 | 160 pound |
A5 | 72 kilograms |
Formulas:
1. Extracting numeric value from Weight (Column B) and placing it in Column C:
Google Sheets (using REGEXEXTRACT)
For both Google Sheets and Microsoft Excel:
2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:
Google Sheets (using REGEXEXTRACT)
For both Google Sheets and Microsoft Excel:
3. Converting the weight to kg based on the original unit and placing it in Column E:
For both Google Sheets and Microsoft Excel:
import pandas as pd
# Sample data
data = {
'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}
df = pd.DataFrame(data)
# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)
# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series
# Conversion function
def convert_to_kg(row):
if row['Unit'] in ['kg', 'kilograms']:
return row['Numeric Value']
elif row['Unit'] in ['lbs', 'pound']:
return row['Numeric Value'] * 0.453592
else:
return None # or handle unknown units
df['Weight in kg'] = df.apply(convert_to_kg, axis=1)
print(df[['Sample ID', 'Weight', 'Weight in kg']])
library(tidyverse)
# Sample data
data <- tibble(
`Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)
# Extract numeric value and unit
data <- data %>%
mutate(
`Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
)
# Conversion function
convert_to_kg <- function(value, unit) {
if (unit %in% c('kg', 'kilograms')) {
return(value)
} else if (unit %in% c('lbs', 'pound')) {
return(value * 0.453592)
} else {
return(NA)
}
}
# Apply conversion
data <- data %>%
rowwise() %>%
mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
ungroup()
print(data %>% select(`Sample ID`, Weight, `Weight in kg`))
Sample ID | Weight | Numeric Value | Unit | Weight_kg |
---|---|---|---|---|
A1 | 70 kg | 70 | kg | 70 |
A2 | 154 lbs | 154 | lbs | 69.85322498 |
A3 | 68 KG | 68 | kg | 68 |
A4 | 160 pound | 160 | pound | 72.5747792 |
A5 | 72 kilograms | 72 | kilograms | 72 |
Outliers are data points that deviate significantly from other observations in the dataset. They can be the result of errors or true anomalies, and handling them appropriately is essential for accurate analysis. Below are some common methods for handling outliers:
Let's use this annual income dataset as an example:
Person ID | Income ($) |
---|---|
1 | 80,000 |
2 | 95,000 |
3 | 115,000 |
4 | 75,000 |
5 | 110,000 |
6 | 85,000 |
7 | 100,000 |
8 | 199,000 |
9 | 105,000 |
10 | 90,000 |
import pandas as pd
import matplotlib.pyplot as plt
# Sample data
data = {
'Person ID': range(1, 11),
'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}
df = pd.DataFrame(data)
# Box Plot
plt.figure(figsize=(8, 4))
plt.boxplot(df['Income'], vert=False, showfliers=True)
plt.title('Box Plot of Incomes')
plt.xlabel('Income ($000)')
plt.show()
# Scatter Plot
plt.figure(figsize=(8, 4))
plt.scatter(df['Person ID'], df['Income'])
plt.title('Scatter Plot of Incomes')
plt.xlabel('Person ID')
plt.ylabel('Income ($000)')
plt.show()
library(tidyverse)
# Sample data
data <- tibble(
`Person ID` = 1:10,
`Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)
# Box Plot
ggplot(data, aes(y = Income)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 8) +
ggtitle("Box Plot of Incomes") +
ylab("Income ($000)")
# Scatter Plot
ggplot(data, aes(x = `Person ID`, y = Income)) +
geom_point() +
ggtitle("Scatter Plot of Incomes") +
xlab("Person ID") +
ylab("Income ($000)")
Box Plot:
Scatter Plot:
import pandas as pd
# Sample data
data = {
'Person ID': range(1, 11),
'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}
df = pd.DataFrame(data)
# Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1
# Calculate bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Identify outliers
df['Outlier'] = df['Income'].apply(lambda x: 'Outlier' if x > upper_bound else 'Normal')
print(df)
R
library(tidyverse)
# Sample data
data <- tibble(
`Person ID` = 1:10,
`Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)
# Calculate Q1 and Q3
Q1 <- quantile(data$Income, 0.25)
Q3 <- quantile(data$Income, 0.75)
IQR <- Q3 - Q1
# Calculate bounds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Identify outliers
data <- data %>%
mutate(Outlier = ifelse(Income > upper_bound, "Outlier", "Normal"))
print(data)
Person_ID | Income | Outlier |
---|---|---|
1 | 80,000 | Normal |
2 | 95,000 | Normal |
3 | 115,000 | Normal |
4 | 75,000 | Normal |
5 | 110,000 | Normal |
6 | 85,000 | Normal |
7 | 100,000 | Normal |
8 | 199,000 | Outlier |
9 | 105,000 | Normal |
10 | 90,000 | Normal |
Let's assess the impact of the identified outlier on the overall dataset.
# Mean and Standard Deviation with Outlier
mean_with_outlier = df['Income'].mean()
std_with_outlier = df['Income'].std()
# Mean and Standard Deviation without Outlier
df_clean = df[df['Outlier'] == 'Normal']
mean_without_outlier = df_clean['Income'].mean()
std_without_outlier = df_clean['Income'].std()
print("Income with Outlier")
print(f"Mean: ${mean_with_outlier:,.2f}")
print(f"SD: ${std_with_outlier:,.2f}")
print("Income without Outlier")
print(f"Mean: ${mean_without_outlier:,.2f}")
print(f"SD: ${std_without_outlier:,.2f}")
R
# Mean and Standard Deviation with Outlier
mean_with_outlier <- mean(df$Income)
std_with_outlier <- sd(df$Income)
# Filter out the outliers
df_clean <- df %>% filter(Outlier == "Normal")
# Mean and Standard Deviation without Outlier
mean_without_outlier <- mean(df_clean$Income)
std_without_outlier <- sd(df_clean$Income)
# Print results
cat("Income with Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_with_outlier))
cat(sprintf("SD: $%.2f\n", std_with_outlier))
cat("\nIncome without Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_without_outlier))
cat(sprintf("SD: $%.2f\n", std_without_outlier))
As you can see, the outlier increases the mean by $10,400 and the standard deviation by $21,637.76. Based on this finding, you will have three options of action:
In the end, the decision on how to handle outliers is yours, as long as it is based on a thoughtful and informed rationale. Whether you choose to keep, remove, or transform outliers should depend on the context of your analysis and the impact these values have on your results. Outliers can provide valuable insights or skew your data, so it is important to consider the nature of your data, the objectives of your analysis, and how outliers affect the accuracy and interpretation of your findings. Ultimately, your approach should align with your goals, ensuring that the conclusions you draw are both robust and credible.
Data entry errors are a common challenge, particularly in manual data collection processes. Left unaddressed, these errors can skew analyses, lead to incorrect conclusions, and compromise the integrity of your dataset. Therefore, it is essential to identify and correct them to ensure data accuracy and reliability. Below are some common methods for detecting and fixing data entry errors:
Validation rules help prevent errors before they even occur by setting up conditions that data must meet. This method ensures that values entered into the system are consistent, logical, and fall within expected ranges. Examples of validation rules include:
Some examples of validation rules are:
import pandas as pd
from datetime import datetime
# Sample data
data = {
'Name': ['John Doe', 'Jane Smith', 'Alice Johnson'],
'DOB': ['1990-01-15', '2094-05-10', '1985-07-23']
}
df = pd.DataFrame(data)
# Convert 'DOB' to datetime and flag future dates
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')
df['Valid DOB'] = df['DOB'] <= pd.Timestamp(datetime.today())
print(df)
library(tidyverse)
library(lubridate)
# Sample data
data <- tibble(
Name = c("John Doe", "Jane Smith", "Alice Johnson"),
DOB = c("1990-01-15", "2094-05-10", "1985-07-23")
)
# Convert 'DOB' to datetime and flag future dates
data <- data %>%
mutate(DOB = ymd(DOB),
`Valid DOB` = DOB <= today())
print(data)
Name | DOB | Valid DOB |
---|---|---|
John Doe | 1990-01-15 0:00:00 | TRUE |
Jane Smith | 2094-05-10 0:00:00 | FALSE |
Alice Johnson | 1985-07-23 0:00:00 | TRUE |
For key variables, especially in small datasets, a manual review of entries is often necessary. This method involves going through the dataset by hand to identify inconsistencies, missing values, or obvious mistakes. While time-consuming, manual review is highly effective in catching subtle errors that automated tools might miss, such as transposed digits (e.g., entering "1987" instead of "1897" for a birth year).
Automated checks leverage software tools to automatically identify and flag common data entry errors. These checks can be programmed to detect outliers, incorrect formats, missing values, or other anomalies. Examples of automated checks include:
Automated tools like Apps Script in Google Sheets, Macros in Microsoft Excel, Python, and R allow you to perform these checks efficiently, especially for larger datasets. Python and R, in particular, offer robust libraries for data validation and error correction.
Library Administration: 631.632.7100
Except where otherwise noted, this work by SBU Libraries is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.