Skip to Main Content
Stony Brook University

Data Cleaning and Wrangling Guide

Essential techniques and best practices for preparing ready-to-use data, with implementation examples in Google Sheets, Microsoft Excel, Python, and R.

Handling Missing Data

Missing data is a common issue in datasets and can arise from various sources, such as incomplete data collection or data entry errors. How you handle missing data depends on the nature of your analysis and the extent of the missing values. Below are some common methods for handling missing data:

  1. Removal: In some cases, it might be appropriate to remove rows or columns with missing data, especially if the missingness is random and minimal.
  2. Imputation: Replacing missing values with substituted data, such as the mean, median, or a predicted value based on other variables.
  3. Flagging: Adding a flag to indicate that data is missing, which can be useful in tracking how missing data might affect your analysis.

Let's use this dataset as an example:

ID Name Age Salary
101 Alice 25 55000
102 Bob   48000
103 Charlie 29  
104 David 33 62000
105 Eve    

Method 1: Removal

Google Sheets and Microsoft Excel

Python
import pandas as pd

# Create the dataset
data = {'ID': [101, 102, 103, 104, 105],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [25, None, 29, 33, None],
        'Salary': [55000, 48000, None, 62000, None]}

df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned = df.dropna()

print(df_cleaned)
R
library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 105),
  Name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  Age = c(25, NA, 29, 33, NA),
  Salary = c(55000, 48000, NA, 62000, NA)
)

# Remove rows with missing values
cleaned_data <- na.omit(data)

print(cleaned_data)
Result:
ID Name Age Salary
101 Alice 25 55000
104 David 33 62000

Method 2: Imputation

Google Sheets and Microsoft Excel

Assuming the same structure, with Age in column C, the following formula for Age_Imputed in column E should work in both spreadsheet applications.

=IF(ISBLANK(C2), AVERAGE($C$2:$C$6), C2)
Python
# Impute missing values with the mean of the column
for column in df.columns:
    if df[column].dtype == 'float64' or df[column].dtype == 'int64': 
        df[column].fillna(df[column].mean(), inplace=True) 

print(df)
R
# Impute missing values with the mean
data <- data %>%
  mutate_if(is.numeric, ~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

print(data)
Result:
ID Name Age Salary
101 Alice 25 55000
102 Bob 29 48000
103 Charlie 29 55000
104 David 33 62000
105 Eve 29 55000

Method 3: Flagging

Depending on the nature of your data and the goal of your analysis, you might want to consider different flagging strategies. If your dataset is small or if individual column values are critical to your analysis, column-based flagging is preferable as it gives you a detailed view. But if your dataset is large or you are primarily concerned with overall row completeness, row-based flagging might be simpler and more effective. You can also use both strategies, first flagging columns for missing values and then flagging rows that contain any missing values, allowing for both granular and holistic views of missing data.

Google Sheets and Microsoft Excel

Assuming the same spreadsheet structure is used, you can apply column-based flagging based on the data in Cell A2 using the following formula, starting at Cell E2. You can then drag it down to apply it to multiple columns (e.g., from B2 to F2, C2 to G2) and across all rows (e.g., from A3 to E3, continuing through An to En).

=if(ISBLANK(A2),"Missing","Not Missing")

Then, you can use this formula to create a row-based flag across all original columns:

=IF(OR(ISBLANK(A2), ISBLANK(B2), ISBLANK(C2), ISBLANK(D2)), "Missing", "Not Missing")
Python
import pandas as pd

# Create a column-based flag
for column in df.columns:
    if df[column].isnull().any():
        df[f'{column}_Flag'] = df[column].isna().apply(lambda x: 'Missing' if x else 'Not Missing')

# Create a row-based flag
df['Missing_Flag'] = df.isnull().any(axis=1).apply(lambda x: 'Missing' if x else 'Not Missing')

print(df)
R
library(tidyverse)

# Create a column-based flag
data <- data %>%
  mutate_at(vars(everything()), 
            list(Flag = ~ ifelse(is.na(.), "Missing", "Not Missing")))

# Create a row-based flag
data <- data %>%
  rowwise() %>%
  mutate(Missing_Flag = ifelse(any(is.na(across(everything()))), "Missing", "Not Missing"))

print(data)
Result:

ID

Name

Age

Salary

ID_Flagged

Name_Flagged

Age_Flagged

Salary_Flagged

Missing_Flagged

101 Alice 25 55000 Not Missing Not Missing Not Missing Not Missing Not Missing
102 Bob 48000 Not Missing Not Missing Missing Not Missing Missing
103 Charlie 29 Not Missing Not Missing Not Missing Missing Missing
104 David 33 62000 Not Missing Not Missing Not Missing Not Missing Not Missing
105 Eve Not Missing Not Missing Missing Missing Missing

Removing Duplicates

Duplicates can skew your analysis by giving undue weight to certain data points. Identifying and removing duplicates is crucial to ensure the accuracy of your results. Below are some common methods for identifying duplicates:

  1. Exact Match: Finding rows that are completely identical across all columns.
  2. Partial Match: Identifying duplicates based on a subset of columns (e.g., the same name and date, but different addresses).

Let's use this dataset as an example: 

ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
102 Bob 29 48000 LA
106 Alice 27 59000 LA
101 Alice 25 55000 NY

In this dataset, you can see that ID 102 and ID 101 each appear twice, which are considered duplicates. However, there are two presumably distinct observations that share the name "Alice" but have different IDs, ages, salaries, and cities. The approach taken to remove duplicates could significantly impact the final results.

Google Sheets

Microsoft Excel

Python
import pandas as pd

# Sample dataset
data = {'ID': [101, 102, 103, 104, 102, 106, 101],
        'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Bob', 'Alice', 'Alicia'],
        'Age': [25, 29, 29, 33, 29, 27, 25],
        'Salary': [55000, 48000, 55000, 62000, 48000, 59000, 55000],
        'City': ['NY', 'LA', 'NY', 'SF', 'LA', 'LA', 'NY']}

# Original Dataset
df = pd.DataFrame(data)
print("Original Dataset")
print(df, '\n')

# Exact Match - based on all columns
df_cleaned = df.drop_duplicates()
print("Duplicates Removed - Exact Match")
print(df_cleaned, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_id = df.drop_duplicates(subset=['ID'])
print("Duplicates Removed - Partial Match based on ID")
print(df_cleaned_id, '\n')

# Partial Match - based on specific columns only (e.g., 'ID')
df_cleaned_name = df.drop_duplicates(subset=['Name'])
print("Duplicates Removed - Partial Match based on Name")
print(df_cleaned_name, '\n')

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
df_cleaned_specific = df.drop_duplicates(subset=['ID', 'Name'])
print("Duplicates Removed - Partial Match based on ID and Name")
print(df_cleaned_specific, '\n')
R
library(tidyverse)

# Create the dataset
data <- data.frame(
  ID = c(101, 102, 103, 104, 102, 106, 101),
  Name = c("Alice", "Bob", "Charlie", "David", "Bob", "Alice", "Alice"),
  Age = c(25, 29, 29, 33, 29, 27, 25),
  Salary = c(55000, 48000, 55000, 62000, 48000, 59000, 55000),
  City = c("NY", "LA", "NY", "SF", "LA", "LA", "NY")
)

# Original Dataset
cat("Original Dataset\n")
print(data)
cat("\n")

# Exact Match - based on all columns
data_cleaned <- distinct(data)
cat("Duplicates Removed - Exact Match\n")
print(data_cleaned)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_id <- distinct(data, ID, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID\n")
print(data_cleaned_id)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID')
data_cleaned_name <- distinct(data, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on Name\n")
print(data_cleaned_name)
cat("\n")

# Partial Match - based on specific columns only (e.g., 'ID' and 'Name')
data_cleaned_specific <- distinct(data, ID, Name, .keep_all = TRUE)
cat("Duplicates Removed - Partial Match based on ID and Name\n")
print(data_cleaned_specific)

 

Result:

Exact Match

ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
106 Alice 27 59000 LA

Partial Match (ID)

ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
106 Alice 27 59000 LA

Partial Match (Name)

ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF

Partial Match (ID and Name)

ID Name Age Salary City
101 Alice 25 55000 NY
102 Bob 29 48000 LA
103 Charlie 29 55000 NY
104 David 33 62000 SF
106 Alice 27 59000 LA

As shown above, in this particular case, using a partial match approach based solely on "Name" yields a different result compared to other methods. This is because we had two distinct observations that share only one common attribute (i.e., "Name"). In other cases, different outcomes may occur. Therefore, it is crucial to carefully select the appropriate key identifiers when using a partial match approach.

Normalizing Data Formats

Data collected from various sources might be in different formats, making it difficult to analyze. Normalizing these formats ensures consistency across your dataset. Below are some common tasks in normalization:

  1. Standardizing Dates: Converting dates to a consistent format (e.g., YYYY-MM-DD).
  2. Text Case Standardization: Ensuring that text data is consistently in uppercase or lowercase.
  3. Unit Conversion: Converting measurements to a common unit (e.g., converting all weights to kilograms).

Case 1: Standardizing Dates

Let's use this dataset as an example:

Event ID Event Date
1 12/01/2023
2 December 2, 2023
3 2023-12-03
4 12.04.2023
5 5th Dec 2023
Google Sheets and Microsoft Excel

Note that in both spreadsheet applications, there are instances where dates may not automatically convert to a standardized format, especially when they are typed with unsupported separators (such as a dot . or comma ,) instead of a dash - or backslash /, or when they include articles like "st," "nd," "rd," or "th." In such cases, extra steps may be needed to correct the format, either by manually replacing or removing the unsupported characters, or by using the "Find and Replace" feature.

Python
import pandas as pd
from dateutil.parser import parse

# Sample data
data = {
    'Event ID': [1, 2, 3, 4, 5],
    'Event Date': ['01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023']
}

df = pd.DataFrame(data)

print("Original Dataset")
print(df, '\n')

# Custom function to parse dates
def parse_date(date_str):
    try:
        return parse(date_str, dayfirst=False)
    except ValueError:
        return pd.NaT

# Apply the custom function
df['Event Date'] = df['Event Date'].apply(parse_date)

# Standardize format to YYYY-MM-DD
df['Event Date'] = df['Event Date'].dt.strftime('%Y-%m-%d')

print("Result")
print(df)
R
library(tidyverse)
library(lubridate)

# Sample data
data <- tibble(
  `Event ID` = 1:5,
  `Event Date` = c('01/12/2023', 'December 2, 2023', '2023-12-03', '12.04.2023', '5th Dec 2023')
)
cat("Original Dataset")
print(data)
cat("\n")

# Convert 'Event Date' to Date format
data <- data %>%
  mutate(`Event Date` = parse_date_time(`Event Date`, orders = c("dmy", "mdy", "ymd")))

# Standardize format to YYYY-MM-DD
data <- data %>%
  mutate(`Event Date` = format(`Event Date`, "%Y-%m-%d"))

cat("Result")
print(data)
Result:

Event ID Event Date
1 2023-12-01
2 2023-12-02
3 2023-12-03
4 2023-12-04
5 2023-12-05

Case 2: Text Case Standardization

Let's use this dataset as an example:

Product ID Product Name
101 apple
102 Banana
103 ORANGE
104 baNana
105 Apple
Google Sheets and Microsoft Excel

Assuming Product Name header is located in Cell B1, you can use the following formula in both spreadsheet applications starting in Cell C3:

=LOWER(B2)

You can then drag it down to apply the same formula to all other rows (i.e., from B3 continuing through Bn) multiple columns.

Python
import pandas as pd

# Sample data
data = {
    'Product ID': [101, 102, 103, 104, 105],
    'Product Name': ['apple', 'Banana', 'ORANGE', 'baNana', 'Apple']
}

df = pd.DataFrame(data)

# Convert 'Product Name' to lowercase
df['Product Name'] = df['Product Name'].str.lower()

print(df)
R
library(tidyverse)

# Sample data
data <- tibble(
  `Product ID` = c(101, 102, 103, 104, 105),
  `Product Name` = c('apple', 'Banana', 'ORANGE', 'baNana', 'Apple')
)

# Convert 'Product Name' to lowercase
data <- data %>%
  mutate(`Product Name` = tolower(`Product Name`))

print(data)
Result:

Product ID Product Name
101 apple
102 banana
103 orange
104 banana
105 apple

Case 3: Unit Conversion

Let's use this dataset as an example:

Sample ID Weight
A1 70 kg
A2 154 lbs
A3 68 KG
A4 160 pound
A5 72 kilograms

 

Google Sheets and Microsoft Excel

Formulas:

1. Extracting numeric value from Weight (Column B) and placing it in Column C:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))


For both Google Sheets and Microsoft Excel:

=VALUE(LEFT(B2, FIND(" ", B2) - 1))


2. Extracting the string (unit) from Weight (Column B) and placing it in Column D:

Google Sheets (using REGEXEXTRACT)

=LOWER(TRIM(REGEXEXTRACT(B2, "[A-Za-z]+")))


For both Google Sheets and Microsoft Excel:

=LOWER(TRIM(MID(B2, FIND(" ", B2) + 1, LEN(B2))))


3. Converting the weight to kg based on the original unit and placing it in Column E:

For both Google Sheets and Microsoft Excel:

=IF(OR(D2="kg", D2="kilogram", D2="kilograms"),C2, IF(OR(D2="lb", D2="lbs", D2="pound", D2="pounds"), CONVERT(C2, "lbm", "kg"), "Unknown Unit"))
Python
import pandas as pd

# Sample data
data = {
    'Sample ID': ['A1', 'A2', 'A3', 'A4', 'A5'],
    'Weight': ['70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms']
}

df = pd.DataFrame(data)

# Extract numeric value and unit
df['Numeric Value'] = df['Weight'].str.extract(r'(\d+\.?\d*)').astype(float)

# Use str to convert the extracted unit to lowercase
df['Unit'] = df['Weight'].str.extract(r'([a-zA-Z]+)')[0].str.lower() # Access the first element of the returned DataFrame, which is a Series

# Conversion function
def convert_to_kg(row):
    if row['Unit'] in ['kg', 'kilograms']:
        return row['Numeric Value']
    elif row['Unit'] in ['lbs', 'pound']:
        return row['Numeric Value'] * 0.453592
    else:
        return None  # or handle unknown units

df['Weight in kg'] = df.apply(convert_to_kg, axis=1)

print(df[['Sample ID', 'Weight', 'Weight in kg']])
R
library(tidyverse)

# Sample data
data <- tibble(
  `Sample ID` = c('A1', 'A2', 'A3', 'A4', 'A5'),
  Weight = c('70 kg', '154 lbs', '68 KG', '160 pound', '72 kilograms')
)

# Extract numeric value and unit
data <- data %>%
  mutate(
    `Numeric Value` = as.numeric(str_extract(Weight, "\\d+\\.?\\d*")),
    Unit = tolower(str_extract(Weight, "[a-zA-Z]+"))
  )

# Conversion function
convert_to_kg <- function(value, unit) {
  if (unit %in% c('kg', 'kilograms')) {
    return(value)
  } else if (unit %in% c('lbs', 'pound')) {
    return(value * 0.453592)
  } else {
    return(NA)
  }
}

# Apply conversion
data <- data %>%
  rowwise() %>%
  mutate(`Weight in kg` = convert_to_kg(`Numeric Value`, Unit)) %>%
  ungroup()

print(data %>% select(`Sample ID`, Weight, `Weight in kg`))
Result:

Sample ID Weight Numeric Value Unit Weight_kg
A1 70 kg 70 kg 70
A2 154 lbs 154 lbs 69.85322498
A3 68 KG 68 kg 68
A4 160 pound 160 pound 72.5747792
A5 72 kilograms 72 kilograms 72

Outlier Detection and Handling

Outliers are data points that deviate significantly from other observations in the dataset. They can be the result of errors or true anomalies, and handling them appropriately is essential for accurate analysis. Below are some common methods for handling outliers:

  1. Visual Inspection: Using plots like box plots or scatter plots to identify outliers visually.
  2. Statistical Methods: Using standard deviations, interquartile range (IQR), or z-scores to detect outliers.
  3. Decision on Handling: Decide whether to keep, remove, or transform outliers based on their impact on the analysis.

Let's use this annual income dataset as an example:

Person ID Income ($)
1 80,000
2 95,000
3 115,000
4 75,000
5 110,000
6 85,000
7 100,000
8 199,000
9 105,000
10 90,000

Visual Inspection 

Python
import pandas as pd
import matplotlib.pyplot as plt

# Sample data
data = {
    'Person ID': range(1, 11),
    'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}

df = pd.DataFrame(data)

# Box Plot
plt.figure(figsize=(8, 4))
plt.boxplot(df['Income'], vert=False, showfliers=True)
plt.title('Box Plot of Incomes')
plt.xlabel('Income ($000)')
plt.show()

# Scatter Plot
plt.figure(figsize=(8, 4))
plt.scatter(df['Person ID'], df['Income'])
plt.title('Scatter Plot of Incomes')
plt.xlabel('Person ID')
plt.ylabel('Income ($000)')
plt.show()

 

R
library(tidyverse)

# Sample data
data <- tibble(
  `Person ID` = 1:10,
  `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)

# Box Plot
ggplot(data, aes(y = Income)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 8) +
  ggtitle("Box Plot of Incomes") +
  ylab("Income ($000)")

# Scatter Plot
ggplot(data, aes(x = `Person ID`, y = Income)) +
  geom_point() +
  ggtitle("Scatter Plot of Incomes") +
  xlab("Person ID") +
  ylab("Income ($000)")

Analysis:

Box Plot:

  • The box plot shows the distribution of incomes where the income of $198,000 appears as an outlier above the upper whisker.

Scatter Plot:

  • The scatter plot displays Person 8's income distinctly higher than others.

Statistical Methods

Python
import pandas as pd

# Sample data
data = {
    'Person ID': range(1, 11),
    'Income': [80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000]
}

df = pd.DataFrame(data)

# Calculate Q1 and Q3
Q1 = df['Income'].quantile(0.25)
Q3 = df['Income'].quantile(0.75)
IQR = Q3 - Q1                   

# Calculate bounds
lower_bound = Q1 - 1.5 * IQR    
upper_bound = Q3 + 1.5 * IQR    

# Identify outliers
df['Outlier'] = df['Income'].apply(lambda x: 'Outlier' if x > upper_bound else 'Normal')

print(df)

R

library(tidyverse)

# Sample data
data <- tibble(
  `Person ID` = 1:10,
  `Income` = c(80000, 95000, 115000, 75000, 110000, 85000, 100000, 199000, 105000, 90000)
)

# Calculate Q1 and Q3
Q1 <- quantile(data$Income, 0.25)
Q3 <- quantile(data$Income, 0.75)
IQR <- Q3 - Q1                   

# Calculate bounds
lower_bound <- Q1 - 1.5 * IQR    
upper_bound <- Q3 + 1.5 * IQR    

# Identify outliers
data <- data %>%
  mutate(Outlier = ifelse(Income > upper_bound, "Outlier", "Normal"))

print(data)
Results:
Person_ID Income Outlier
1 80,000 Normal
2 95,000 Normal
3 115,000 Normal
4 75,000 Normal
5 110,000 Normal
6 85,000 Normal
7 100,000 Normal
8 199,000 Outlier
9 105,000 Normal
10 90,000 Normal

Decision on Handling Outliers

Let's assess the impact of the identified outlier on the overall dataset.

Python
# Mean and Standard Deviation with Outlier
mean_with_outlier = df['Income'].mean()
std_with_outlier = df['Income'].std()

# Mean and Standard Deviation without Outlier
df_clean = df[df['Outlier'] == 'Normal']
mean_without_outlier = df_clean['Income'].mean()
std_without_outlier = df_clean['Income'].std()

print("Income with Outlier")
print(f"Mean: ${mean_with_outlier:,.2f}")
print(f"SD: ${std_with_outlier:,.2f}")
print("Income without Outlier")
print(f"Mean: ${mean_without_outlier:,.2f}")
print(f"SD: ${std_without_outlier:,.2f}")

R

# Mean and Standard Deviation with Outlier
mean_with_outlier <- mean(df$Income)
std_with_outlier <- sd(df$Income)

# Filter out the outliers
df_clean <- df %>% filter(Outlier == "Normal")

# Mean and Standard Deviation without Outlier
mean_without_outlier <- mean(df_clean$Income)
std_without_outlier <- sd(df_clean$Income)

# Print results
cat("Income with Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_with_outlier))
cat(sprintf("SD: $%.2f\n", std_with_outlier))

cat("\nIncome without Outlier\n")
cat(sprintf("Mean: $%.2f\n", mean_without_outlier))
cat(sprintf("SD: $%.2f\n", std_without_outlier))
Results:

Income with Outlier
Mean: $105,400.00
SD: $35,330.82
Income without Outlier
Mean: $95,000.00
SD: $13,693.06

 

As you can see, the outlier increases the mean by $10,400 and the standard deviation by $21,637.76. Based on this finding, you will have three options of action:

  1. Keep the Outlier
    • This is an option if you can verify the income is accurate and the person doesn't represent a different income group (e.g., executive level instead of regular level)
    • This is also a good option if this data point in important for your purpose (e.g., showing income distribution and inequality)
  2. Remove the Outlier
    • This is an option if the outlier skews the data and is not representative.
    • This is also appropriate for analyses focusing on the typical income range.
  3. Transform the Data
    • Applying a logarithmic transformation can help reduce skewness in your data distribution
    • This is another option that allows inclusion of the outlier while minimizing its impact
    • Transformed data might be more difficult to interpret

In the end, the decision on how to handle outliers is yours, as long as it is based on a thoughtful and informed rationale. Whether you choose to keep, remove, or transform outliers should depend on the context of your analysis and the impact these values have on your results. Outliers can provide valuable insights or skew your data, so it is important to consider the nature of your data, the objectives of your analysis, and how outliers affect the accuracy and interpretation of your findings. Ultimately, your approach should align with your goals, ensuring that the conclusions you draw are both robust and credible.

Correcting Data Entry Errors

Data entry errors are a common challenge, particularly in manual data collection processes. Left unaddressed, these errors can skew analyses, lead to incorrect conclusions, and compromise the integrity of your dataset. Therefore, it is essential to identify and correct them to ensure data accuracy and reliability. Below are some common methods for detecting and fixing data entry errors:

1. Validation Rules

Validation rules help prevent errors before they even occur by setting up conditions that data must meet. This method ensures that values entered into the system are consistent, logical, and fall within expected ranges. Examples of validation rules include:

  • Date of Birth cannot be a future date.
  • Age should be within a reasonable range (e.g., 0 to 120 years).
  • Email Address should follow the standard format (e.g., name@domain.com).

Some examples of validation rules are:

Google Sheets

Microsoft Excel

Python

import pandas as pd
from datetime import datetime

# Sample data
data = {
    'Name': ['John Doe', 'Jane Smith', 'Alice Johnson'],
    'DOB': ['1990-01-15', '2094-05-10', '1985-07-23']  
}

df = pd.DataFrame(data)

# Convert 'DOB' to datetime and flag future dates
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')
df['Valid DOB'] = df['DOB'] <= pd.Timestamp(datetime.today())

print(df)

R

library(tidyverse)
library(lubridate)

# Sample data
data <- tibble(
  Name = c("John Doe", "Jane Smith", "Alice Johnson"),
  DOB = c("1990-01-15", "2094-05-10", "1985-07-23")
)

# Convert 'DOB' to datetime and flag future dates
data <- data %>%
  mutate(DOB = ymd(DOB),
         `Valid DOB` = DOB <= today())

print(data)

Results:

Name DOB Valid DOB
John Doe 1990-01-15 0:00:00 TRUE
Jane Smith 2094-05-10 0:00:00 FALSE
Alice Johnson 1985-07-23 0:00:00 TRUE

2. Manual Review

For key variables, especially in small datasets, a manual review of entries is often necessary. This method involves going through the dataset by hand to identify inconsistencies, missing values, or obvious mistakes. While time-consuming, manual review is highly effective in catching subtle errors that automated tools might miss, such as transposed digits (e.g., entering "1987" instead of "1897" for a birth year).

3. Automated Checks

Automated checks leverage software tools to automatically identify and flag common data entry errors. These checks can be programmed to detect outliers, incorrect formats, missing values, or other anomalies. Examples of automated checks include:

  1. Range Checks: Ensuring all values fall within predefined limits.
  2. Consistency Checks: Verifying that related fields make sense together (e.g., the “end date” of an event cannot be earlier than the “start date”).
  3. Format Validation: Check that data entries adhere to a specific format (e.g., email addresses must include "@" and a domain).
  4. Spelling and Variations Detection: Automated tools can identify spelling errors and variations in text data (e.g., “USA,” “United States,” “U.S.”). This ensures consistency in categorical data, which can otherwise lead to inaccurate groupings or analysis. Some tools can flag these inconsistencies or automatically correct them to a predefined standard.

Automated tools like Apps Script in Google Sheets, Macros in Microsoft Excel, Python, and R allow you to perform these checks efficiently, especially for larger datasets. Python and R, in particular, offer robust libraries for data validation and error correction.