5 Score and Clean Data

Data Scoring

Before you can analyze the data you need to process the raw data into aggregate summary variables. This involves several steps:

Calculate task scores
Evaluate and remove problematic subjects and outliers
Calculate reliability

Overview of Template

You can download an R script template to score tidy raw data files:

englelab::get_template(score_script = TRUE)

Setup

# ---- Setup ------------------
# packages
library(here)
library(readr)
library(dplyr)
library(englelab)    # for data cleaning functions
library(tidyr)       # for pivot_wider. delete if not using
library(psych)       # for cronbach's alpha. delete if not using

# directories
import_dir <- "data/raw"
output_dir <- "data/scored"

# file names
task <- "taskname"
import_file <- paste(task, "raw.csv", sep = "_")
output_file <- paste(task, "Scores.csv", sep = "_")

## data cleaning parameters
outlier_cutoff <- 3.5
# -----------------------------

The Setup section is to:

Load any packages used in the script
Set the directories of where to import and output data files to
Set the file names of the data files to be imported and outputted
Set data cleaning parameters (e.g., outlier criterion)

Import

Given that 1) you created a tidy raw data file in .csv format, and 2) you specified import_dir and import_file in the setup section, you most likely do not need to change anything here.

Optionally, you might want to go ahead and filter out any rows that you absolutely do not want to include at any point in scoring the data (e.g., practice trials or certain subjects).

# ---- Import Data ------------
data_import <- read_csv(here(import_dir, import_file)) |>
  filter()
# -----------------------------

Score Data

The Score Data section is where most of the work needs to be done. You should be using dplyr and possibly tidyr to do most of the work here, though you may need other packages and functions. You can delete whatever is in there, that is just a placeholder as an example of the type of functions you might use.

# ---- Score Data -------------
data_scores <- data_import |>
  summarise(.by = Subject)
# -----------------------------

Clean Data

The next section of the script template is for cleaning the data by removing problematic subjects and/or removing outliers.

Reliability

There are two standard ways of calculating reliability: split-half and cronbach’s alpha. The script template provides some template code for calculating both of these.

Save Data

No need to change anything here. Isn’t that nice?

# ---- Save Data --------------
write_csv(data_raw, here(output_dir, output_file))
# -----------------------------

rm(list = ls())

Tidy Data Example

Let’s use the tidy data created in the previous chapter (with two added participants)

data_import <- filter(data_import, TrialProc == "real")

Trial-Level Cleaning

It can be a good idea to do some initial data cleaning at the trial level before scoring the data to make sure we are using only quality data.

The data cleaning that you need to do will of course depend on the nature of your data and study. Give some thought to the type of data cleaning steps you might want to take. Think about what would identify an observation as having low quality data (e.g., unrealistically short or long reaction times).

Reaction Times

For behavioral and cognitive tasks that include measures of reaction time, researchers will often exclude trials that have unrealistically fast reaction times or extremely long reaction times.

You will notice that in the example tidy raw data file, there are a few trials that contain unrealistically fast reaction times and extremely long reaction times.

A common criterion for unrealistically fast reaction times is anything less than 200 milliseconds. For extremely long reaction times, we can just use 5000 milliseconds (5 seconds) as the criterion.

There are two general strategies to do this:

use filter() to get rid of the rows entirely
use mutate() and case_when() to set the reaction time and accuracy values to missing NA

# set criterion in setup section of script
rt_short_criterion <- 200
rt_long_criterion <- 5000

data_scores <- data_import |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion)

Set to missing

Code
Data Frame

# set criterion in setup section of script
rt_short_criterion <- 200
rt_long_criterion <- 5000

data_scores <- data_import |>
  mutate(RT = case_when(RT < rt_short_criterion ~ NA,
                        RT > rt_long_criterion ~ NA,
                        .default = RT),
         Accuracy = case_when(is.na(RT) ~ NA,
                              .default = Accuracy))

Data Scoring

How you need to score your data depends on what you are measuring and what type of conditions you have. However, most approaches require calculating a summary statistic at some point. We will cover how to use summarise(.by = ) to calculate summary statistics.

With this example data let’s calculate a few different scores, aggregated by ID and Condition

Mean accuracy
Mean reaction time
Reaction time variability (e.g., standard deviation)
Number of trials aggregated over

Code
Data Frame

# set criterion in setup section of script
rt_short_criterion <- 200
rt_long_criterion <- 5000

data_scores <- data_import |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion) |>
  summarise(.by = c(ID, Condition),
            Accuracy.mean = mean(Accuracy, na.rm = TRUE),
            RT.mean = mean(RT, na.rm = TRUE),
            RT.sd = sd(RT, na.rm = TRUE))

Pivot

Now, depending on how this data is going to be analyzed (and even what statistical software you are using to analyze the data) you may want this data frame in either long (its current form) or wide format.

Between-Subjects ANOVA

Condition is not a between-subject variable in this sample data but if you have a between-subject variable or design, you will want to keep the column(s) for that variable(s) in long format.

Within-Subjects ANOVA

Condition is a within-subject variable in this sample data. Whether you want this variable in long or wide format depends on whether you will use R or JASP to analyze the data.

For R, you can keep it in long format
For JASP, you will need to restructure it to wide format

Code
Data Frame

# set criterion in setup section of script
rt_short_criterion <- 200
rt_long_criterion <- 5000

data_scores <- data_import |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion) |>
  summarise(.by = c(ID, Condition),
            Accuracy.mean = mean(Accuracy, na.rm = TRUE),
            RT.mean = mean(RT, na.rm = TRUE),
            RT.sd = sd(RT, na.rm = TRUE)) |>
  pivot_wider(id_cols = ID,
              names_from = Condition,
              values_from = c(Accuracy.mean, RT.mean, RT.sd),
              names_glue = "{Condition}_{.value}")

Use the small arrow ▸ at the end of the column names to see more columns

Correlation and Regression

For correlation and regression you will want to restructure the data to wide format (same as the code above).

If you have multiple data files from different measures or tasks, that you eventually want to merge into a single data frame for analysis, it is a good idea to add the measure or task name to the column names. That way when you merge the data you know which column corresponds to which task.

Code
Data Frame

# set criterion in setup section of script
rt_short_criterion <- 200
rt_long_criterion <- 5000

data_scores <- data_import |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion) |>
  summarise(.by = c(ID, Condition),
            Accuracy.mean = mean(Accuracy, na.rm = TRUE),
            RT.mean = mean(RT, na.rm = TRUE),
            RT.sd = sd(RT, na.rm = TRUE)) |>
  pivot_wider(id_cols = ID,
              names_from = Condition,
              values_from = c(Accuracy.mean, RT.mean, RT.sd),
              names_glue = "{task}_{Condition}_{.value}")

Use the small arrow ▸ at the end of the column names to see more columns

Subject-Level Cleaning

The next section of the script template is for cleaning the data by removing problematic subjects and/or removing outliers.

# ---- Clean Data --------------------------------------------------------------
data_cleaned <- data_scores |>
  remove_problematic(
    filter = "", # add a statement that would go in dplyr::filter
    log_file = here("data/logs", paste(task, "_problematic.csv", sep = ""))) |>
  replace_outliers(
    variables = c(),
    cutoff = outlier_cutoff,
    with = "NA",
    log_file = here("data/logs", paste(task, "_outliers.csv", sep = ""))) |>
  filter(!is.na())
# ------------------------------------------------------------------------------

Remove Problematic Subjects

Depending on the task, problematic subjects can be detected in different ways. For this example data we will simply remove subjects that had less than chance performance on congruent trials (were just guessing or did not understand the task).

To do so, we will use a custom function created for this purpose, remove_problematic() from our englelab package. ?englelab::remove_problematic

The main argument is remove =. This argument takes a logical statement (e.g., remove = Accuracy.mean <= .5).

The other argument, that is more optional, is log_file =. This allows us to save a data file containing only the subjects that were removed. This is good if we later on want to report in publications how many subjects were removed.

Code
Data Frame

# set criterion in setup section of script
acc_criterion <- .34

data_cleaned <- data_scores |>
  remove_problematic(
    remove = "Stroop_congruent_Accuracy.mean <= acc_criterion",
    log_file = here("data/logs", paste(task, "_problematic.csv", sep = "")))

Use the small arrow ▸ at the end of the column names to see more columns

Remove Outliers

Remove outliers based on their final task scores. A typical way we remove outliers is by setting their score to missing NA if it is 3.5 standard deviations above or below the mean.

To do so, we will use a custom function created for this purpose, replace_outliers from our englelab package.

There are several arguments that need to be defined. See ?englelab::replace_outliers for a description on this.

The fully piped |> code for this entire section looks like:

Code
Data Frame

# set criterion in setup section of script
acc_criterion <- .34
outlier_criterion <- 3.5

data_cleaned <- data_scores |>
  remove_problematic(
    remove = "Stroop_congruent_Accuracy.mean <= acc_criterion",
    log_file = here("data/logs", paste(task, "_problematic.csv", sep = ""))) |>
  replace_outliers(
    variables = "Stroop_incongruent_RT.mean",
    cutoff = outlier_criterion,
    with = "NA",
    pass = 1,
    id = "ID",
    log_file = here("data/logs", paste(task, "_outliers.csv", sep = ""))) |>
  filter(!is.na(Stroop_congruent_RT.mean))

Problematic subjects removed: 1

Outliers detected (pass = 1): 0

Notice that filter(!is.na()) is specified after replace_outliers. This removes any of the outliers from the dataframe.

Use the small arrow ▸ at the end of the column names to see more columns

Calculate Reliability

There are two standard ways of calculating reliability: split-half and cronbach’s alpha.

It is best to calculate reliability only on the subjects that made it passed data cleaning.

Code
Data Frame

reliability <- data_import |>
  filter(ID %in% data_cleaned$ID) |>
  mutate(.by = ID,
         Trial = row_number())

Use the small arrow ▸ at the end of the column names to see more columns

Split-half reliability

First trials need to be split between even and odd trials (or whatever split-half one is using).

Code
Data Frame

splithalf <- reliability |>
  mutate(.by = ID,
         Split = ifelse(Trial %% 2, "odd", "even"))

Use the small arrow ▸ at the end of the column names to see more columns

Then scores can be re-calculated based on the even/odd splits. Everything that was done to score the data needs to be included, such as trial-level cleaning.

Code
Data Frame

splithalf <- reliability |>
  mutate(.by = ID,
         Split = ifelse(Trial %% 2, "odd", "even")) |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion) |>
  summarise(.by = c(ID, Condition, Split),
            RT.mean = mean(RT, na.rm = TRUE)) |>
  pivot_wider(id_cols = ID,
              names_from = c(Condition, Split),
              values_from = RT.mean)

Use the small arrow ▸ at the end of the column names to see more columns

Finally, spearman-brown corrected split-half reliability can be calculated and added to the data frame

Code
Data Frame

splithalf <- reliability |>
  mutate(.by = ID,
         Split = ifelse(Trial %% 2, "odd", "even")) |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion) |>
  summarise(.by = c(ID, Condition, Split),
            RT.mean = mean(RT, na.rm = TRUE)) |>
  pivot_wider(id_cols = ID,
              names_from = c(Condition, Split),
              values_from = RT.mean) |>
  summarise(r = cor(incongruent_even, incongruent_odd)) |>
  mutate(r = (2 * r) / (1 + r))

data_cleaned$Stroop_incongruent_RT.mean_splithalf <- splithalf$r

Use the small arrow ▸ at the end of the column names to see more columns

Cronbach’s alpha

Cronbach’s alpha is easier to calculate because we do not need to re-calculate the scores.

The first step is to get the data frame setup such that trial number is spread across columns

Code
Data Frame

cronbachalpha <- reliability |>
  filter(RT >= rt_short_criterion, RT <= rt_long_criterion) |>
  select(ID, Trial, Accuracy) |>
  pivot_wider(id_cols = ID,
              names_from = Trial,
              values_from = Accuracy)

Use the small arrow ▸ at the end of the column names to see more columns

Then we can use psych::alpha() to calculate the average correlation between items and save it to the data frame

Code
Data Frame

cronbachalpha <- reliability |>
  select(ID, Trial, Accuracy) |>
  pivot_wider(id_cols = ID,
              names_from = Trial,
              values_from = Accuracy) |>
  select(-ID) |>
  alpha()  # from the psych package

Some items ( 1 ) were negatively correlated with the total scale and 
probably should be reversed.  
To do this, run the function again with the 'check.keys=TRUE' option

data_cleaned$Stroop_incongruent_RT.mean_cronbachalpha <- 
  cronbachalpha$total$std.alpha

Use the small arrow ▸ at the end of the column names to see more columns