NOTE: When updating my website in June 2021, I no longer could easily find my fake data file. I am using eval = F
to preserve the post, but we will not be able to see output. I will try to find a copy of the data in my backup drive and fix at a later time.
It’s that time of year… The end of the semester means grading galore for professors and graduate student instructors. In this post, I will explore tidyr in the context of organizing and calculating grades.
I have some familiarity with reshape2, but I always have to Google an example to remember how to go from wide to long format and vice versa. I’m hoping the tidyr functions will be more intuitive, so I won’t end up like:
I’ve generated some fake grade data based on the format that grades come in when you are an instructor at Berkeley (we use bCourses, similar to Moodle, Canvas, etc.).
Each row is a student. Each column is an assignment. The value in this table is the number of points earned.
head(grades,2)
gather
: use when your column names are actually levels of a particular variableIn the gather
syntax, the first element (after piping in the dataframe) is the new key (the name of the new variable we want, refers to what the current column names are) and the second element is the value (associated with the key, refers to what values are currently in each column). The next arguments are the columns that are going to be turned into key-value pairs. By using the minus sign we can say we want to reshape every column except student.
This gathered format allows us to easily group by student or assignment and see how things are going. Is a particular student struggling? Is there an assignment that everyone is stuggling with?
Note: This example is oversimplifying things since we haven’t yet said how many points each assignment is worth, but we’ll get there.
spread
: use when you want the levels of a variable to be separate columnsWe can undo gather
by using spread
. We may want to look at this shape if we want to see by assignment, how each student does.
Annoyingly, this doesn’t get us quite back to the original data because the assignments are organized alphabetically instead of numerically.
But we can do a little hack and get things back to normal.
For simplicity, let’s assume that there exists a student for each assigment who got a perfect score, so we can use the maximum per assignment as the number of possible points.
The use of purrr is coming in a future blog post (with @kelliotto) but for now I’m sticking to apply
.
spread
: use when you want to split levels (or values) of a variable into multiple componentsWe can imagine many different ways of organizing grades that may require other verbs in the tidyr package. What if our assignment names also included the due date? We may want to be able to calculate the average grade per month of the semester to assess pacing.
grades2=grades
#https://stackoverflow.com/questions/21502332/generating-random-dates
td = as.Date('2018/05/01') - as.Date('2018/02/01')
dates=as.Date('2018/02/01') + sample(0:td, ncol(grades)-1)
dates=dates[order(dates)]
names(grades2)[2:ncol(grades2)]=paste(names(grades2)[2:ncol(grades2)],dates,sep="_")
head(grades2,2)
gathered.grades2=gather(grades2,assignment,grade,-Student)
head(gathered.grades2,2)
gathered.grades.dueDates=gathered.grades2 %>% separate(assignment,c("assignment","year","month","day"))
head(gathered.grades.dueDates,2)
There is not a lot to go on here, but we could imagine breaking things down by week instead of month to get a better sense.
We could do something similar with str_split
but it would take many more lines of code.
unite
: use when you want to concatenate multiple variables into oneTo undo separate
, we use unite
. We want to collapse all but Student and grade into a variable called assignment. I would usually use paste
.
replace_NA
: use to replace missing values with a particular valueWe may want to replace assignments that are missing with zeros. replace_na
requires a named list of what to replace an NA with in each column. Since we have many assignment columns, I had to ask Stack Overflow for some assistance.
## this seems like a prime candidate for walk from purrr, to be continued...
grades[1,sample(1:(ncol(grades)-1),1)]=NA
grades[5,sample(1:(ncol(grades)-1),1)]=NA
grades[10,sample(1:(ncol(grades)-1),1)]=NA
grades[15,sample(1:(ncol(grades)-1),1)]=NA
#https://stackoverflow.com/questions/45576805/how-to-replace-all-na-in-a-dataframe-using-tidyrreplace-na
myList <- setNames(lapply(vector("list", ncol(grades)-1), function(x) x <- 0), names(grades)[-1])
head(myList,2)
grades=grades %>% replace_na(myList)
head(grades,2)
drop_na
: use when you want to remove rows with missing values in certain columnsdrop_na
may be useful when we want to drop each student’s lowest grade in a certain category of grades. Assuming all other NA values are converted to zeros…
gathered.grades.cat=gathered.grades %>% mutate(category=sample(1:3,nrow(gathered.grades),replace=T))
byStudent=split(gathered.grades.cat,pull(gathered.grades.cat,Student))
dropLowestScore=function(studentGrades,categoryToDrop){
catToDrop=studentGrades %>% filter(category==categoryToDrop)
toDrop=which.min(pull(catToDrop,grade))
studentGrades$grade[toDrop]=NA
## I don't think I can use pull here
studentGradesAdj=studentGrades %>% drop_na()
return(studentGradesAdj)
}
adjustedGrades=do.call("rbind",lapply(byStudent,dropLowestScore,1))
length(unique(pull(adjustedGrades,assignment)))
adjustedGrades%>%group_by(Student)%>% summarise(count=n())%>%head(2)
complete
: use when you want to make implicit missing data explicitI often use expand.grid
and then merge
to do this. This is much more concise.
Note: If we only want to include levels of a variable that are present in the data we use nesting
to denote this within the complete
call.
separate_rows
: use when a value contains more than one piece of information and you want separate rows for each piece of informationNow suppose our spreadsheet of grades contains the maximum points possible for each assignment. We can separate this information to make it easier to do calculations.
maxPts=apply(grades[,2:ncol(grades)],2,max)
pasteGrades=function(idx){
x=paste(grades[,idx+1],rep(maxPts[idx],nrow(grades)),sep="/")
return(x)
}
pastedGrades=lapply(1:(ncol(grades)-1),pasteGrades)
adjGrades=do.call("cbind",pastedGrades)
grades3=cbind.data.frame(pull(grades,Student),adjGrades)
names(grades3)=names(grades)
head(grades3,2)
separated.grades=grades3 %>% separate_rows(-Student,sep="/")
head(separated.grades,4)
pointsEarned=separated.grades[seq(1,nrow(separated.grades),by=2),]
pointsPossible=separated.grades[2,-1]
row.names(pointsPossible)=NULL
head(pointsEarned,2)
pointsPossible
nest
: use when you want to see which levels of a variable occur with the same combination of levels of other variablesWe can use nest
to see who got the top grade for each assignment (and use unnest
to undo the operation). This call makes every row a combination of assignment and grade level. The last column is a list of all the students who received that particular grade on the assignment.
That covers most of the tidyr verbs. I suspect that I will still need to look up examples to remember how to use gather
and spread
, but I’m hoping this cheat sheet will help me remember when to use which without having to try both on a sample data set.
The other verbs that I can see myself using frequently are complete
and spread
. I’m doing a project now where I use the expand.grid
and merge
trick way too often. I suspect it is a lot slower than complete
, so I’m going to make this switch and see if it speeds things up.
P.S. Just for the record: before posting this I searched for the dollar sign and rewrote the lines that involved it. Working on that guilty pleasure…
Happy Grading!