Grading with tidyr

PUBLISHED ON MAY 17, 2018

It’s that time of year… The end of the semester means grading galore for professors and graduate student instructors. In this post, I will explore tidyr in the context of organizing and calculating grades.

I have some familiarity with reshape2, but I always have to Google an example to remember how to go from wide to long format and vice versa. I’m hoping the tidyr functions will be more intuitive, so I won’t end up like:

require(tidyr)
require(dplyr)
require(ggplot2)

I’ve generated some fake grade data based on the format that grades come in when you are an instructor at Berkeley (we use bCourses, similar to Moodle, Canvas, etc.).

setwd("~/Desktop")
grades=read.csv("gradesFake.csv")

Each row is a student. Each column is an assignment. The value in this table is the number of points earned.

head(grades,2) 
##   Student A1  A2  A3 A4  A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17
## 1       A  1 0.5 0.5  1 1.0  1  1  1  1   2   3   2   3   2   2   2   2
## 2       B  1 1.0 1.0  1 0.5  1  1  1  2   3   3   3   3   3   2   2   3
##   A18 A19 A20 A21 A22 A23
## 1 1.0   1 1.0   1   1 0.5
## 2 0.5   1 0.5   1   1 1.0

gather: use when your column names are actually levels of a particular variable

In the gather syntax, the first element (after piping in the dataframe) is the new key (the name of the new variable we want, refers to what the current column names are) and the second element is the value (associated with the key, refers to what values are currently in each column). The next arguments are the columns that are going to be turned into key-value pairs. By using the minus sign we can say we want to reshape every column except student.

gathered.grades=grades %>% gather(assignment,grade,-Student)

head(gathered.grades,2)
##   Student assignment grade
## 1       A         A1     1
## 2       B         A1     1

This gathered format allows us to easily group by student or assignment and see how things are going. Is a particular student struggling? Is there an assignment that everyone is stuggling with?

Note: This example is oversimplifying things since we haven’t yet said how many points each assignment is worth, but we’ll get there.

gathered.grades %>% group_by(Student) %>% summarise(meanPoints=mean(grade),sdPoints=sd(grade))%>% head(2)
## # A tibble: 2 x 3
##   Student meanPoints sdPoints
##   <chr>        <dbl>    <dbl>
## 1 A             1.37    0.726
## 2 B             1.59    0.949
gathered.grades %>% group_by(assignment) %>% summarise(meanPoints=mean(grade),sdPoints=sd(grade))%>% head(2)
## # A tibble: 2 x 3
##   assignment meanPoints sdPoints
##   <chr>           <dbl>    <dbl>
## 1 A1              0.933    0.258
## 2 A10             2.47     0.516

spread: use when you want the levels of a variable to be separate columns

We can undo gather by using spread. We may want to look at this shape if we want to see by assignment, how each student does.

Annoyingly, this doesn’t get us quite back to the original data because the assignments are organized alphabetically instead of numerically.

byStudent=gathered.grades %>% spread(assignment, grade) 
byAssignment=gathered.grades %>% spread(Student, grade) 

head(byStudent,2) ## alphabetical weirdness
##   Student A1 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19  A2 A20 A21 A22 A23
## 1       A  1   2   3   2   3   2   2   2   2 1.0   1 0.5 1.0   1   1 0.5
## 2       B  1   3   3   3   3   3   2   2   3 0.5   1 1.0 0.5   1   1 1.0
##    A3 A4  A5 A6 A7 A8 A9
## 1 0.5  1 1.0  1  1  1  1
## 2 1.0  1 0.5  1  1  1  2
head(byAssignment,2) ## alphabetical weirdness
##   assignment A B C D E F G H I J K L M N O
## 1         A1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## 2        A10 2 3 3 3 2 2 2 2 3 2 3 2 3 2 3

But we can do a little hack and get things back to normal.

orderIwant=paste("A",1:(ncol(byStudent)-1),sep="")

head(byStudent[,c("Student",orderIwant)],3)
##   Student A1  A2  A3 A4  A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17
## 1       A  1 0.5 0.5  1 1.0  1  1  1  1   2   3   2   3   2   2   2   2
## 2       B  1 1.0 1.0  1 0.5  1  1  1  2   3   3   3   3   3   2   2   3
## 3       C  1 1.0 1.0  1 1.0  1  1  1  3   3   3   3   3   3   2   3   3
##   A18 A19 A20 A21 A22 A23
## 1 1.0   1 1.0   1   1 0.5
## 2 0.5   1 0.5   1   1 1.0
## 3 1.0   1 1.0   1   1 1.0
byAssignmentAdjust=byAssignment[unlist(lapply(orderIwant,function(x){which(pull(byAssignment,assignment)==x)})),]

row.names(byAssignmentAdjust)=NULL

head(byAssignmentAdjust,3)
##   assignment   A B C D   E   F G H I J K L M N O
## 1         A1 1.0 1 1 1 1.0 0.0 1 1 1 1 1 1 1 1 1
## 2         A2 0.5 1 1 1 0.5 0.0 1 1 1 1 1 1 1 1 1
## 3         A3 0.5 1 1 1 0.5 0.5 1 1 1 1 1 1 1 1 1

Calculate grades

For simplicity, let’s assume that there exists a student for each assigment who got a perfect score, so we can use the maximum per assignment as the number of possible points.

scores1=select(byStudent,starts_with("A")) 
scores2=select(byAssignment,one_of(LETTERS[1:(ncol(byAssignment)-1)])) 

head(scores1,2)
##   A1 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19  A2 A20 A21 A22 A23  A3 A4
## 1  1   2   3   2   3   2   2   2   2 1.0   1 0.5 1.0   1   1 0.5 0.5  1
## 2  1   3   3   3   3   3   2   2   3 0.5   1 1.0 0.5   1   1 1.0 1.0  1
##    A5 A6 A7 A8 A9
## 1 1.0  1  1  1  1
## 2 0.5  1  1  1  2
head(scores2,2)
##   A B C D E F G H I J K L M N O
## 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
## 2 2 3 3 3 2 2 2 2 3 2 3 2 3 2 3

The use of purrr is coming in a future blog post (with @kelliotto) but for now I’m sticking to apply.

possiblePoints=sum(apply(scores1,2,max))
pointsPerStudent=apply(scores1,1,sum)
studentPercentages=pointsPerStudent/possiblePoints
studentPercentages
##         1         2         3         4         5         6         7 
## 0.7682927 0.8902439 0.9756098 0.9512195 0.8414634 0.4756098 0.9756098 
##         8         9        10        11        12        13        14 
## 0.9024390 0.8170732 0.9268293 0.9512195 0.9146341 0.9024390 0.8536585 
##        15 
## 0.9756098
possiblePoints=sum(apply(scores2,1,max))
pointsPerStudent=apply(scores2,2,sum)
studentPercentages=pointsPerStudent/possiblePoints
studentPercentages
##         A         B         C         D         E         F         G 
## 0.7682927 0.8902439 0.9756098 0.9512195 0.8414634 0.4756098 0.9756098 
##         H         I         J         K         L         M         N 
## 0.9024390 0.8170732 0.9268293 0.9512195 0.9146341 0.9024390 0.8536585 
##         O 
## 0.9756098

spread: use when you want to split levels (or values) of a variable into multiple components

We can imagine many different ways of organizing grades that may require other verbs in the tidyr package. What if our assignment names also included the due date? We may want to be able to calculate the average grade per month of the semester to assess pacing.

grades2=grades
#https://stackoverflow.com/questions/21502332/generating-random-dates
td = as.Date('2018/05/01') - as.Date('2018/02/01')
dates=as.Date('2018/02/01') + sample(0:td, ncol(grades)-1)
dates=dates[order(dates)]
names(grades2)[2:ncol(grades2)]=paste(names(grades2)[2:ncol(grades2)],dates,sep="_")
head(grades2,2)
##   Student A1_2018-02-01 A2_2018-02-08 A3_2018-02-10 A4_2018-02-13
## 1       A             1           0.5           0.5             1
## 2       B             1           1.0           1.0             1
##   A5_2018-02-23 A6_2018-02-28 A7_2018-03-06 A8_2018-03-08 A9_2018-03-09
## 1           1.0             1             1             1             1
## 2           0.5             1             1             1             2
##   A10_2018-03-13 A11_2018-03-15 A12_2018-03-16 A13_2018-03-17
## 1              2              3              2              3
## 2              3              3              3              3
##   A14_2018-03-19 A15_2018-03-24 A16_2018-03-25 A17_2018-03-27
## 1              2              2              2              2
## 2              3              2              2              3
##   A18_2018-03-31 A19_2018-04-01 A20_2018-04-18 A21_2018-04-26
## 1            1.0              1            1.0              1
## 2            0.5              1            0.5              1
##   A22_2018-04-30 A23_2018-05-01
## 1              1            0.5
## 2              1            1.0
gathered.grades2=gather(grades2,assignment,grade,-Student)
head(gathered.grades2,2)
##   Student    assignment grade
## 1       A A1_2018-02-01     1
## 2       B A1_2018-02-01     1
gathered.grades.dueDates=gathered.grades2 %>% separate(assignment,c("assignment","year","month","day"))
head(gathered.grades.dueDates,2)
##   Student assignment year month day grade
## 1       A         A1 2018    02  01     1
## 2       B         A1 2018    02  01     1

There is not a lot to go on here, but we could imagine breaking things down by week instead of month to get a better sense.

gathered.grades.dueDates=gathered.grades %>% group_by(assignment)%>%summarise(maxPoss=max(grade)) %>%inner_join(gathered.grades.dueDates,by=c("assignment"="assignment"))

gathered.grades.dueDates %>% group_by(month)%>% summarise(avgGrade=mean(grade/maxPoss))
## # A tibble: 4 x 2
##   month avgGrade
##   <chr>    <dbl>
## 1 02       0.883
## 2 03       0.871
## 3 04       0.917
## 4 05       0.800

We could do something similar with str_split but it would take many more lines of code.

unite: use when you want to concatenate multiple variables into one

To undo separate, we use unite. We want to collapse all but Student and grade into a variable called assignment. I would usually use paste.

gathered.grades.dueDates %>%unite(assignment,-Student, -grade,sep="-") %>% head(2)
## # A tibble: 2 x 3
##   assignment      Student grade
##   <chr>           <chr>   <dbl>
## 1 A1-1-2018-02-01 A          1.
## 2 A1-1-2018-02-01 B          1.

replace_NA: use to replace missing values with a particular value

We may want to replace assignments that are missing with zeros. replace_na requires a named list of what to replace an NA with in each column. Since we have many assignment columns, I had to ask Stack Overflow for some assistance.

## this seems like a prime candidate for walk from purrr, to be continued...
grades[1,sample(1:(ncol(grades)-1),1)]=NA
grades[5,sample(1:(ncol(grades)-1),1)]=NA
grades[10,sample(1:(ncol(grades)-1),1)]=NA
grades[15,sample(1:(ncol(grades)-1),1)]=NA


#https://stackoverflow.com/questions/45576805/how-to-replace-all-na-in-a-dataframe-using-tidyrreplace-na
myList <- setNames(lapply(vector("list", ncol(grades)-1), function(x) x <- 0), names(grades)[-1])
head(myList,2)
## $A1
## [1] 0
## 
## $A2
## [1] 0
grades=grades %>% replace_na(myList)
head(grades,2)
##   Student A1  A2  A3 A4  A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17
## 1       A  1 0.5 0.5  1 1.0  1  0  1  1   2   3   2   3   2   2   2   2
## 2       B  1 1.0 1.0  1 0.5  1  1  1  2   3   3   3   3   3   2   2   3
##   A18 A19 A20 A21 A22 A23
## 1 1.0   1 1.0   1   1 0.5
## 2 0.5   1 0.5   1   1 1.0

drop_na: use when you want to remove rows with missing values in certain columns

drop_na may be useful when we want to drop each student’s lowest grade in a certain category of grades. Assuming all other NA values are converted to zeros…

gathered.grades.cat=gathered.grades %>% mutate(category=sample(1:3,nrow(gathered.grades),replace=T))

byStudent=split(gathered.grades.cat,pull(gathered.grades.cat,Student))


dropLowestScore=function(studentGrades,categoryToDrop){

  catToDrop=studentGrades %>% filter(category==categoryToDrop)
  
  toDrop=which.min(pull(catToDrop,grade))
  
  studentGrades$grade[toDrop]=NA
  ## I don't think I can use pull here
  
  studentGradesAdj=studentGrades %>% drop_na()
  
  return(studentGradesAdj)
}

adjustedGrades=do.call("rbind",lapply(byStudent,dropLowestScore,1))

length(unique(pull(adjustedGrades,assignment)))
## [1] 23
adjustedGrades%>%group_by(Student)%>% summarise(count=n())%>%head(2)
## # A tibble: 2 x 2
##   Student count
##   <chr>   <int>
## 1 A          22
## 2 B          22

complete: use when you want to make implicit missing data explicit

I often use expand.grid and then merge to do this. This is much more concise.

Note: If we only want to include levels of a variable that are present in the data we use nesting to denote this within the complete call.

dim(adjustedGrades)
## [1] 330   4
gathered.grades.cat%>%complete(Student,assignment)%>% dim
## [1] 345   4

separate_rows: use when a value contains more than one piece of information and you want separate rows for each piece of information

Now suppose our spreadsheet of grades contains the maximum points possible for each assignment. We can separate this information to make it easier to do calculations.

maxPts=apply(grades[,2:ncol(grades)],2,max)

pasteGrades=function(idx){
  x=paste(grades[,idx+1],rep(maxPts[idx],nrow(grades)),sep="/")
  return(x)
}

pastedGrades=lapply(1:(ncol(grades)-1),pasteGrades)

adjGrades=do.call("cbind",pastedGrades)

grades3=cbind.data.frame(pull(grades,Student),adjGrades)
names(grades3)=names(grades)

head(grades3,2)
##   Student  A1    A2    A3  A4    A5  A6  A7  A8  A9 A10 A11 A12 A13 A14
## 1       A 1/1 0.5/1 0.5/1 1/1   1/1 1/1 0/1 1/1 1/3 2/3 3/3 2/3 3/3 2/3
## 2       B 1/1   1/1   1/1 1/1 0.5/1 1/1 1/1 1/1 2/3 3/3 3/3 3/3 3/3 3/3
##   A15 A16 A17   A18 A19   A20 A21 A22   A23
## 1 2/3 2/3 2/3   1/1 1/1   1/1 1/1 1/1 0.5/1
## 2 2/3 2/3 3/3 0.5/1 1/1 0.5/1 1/1 1/1   1/1
separated.grades=grades3 %>% separate_rows(-Student,sep="/")
head(separated.grades,4)
##   Student A1  A2  A3 A4  A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17
## 1       A  1 0.5 0.5  1   1  1  0  1  1   2   3   2   3   2   2   2   2
## 2       A  1   1   1  1   1  1  1  1  3   3   3   3   3   3   3   3   3
## 3       B  1   1   1  1 0.5  1  1  1  2   3   3   3   3   3   2   2   3
## 4       B  1   1   1  1   1  1  1  1  3   3   3   3   3   3   3   3   3
##   A18 A19 A20 A21 A22 A23
## 1   1   1   1   1   1 0.5
## 2   1   1   1   1   1   1
## 3 0.5   1 0.5   1   1   1
## 4   1   1   1   1   1   1
pointsEarned=separated.grades[seq(1,nrow(separated.grades),by=2),]
pointsPossible=separated.grades[2,-1]
row.names(pointsPossible)=NULL

head(pointsEarned,2)
##   Student A1  A2  A3 A4  A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17
## 1       A  1 0.5 0.5  1   1  1  0  1  1   2   3   2   3   2   2   2   2
## 3       B  1   1   1  1 0.5  1  1  1  2   3   3   3   3   3   2   2   3
##   A18 A19 A20 A21 A22 A23
## 1   1   1   1   1   1 0.5
## 3 0.5   1 0.5   1   1   1
pointsPossible
##   A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18 A19 A20
## 1  1  1  1  1  1  1  1  1  3   3   3   3   3   3   3   3   3   1   1   1
##   A21 A22 A23
## 1   1   1   1

nest: use when you want to see which levels of a variable occur with the same combination of levels of other variables

We can use nest to see who got the top grade for each assignment (and use unnest to undo the operation). This call makes every row a combination of assignment and grade level. The last column is a list of all the students who received that particular grade on the assignment.

byAssignmentGrade=gathered.grades%>%nest(Student)
head(byAssignmentGrade)
##   assignment grade                                     data
## 1         A1   1.0 A, B, C, D, E, G, H, I, J, K, L, M, N, O
## 2         A1   0.0                                        F
## 3         A2   0.5                                     A, E
## 4         A2   1.0       B, C, D, G, H, I, J, K, L, M, N, O
## 5         A2   0.0                                        F
## 6         A3   0.5                                  A, E, F
pull(byAssignmentGrade,data)[1]
## [[1]]
##    Student
## 1        A
## 2        B
## 3        C
## 4        D
## 5        E
## 7        G
## 8        H
## 9        I
## 10       J
## 11       K
## 12       L
## 13       M
## 14       N
## 15       O
unnest(byAssignmentGrade,data) %>%head
##   assignment grade Student
## 1         A1     1       A
## 2         A1     1       B
## 3         A1     1       C
## 4         A1     1       D
## 5         A1     1       E
## 6         A1     1       G

That covers most of the tidyr verbs. I suspect that I will still need to look up examples to remember how to use gather and spread, but I’m hoping this cheat sheet will help me remember when to use which without having to try both on a sample data set.

The other verbs that I can see myself using frequently are complete and spread. I’m doing a project now where I use the expand.grid and merge trick way too often. I suspect it is a lot slower than complete, so I’m going to make this switch and see if it speeds things up.

P.S. Just for the record: before posting this I searched for the dollar sign and rewrote the lines that involved it. Working on that guilty pleasure…

Happy Grading!

Feedback, questions, comments, etc. are welcome (@sastoudt).