NOTE: When updating my website in June 2021, I no longer could easily find this data (the 538 GitHub data page has changed?). I am using eval = F
to preserve the post, but we will not be able to see output. I will try to find a copy of the data in my backup drive and fix at a later time.
Next up on my to-learn list is dplyr. I use group_by
and summarize
fairly regularly, but there is other functionality that I want to learn to take advantage of.
mutate
select
filter
arrange
pull
(instead of my favorite, $)merge
)In anticipation of The Bachelorette starting at the end of May, I’m going to wrangle some data on previous contestants. Yes, watching The Bachelorette/The Bachelor is my guilty pleasure. But this feels fitting because the dollar sign is my R
guilty pleasure. I think this will be the hardest habit to break, but here I try to gain some muscle memory for mutate
and pull
.
I feel the disapproval already, but read this super-scientific article before judging me.
Luckily, both 538 and Kaggle have some data on this, so I don’t have to do any web scraping.
#https://github.com/fivethirtyeight/data/tree/master/bachelorette
#https://www.kaggle.com/brianbgonz/the-bachelor-contestants/data
setwd("~/Desktop/data-538/bachelorette")
contestants<-read.csv("contestants.csv",stringsAsFactors=F) ## just female contestants
bachelors<-read.csv("bachelors.csv",stringsAsFactors=F) ## bachelors
weekByWeek<-read.csv("bachelorette.csv",stringsAsFactors=F) ## both
These datasets contain different information, so the goal of this exercise is to wrangle them together and do something simple with the result. Plenty of others have done fancier stuff with this kind of data. I’m just trying to learn some new tidy verbs.
Get rid of headers.
weekByWeek=weekByWeek[-which(weekByWeek$SEASON=="SEASON"),]
Use select
and mutate
to add the number of dates.
Usually I would usually just use the dollar sign to add new columns (and I would just manually specify the dates columns).
dates=select(weekByWeek,starts_with("DATES"))
weekByWeek=weekByWeek %>% mutate(numOneOnOneDates=apply(dates,1,function(x){length(which(x=="D1"))}))
weekByWeek=weekByWeek %>% mutate(numDates=apply(dates,1,function(x){length(which(x!=""))}))
weekByWeek=weekByWeek %>% mutate(numGroupDates=numDates-numOneOnOneDates) ##
Get the names ready to join using mutate
and pull
. To standardize each data set, I want first names and last initials in all capitals.
Note: The string processing here is rudimentary. stringr will have its own blog post.
weekByWeek= weekByWeek %>% mutate(firstName=unlist(lapply(pull(weekByWeek, CONTESTANT),
function(x){unlist(str_split(x, "_"))[2]})))
weekByWeek= weekByWeek %>% mutate(lastInitial=unlist(lapply(pull(weekByWeek, CONTESTANT),
function(x){unlist(str_split(x, "_"))[3]})))
weekByWeek=weekByWeek %>% mutate(lastInitial=unlist(lapply(pull(weekByWeek,lastInitial),function(x){ifelse(is.na(x),"",x)})))
weekByWeek=weekByWeek %>% mutate(nameNice=paste(firstName,lastInitial,sep=" "))
contestants=contestants %>% mutate(firstName=unlist(lapply(pull(contestants,Name),function(x){unlist(str_split(x," "))[1]})))
contestants=contestants %>% mutate(lastName=unlist(lapply(pull(contestants,Name),function(x){unlist(str_split(x," "))[2]})))
contestants=contestants %>% mutate(lastInitial=unlist(lapply(pull(contestants,lastName),function(x){unlist(str_split(x,""))[1]})))
contestants=contestants %>% mutate(lastInitial=unlist(lapply(pull(contestants,lastInitial),function(x){ifelse(is.na(x),"",x)})))
contestants=contestants %>% mutate(nameNice=toupper(paste(firstName,lastInitial,sep=" ")))
Deal with some weird entries (nicknames have parentheses that mess things up).
Before we start merging, we should have a sense of what the best we can do is. Kaggle warns us that the data is missing some seasons.
I usually use merge
, but dplyr focuses on inner_join
, left_join
, right_join
, etc. so let’s get used to that syntax.
weekByWeek=weekByWeek %>% mutate(SEASON=as.numeric(SEASON))
tryMerge=inner_join(contestants,weekByWeek,by=c("nameNice"="nameNice","Season"="SEASON"))
dim(tryMerge)
Because some contestants don’t have last names listed in the Kaggle data, we are losing a lot of rows. Within a season we should be able to do a rough join instead to recover some of these. However, there are seasons where multiple contestants have the same first name. We can see that duplicates do occur with this fuzzy join.
mergedData=weekByWeek %>% regex_inner_join(contestants, by = c(nameNice = "nameNice",SEASON="Season"))
dim(mergedData)
length(which(duplicated(mergedData$CONTESTANT)))
Since one of the datasets only has the female contestants, our merged file will only have the women in it. Where are the female contestants from?
Using my ggplot skills from previous posts…
Let’s find out how many per state (and take advantage of arrange
).
More building off of previous ggplot skills…
pts = SpatialPoints(coordinates[complete.cases(coordinates),])
#https://journal.r-project.org/archive/2016/RJ-2016-043/RJ-2016-043.pdf
## There is probably an easier way to do this.
us_states <- unique(fips_codes$state)[1:51]
continental_states <- us_states[!us_states %in% c("AK", "HI")]
us_pumas <- rbind_tigris(
lapply(
continental_states, function(x) {
pumas(state = x, cb = TRUE)
}
)
)
proj4string(pts)=proj4string(us_pumas) ## this is needed for over
withinContinental=over(pts,us_pumas)
byState=group_by(withinContinental, STATEFP10) %>% summarise(count=n())
byState=byState %>% mutate(STATEFP10=as.numeric(STATEFP10))
byState=inner_join(byState,fips.state,by=c("STATEFP10"="STATE"))%>% arrange(desc(count))
I’m curious if the males participating on The Bachelorette are from the same types of places. It seems to me like small town southern girls are more represented than small town southern boys, but I want to test this theory. Let’s filter
for the South to at least start getting a sense.
byState
region=cbind.data.frame(state.abb, as.character(state.region))
names(region)=c("abb","reg")
south=inner_join(byState,region,by=c("STUSAB"="abb"))%>%filter(reg=="South")
south
sum(pull(south,count))/sum(pull(byState,count))
That covers the dplyr verbs, and I’m starting to get the hang of it. I think replacing the dollar sign with mutate
will come more easily, but I fear that pull
will always be a stretch for me. Fingers crossed I can kick the dollar sign habit.
R
guilty pleasure?