Guilty Pleasures via dplyr

PUBLISHED ON MAY 9, 2018

Next up on my to-learn list is dplyr. I use group_by and summarize fairly regularly, but there is other functionality that I want to learn to take advantage of.

  • mutate
  • select
  • filter
  • arrange
  • pull (instead of my favorite, $)
  • various joins (instead of merge)

In anticipation of The Bachelorette starting at the end of May, I’m going to wrangle some data on previous contestants. Yes, watching The Bachelorette/The Bachelor is my guilty pleasure. But this feels fitting because the dollar sign is my R guilty pleasure. I think this will be the hardest habit to break, but here I try to gain some muscle memory for mutate and pull.

I feel the disapproval already, but read this super-scientific article before judging me.

require(stringr)
require(dplyr)
require(ggmap)
require(fuzzyjoin)
require(maps)
require(tigris)
require(sp)
require(acs)

Luckily, both 538 and Kaggle have some data on this, so I don’t have to do any web scraping.

#https://github.com/fivethirtyeight/data/tree/master/bachelorette
#https://www.kaggle.com/brianbgonz/the-bachelor-contestants/data

setwd("~/Desktop/data-538/bachelorette")

contestants<-read.csv("contestants.csv",stringsAsFactors=F)  ## just female contestants
bachelors<-read.csv("bachelors.csv",stringsAsFactors=F) ## bachelors
weekByWeek<-read.csv("bachelorette.csv",stringsAsFactors=F) ## both

These datasets contain different information, so the goal of this exercise is to wrangle them together and do something simple with the result. Plenty of others have done fancier stuff with this kind of data. I’m just trying to learn some new tidy verbs.

head(contestants)
##             Name Age                      Occupation             Hometown
## 1   Amanda Marsh  23                   Event Planner      Chanute, Kansas
## 2    Trista Rehn  29               Miami Heat Dancer       Miami, Florida
## 3 Shannon Oliver  24 Financial Management Consultant        Dallas, Texas
## 4            Kim  24                           Nanny       Tempe, Arizona
## 5   Cathy Grimes  22                Graduate Student Terra Haute, Indiana
## 6      Christina  28                        Attorney   Bonita, California
##   Height ElimWeek Season
## 1     NA       NA      1
## 2     NA        6      1
## 3     NA        5      1
## 4     NA        4      1
## 5     NA        3      1
## 6     NA        3      1
head(bachelors)
##               Name Age                  Hometown Height Season
## 1      Alex Michel  32 Charlottesville, Virginia     NA      1
## 2     Aaron Buerge  28          Butler, Missouri     NA      2
## 3     Jesse Palmer  34          Toronto, Ontario     NA      5
## 4 Lorenzo Borghese  34              Milan, Italy     NA      9
## 5     Andy Baldwin  30   Lancaster, Pennsylvania     NA     10
## 6      Brad Womack  35             Austin, Texas     NA     11
head(weekByWeek,2)
##           SHOW SEASON CONTESTANT ELIMINATION.1 ELIMINATION.2 ELIMINATION.3
## 1         SHOW SEASON         ID             1             2             3
## 2 Bachelorette     13 13_BRYAN_A            R1                            
##   ELIMINATION.4 ELIMINATION.5 ELIMINATION.6 ELIMINATION.7 ELIMINATION.8
## 1             4             5             6             7             8
## 2             R             R                           R              
##   ELIMINATION.9 ELIMINATION.10 DATES.1 DATES.2 DATES.3 DATES.4 DATES.5
## 1             9             10       1       2       3       4       5
## 2                            W                      D6     D13      D1
##   DATES.6 DATES.7 DATES.8 DATES.9 DATES.10
## 1       6       7       8       9       10
## 2      D7      D1      D1      D1       D1

Get rid of headers.

weekByWeek=weekByWeek[-which(weekByWeek$SEASON=="SEASON"),]

Counting Dates

Use select and mutate to add the number of dates.

Usually I would usually just use the dollar sign to add new columns (and I would just manually specify the dates columns).

dates=select(weekByWeek,starts_with("DATES")) 

weekByWeek=weekByWeek %>% mutate(numOneOnOneDates=apply(dates,1,function(x){length(which(x=="D1"))}))

weekByWeek=weekByWeek %>% mutate(numDates=apply(dates,1,function(x){length(which(x!=""))}))

weekByWeek=weekByWeek %>% mutate(numGroupDates=numDates-numOneOnOneDates) ## 

String Processing

Get the names ready to join using mutate and pull. To standardize each data set, I want first names and last initials in all capitals.

Note: The string processing here is rudimentary. stringr will have its own blog post.

weekByWeek= weekByWeek %>% mutate(firstName=unlist(lapply(pull(weekByWeek, CONTESTANT),
                                        function(x){unlist(str_split(x, "_"))[2]})))

weekByWeek= weekByWeek %>% mutate(lastInitial=unlist(lapply(pull(weekByWeek, CONTESTANT),
                                                          function(x){unlist(str_split(x, "_"))[3]})))

weekByWeek=weekByWeek %>% mutate(lastInitial=unlist(lapply(pull(weekByWeek,lastInitial),function(x){ifelse(is.na(x),"",x)})))

weekByWeek=weekByWeek %>% mutate(nameNice=paste(firstName,lastInitial,sep=" ")) 



contestants=contestants %>% mutate(firstName=unlist(lapply(pull(contestants,Name),function(x){unlist(str_split(x," "))[1]})))

contestants=contestants %>% mutate(lastName=unlist(lapply(pull(contestants,Name),function(x){unlist(str_split(x," "))[2]}))) 

contestants=contestants %>% mutate(lastInitial=unlist(lapply(pull(contestants,lastName),function(x){unlist(str_split(x,""))[1]}))) 

contestants=contestants %>% mutate(lastInitial=unlist(lapply(pull(contestants,lastInitial),function(x){ifelse(is.na(x),"",x)})))

contestants=contestants %>% mutate(nameNice=toupper(paste(firstName,lastInitial,sep=" "))) 

Deal with some weird entries (nicknames have parentheses that mess things up).

contestants[which(grepl("\\(",contestants$Name)),]
##                       Name Age                  Occupation
## 100    Britt (Bowe) Newton  28                Beer chemist
## 212    Elizabeth Kitt (NE)  29                       Nanny
## 218 Alexa (Lex) McAllister  25                Entrepreneur
## 221 Elizabeth Kreft (D.C.)  29 Captain, Air National Guard
##               Hometown Height ElimWeek Season firstName lastName
## 100       Columbus, OH     NA        1      9     Britt   (Bowe)
## 212 Imperial, Nebraska     NA        3     14 Elizabeth     Kitt
## 218     Galloway, Ohio     NA        1     14     Alexa    (Lex)
## 221    Union, Kentucky     NA        1     14 Elizabeth    Kreft
##     lastInitial    nameNice
## 100           (     BRITT (
## 212           K ELIZABETH K
## 218           (     ALEXA (
## 221           K ELIZABETH K
contestants$nameNice[which(grepl("Bowe",contestants$Name))]="Britt N"
contestants$nameNice[which(grepl("McAllister",contestants$Name))]="Alexa A"

Merging/Joining

Before we start merging, we should have a sense of what the best we can do is. Kaggle warns us that the data is missing some seasons.

dim(contestants)
## [1] 423  11
dim(weekByWeek)
## [1] 887  29

I usually use merge, but dplyr focuses on inner_join, left_join, right_join, etc. so let’s get used to that syntax.

weekByWeek=weekByWeek %>% mutate(SEASON=as.numeric(SEASON))
tryMerge=inner_join(contestants,weekByWeek,by=c("nameNice"="nameNice","Season"="SEASON"))
dim(tryMerge)
## [1] 313  38

Because some contestants don’t have last names listed in the Kaggle data, we are losing a lot of rows. Within a season we should be able to do a rough join instead to recover some of these. However, there are seasons where multiple contestants have the same first name. We can see that duplicates do occur with this fuzzy join.

mergedData=weekByWeek %>% regex_inner_join(contestants, by = c(nameNice = "nameNice",SEASON="Season"))
dim(mergedData)
## [1] 433  40
length(which(duplicated(mergedData$CONTESTANT))) 
## [1] 41

Location, Location, Location

Since one of the datasets only has the female contestants, our merged file will only have the women in it. Where are the female contestants from?

coordinates=geocode(pull(mergedData,Hometown),output="latlon")
write.csv(coordinates,"bachelorCoords.csv",row.names=F)

Using my ggplot skills from previous posts…

all_states <- map_data("state")

p <- ggplot()+ geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="black", fill="white" )
p <- p+ geom_point(data=coordinates,aes(x=lon,y=lat))+xlim(-125,-60)+ylim(25,50)+theme_void()
p

Let’s find out how many per state (and take advantage of arrange).

More building off of previous ggplot skills…

pts = SpatialPoints(coordinates[complete.cases(coordinates),])

#https://journal.r-project.org/archive/2016/RJ-2016-043/RJ-2016-043.pdf
## There is probably an easier way to do this.
us_states <- unique(fips_codes$state)[1:51]
continental_states <- us_states[!us_states %in% c("AK", "HI")]
us_pumas <- rbind_tigris(
 lapply(
    continental_states, function(x) {
      pumas(state = x, cb = TRUE)
    }
  )
)

proj4string(pts)=proj4string(us_pumas) ## this is needed for over

withinContinental=over(pts,us_pumas)

byState=group_by(withinContinental, STATEFP10) %>% summarise(count=n()) 
byState=byState %>% mutate(STATEFP10=as.numeric(STATEFP10))
byState=inner_join(byState,fips.state,by=c("STATEFP10"="STATE"))%>% arrange(desc(count))

I’m curious if the males participating on The Bachelorette are from the same types of places. It seems to me like small town southern girls are more represented than small town southern boys, but I want to test this theory. Let’s filter for the South to at least start getting a sense.

byState
## # A tibble: 39 x 5
##    STATEFP10 count STUSAB STATE_NAME   STATENS
##        <dbl> <int> <chr>  <chr>          <int>
##  1        6.    46 CA     California   1779778
##  2       48.    37 TX     Texas        1779801
##  3       12.    27 FL     Florida       294478
##  4       36.    21 NY     New York     1779796
##  5       17.    19 IL     Illinois     1779784
##  6       26.    13 MI     Michigan     1779789
##  7       42.    12 PA     Pennsylvania 1779798
##  8        4.    11 AZ     Arizona      1779777
##  9       47.    10 TN     Tennessee    1325873
## 10       49.    10 UT     Utah         1455989
## # ... with 29 more rows
region=cbind.data.frame(state.abb, as.character(state.region))
names(region)=c("abb","reg")

south=inner_join(byState,region,by=c("STUSAB"="abb"))%>%filter(reg=="South")
south
## # A tibble: 15 x 6
##    STATEFP10 count STUSAB STATE_NAME     STATENS reg  
##        <dbl> <int> <chr>  <chr>            <int> <fct>
##  1       48.    37 TX     Texas          1779801 South
##  2       12.    27 FL     Florida         294478 South
##  3       47.    10 TN     Tennessee      1325873 South
##  4       37.     8 NC     North Carolina 1027616 South
##  5       21.     6 KY     Kentucky       1779786 South
##  6       40.     6 OK     Oklahoma       1102857 South
##  7       13.     5 GA     Georgia        1705317 South
##  8       45.     4 SC     South Carolina 1779799 South
##  9       51.     4 VA     Virginia       1779803 South
## 10        5.     3 AR     Arkansas         68085 South
## 11       24.     3 MD     Maryland       1714934 South
## 12        1.     2 AL     Alabama        1779775 South
## 13       22.     2 LA     Louisiana      1629543 South
## 14       28.     1 MS     Mississippi    1779790 South
## 15       54.     1 WV     West Virginia  1779805 South
sum(pull(south,count))/sum(pull(byState,count))
## [1] 0.3541667

That covers the dplyr verbs, and I’m starting to get the hang of it. I think replacing the dollar sign with mutate will come more easily, but I fear that pull will always be a stretch for me. Fingers crossed I can kick the dollar sign habit.

Feedback, questions, comments, etc. are welcome (@sastoudt). What’s your R guilty pleasure?