Cursing with Instead of at Factors


## Thank you to Chris Kennedy for kindly telling me I should be using library instead of require on my posts.

Forever ago @dpseidel drew my attention to an awesome dataset collected by @jimwebb about tweets that cursed being cold/hot. This reminded me of a project that @danascientist (so many Dana’s!) and I did in @BaumerBen’s class where we tried to assess how cold it had to be for people to talk about being cold on Twitter. I was curious whether differences in region would impact the overall median per phrase since we expect those in warmer regions to have less of a tolerance to the cold.

This seemed like a perfect dataset to work through the forcats package and tackle some factors. Be warned, curse words are involved.

data <- read.csv("data/collected-tweets.csv")

First I translate lat/long to state.

latlong2state <- function(pointsDF) {
  # Prepare SpatialPolygons object with one SpatialPolygon
  # per state (plus DC, minus HI & AK)
  states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
  IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
  states_sp <- map2SpatialPolygons(states, IDs=IDs,
                                   proj4string=CRS("+proj=longlat +datum=WGS84"))
  # Convert pointsDF to a SpatialPoints object 
  pointsSP <- SpatialPoints(pointsDF, 
                            proj4string=CRS("+proj=longlat +datum=WGS84"))
  # Use 'over' to get _indices_ of the Polygons object containing each point 
  indices <- over(pointsSP, states_sp)
  # Return the state names of the Polygons object containing each point
  stateNames <- sapply(states_sp@polygons, function(x) x@ID)

data$state <- latlong2state(data[,c("long","lat")])

Then I break down the states into regions using fct_collapse since we don’t have enough tweets per state.

data$division <- fct_collapse(data$state,

newengland = c("connecticut", "maine", "massachusetts", "new hampshire", "rhode island", "vermont"),
midatlantic = c("new jersey", "new york", "pennsylvania"),
eastnorthcentral = c("illinois", "indiana", "michigan", "ohio", "wisconsin"),
westnorthcentral = c("iowa", "kansas", "minnesota", "missouri", "nebraska", "north dakota", "south dakota"),
southatlantic = c("delaware", "florida", "georgia", "maryland", "north carolina", "south carolina","district of columbia", "west virginia", "virginia"),
eastsouthcentral = c("alabama", "kentucky", "mississippi", "tennessee"),
westsouthcentral = c("arkansas", "louisiana", "oklahoma", "texas"),
mountain = c("arizona", "colorado", "idaho", "montana", "nevada", "new mexico", "utah", "wyoming"),
pacific = c("alaska", "california", "hawaii", "oregon", "washington")
## Warning: Unknown levels in `f`: alaska, hawaii
data$region <- fct_collapse(data$division,
northeast = c("newengland", "midatlantic"),
midwest = c("eastnorthcentral", "westnorthcentral"),
south = c("southatlantic", "eastsouthcentral", "westsouthcentral"),
west = c("mountain", "pacific")

No Hawaii or Alaska in our dataset, but that’s fine. Also, am I the only one who thinks it is weird that Delaware is considered the south?

We can use fct_count to easily count how many tweets fall in each category.

## # A tibble: 10 x 2
##    f                    n
##    <fct>            <int>
##  1 eastsouthcentral   313
##  2 mountain           224
##  3 westsouthcentral   709
##  4 pacific            911
##  5 newengland         108
##  6 southatlantic     1043
##  7 eastnorthcentral   697
##  8 westnorthcentral   172
##  9 midatlantic        500
## 10 <NA>               698
## # A tibble: 5 x 2
##   f             n
##   <fct>     <int>
## 1 south      2065
## 2 west       1135
## 3 northeast   608
## 4 midwest     869
## 5 <NA>        698

Even these are a bit sparse, so we’ll stick with region to try to get a reasonable number of tweets per phrase.

For reference I want to easily be able to tell which phrase contains “hot” or “cold”.

data$type <- ifelse(grepl("hot",data$phrase, = T),"H",ifelse(grepl("cold",data$phrase, = T),"C","O"))
## I should be using stringr, but forgive me

Now I want to pick the phrases that are displayed in Jim’s plots to make comparisons easier.

toCompare = c("colder than mars", "colder than a witch's tit", "cold as heck", "colder than a mf", "cold as a bitch", "cold as balls", "cold as tits", "cold as fuck", "colder than my heart", "colder than a bitch", "cold as a mf", "cold as hell", "cold as ice", "hot as heck", "hotter than two rats", "hot as balls", "hotter than hell", "hot as tits", "hot as hell", "hot as fuck", "hot as shit", "hotter than satan's asshole", "hot as a mf", "hot as a bitch", "hotter than a mf", "hot as hades", "hot as dick")

We can use fct_other to lump all the other phrases into an “Other” category.

data$phrase <- fct_other(data$phrase, keep = toCompare)

Some regions don’t have every phrase, but I want that to be explicit in the data, so we break out our new friend tidyr. You can read more about my adventures with tidyr here.

data <- left_join(expand(data,phrase,region),data,c("phrase"="phrase","region"="region"))
res <- data %>% group_by(phrase,region) %>% summarise(medAppT = median(apparentTemperature, na.rm = T), count = n()) 

data %>% group_by(phrase) %>% summarise(medAppT = median(apparentTemperature, na.rm = T),count = n()) %>% arrange(medAppT)
## # A tibble: 28 x 3
##    phrase                    medAppT count
##    <fct>                       <dbl> <int>
##  1 colder than mars            -7.36    12
##  2 colder than a witch's tit   24.9     38
##  3 cold as heck                28.3     39
##  4 colder than a mf            33.2     16
##  5 cold as a bitch             36.5     43
##  6 cold as balls               36.6     78
##  7 cold as tits                40.5     24
##  8 cold as fuck                40.7    623
##  9 colder than my heart        41.8     19
## 10 colder than a bitch         44.3     14
## # ... with 18 more rows

Success! We match Jim’s values. I can try to make a quick plot to assess differences per region.

res2 <- subset(res,! & phrase != "Other")
ggplot(data = res2, aes(x = phrase, y = medAppT, col = region)) + geom_point() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

But my ordering doesn’t match Jim’s so it’s hard to compare. He ordered them by overall median, so I can compute that and then use fct_reorder to rearrange. But first I need to get rid of the pesky “Other” category using fct_drop.

data2 <- data %>% filter(phrase != "Other") 
data2$phrase <- fct_drop(data2$phrase) ## defaults to drop what isn't present

toOrder <- data2 %>% group_by(phrase) %>% summarise(medAppT = median(apparentTemperature, na.rm = T), count = n()) %>% arrange(medAppT) %>% mutate(order = 1:nrow(.))

res2$phrase <- fct_drop(res2$phrase) ## drop before merging so levels match

res2 <- left_join(x = res2, y = toOrder, by = c("phrase"="phrase"))

res2$phrase <- fct_reorder(res2$phrase, res2$order)

ggplot(data = res2, aes(x = phrase, y = medAppT.x, col = region)) + geom_point() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ylab("median apparent temp (F)") + xlab("")

There we go! But it is still hard to see what is going on. The south and west have larger temperatures for cold and hot categories which makes sense. But how much do the differences in number of tweets by region affect the median? Regional differences in phrase usage could potentially skew even a median.

## drop levels
data2 <- data %>% filter(phrase!="Other")
data2$phrase <- fct_drop(data2$phrase,only="Other")
## I would like to be able to do this in one step?

## reorder
data2 <- left_join(x = data2,y = toOrder, by = c("phrase"="phrase"))
data2$phrase <- fct_reorder(data2$phrase, data2$order)

## add on medians in black
ggplot(data = filter(data2, !, aes(x = phrase, y = apparentTemperature, col = region)) + geom_point(alpha = .5) + geom_point(aes(x = phrase,  y = medAppT), cex = 2, col = "black") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("") + ylab("apparent temperature (F)")

Even with some transparency in the points, we can’t see what may be affecting the medians (shown in black). Jittering on the categorical scale to the rescue!

ggplot(data = filter(data2, !, aes(x = phrase, y = apparentTemperature, col = region)) + geom_point(alpha = .5, position = position_jitter(w = .25, h = 0)) + geom_point(aes(x = phrase, y = medAppT), cex = 2, col = "black") + theme(text = element_text(size=15), axis.text.x = element_text(angle = 45, hjust = 1)) + xlab("") + ylab("apparent temperature (F)")

There are no immediately obvious clusters by region in the temperature direction per phrase, so the median seems like a reasonable choice. However, we can see how the mean could be affected by values that differ from the majority per phrase.

Actually drawing conclusions is beyond the scope of this post but questions remain:

  • Do particular regions use certain phrases more than we would expect by proportion of people/Twitter users?

  • Within phrases that are common across all regions, are there statistically and practically significant differences between the average temperature when a phrase is used?

I almost made it through this post without a gif (the horror!).

Feedback, questions, comments, etc. are welcome (@sastoudt).

P.S. @AmeliaMN and @askdrstats have a great paper about Wrangling Categorical Data in R if you want some serious guidance on factors.