There are lots of functions in stringR that improve upon base R equivalents for string processing. I’m not going to go through all the functionality, but at the end of the post, after the main attraction, I’ll go through examples in the stringR documentation and pick out the ones that seem handiest for scenarios I have run into where base R has been found wanting.
Now for the star of the show. I want to be able to read in the text of a letter of recommendation and make all the pronouns gender neutral. I found a template here for a letter of recommendation that we will use. Since names are often a give-away of gender, I also would like to replace any names with “Student”. I’m going to change the generic name to “Sara” to test this out. Note that originally I wanted to replace with the less awkward “my student”, but then I would have to worry about whether “my” should be capitalized or not depending on the position of the name in the sentence.
letter=[1238 chars quoted with '"']
If the candidate has a name that has at least 5 uses in the United States, we can use the babynames package to locate it and replace it. This approach has limitations for international names. str_detect
only searches for one pattern. We don’t want to search for every single name one at a time. Instead, I’m going to find each capitalized word in the letter of recommendation and use it as the pattern to look for in the babynames. This is somewhat wasteful because the first word of every sentence is capitalized, but for now, I don’t want to have to deal with deciding whether or not each word comes directly after a punctuation mark.
[1] "Mary" "Anna" "Emma" "Elizabeth" "Minnie"
[6] "Margaret"
#https://stackoverflow.com/questions/21781014/remove-all-line-breaks-enter-symbols-from-the-string-using-r
nolines=str_replace_all(letter,"[\n]"," ")
nolines=str_replace_all(nolines,fixed("["),"\\[")
nolines=str_replace_all(nolines,fixed("]"),"\\]")
words=str_split(nolines," ")
words=words[[1]]
capitalized=unique(words[str_detect(words,"[A-Z]")])
The following is not restrictive enough since the words can be part of a name.
isName=lapply(capitalized,function(x){sum(str_detect(unique(babynames$name),x))})
capitalized[which(unlist(isName)>0)] ## not restrictive enough
[1] "Dear" "Sara" "I" "She" "Her" "As" "Best"
The following is too restrictive:
isName=lapply(capitalized,function(x){sum(str_detect(unique(babynames$name),paste(x,"\\>",sep="")))})
capitalized[unlist(isName)]
character(0)
But this is weird. This tells me that “\>” is the regular expression for the pattern being found at the end of the word.
sum(str_detect(unique(babynames$name),"Sara\\>")) ## default is regex so not looking for that actually
[1] 0
sum(str_detect(unique(babynames$name),regex("Sara\\>"))) ## default is regex so not looking for that actually
[1] 0
This works using grepl
. Can someone please explain this to me? I thought it might have to do with default engines, but I couldn’t find much information on the base R default beyond a note in the “performance consideration section” here.
Here is a hack:
allnames=str_flatten(unique(babynames$name)," ")
isName=lapply(capitalized,function(x){str_detect(allnames,paste(" ",x," ",sep=""))})
capitalized[unlist(isName)]
[1] "Sara" "She" "Her"
This is annoying. These are actually names.
[1] "She"
[1] "Her"
I’ll just create an exception list.
exception=c("She","Her") ## may need to add more as we experience more weird things
Pick names to replace. Note we don’t have to worry about “Sara’s” because we will still replace the “Sara” portion with “Student”.
Do the replacing. I don’t want to use a loop but I need to continually update words
. Any suggestions? Will walk
in purrr do this?
### need to go through everything in namesToReplace but resave every time
for(i in 1:length(namesToReplace)){
words=str_replace_all(words,namesToReplace[i],"Student")
}
Ideally, we could just change everything to the gender neutral singular they/them. However this would require us to change the verbs. Instead we will use “s/he”, while recognizing that this binary is not fully inclusive.
Again the mystery of different syntax for anchors comes up:
which(str_detect(words,"^She\\>")>0)
integer(0)
grep("^She\\>",words)
[1] 51 89 116
grep("^She$",words)
[1] 51 89 116
#words= str_replace_all(words,"^She\\>","S/He") ## doesn't work
words= str_replace_all(words,"^She$","S/He")
words= str_replace_all(words,"^she$","s/he")
words= str_replace_all(words,"^he$","s/he")
words= str_replace_all(words,"^He$","S/He")
## need possessives
words= str_replace_all(words,"^She's$","S/He's")
words= str_replace_all(words,"^she's$","s/he's")
words= str_replace_all(words,"^he's$","s/he's")
words= str_replace_all(words,"^He's$","S/He's")
words= str_replace_all(words,"^hers\\>","theirs") ## shouldn't be first so no capitalization
words= str_replace_all(words,"^him\\>","them") ## shouldn't be first
str_flatten(words," ")
[1] "Dear Mr./Mrs./Ms. \\[Last Name\\], It’s my absolute pleasure to recommend Student for \\[position\\] with \\[Company\\]. Student and I \\[relationship\\] at \\[Company\\] for \\[length of time\\]. I thoroughly enjoyed my time working with Student, and came to know her as a truly valuable asset to absolutely any team. S/He is honest, dependable, and incredibly hard-working. Beyond that, s/he is an impressive \\[soft skill\\] who is always \\[result\\]. Her knowledge of \\[specific subject\\] and expertise in \\[specific subject\\] was a huge advantage to our entire office. S/He put this skillset to work in order to \\[specific achievement\\]. Along with her undeniable talent, Student has always been an absolute joy to work with. S/He is a true team player, and always manages to foster positive discussions and bring the best out of other employees. Without a doubt, I confidently recommend Student to join your team at \\[Company\\]. As a dedicated and knowledgeable employee and an all-around great person, I know that s/he will be a beneficial addition to your organization. Please feel free to contact me at \\[your contact information\\] should you like to discuss Student’s qualifications and experience further. I’d be happy to expand on my recommendation. Best wishes, \\[Your Name\\] "
Now because English is weird we have a problem. How do we distinguish between the following examples?
That is hers. –> theirs
That is his. —> theirs
That is his experience. —> their
That is her experience. —> their
Get to know her. —> them
Get to know him. —> them
Numbers 1 and 6 are not ambigous, so we can fix those.
words= str_replace_all(words,"^hers$","theirs")
words= str_replace_all(words,"^him$","them")
toParse= str_flatten(words," ")
#toParse=r_to_py(toParse)
To distinguish between 2 and 3 and 4 and 5, we need to automatically determine what part of speech the words are.
his=which(str_detect(words,"^his$"))
His=which(str_detect(words,"^His$"))
her=which(str_detect(words,"^her$"))
Her=which(str_detect(words,"^Her$"))
toChange=c(his,His,her,Her)
Bet you didn’t expect to see NLP when you clicked on this post. Apparently we need a part of speech (POS) tagger to tell us what type of word each is in a sentence.
Both the R packages I found to do this had rJava issues.
## not run
require(openNLP)
devtools::install_github("bnosac/RDRPOSTagger")
I guess now is the time to learn some reticulate basics.
I tried to use r_to_py
to pass in toParse
, but was having trouble (see commented out code), so for now, I’m just copying the contents of toParse
into this chunk. Can somebody please point me to an example of getting an R object to Python in Markdown?
import nltk
text = nltk.word_tokenize("Dear Mr./Mrs./Ms. [Last Name], It’s my absolute pleasure to recommend Student for [position] with [Company]. Student and I [relationship] at [Company] for [length of time]. I thoroughly enjoyed my time working with Student, and came to know her as a truly valuable asset to absolutely any team. S/He is honest, dependable, and incredibly hard-working. Beyond that, s/he is an impressive [soft skill] who is always [result]. Her knowledge of [specific subject] and expertise in [specific subject] was a huge advantage to our entire office. S/He put this skillset to work in order to [specific achievement]. Along with her undeniable talent, Student has always been an absolute joy to work with. S/He is a true team player, and always manages to foster positive discussions and bring the best out of other employees. Without a doubt, I confidently recommend Student to join your team at [Company]. As a dedicated and knowledgeable employee and an all-around great person, I know that s/he will be a beneficial addition to your organization. Please feel free to contact me at [your contact information] should you like to discuss Student’s qualifications and experience further. I’d be happy to expand on my recommendation. Best wishes, [Your Name]")
test=nltk.pos_tag(text)
#test=nltk.pos_tag(toParse)
print(py$test[[1]])
[[1]]
[1] "Dear"
[[2]]
[1] "NNP"
wordsPy=unlist(lapply(py$test,function(x){x[[1]]}))
hisPy=which(str_detect(wordsPy,"^his$"))
HisPy=which(str_detect(wordsPy,"^His$"))
herPy=which(str_detect(wordsPy,"^her$"))
HerPy=which(str_detect(wordsPy,"^Her$"))
toGet=c(hisPy,HisPy,herPy,HerPy)
pos=unlist(lapply(toGet,function(x){py$test[[x]][[2]]}))
pos
[1] "PRP" "PRP$" "PRP$"
According to the key here:
PRP: pronoun, personal (case 5)
PRP$: pronoun, possessive (case 4)
So now we can determine what to replace them with. Bear with this loop please.
for(i in 1:length(pos)){
if(pos[i]=="PRP"){
words[toChange[i]]="them"
}else if(pos[i]=="PRP$"&str_detect(words[toChange[i]],"[A-Z]")){
words[toChange[i]]="Their"
}else if(pos[i]=="PRP$"&!str_detect(words[toChange[i]],"[A-Z]")){
words[toChange[i]]="their"
}
}
Finally, we can take away the extra escape characters to get back to the original.
words= str_replace_all(words,fixed("\\["),"[")
words= str_replace_all(words,fixed("\\]"),"]")
str_flatten(words," ")
[1] "Dear Mr./Mrs./Ms. [Last Name], It’s my absolute pleasure to recommend Student for [position] with [Company]. Student and I [relationship] at [Company] for [length of time]. I thoroughly enjoyed my time working with Student, and came to know them as a truly valuable asset to absolutely any team. S/He is honest, dependable, and incredibly hard-working. Beyond that, s/he is an impressive [soft skill] who is always [result]. Their knowledge of [specific subject] and expertise in [specific subject] was a huge advantage to our entire office. S/He put this skillset to work in order to [specific achievement]. Along with their undeniable talent, Student has always been an absolute joy to work with. S/He is a true team player, and always manages to foster positive discussions and bring the best out of other employees. Without a doubt, I confidently recommend Student to join your team at [Company]. As a dedicated and knowledgeable employee and an all-around great person, I know that s/he will be a beneficial addition to your organization. Please feel free to contact me at [your contact information] should you like to discuss Student’s qualifications and experience further. I’d be happy to expand on my recommendation. Best wishes, [Your Name] "
I thought this would be a quick, cute thing, but I was SO wrong; it turned into a mess. But it finally works!!
toupper
and tolower
have equivalents in stringR, but stringR also has a function to make things like a title. This can come in handy for example, when you need state names to start with a capital later for facet_geo
.
states<-c("pennsylvania","massachusetts","maryland","california")
#str_to_upper ## toupper
#str_to_lower ## tolower
str_to_title(states) ## this format needed for geofacet
[1] "Pennsylvania" "Massachusetts" "Maryland" "California"
A period matches any character in a regular expression, but sometimes you want to search for acutal periods. You can use fixed
in stringR functions to do this without having to remember escape characters. Apparently, base R string functions have a fixed
parameter as well, but I wasn’t aware of it before now.
pattern<-"a.b"
strings<-c("abb","a.b")
str_detect(strings,pattern)
[1] TRUE TRUE
str_detect(strings,fixed(pattern))
[1] FALSE TRUE
Using boundary
you can split on words and allow for inconsistent spacing.
words<-c("These are some words.")
str_split(words,boundary("word"))[[1]] ## character, line_break, sentence, word
[1] "These" "are" "some" "words"
I always put the wrong argument first in grep
and grepl
, but the stringR packages have the order of parameters that fit my expectation.
str_detect(fruit,"a") ## grepl("a",fruit)
I always forget how to concatenate a vector with a particular separation using paste
.
str_flatten(letters,"-")
[1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"
paste(letters,collapse="-")
[1] "a-b-c-d-e-f-g-h-i-j-k-l-m-n-o-p-q-r-s-t-u-v-w-x-y-z"
glue
related functions seem handy. This could be a whole other post, so I’ll save the details for later
name <- "Fred"
str_glue("My name is {name}, not {{name}}.")
My name is Fred, not {name}.
mtcars %>% str_glue_data("{rownames(.)} has {hp} hp") %>% head()
Mazda RX4 has 110 hp
Mazda RX4 Wag has 110 hp
Datsun 710 has 93 hp
Hornet 4 Drive has 110 hp
Hornet Sportabout has 175 hp
Valiant has 105 hp
stringR has fancier trimming functions.
str_trim(" test ",side="both") ## trimws
[1] "test"
str_squish("\n\nString with excess, trailing and leading white space\n\n")
[1] "String with excess, trailing and leading white space"
If you have any insight into my remaining mysteries, please let me know!