Wrangling USDA Data with readr

PUBLISHED ON MAY 26, 2018

When I saw that this week’s blog post was supposed to be about readr I drew a blank on how to get my hands dirty using the functionality in the package. I didn’t want to use the same sample data in the documentation, but I also didn’t want to go scouring for a dataset that I wasn’t really motivated to munge. Then fate stepped in.

At work I wanted to get some data from the USDA that is not available through their API service. Many of the USDA reports come in .txt or .pdf files, and these files contain tables that have information that I needed. I literally spent days trying to get things organized (but I did it!). Even though it still took a long time (and I am not a patient person), I was grateful that I had perused the readr documentation ahead of time in preparation for writing this blog.

I can’t go into too much detail about what I did at work, but instead, I’ll show a representative example of a USDA text file full of tables. This example is actually a bit gnarlier than the stuff I was dealing with, so this seems fair.

Go ahead and look at this original. WHY?! Why must our government store data in a super inaccessible format?! As much as we complain about Excel, I would be grateful for a .xls file here.

But we must persevere.

Disclaimer: I am not claiming that this is the best way to use readr to wrangle the information in these tables. I would love if someone had a less clunky approach they were willing to share.

Read in using the default read_table

I manually looked for the number of lines I could skip before getting to the good stuff.

require(readr)
require(dplyr)
setwd("~/Desktop")
raw=read_table("usv1.txt")

head(raw)
## # A tibble: 6 x 1
##   X1                                                 
##   <chr>                                              
## 1 United States                                      
## 2 Summary and State Data                             
## 3 "Volume 1 \x95 Geographic Area Series \x95 Part 51"
## 4 <NA>                                               
## 5 AC-12-A-51                                         
## 6 <NA>
rawSkip=read_table("usv1.txt",skip=475)

grep for “Table” (or any other key words)

Note: This is why I manually skipped over the table of contents, since it lists all the tables.

tables=which(unlist(lapply(rawSkip[,1],function(x){grepl("Table",x)}))==T)

Simplify the problem further to start

Even Table 1 has lots of components and weird structure, so let’s simplify to the smallest chunk in Table 1 that seems to stand alone.

toSave=rawSkip[tables[1]:(tables[1]+10),]%>% as.data.frame()
toSave
##                                                                                                                                                                                            X1
## 1                                                                                                                             Table 1.  Historical Highlights:  2012 and Earlier Census Years
## 2                                                                                                                          [For meaning of abbreviations and symbols, see introductory text.]
## 3  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 4                                                                         :                :                :                :                :                     Not adjusted for coverage
## 5                                                    :                :                :                :                :-------------------------------------------------------------------
## 6                             All farms                    :      2012      :      2007      :      2002      :      1997      :      1997      :      1992      :      1987      :      1982
## 7  ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## 8     Farms ......................................number:    2,109,303        2,204,792        2,128,982        2,215,876        1,911,859        1,925,300        2,087,759        2,240,976
## 9     Land in farms ...............................acres:  914,527,657      922,095,840      938,279,056      954,752,502      931,795,255      945,531,506      964,470,625      986,796,579
## 10        Average size of farm ....................acres:          434              418              441              431              487              491              462              440
## 11                                                                                                                                                                                          :

Trim a bunch of extra characters using gsub

I initially thought that some of these extra characters (like “:”) would help as delimiters, but because the first column is broken away from the rest in a different way, using the extra characters to help split things up didn’t end up working.

toSave[,1]=gsub(":"," ",toSave[,1])
toSave[,1]=gsub("\\.","",toSave[,1])

Because white space acts as a delimiter everywhere except the first column, I wanted to replace the spaces between words in the first column with something else.

collapseNames=function(x){
  #browser()
  words=unlist(strsplit(x," ")) ## get individual words
  if(length(which(words==""))>0){
    toReturna=paste(words[1:(which(words=="")[1]-1)],collapse="_") ## collapse the actual words in column 1
    ## need to paste the rest
    toReturnb=paste(words[which(words=="")[1]:length(words)],collapse=" ") ## paste on the info in the extra columns
    
    toReturn=paste(toReturna,toReturnb,sep=" ")
    
  }else{
    toReturn=x ## if can't be broken into words, just return the line
  }
return(toReturn)
}

toSave[,1]=unlist(lapply(toSave[1:nrow(toSave),1],collapseNames))
toSave[5:10,1]
## [1] "                                                                      -------------------------------------------------------------------"                                                 
## [2] "All_farms                           2012             2007             2002             1997             1997             1992             1987             1982"                           
## [3] "------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
## [4] "Farms_number     2,109,303        2,204,792        2,128,982        2,215,876        1,911,859        1,925,300        2,087,759        2,240,976"                                         
## [5] "Land_in_farms_acres   914,527,657      922,095,840      938,279,056      954,752,502      931,795,255      945,531,506      964,470,625      986,796,579"                                  
## [6] "Average_size_of_farm_acres           434              418              441              431              487              491              462              440"

Save a subset and read it back in (using read_table2) with a forced number of columns (using col_names)

read_table2 “allows any number of whitespace characters between columns, and the lines can be of different lengths.”

By giving column names we can ensure that the desired number of columns is respected.

setwd("~/Desktop")
write.table(toSave,"tmp.txt",row.names=F,col.names=F)
readIn=read_table2("tmp.txt",skip=5,col_names=paste("V",1:9,sep=""))
head(readIn)
## # A tibble: 6 x 9
##   V1          V2          V3      V4      V5      V6      V7      V8 V9   
##   <chr>       <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>
## 1 "\"All_far… 2012    2.01e3  2.00e3  2.00e3  2.00e3  1.99e3  1.99e3 "198…
## 2 "\"-------… <NA>   NA      NA      NA      NA      NA      NA      <NA> 
## 3 "\"Farms_n… 2,109…  2.20e6  2.13e6  2.22e6  1.91e6  1.93e6  2.09e6 "2,2…
## 4 "\"Land_in… 914,5…  9.22e8  9.38e8  9.55e8  9.32e8  9.46e8  9.64e8 "986…
## 5 "\"Average… 434     4.18e2  4.41e2  4.31e2  4.87e2  4.91e2  4.62e2 "440…
## 6 "\""        "\""   NA      NA      NA      NA      NA      NA      <NA>

Parse numbers (using parse_number) and remove empty rows

We convert strings that are clearly numbers into numbers (remove commas, etc.) using parse_number. This also has the added benefit of making filler strings into NA values in the columns where we expect numbers.

readIn[,2:9]=apply(readIn[,2:9],2,parse_number)

readIn=readIn[-which(is.na(readIn[,2]) & is.na(readIn[,3])),]

readIn
## # A tibble: 4 x 9
##   V1              V2      V3      V4      V5      V6      V7     V8     V9
##   <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>  <dbl>
## 1 "\"All_fa…  2.01e3  2.01e3  2.00e3  2.00e3  2.00e3  1.99e3 1.99e3 1.98e3
## 2 "\"Farms_…  2.11e6  2.20e6  2.13e6  2.22e6  1.91e6  1.93e6 2.09e6 2.24e6
## 3 "\"Land_i…  9.15e8  9.22e8  9.38e8  9.55e8  9.32e8  9.46e8 9.64e8 9.87e8
## 4 "\"Averag…  4.34e2  4.18e2  4.41e2  4.31e2  4.87e2  4.91e2 4.62e2 4.40e2

Get more ambitious…

Now that we got things working for a subset of the table, let’s try to do the same thing for the rest of the document.

toSave=rawSkip[tables[1]:nrow(rawSkip),]%>% as.data.frame()
toSave[,1]=gsub(":"," ",toSave[,1])
toSave[,1]=gsub("\\.","",toSave[,1])
toSave[,1]=unlist(lapply(toSave[1:nrow(toSave),1],collapseNames))
setwd("~/Desktop")
write.table(toSave,"tmp.txt",row.names=F,col.names=F)
readIn=read_table2("tmp.txt",skip=5,col_names=paste("V",1:9,sep=""))

readIn[,2:9]=apply(readIn[,2:9],2,parse_number)

Deal with non-ASCII characters and remove extra rows

The first column of this dataframe contained some non-ASCII characters which was giving the string related functions a hard time. We remove those and then get rid of some extra rows.

The resulting file still has some rather sparse rows, but it maintains the structure of the document rather well. We could now grep for the things we are interested in and easily get the numeric values associated with that chunk of the file.

#https://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
readIn[,1]=iconv(readIn[,1], "latin1", "ASCII", sub="")
readIn=as.data.frame(readIn) ## tibble is giving me a hard time here


readIn=readIn[-which(is.na(readIn[,2]) & is.na(readIn[,3]) & nchar(readIn[,1])<=1),]

An interesting dead end…

Future Pipe Dream: I wanted to make a tokenizer that would first try one delimiter, and if it didn’t split the line into the desired number of columns, then it would try the other one (using tokenizer_delim).

I used read_delim_chunked to help understand how many columns a particular delimiter would produce. I still think designing better callbacks could help make a more flexible tokenizer, but further investigation will have to be saved for later (perhaps for a future post).

f=function(x,pos){length(which(!is.na(x)))} ## tell how many columns the data is actually put in 
f2=function(x,pos){x} ## show what the output will look like if we use this delimiter

setwd("~/Desktop")
test=read_delim_chunked("tmp.txt",delim="   ",callback=DataFrameCallback$new(f),chunk_size=1,col_names=paste("V",1:9,sep=""))
## want this to be 9

#Note: the callback happens per chunk, so I made the chunk_size 1 to just get the answers per line

test2=read_delim_chunked("tmp.txt",delim="   ",callback=DataFrameCallback$new(f2),chunk_size=1,col_names=paste("V",1:9,sep=""))

head(test)
##      [,1]
## [1,]    1
## [2,]    1
## [3,]    1
## [4,]    1
## [5,]    1
## [6,]    1
head(test2)
## # A tibble: 6 x 9
##   V1                       V2    V3    V4    V5    V6    V7    V8    V9   
##   <chr>                    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Table_1  Historical Hig… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 [For meaning of abbrevi… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 3 -----------------------… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 4 "                      … <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 5 "                      … <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 6 All_farms              … <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>

This post took much more time than I anticipated, but it is reassuring that eventually we can make some sense of this poorly formatted data. Even though this approach seems specialized, there are files for other agricultural census years that I hope would at least keep a consistent, if gross, formatting approach, that we could repurpose this code for. I do wonder if we could do something more clever, perhaps just with readLines, so I’m open to other ideas.

Sidenote: If anyone has a similarly convoluted way to wrangle a particular type of government data, I would be curious to see the approach. It would be awesome if we could organize these approaches in one place. Even if they are hacky, it would help increase accessibility of data that hasn’t been API-ified yet.

Feedback, questions, comments, etc. are welcome (@sastoudt).