When I saw that this week’s blog post was supposed to be about readr I drew a blank on how to get my hands dirty using the functionality in the package. I didn’t want to use the same sample data in the documentation, but I also didn’t want to go scouring for a dataset that I wasn’t really motivated to munge. Then fate stepped in.
At work I wanted to get some data from the USDA that is not available through their API service. Many of the USDA reports come in .txt or .pdf files, and these files contain tables that have information that I needed. I literally spent days trying to get things organized (but I did it!). Even though it still took a long time (and I am not a patient person), I was grateful that I had perused the readr documentation ahead of time in preparation for writing this blog.
I can’t go into too much detail about what I did at work, but instead, I’ll show a representative example of a USDA text file full of tables. This example is actually a bit gnarlier than the stuff I was dealing with, so this seems fair.
Go ahead and look at this original. WHY?! Why must our government store data in a super inaccessible format?! As much as we complain about Excel, I would be grateful for a .xls file here.
But we must persevere.
Disclaimer: I am not claiming that this is the best way to use readr to wrangle the information in these tables. I would love if someone had a less clunky approach they were willing to share.
read_table
I manually looked for the number of lines I could skip before getting to the good stuff.
# A tibble: 6 x 1
`United States`
<chr>
1 "Summary and State Data"
2 "Volume 1 \x95 Geographic Area Series \x95 Part 51"
3 "AC-12-A-51"
4 "Issued May 2014"
5 "United States Department of Agriculture"
6 "Tom Vilsack, Secretary"
rawSkip <- read_table("usv1.txt", skip = 475)
grep
for “Table” (or any other key words)Note: This is why I manually skipped over the table of contents, since it lists all the tables.
Even Table 1 has lots of components and weird structure, so let’s simplify to the smallest chunk in Table 1 that seems to stand alone.
toSave <- rawSkip[tables[1]:(tables[1] + 10), ] %>% as.data.frame()
toSave
[For meaning of abbreviations and symbols, see introductory text.]
1 Table 1. Historical Highlights: 2012 and Earlier Census Years (continued)
2 [For meaning of abbreviations and symbols, see introductory text.]
3 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
4 : : : : : Not adjusted for coverage
5 : : : : :-------------------------------------------------------------------
6 All farms : 2012 : 2007 : 2002 : 1997 : 1997 : 1992 : 1987 : 1982
7 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8 Livestock and poultry: - Con. :
9 :
10 Layers inventory (see text) ...............farms: 198,272 145,615 98,315 (NA) (NA) (NA) (NA) (NA)
11 number: 350,715,978 349,772,508 334,435,155 (NA) (NA) (NA) (NA) (NA)
gsub
I initially thought that some of these extra characters (like “:”) would help as delimiters, but because the first column is broken away from the rest in a different way, using the extra characters to help split things up didn’t end up working.
Because white space acts as a delimiter everywhere except the first column, I wanted to replace the spaces between words in the first column with something else.
collapseNames <- function(x) {
# browser()
words <- unlist(strsplit(x, " ")) ## get individual words
if (length(which(words == "")) > 0) {
toReturna <- paste(words[1:(which(words == "")[1] - 1)], collapse = "_") ## collapse the actual words in column 1
## need to paste the rest
toReturnb <- paste(words[which(words == "")[1]:length(words)], collapse = " ") ## paste on the info in the extra columns
toReturn <- paste(toReturna, toReturnb, sep = " ")
} else {
toReturn <- x ## if can't be broken into words, just return the line
}
return(toReturn)
}
toSave[, 1] <- unlist(lapply(toSave[1:nrow(toSave), 1], collapseNames))
toSave[5:10, 1]
[1] " -------------------------------------------------------------------"
[2] "All_farms 2012 2007 2002 1997 1997 1992 1987 1982"
[3] "------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------"
[4] "Livestock_and_poultry - Con "
[5] " "
[6] "Layers_inventory_(see_text)_farms 198,272 145,615 98,315 (NA) (NA) (NA) (NA) (NA)"
read_table2
) with a forced number of columns (using col_names
)read_table2
“allows any number of whitespace characters between columns, and the lines can be of different lengths.”
By giving column names we can ensure that the desired number of columns is respected.
setwd("~/Desktop")
write.table(toSave, "tmp.txt", row.names = F, col.names = F)
readIn <- read_table2("tmp.txt", skip = 5, col_names = paste("V", 1:9, sep = ""))
head(readIn)
# A tibble: 6 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "\"All_farms" "2012" 2007 "2002" 1997 1997 1992 1987 "198…
2 "\"------------… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 "\"Livestock_an… "-" Con "\"" <NA> <NA> <NA> <NA> <NA>
4 "\"" "\"" <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 "\"Layers_inven… "198,2… 145,6… "98,3… (NA) (NA) (NA) (NA) "(NA…
6 "\"number" "350,7… 349,7… "334,… (NA) (NA) (NA) (NA) "(NA…
parse_number
) and remove empty rowsWe convert strings that are clearly numbers into numbers (remove commas, etc.) using parse_number
. This also has the added benefit of making filler strings into NA values in the columns where we expect numbers.
readIn[, 2:9] <- apply(readIn[, 2:9], 2, parse_number)
readIn <- readIn[-which(is.na(readIn[, 2]) & is.na(readIn[, 3])), ]
readIn
# A tibble: 3 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 "\"All_farms" 2.01e3 2.01e3 2.00e3 1997 1997 1992 1987 1982
2 "\"Layers_inve… 1.98e5 1.46e5 9.83e4 NA NA NA NA NA
3 "\"number" 3.51e8 3.50e8 3.34e8 NA NA NA NA NA
Now that we got things working for a subset of the table, let’s try to do the same thing for the rest of the document.
setwd("~/Desktop")
write.table(toSave, "tmp.txt", row.names = F, col.names = F)
readIn <- read_table2("tmp.txt", skip = 5, col_names = paste("V", 1:9, sep = ""))
readIn[, 2:9] <- apply(readIn[, 2:9], 2, parse_number)
The first column of this dataframe contained some non-ASCII characters which was giving the string related functions a hard time. We remove those and then get rid of some extra rows.
The resulting file still has some rather sparse rows, but it maintains the structure of the document rather well. We could now grep
for the things we are interested in and easily get the numeric values associated with that chunk of the file.
# https://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
readIn[, 1] <- iconv(readIn[, 1], "latin1", "ASCII", sub = "")
readIn <- as.data.frame(readIn) ## tibble is giving me a hard time here
readIn <- readIn[-which(is.na(readIn[, 2]) & is.na(readIn[, 3]) & nchar(readIn[, 1]) <= 1), ]
Future Pipe Dream: I wanted to make a tokenizer that would first try one delimiter, and if it didn’t split the line into the desired number of columns, then it would try the other one (using tokenizer_delim
).
I used read_delim_chunked
to help understand how many columns a particular delimiter would produce. I still think designing better callbacks could help make a more flexible tokenizer, but further investigation will have to be saved for later (perhaps for a future post).
f <- function(x, pos) {
length(which(!is.na(x)))
} ## tell how many columns the data is actually put in
f2 <- function(x, pos) {
x
} ## show what the output will look like if we use this delimiter
setwd("~/Desktop")
test <- read_delim_chunked("tmp.txt", delim = " ", callback = DataFrameCallback$new(f), chunk_size = 1, col_names = paste("V", 1:9, sep = ""))
## want this to be 9
# Note: the callback happens per chunk, so I made the chunk_size 1 to just get the answers per line
test2 <- read_delim_chunked("tmp.txt", delim = " ", callback = DataFrameCallback$new(f2), chunk_size = 1, col_names = paste("V", 1:9, sep = ""))
head(test)
[,1]
[1,] 1
[2,] 1
[3,] 1
[4,] 1
[5,] 1
[6,] 1
head(test2)
# A tibble: 6 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Table_1 Historica… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 "[For meaning of ab… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 "------------------… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 " … <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 " … <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
6 "All_farms … <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
This post took much more time than I anticipated, but it is reassuring that eventually we can make some sense of this poorly formatted data. Even though this approach seems specialized, there are files for other agricultural census years that I hope would at least keep a consistent, if gross, formatting approach, that we could repurpose this code for. I do wonder if we could do something more clever, perhaps just with readLines
, so I’m open to other ideas.
Sidenote: If anyone has a similarly convoluted way to wrangle a particular type of government data, I would be curious to see the approach. It would be awesome if we could organize these approaches in one place. Even if they are hacky, it would help increase accessibility of data that hasn’t been API-ified yet.