Sara A. Stoudt, PhD: DataFest Advice

With DataFest season right around the corner, I wanted to collect some tips for participants.

Patience is not my greatest virtue, so being able to quickly load data, make plots, and look at summaries is key for my initial exploration phase. If you are the same way, I recommend using the data.table::fread() function to read in what presumably is a large data file. This will be faster than the usual read.csv() or read_csv(). Then consider taking a random subset of the data to start with.

data <- data.table::fread("data.csv")

set.seed(315235) ## set the seed so you and your group have the same random sample

idx <- sample(1:nrow(data), 5000, replace = F) ## take a random sample of size 5000

data_sub <- data[idx,]

rm(data) ## clear up your memory

When you first get started as a team, consider taking some solo exploration time. During this time everyone should just play around with the data and investigate what interests them. Ideally you are not talking to one another at this time. You don’t want to mind-meld too quickly; you want a diversity of approaches to the data. Make some ugly plots! Understand the variables. What is not in the data (i.e. consider the missing data as well, there might be a story there)?
During this exploration phase, dig into something niche. The company donating the dataset already knows the most obvious or at least the biggest picture things about the data. What you can contribute is an outsider’s perspective. What little thing do you notice that someone working for the company may have overlooked? This is the time to be weird. :D
When you come back together as a team after individual exploration, take turns describing what you investigated and what you found. After hearing from everyone, discuss what seems most promising and worth pursuing. Is there a common theme in a subset of your investigations? Remember at the end you want to tell a cohesive story rather than something like “here’s 3 weird things we found”.
Once you have decided on the subset of ideas to keep pursuing, reassign more deep-dive tasks. You might want to do this in pairs so that everyone has someone else to bounce ideas off of.
Now this might be controversial, but consider not fitting any models during this deep-dive stage. I know, I know. You want to show off your fancy skills, but hear me out. First, you do not have a lot of time to present your findings, so you want to be able to get to the point. Second, you don’t know the intricacies of the data, but I’ll bet that at least one major assumption of any model you might use is not met. Third, if a model is going to pick up on something interesting, you should be able to design a visualization to show that exact thing. (Corollary: if the finding is weak/sensitive to the model, it won’t be as obvious in the data, and you are putting all your chips on the wrong story) It’s cliche but true: a picture is worth a thousand words (or here, parameters of a model). If you must fit a model though, pick something simple. I’m talking regression; save the machine learning bells and whistles for another time.
Come back together and present your findings. As you are listening to your group members, be thinking about what the broader story is. Consider how you will transition between each sub-group’s approach and findings.
Do not wait until the last minute to put together your slides. Keep your slides simple. Focus on pictures that you can narrate. Make sure all labels are in big enough font. For some ggplot customization tips check out this source.
Practice, practice, practice your talk (with a timer!). It is much harder to give a shorter talk than a longer one. You do not want to end up monologuing and getting cut off before your big reveal.
Have fun! I get pretty competitive myself, so I know it can be hard to zoom out and just enjoy the experience. Talk to other teams. Learn from them. Network with judges and helpers.

Want some more resources?