by Rhea Ehresmann
Experiences in data organization and management
It was 2 AM the week before a fisheries conference, and I was attempting some last-minute analyses. Having added a couple hundred lines of new code for figures, my code would no longer run from the beginning and I had a hunch it was because I hadn’t set everything up correctly initially. Ten hours and 25 Stack Exchange posts later, I had the realization that no researcher wants to have: my data needed to be completely reorganized from the ground up. I should have organized my data months ago (or ideally from the get-go) but instead I ignored it until it became absolutely necessary. Further complicating this issue was that I am a remote student working in Sitka (most of the faculty and students are in Juneau or Fairbanks) so finding the solution was entirely on me. Data organization is like working out or eating healthy: we all know we should do it, but often it is given the “I’ll do that tomorrow” excuse. Out of this minor meltdown, I learned a lot about data organization and management and even how to overcome roadblocks independently.
How things go wrong
It often feels that life as a remote student is all about learning things the hard way. Being on my own in Sitka, I can’t walk down the hall to talk with another student about a problem or spontaneously chat with a professor in passing or after class. Only having my screen in front of me, I don’t have much opportunity to compare other students’ code writing techniques or data management practices. With my complete dataset nearly 4 million lines long, bad habits were started without corrections: developing a “run the code until it’s broke” philosophy, overwriting and deleting code I didn’t think I’d need, and even naming files poorly by just tacking on another number or “new” at the end of the name.
While these are now commonsense practices I avoid, I learned these after much trial and error, as well as after reading other articles on data organization and management using R. There are many other struggles that come along with being a distance student, but I believe it ultimately pushes me to be a better researcher. And the best part of being remote is no one can see your meltdowns! But you don’t have to wait until you are in a panic to start some best practices.
Five data management tips
Careful organization and management of data and code is essential for any analysis. Taking the time and initiative to make clear notations for code, to be consistent with code-writing techniques, and to organize script files in your directory will help with more than data management of the entire project. These steps also protect against data loss and analytical errors while allowing your analyses to run smoothly from one to the next. There are many tips and tricks for data management taught in classes and online, but here are the top five I’ve learned from my solo trek to good data practices with links to more information on each:
- Keep raw data as a “flat” table saved in an open data format (like .csv) with records saved in rows, using descriptive and concise names for the data files. Though it is tempting to quickly add in a new column or delete some rows of data in Excel, don’t do it in the raw data file (more here).
- Be diligent and consistent with notation, syntax, and commenting lines in script files. Comment often! This will help to remember what/why you did something when revisiting an earlier analysis, as well as helping others make sense of your code if you need to share it. Also, don’t delete code you think you don’t need. I can’t count the number of times I’ve gone back to this “unused code” months later, only to tweak it a bit and make it useable. Keep separate analyses in their own script files. Style guides like Google’s and Hadley’s R Style Guides are great resources.
- Use Stack Exchange to solve an issue or google your R questions. I’ve developed a special knack for figuring out how to do obscure things in R by scouring these websites. My philosophy is that anything I’m trying to do in R has already been done, and usually that is the case. It just requires a bit more patience and persistence to identify the search terms needed for finding the solution.
- Back up everything and keep your folders organized. Time spent searching folders for an older version of a script or data file is not efficient. I save script files and raw data files on Google drive for ease of access at any computer, along with an external hard drive. Dropbox is commonly used because of its simplicity of automatically backing up files. There are other options for version control and backing up data like Git that I don’t currently use but are worth looking into (more here).
- Don’t cut corners to save a few minutes in the short term. It’s so easy to make quick-fixes to get that figure to run or to get to the desired output, but these shortcuts aren’t worth the time you’ll have to spend fixing the code or trying to remember what you did down the road. Take the time to establish a good workflow for yourself, and be consistent with this while working on your analyses. There are great resources for establishing a solid workflow (more here).
Even though data management ranks up there in fun with tasks such as doing your taxes, folding laundry, and weeding the garden, the short-term pain will save you a lot of headache and time by having a well-organized and accurate end result. As I move onto new analyses for the next chapter of my thesis, I now know how important it is to set things up correctly from the beginning. Being a remote student during all of this has been challenging, but it has ultimately made me a much stronger student and more self-reliant. Meltdowns not included.
Rhea is a Master’s student in the Coastal Fisheries Ecology Lab and is studying the movement ecology of juvenile sablefish near Sitka, Alaska.