Can you imagine a database of datasets paired with a tool to automatically download, preprocess, and import into a desired format for your usage? For this week’s RSEPedia Software Survey, we introduce you to weecology/retriever, a tool to do just that.
Are you already familiar with this software? We encourage you to contribute to the research software encyclopedia and annotate the respository:
otherwise, keep reading!
If you’ve ever been a student or researcher, you know that data is gold. The quality and size of the dataset can make or break your analysis, or as they like to say “Trash in, trash out!” This is why we should be so excited about a tool like data retriever. In only a few commands with the retriever client, you can install a dataset, meaning that you download it along with rich metadata, clean and standardize it, and then import into your final destination of choice. That could mean a full blown relational database, or a flat file like csv or json.
For example, first we might install retriever:
pip install retriever
And then use the install
command to select a dataset.
usage: retriever install [-h] [--compile] [--debug]
{mysql,postgres,sqlite,msaccess,csv,json,xml} ...
positional arguments:
{mysql,postgres,sqlite,msaccess,csv,json,xml}
engine-specific help
mysql MySQL
postgres PostgreSQL
sqlite SQLite
msaccess Microsoft Access
csv CSV
json JSON
xml XML
optional arguments:
-h, --help show this help message and exit
--compile force re-compile of script before downloading
--debug run in debug mode
Do you see how many options for databases you can use? When data is typically shared online (if it is shared at all) you usually have to find it, download some single or set of files, and then clean and preprocess it into the final destination of your choice. This tool will do all that for you, saving you time and energy, and ensuring that the dataset used is consistent between users. And hold your hat, there are over 200 datasets available in this manner, including everything from demographics, to agriculture, and geography.
You can use the following Zenodo citation for the software. What is so great about this citation is we can see how many folks worked together to make the software!
@software{ben_morris_2017_1038272,
author = {Ben Morris and
Ethan White and
Henry Senyondo and
Akash Goel and
Shivam Negi and
Elita Baldridge and
Andrew Zhang and
Dan McGlinn and
Akshay and
David J. Harris and
Kate Thibault and
Deborah Gertrude Digges and
Pankaj and
Paul Wolf and
Kapil kumar and
Amritanshu jain and
Sarah Reehl and
Kunal Pal and
Kevin Amipara and
Erica Christensen and
Yanghao Li and
Xiao Xiao and
Kristina Riemer and
Saket Choudhary and
Morgan Ernest and
James Quadrino and
David LeBauer and
carol-rowe666 and
Bishakh Ghosh and
Barry Wark},
title = {weecology/retriever: v2.1.0},
month = oct,
year = 2017,
publisher = {Zenodo},
version = {v2.1.0},
doi = {10.5281/zenodo.1038272},
url = {https://doi.org/10.5281/zenodo.1038272}
}
You probably want to start with your programming language of interest - data retriever is available for both R and Python!
or read more about annotation here. You can clone the software repository to do bulk annotation, or annotation any repository in the software database, We want annotation to be fun, straight-forward, and easy, so we will be showcasing one repository to annotate per week. If you’d like to request annotation of a particular repository (or addition to the software database) please don’t hesitate to open an issue or even a pull request.
You might find these other resources useful:
For any resource, you are encouraged to give feedback and contribute!