Jan 10, 2021. | By: @vsoch

Can you imagine a database of datasets paired with a tool to automatically download, preprocess, and import into a desired format for your usage? For this week’s RSEPedia Software Survey, we introduce you to weecology/retriever, a tool to do just that.


Are you already familiar with this software? We encourage you to contribute to the research software encyclopedia and annotate the respository:

otherwise, keep reading!

What is data retriever?

If you’ve ever been a student or researcher, you know that data is gold. The quality and size of the dataset can make or break your analysis, or as they like to say “Trash in, trash out!” This is why we should be so excited about a tool like data retriever. In only a few commands with the retriever client, you can install a dataset, meaning that you download it along with rich metadata, clean and standardize it, and then import into your final destination of choice. That could mean a full blown relational database, or a flat file like csv or json.

For example, first we might install retriever:

pip install retriever

And then use the install command to select a dataset.

usage: retriever install [-h] [--compile] [--debug]
                         {mysql,postgres,sqlite,msaccess,csv,json,xml} ...

positional arguments:
                        engine-specific help
    mysql               MySQL
    postgres            PostgreSQL
    sqlite              SQLite
    msaccess            Microsoft Access
    csv                 CSV
    json                JSON
    xml                 XML

optional arguments:
  -h, --help            show this help message and exit
  --compile             force re-compile of script before downloading
  --debug               run in debug mode

Do you see how many options for databases you can use? When data is typically shared online (if it is shared at all) you usually have to find it, download some single or set of files, and then clean and preprocess it into the final destination of your choice. This tool will do all that for you, saving you time and energy, and ensuring that the dataset used is consistent between users. And hold your hat, there are over 200 datasets available in this manner, including everything from demographics, to agriculture, and geography.

How do I cite it?

You can use the following Zenodo citation for the software. What is so great about this citation is we can see how many folks worked together to make the software!

  author       = {Ben Morris and
                  Ethan White and
                  Henry Senyondo and
                  Akash Goel and
                  Shivam Negi and
                  Elita Baldridge and
                  Andrew Zhang and
                  Dan McGlinn and
                  Akshay and
                  David J. Harris and
                  Kate Thibault and
                  Deborah Gertrude Digges and
                  Pankaj and
                  Paul Wolf and
                  Kapil kumar and
                  Amritanshu jain and
                  Sarah Reehl and
                  Kunal Pal and
                  Kevin Amipara and
                  Erica Christensen and
                  Yanghao Li and
                  Xiao Xiao and
                  Kristina Riemer and
                  Saket Choudhary and
                  Morgan Ernest and
                  James Quadrino and
                  David LeBauer and
                  carol-rowe666 and
                  Bishakh Ghosh and
                  Barry Wark},
  title        = {weecology/retriever: v2.1.0},
  month        = oct,
  year         = 2017,
  publisher    = {Zenodo},
  version      = {v2.1.0},
  doi          = {10.5281/zenodo.1038272},
  url          = {}

How do I get started?

You probably want to start with your programming language of interest - data retriever is available for both R and Python!

How do I contribute to the software survey?

or read more about annotation here. You can clone the software repository to do bulk annotation, or annotation any repository in the software database, We want annotation to be fun, straight-forward, and easy, so we will be showcasing one repository to annotate per week. If you’d like to request annotation of a particular repository (or addition to the software database) please don’t hesitate to open an issue or even a pull request.

Where can I learn more?

You might find these other resources useful:

For any resource, you are encouraged to give feedback and contribute!


News 2

Tutorials 2

Software 33

Recent Posts