This section includes commands for the Research Software Encyclopedia, specifically to interact with a software repository that further can control parsers and update your software database.

• Exists: determine if a particular repository exists in your local repository
• Get: retrieve current metadata for a piece of software, or all software
• Scrape: automated update of new software repositories from a remote resource
• Update: update metadata for a single repository or all repositories
• Label a software repository with custom metadata
• List all software or software specific to a parser
• Export list of software, or static interface (with or without annotation)
• Import import software from a known source (e.g., Google Sheet)

• Clear a software repository, all under a parser, or the entire database.
• Search across your software to find a particular one.
• Summary to summarize your research software database.
• Analyze a specific software repository, indicating a consensus/summary about criteria and taxonomy.
• Shell into a Python shell to interact with an encyclopedia client.
• Start an interactive dashboard to see software and annotate crtieria and taxonomy membership.
• Topics list topics (tags) associated with a repository

## Exists

If you are working locally, the first thing you might want to do is determine if a particular software repository exists in your database. You can thus do:

$rse exists github.com/singularityhub/sregistry INFO:rse.main:Database: filesystem github.com/singularityhub/sregistry does not exist.  This assumes the rse.ini config file is in the present working directory. If not, you should specify it: $ rse --config_file ../software/rse.ini exists github.com/singularityhub/sregistry


Let’s say that we tested if our software of interest existed in the repository, and then we found that it did not, and want to add it. We can use rse add to do this.

$rse --config_file ../software/rse.ini add github.com/singularityhub/sregistry INFO:rse.main:Database: filesystem INFO:rse.main:github.com/singularityhub/sregistry was added to the the database.  If the entry already exists, you will be told that! $ rse add github.com/singularityhub/sregistry
INFO:rse.main:Database: filesystem
ERROR:rse.main:github.com/singularityhub/sregistry already exists in the database.


And if the entry doesn’t exist on the remote (e.g., GitHub) you’ll see:

$rse add github.com/singularityhub/singularity INFO:rse.main:Database: filesystem ERROR:rse.main.parsers.github:Cannot find repository singularityhub/singularity.  We can also add in bulk from a text file with one repository per line. For example, here is a small one: github.com/stan-dev/stan github.com/mathjax/MathJax github.com/optuna/optuna github.com/PyTables/PyTables  We would add in bulk as follows: $ rse add --file repos.txt


And of course you can add urls for other version control systems that are supported.

$rse add gitlab.com/singularityhub/gitlab-ci  By default, repos that are already added will be skipped over. ## Get A get will retrieve a known identifier. Unlike exists, if it doesn’t exist, it will parse and retrieve it for you. $ rse get github.com/singularityhub/sregistry
INFO:rse.main:Database: filesystem
{
"parser": "github",
"uid": "github/singularityhub/sregistry",
"data": {
"timestamp": "2020-06-07 14:00:24.727142",
"url": "https://api.github.com/repos/singularityhub/sregistry",
"id": 99180575,
"node_id": "MDEwOlJlcG9zaXRvcnk5OTE4MDU3NQ==",
"name": "sregistry",
...
"network_count": 32,
"subscribers_count": 10
}
}


If you run get without an argument, it will retrieve the last modified entry for you:

rse get


## Scrape

Adding a repository here and there is logical, but it would be very arduous to need to consistently look for and add new software repositories. Toward this goal, the research software encyclopedia has a scrape command that will allow you to programaticaly discover new repos from some external resource. For example, if we wanted to query the Journal of Open Source Software

$rse scrape joss  If you don’t provide a query term, the latest set will be returned. If you do provide a term, $ rse scrape joss docker


The term will be searched for instead. You can also do a dry run to see the repos found, but not add them to the software repository:

$rse scrape --dry-run joss  For more detailed scraping, it’s recommended to interact with a scraper from within Python. See the scrapers getting started pages to do this. ## Update Updating an entry coincides with retriving updated metadata. You can do this for an existing software repository: $ rse update github.com/singularityhub/sregistry
INFO:rse.main:Database: filesystem
INFO:rse.main:github/singularityhub/sregistry has been updated.


And of course you cannot update an entry that doesn’t exist.

$rse update github.com/singularityhub/doesnotexist INFO:rse.main:Database: filesystem ERROR:rse.main:github.com/singularityhub/doesnotexist does not exist.  We can also update in bulk from a text file with one repository per line. For example, here is a small one: github.com/stan-dev/stan github.com/mathjax/MathJax github.com/optuna/optuna github.com/PyTables/PyTables  We would update in bulk as follows: $ rse update --file repos.txt


By default, repos that are not present will be skipped over. If you want to rewrite existing metadata (for example, if you change the structure of the data) you can add the --rewrite flag:

$rse update --file repos.txt --rewrite  This works for single repository updates as well. ## Label Let’s say that we know a DOI (digital object identifier) for a repository, and we want to label it. We can do that as follows: $ rse label github/singularityhub/sregistry doi 10.5281/zenodo.1012531
INFO:rse.main:Database: sqlite
INFO:rse.main:github/singularityhub/sregistry has been updated.


The above command would say “Add the metadata value for “doi” to the Github repository for Singularity Registry server. You would then see the value in the metadata:

$rse get INFO:rse.main:Database: sqlite { "parser": "github", "uid": "github/singularityhub/sregistry", "data": { "timestamp": "2020-06-19 16:26:07.115675", ... "subscribers_count": 10, "doi": "10.5281/zenodo.1012531" } }  If you try to add a value that already exists, you’ll get a warning and be asked to use --force. INFO:rse.main:Database: sqlite doi is already defined for github/singularityhub/sregistry. Use --force to overwrite.  Although it’s less likely for someone to do this on the command line, the function is used by scrapers when a link is found between a software respository and some external DOI. ## List For the command line, you can easily list repos. For the filesystem database, since we would need to read in several json files, the listing just shows the repo ids. If you do a general list with rse ls, it will show all software ids: $ rse ls
DATABASE: filesystem
INFO:rse.main:Database: filesystem
1  github/singularityhub/sregistry


Remember that the rse.ini needs to be in the present working directory, or specified with --config_file. You can also list a particular parser:

$rse ls github DATABASE: filesystem INFO:rse.main:Database: filesystem 1 github/singularityhub/sregistry  ## Export If you want to export a flat listing of repos, you can do so like: $ rse export repos.txt


The default filename is repos.txt, so you could also leave this out:

$rse export  The --type is a variable that can be changed to indicate an export of a static interface, which (if you want annotations) will start the web server, and then query endpoints to export static files to some folder of interest. You’re also required to indicate a path. $ rse export --type static-web docs/


If the folder already exists and you want to over-write, it’s suggested to remove it first and then run, but if you want to overwrite without removal, just add --force

$rse export --type static-web --force docs/  To export a static Jekyll interface (without annotation) you can do: $ rse export --type jekyll-web docs/


Make sure the directory does not exist the first time you export! For times after that, only the inner _software collection will be updated with your current software database. If you’d like a complete tutorial for deploying a static web interface (that automatically updates itself from your sheet) see the rse-jekyll-web repository, where the README provides instructions with to deploy a web interface akin to rseng/web.

## Import

The import command can be used to take a remote source of data (e.g., a Google Spreadsheet exposed as a comma separated value export) and add to your database. The following syntax is used for import:

$rse import --type <type> <param1> ... <paramN>  Generally, any imported source of information must have some kind of unique identifier that can distinguish the record, and alert the rse software if the record already exists, and if so, if it needs to be updated. Since the rse software typically uses a github or gitlab (or other version control) identifier for this purpose, if you have records that don’t provide such an identifier, then the record will be stored under a custom namespace, with an identifier determined by the title. By default, the namespace is “custom” and you can change it by exporting RSE_CUSTOM_DATABASE_DIR, for example: export RSE_CUSTOM_DATABASE_DIR=research  Since we cannot guarantee that titles are unique this management will be up to you. As a set of examples, here are how titles might be translated into uids and paths in a filesystem database: Title Identifer Path Research / Acoustic Indices custom/research/acoustic-indices database/custom/research/acoustic-indices Acoustic Indices custom/acoustic-indices database/custom/acoustic-indices Adobe Audition custom/adobe-audition database/custom/adobe-audition Animal Sound Identifier custom/animal-sound-identifier database/custom/animal-sound-identifier ANIMAL-SPOT custom/animal-spot database/custom/animal-spot ARTWARP custom/artwarp database/custom/artwarp If you find a title that doesn’t parse well or would like to request a new kind of import, please open an issue. Each importer is described in more detail below. ### CSV A csv import takes a basic csv (or other deliminated) file and imports it. If you don’t want to use the Google Sheets importer, or want to inspect your csv first, this could be an option. The following fields (first row) are required: • Title: (required) A human-friendly title to describe the software. If the Url doesn’t have a version control address, this will be parsed for a unique identifier. • Url: (required) A link to GitHub, GitLab, or another online resource (this will be parsed looking for a unique identifier) • Description: (required) The description of the software project • Tags: (optional) A list of tags, comma separated, to parse into the metadata. Here is an example import $ mkdir bioacousics
$cd bioacousics$ rse init .
INFO:rse.main.config:Generating configuration file rse.ini


We can then run the import:

$rse import --type csv software-sheet.csv  You can also tweak the newline parameter, and delimiter (default ,). $ rse import --type csv --delim="," software-sheet.csv


It might be the case that you want to have software input from a form, and then input via a Google sheet. We provide an example template sheet that you could provide to rse import with the google-sheet and path to it’s exported csv. The fields (first column) required are the same as for the CSV import detailed above. Once you have your data sheet, you’ll want to make sure to generate a public link to export csv. You can do that via:

File -> Share -> Publish to Web -> Form Responses 1 (or the sheet name in first dropdown) -> Comma-separated value (csv) (second dropdown)


Here is an example from that same sheet.

Then to run the import, let’s say we create a new rse.ini database first:

$mkdir bioacousics$ cd bioacousics
$rse init . INFO:rse.main.config:Generating configuration file rse.ini  We can then run the import: $ rse import --type google-sheet "https://docs.google.com/spreadsheets/d/e/2PACX-1vTsPmEWUg8Tr1ZoYTcQ0kTdsCrVskQveSuwfdEHaktHtQG693O4DHQrZotoFd5dXCLAciykAYNf-RSz/pub?gid=0&single=true&output=csv"


Important The sheet URL is provided in quotes.

The resulting structure will look like the following:

database/
├── custom
│   ├── anabat-insight
│   ├── animal-sound-identifier
│   ├── artwarp
│   └── audacity
└── github
├── ChristianBergler
│   └── ANIMAL-SPOT
├── nwolek
│   └── audiomoth-scripts
└── patriceguyot
└── Acoustic_Indices


And if you just want to do a dry run, add --dry-run to the above.

Note: from a developer standpoint, although “import” is its own command, it is represented under “scrapers” in the code, as an import is another format of a scrape.

As a reminder, to update the organization of where custom entries are added. you can update RSE_CUSTOM_DATABASE_DIR in your rse/defaults.py, or via the environment export RSE_CUSTOM_DATABASE_DIR=research.

If you want to delete a software entry, just use clear with it’s unique id:

$rse clear github.com/singularityhub/sregistry INFO:rse.main:Database: filesystem This will delete software github.com/singularityhub/sregistry, are you sure? [n]|y: y INFO:rse.main.database.filesystem:github.com/singularityhub/sregistry has been removed.  You can also remove an entire parser: $ rse clear github
INFO:rse.main:Database: filesystem
This will delete all github software in the database, are you sure? [n]|y: y


or all software repositories in the database:

$rse clear DATABASE: filesystem This will delete all software in the database, are you sure? [n]|y: y INFO:rse.main.database.filesystem:Removing /home/vanessa/Desktop/Code/rseng/software/database/github  Each time you’ll be asked for a confirmation first, in case the command was run in error. We can easily search across our software repos with search. For a filesystem database, this means only the filenames. $ rse search term
INFO:rse.main:Database: filesystem


Here is an example with results:

$rse search singularity INFO:rse.main:Database: filesystem 1 github/singularityhub/sregistry  For a filesystem database, you can also search across taxonomy and/or criteria items: $ rse search --taxonomy package
RSE-taxonomy-package-management
1  github/easybuilders/easybuild
2  github/spack/spack

$rse search --criteria research RSE-research-intention 1 github/AA-ALERT/AstroData 2 github/fair-software/howfairis 3 github/BrianAronson/birankr 4 github/3D-e-Chem/knime-sstea 5 github/davidebolo1993/TRiCoLOR 6 github/AA-ALERT/AMBER 7 gitlab/davidtourigny/dynamic-fba 8 github/Sulstice/cocktail-shaker 9 github/spack/spack 10 github/snakemake/snakemake 11 github/potree/PotreeConverter 12 github/Effective-Quadratures/Effective-Quadratures 13 github/3D-e-Chem/knime-pharmacophore 14 github/sunpy/sunpy 15 github/AA-ALERT/frbcatdb 16 github/AA-ALERT/frbcat-web 17 github/Parsl/parsl 18 github/JuliaOpt/JuMP.jl 19 github/AA-ALERT/Dedispersion 20 github/scikit-image/scikit-image 21 github/3D-e-Chem/sygma 22 github/nextflow-io/nextflow 23 gitlab/LouisLab/PiVR 24 github/3D-e-Chem/knime-gpcrdb 25 gitlab/cosmograil/PyCS3 26 github/sjvrijn/mf2 27 github/KVSlab/turtleFSI 28 github/ropensci/chirps 29 gitlab/ampere2/metalwalls  The searches are independent, meaning that you might see the same repository in two results listings if it has more than one match for a given taxonomy or criteria item. The same is true for adding a search term at the onset: $ rse search singularity --taxonomy package
singularity
1  github/hpcng/singularity
2  github/singularityhub/singularity-compose
3  github/singularityhub/sregistry
4  github/eWaterCycle/setup-singularity

RSE-taxonomy-package-management
1  github/spack/spack
2  github/easybuilders/easybuild


## Summary

You might want a quick summary of the annotations, whether taxonomy or criteria, or number of unique users that have annotated your database. The summary command can help you here.

$rse summary INFO:rse.main:Database: filesystem { "repos": 86, "taxonomy-count": 17, "criteria-count": 6, "users": { "vsoch": { "criteria-annotations": 2, "taxonomy-annotations": 0 } }, "taxonomy": { "github/vsoch/gridtest": { "RSE-taxonomy-numerical-libraries": 1 }, "github/singularityhub/sregistry": { "RSE-taxonomy-databases": 1, "RSE-taxonomy-application-programming-interfaces": 1 } }, "criteria": { "github/singularityhub/sregistry": { "yes": 1, "no": 0 }, "github/singularityhub/singularity-compose": { "yes": 0, "no": 1 } }, "users-count": 1 }  You can also ask to show just metrics associated with taxonomy, criteria, or users: $ rse summary --type criteria
INFO:rse.main:Database: filesystem
{
"criteria": {
"github/singularityhub/sregistry": {
"yes": 1,
"no": 0
},
"github/singularityhub/singularity-compose": {
"yes": 0,
"no": 1
}
},
"criteria-count": 6,
"repos": 86
}

$rse summary --type taxonomy INFO:rse.main:Database: filesystem { "taxonomy": { "github/vsoch/gridtest": { "RSE-taxonomy-numerical-libraries": 1 }, "github/singularityhub/sregistry": { "RSE-taxonomy-databases": 1, "RSE-taxonomy-application-programming-interfaces": 1 } }, "taxonomy-count": 17, "repos": 86 }  $ rse summary --type users
INFO:rse.main:Database: filesystem
{
"users-count": 1,
"users": {
"vsoch": {
"criteria-annotations": 2,
"taxonomy-annotations": 0
}
},
"repos": 86
}


or ask to filter down to one repository:

$rse summary github/singularityhub/sregistry INFO:rse.main:Database: filesystem { "repo": "github/singularityhub/sregistry", "taxonomy-count": 17, "criteria-count": 6, "users": { "vsoch": { "criteria-annotations": 1, "taxonomy-annotations": 0 } }, "taxonomy": { "github/singularityhub/sregistry": { "RSE-taxonomy-databases": 1, "RSE-taxonomy-application-programming-interfaces": 1 } }, "criteria": { "github/singularityhub/sregistry": { "yes": 1, "no": 0 } }, "users-count": 1 }  ## Analyze Analyze can provide metrics (or calculations) specific to a single repository, or across all repositories. We will start with the single repository example first. Let’s say that we want to analyze the repository github.com/singularityhub/sregistry. For criteria, by default it will give you a “final answer” of yes/no depending on the majority, or indicate a tie otherwise. For taxonomy items, it will list all categories with > 1 vote. $ rse analyze github/singularityhub/sregistry
INFO:rse.main:Database: filesystem
Summary for github/singularityhub/sregistry

Criteria
1  yes	Would taking away the software be a detriment to research?
2  no	Is the software intended for a particular domain?
3  no	Was the software created with intention to solve a research question?
4  no	Is the software intended for research?
5  yes	Has the software been used by researchers?
6  yes	Has the software been cited?

Taxonomy
1  1	Databases
2  1	Application Programming Interfaces


The above shows us that the majority (>50%) said that taking away the software would be a detriment to research, and that it’s been used and cited by researchers. The taxonomy categories that were voted for (greater than 1 user) include Databases and application programming interfaces. You can change these thresholds easily:

$rse analyze github/singularityhub/sregistry --cthresh 0.6 --tthresh 2  For the above, we wouldn’t have any taxonomy results because there are none with more than two votes. If you want to do a bulk analysis for all repositories, you are required to use the internal client: from rse.main import Encyclopedia client = Encyclopedia()  The client analyze_bulk function takes the same arguments, but returns a large json structure with all repos that have annotations for criteria or taxonomy categories. client.analyze_bulk() [{'repo': 'github/vsoch/gridtest', 'criteria': {}, 'taxonomy': {'RSE-taxonomy-numerical-libraries': 1}}, {'repo': 'github/singularityhub/sregistry', 'criteria': {'RSE-absence': 'yes', 'RSE-domain-intention': 'no', 'RSE-question-intention': 'no', 'RSE-research-intention': 'no', 'RSE-usage': 'yes', 'RSE-citation': 'yes'}, 'taxonomy': {'RSE-taxonomy-databases': 1, 'RSE-taxonomy-application-programming-interfaces': 1}}, {'repo': 'github/singularityhub/singularity-compose', 'criteria': {'RSE-absence': 'yes', 'RSE-domain-intention': 'no', 'RSE-question-intention': 'no', 'RSE-research-intention': 'no', 'RSE-usage': 'yes', 'RSE-citation': 'no'}, 'taxonomy': {}}]  If you want to include empty repos (without votes) set include_empty to True. ## Shell The shell is a quick way to open up an interactive environment with an encyclopedia client. For example, let’s say we are sitting at the root of a database, such as the repository rseng/software: git clone git@github.com:rseng/software.git cd software  This means that we have an rse.ini file, and can then start a shell to interact with the software there: $ rse shell
INFO:rse.main:Database: sqlite
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

client
Out[1]: <rse.main.Encyclopedia at 0x7f2135346690>


From that point on, you’d be interacting with the Encyclopedia client.

## Start

The rse start command will open a web interface with an interactive table for your tasks.

$rse start  You can run it in debug mode: $ rse start --debug


or further customize the port or hostname

$rse start --port 8000 --host 0.0.0.0  For each, you can specify a particular action (e.g., delete or re-run) or click on it for further details. See the dashboard documentation page for more details. ## Topics As of version 0.0.29, the Research Software Encyclopedia has support for topics, primarily for the GitHub parser. If you have an older verison you can update your metadata with: rse update github/<usename>/<repo>  or export all names to file, and bulk update rse export repos.txt rse update --file repos.txt  Then you can ask to list topics, for example filtered by a pattern: $ rse topics --pattern meta
INFO:rse.main:Database: filesystem


If you want to see topics for a single repository, they are part of the standard metadata returned by get:

$rse get github/<usename>/<repo>  If you want to list all unique topics: $ rse topics
INFO:rse.main:Database: filesystem
cli
client
cloud-native
container
container-friends
container-orchestration
containers
cosmology
date
date-parser
datetime
docker
entity-extraction
hdf5
hpc
html-parsing
information-extraction
linux
management
natural-language-processing
nlp
parallel
particle
portability
portable
registry
reproducible
reproducible-science
rootless-containers
science
singularity
singularity-compose
singularity-container
singularity-containers
singularity-python
singularityhub
web-scraping
webscraping


Finally, you can search for one or more topics, and find repositories that are labeled as such:

\$ rse topics --search science
INFO:rse.main:Database: filesystem
github/JuliaLang/julia
github/MD-Studio/MDStudio
github/MDAnalysis/mdanalysis
github/SCM-NV/qmflows
github/SCM-NV/qmflows-namd
github/astropy/astropy
github/hpcng/singularity
github/recipy/recipy