You might want to interact with a software database via Python.
Why would I want to do that?
You might want to contribute repositories to a database, or just use it for your own analysis. In this walkthrough, we will clone the the github.com/rseng/software database, checkout a new branch, and then make changes that we would intend to do a pull request to contribute back.
Get the Database
While you could start from scratch and create your own database with rse init
and then rse add
, it’s more likely you’ll want to start with an already
existing base. Let’s do that by cloning:
git clone https://github.com/rseng/software
cd software
You will notice an rse.ini
file that has a basic configuration for the filesystem.
While other formats are supported, because we want to keep this database in version
control, a filesystem format is optimal. Once we have our repository cloned,
we can start a shell to interact with it:
$ rse shell
...
client
Out[1]: <rse.main.Encyclopedia at 0x7fb543b25590>
And remember that you can always export the full path to an rse.ini config file,
in the case that you want to be able to run rse shell
from anywhere:
export RSE_CONFIG_FILE=/home/vanessa/Desktop/Code/rseng/software/rse.ini
Commands
All of the commands that are available on the command line (and a few more)! are available to you.
- Exists: determine if a particular repository exists in your local repository
- Add: add a new software repository to your database
- Get: retrieve current metadata for a piece of software, or all software
- Update: update metadata for a single repository or all repositories
-
List all software or software specific to a parser
- Remote: query the rseng/software remote database
- Clear a software repository, all under a parser, or the entire database.
-
Search across your software to find a particular one.
- Criteria: list criteria from the rseng criteria and taxonomy API
Exists
Does a particular software repository exists in your database?
$ client.exists("github.com/singularityhub/sregistry")
True
$ client.exists("github.com/singularityhub/pancakes")
False
Add
Let’s say that we tested if our software of interest existed in the repository,
and then we found that it did not, and want to add it. We can use rse add
to do this.
$ client.add("github.com/sci-f/scif-go")
INFO:rse.main.database.relational:github/sci-f/scif-go was added to the the database.
Ah! But don’t forget that you need to export your RSE_GITHUB_TOKEN
.
If the entry already exists, you will be told that!
$ client.add("github.com/sci-f/scif-go")
ERROR:rse.main:github.com/sci-f/scif-go already exists in the database.
And if the entry doesn’t exist on the remote (e.g., GitHub) you’ll see:
$ client.add("github.com/singularityhub/pancakes")
ERROR:rse.main.parsers.github:Cannot find repository singularityhub/pancakes.
We can also add in bulk from a text file with one repository per line. For example, here is a small one:
github.com/stan-dev/stan
github.com/mathjax/MathJax
github.com/optuna/optuna
github.com/PyTables/PyTables
We would add in bulk as follows:
$ client.bulk_add("repos.txt")
By default, repos that are already added will be skipped over.
Get
A get will retrieve a known identifier. Unlike exists, if it doesn’t exist, it will parse and retrieve it for you.
$ repo = client.get("github.com/singularityhub/sregistry")
# <SoftwareRepository 'github/singularityhub/sregistry'>
You can then inspect, export, or otherwise interact with the repository instance.
repo.data
repo.uid
repo.export()
repo.load()
repo.parser
repo.summary()
repo.timestamp
If you run get without an argument, it will retrieve the last modified entry for you:
$ client.get()
# <SoftwareRepository 'github/sci-f/scif-go'>
Update
Updating an entry coincides with retriving updated metadata. You can do this for an existing software repository:
> client.update("github.com/singularityhub/sregistry")
INFO:rse.main:github/singularityhub/sregistry has been updated.
# <SoftwareRepository 'github/singularityhub/sregistry'>
And of course you cannot update an entry that doesn’t exist.
$ client.update("github.com/singularityhub/noodles")
ERROR:rse.main:github.com/singularityhub/noodles does not exist.
We can also update in bulk from a text file with one repository per line. For example, here is a small one:
github.com/stan-dev/stan
github.com/mathjax/MathJax
github.com/optuna/optuna
github.com/PyTables/PyTables
We would update in bulk as follows:
$ client.bulk_update("repos.txt")
By default, repos that are not present will be skipped over.
List
We can easily list repos with list.
> client.list()
[['github/singularityhub/sregistry'],
['github/scikit-learn/scikit-learn'],
['github/tensorflow/tensorflow'],
['github/mlpack/mlpack'],
['github/sunpy/sunpy'],
['github/stan-dev/stan'],
...
['github/mathjax/MathJax'],
['github/optuna/optuna'],
['github/PyTables/PyTables'],
['github/nteract/nteract'],
['github/yt-project/yt'],
['github/ropensci/rtweet'],
['github/sci-f/scif-go']]
You can also list a particular parser:
> client.list("github")
Clear
If you want to delete a software entry, just use clear with it’s unique id:
> client.clear("github.com/singularityhub/sregistry")
This will delete software github.com/singularityhub/sregistry, are you sure? [n]|y: y
If you don’t want the prompt, add noprompt=True
> client.clear("github.com/singularityhub/sregistry", noprompt=True)
You can also remove an entire parser:
> client.clear("github")
This will delete all github software in the database, are you sure? [n]|y: y
or all software repositories in the database:
> client.clear()
This will delete all software in the database, are you sure? [n]|y: y
Unless you set noprompt
to True, each time you’ll be asked for a confirmation first, in case the command was
run in error.
Search
We can easily search across our software repos with search. For a filesystem database, this means only the filenames.
> client.search("singularity")
[['github/singularityhub/sregistry', 'github', '2020-06-09 17:59:41'],
['github/hpcng/singularity', 'github', '2020-06-09 19:44:04']]
Depending on your database backend, you might retrieve more metadata (the above is for the sqlite backend).
Criteria
The criteria that we use to populate the client is the present version available from [https://rseng.github.io/rseng]. If you need an earlier verison, you can interact with the rseng library directly. To do this with the client here, you can simply list criteria:
> client.list_criteria()
[{'uid': 'RSE-absence',
'name': 'Would taking away the software be a detriment to research?',
'options': ['yes', 'no'],
'date': '2020-06-13 15:04:25 +0000'},
{'uid': 'RSE-citation',
'name': 'Has the software been cited?',
'options': ['yes', 'no'],
'date': '2020-06-13 15:04:25 +0000'},
{'uid': 'RSE-domain-intention',
'name': 'Is the software intended for a particular domain?',
'options': ['yes', 'no'],
'date': '2020-06-13 15:04:25 +0000'},
{'uid': 'RSE-question-intention',
'name': 'Was the software created with intention to solve a research question?',
'options': ['yes', 'no'],
'date': '2020-06-13 15:04:25 +0000'},
{'uid': 'RSE-research-intention',
'name': 'Is the software intended for research?',
'options': ['yes', 'no'],
'date': '2020-06-13 15:04:25 +0000'},
{'uid': 'RSE-usage',
'name': 'Has the software been used by researchers?',
'options': ['yes', 'no'],
'date': '2020-06-13 15:04:25 +0000'}]
Taxonomy
You can also list a flattened version of the taxonomy from from [https://rseng.github.io/rseng].
> client.list_taxonomy()
[{'uid': 'RSE-taxonomy-analysis',
'name': 'Domain-specific analysis software',
'example': 'SPM, fsl, afni for neuroscience',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-application-programming-interfaces',
'name': 'Application Programming Interfaces',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-communication-tools',
'name': 'Communication tools or platforms',
'example': 'email, slack, etc.',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-data-collection',
'name': 'Data collection',
'example': 'web-based experiments or portals',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-databases',
'name': 'Databases',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-domain-hardware',
'name': 'Domain-specific hardware',
'example': 'software for physics to control lab equipment',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-frameworks',
'name': 'Frameworks',
'example': 'to generate documentation, content management systems',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-ide-research',
'name': 'Interactive development environments for research',
'example': 'Matlab, Jupyter',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-numerical libraries',
'name': 'Numerical libraries',
'example': 'includes optimization, statistics, simulation, e.g., numpy',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-operating-systems',
'name': 'Operating systems',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-optimized',
'name': 'Domain-specific optimized software',
'example': 'neuroscience software optimized for GPU',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-personal-scheduling-task-management',
'name': 'Personal scheduling and task management',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-provenance-metadata-tools',
'name': 'Provenance and metadata collection tools',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-text-editors-ides',
'name': 'Text editors and integrated development environments',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-version-control',
'name': 'Version control',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-visualization',
'name': 'Visualization',
'example': 'interfaces to interact with, understand, and see data, plotting tools',
'date': '2020-06-13 14:48:53 +0000'},
{'uid': 'RSE-taxonomy-workflow-managers',
'name': 'Workflow managers',
'date': '2020-06-13 14:48:53 +0000'}]