Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions .github/workflows/action-test-before-PR.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Run SOMEF tests before PR

on:
pull_request:
branches:
- main #should be this branch or master?
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.11"

- name: Install Poetry
run: curl -sSL https://install.python-poetry.org | python3 -

- name: Install dependencies
run: poetry install

- name: Download NLTK data
run: poetry run python -m nltk.downloader wordnet omw-1.4 punkt punkt_tab stopwords

- name: Configure SOMEF
run: poetry run somef_core configure -a

- name: Run pytest
run: poetry run pytest -v src/somef_core/test/
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
build/*
dist/*
env*/*
somef_core_env/
.vscode/
.ipynb_checkpoints/
config.json
__pycache__/
**/__pycache__/
.DS_Store
Lib/*
Scripts/*
Expand All @@ -13,7 +15,7 @@ Scripts/*
*.ttl
env_3.9/*
src/somef_core/create_corpus_for_NER.py
src/somef.egg-info/*
src/somef_core.egg-info/*
local_tests/*
Dockerfile_old
repos.txt
Expand Down
23 changes: 14 additions & 9 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,9 +1,16 @@
title: "SOMEF: Software metadata extraction framework"
license: Apache-2.0
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: 'SOMEF: Software metadata extraction framework'
message: >-
If you use this software, please cite both the article
from preferred-citation and the software itself.
type: software
authors:
- family-names: Garijo
given-names: Daniel
orcid: "http://orcid.org/0000-0003-0454-7145"
orcid: 'https://orcid.org/0000-0003-0454-7145'
- family-names: Mao
given-names: Allen
- family-names: Dharmala
Expand All @@ -20,18 +27,16 @@ authors:
given-names: Jenifer
- family-names: Mendoza
given-names: Juanje

cff-version: 1.2.0
message: "If you use this software, please cite both the article from preferred-citation and the software itself."
license: Apache-2.0
preferred-citation:
authors:
- family-names: Kelley
given-names: Aidan
- family-names: Garijo
given-names: Daniel
title: "A Framework for Creating Knowledge Graphs of Scientific Software Metadata"
title: A Framework for Creating Knowledge Graphs of Scientific Software Metadata
type: article
journal: "Quantitative Science Studies"
pages: "1-37"
journal: Quantitative Science Studies
pages: 1-37
year: 2021
doi: 10.1162/qss_a_00167
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ RUN curl -sSL https://install.python-poetry.org | python3 -

RUN pip install poetry-plugin-shell

WORKDIR "/somef-core"
WORKDIR "/somef_core"

RUN poetry install

Expand Down
120 changes: 99 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,15 @@
# Software Metadata Extraction Framework CORE (SOMEF-core)

TO DO: Add other badges.

[![Documentation Status](https://readthedocs.org/projects/somef-core/badge/?version=latest)](https://somef-core.readthedocs.io/en/latest/?badge=latest)

[![Documentation Status](https://readthedocs.org/projects/somef/badge/?version=latest)](https://somef.readthedocs.io/en/latest/?badge=latest)
[![Python](https://img.shields.io/pypi/pyversions/somef.svg?style=plastic)](https://badge.fury.io/py/somef) [![PyPI](https://badge.fury.io/py/somef.svg)](https://badge.fury.io/py/somef) [![DOI](https://zenodo.org/badge/190487675.svg)](https://zenodo.org/badge/latestdoi/190487675) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/KnowledgeCaptureAndDiscovery/somef/HEAD?filepath=notebook%2FSOMEF%20Usage%20Example.ipynb) [![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active)

<img src="docs/logo.png" alt="logo" width="150"/>

A command line interface for automatically extracting relevant metadata from code repositories (readme, configuration files, documentation, etc.).

This repository is extracted from https://github.com/SciCodes/somef-core/issues
This repository is extracted from https://github.com/SciCodes/somef-core/

**Demo:** See a [demo running somef as a service](https://somef.linkeddata.es), through the [SOMEF-Vider tool](https://github.com/SoftwareUnderstanding/SOMEF-Vider/).

**Authors:** Daniel Garijo, Allen Mao, Miguel Ángel García Delgado, Haripriya Dharmala, Vedant Diwanji, Jiaying Wang, Aidan Kelley, Jenifer Tabita Ciuciu-Kiss, Luca Angheluta and Juanje Mendoza.

Expand All @@ -19,6 +18,13 @@ This repository is extracted from https://github.com/SciCodes/somef-core/issues
Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present), listed in alphabetical order:

- **Acknowledgement**: Text acknowledging funding sources or contributors
- **Application domain**: The application domain of the repository. Current supported domains include: Astrophysics, Audio, Computer vision, Graphs, Natural language processing, Reinforcement learning, Semantc web, Sequential. Domains are not mutually exclusive. These domains have been extracted from [awesome lists](https://github.com/topics/awesome-list) and [Papers with code](https://paperswithcode.com/). Find more information in our [documentation](https://somef.readthedocs.io/en/latest/)
- **Authors**: Person(s) or organization(s) responsible for the project. We recognize the following properties:
- Name: name of the author (including last name)
- Given name: First name of an author
- Family name: Last name of an author
- Email: email of author
- URL: website or ORCID associated with the author
- **Build file**: Build file(s) of the project. For example, files used to create a Docker image for the target software, package files, etc.
- **Citation**: Preferred citation as the authors have stated in their readme file. SOMEF recognizes Bibtex, Citation File Format files and other means by which authors cite their papers (e.g., by in-text citation). We aim to recognize the following properties:
- Title: Title of the publication
Expand All @@ -43,7 +49,7 @@ Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the follo
- **Forks url**: Links to forks made of the project
- **Full name**: Name + owner (owner/name)
- **Full title**: If the repository is a short name, we will attempt to extract the longer version of the repository name
- **Identifier**: Identifier associated with the software (if any), such as Digital Object Identifiers. DOIs associated with publications will also be detected.
- **Identifier**: Identifier associated with the software (if any), such as Digital Object Identifiers and Software Heritage identifiers (SWH). DOIs associated with publications will also be detected.
- **Images**: Images used to illustrate the software component
- **Installation instructions**: A set of instructions that indicate how to install a target repository
- **Invocation**: Execution command(s) needed to run a scientific software component
Expand All @@ -52,6 +58,7 @@ Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the follo
- **License**: License and usage terms of a software component
- **Logo**: Main logo used to represent the target software component
- **Name**: Name identifying a software component
- **Ontologies**: URL and path to the ontology files present in the repository
- **Owner**: Name and type of the user or organization in charge of the repository
- **Package distribution**: Links to package sites like pypi in case the repository has a package available.
- **Package files**: Links to package files used to wrap the project in a package.
Expand All @@ -69,18 +76,23 @@ Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the follo
- Link to the tarball zip and code of the release
- **Repository status**: Repository status as it is described in [repostatus.org](https://www.repostatus.org/).
- **Requirements**: Pre-requisites and dependencies needed to execute a software component
- **Run**: Running instructions of a software component. It may be wider than the `invocation` category, as it may include several steps and explanations.
- **Runtime platform**: specifies runtime platform or script interpreter dependencies required to run the project..
- **Script files**: Bash script files contained in the repository
- **Stargazers count**: Total number of stargazers of the project
- **Support**: Guidelines and links of where to obtain support for a software component
- **Support channels**: Help channels one can use to get support about the target software component
- **Type**: type of software (command line application, notebook, ontology, scientific workflow, etc.)
- **Usage examples**: Assumptions and considerations recorded by the authors when executing a software component, or examples on how to use it
- **Workflows**: URL and path to the computational workflow files present in the repository

We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the [output format description](https://somef.readthedocs.io/en/latest/output/)

## Documentation

The documentation for somef-core is available at [https://somef-core.readthedocs.io/en/latest/](https://somef-core.readthedocs.io/en/latest/)
See full documentation for somef-core is available at [https://somef-core.readthedocs.io/en/latest/](https://somef-core.readthedocs.io/en/latest/)

## Cite SOMEF and SOMEF-core:
## Cite SOMEF-core:

Journal publication (preferred):

Expand Down Expand Up @@ -132,7 +144,7 @@ To run somef_core, please follow the next steps:
Clone this GitHub repository

```
git clone https://github.com/KnowledgeCaptureAndDiscovery/somef-core.git
git clone https://github.com/SciCodes/somef-core.git
```

We use [Poetry](https://python-poetry.org/) to ensure library compatibility. It can be installed as follows:
Expand All @@ -145,32 +157,36 @@ This option is recommended over installing Poetry with pip install.

Now Poetry will handle the installation of SOMEF-core and all its dependencies configured in the `toml` file.

To test the correct installation of poetry run:
To test the correct installation of poetry run (poetry version `> 2.0.0`):

```
poetry --version
```

Install somef and all their dependencies.
Install somef-core and all their dependencies.

```
cd /somef_core
poetry install
```

Now we need to access our virtual environment, to do so you have to install the [poetry plugin shell](https://github.com/python-poetry/poetry-plugin-shell) and run the following command:
Now we need to access our virtual environment, to do so you can run the following command:

```bash
poetry env activate
```
pip install poetry-plugin-shell
```
After `shell` is set up, you can run the following comand to access the virtual environment
```
poetry shell
If the environment is not active, paste the command shown when `poetry env activate` is run, typically something like the command below:

```bash
source /path_to_env/ENV_NAME/bin/activate
```
Test SOMEF installation

To learn more about poetry environment management, visit their official documentation [here](https://python-poetry.org/docs/managing-environments/).

Test the SOMEF-core installation run:

```bash
somef --help
somef_core --help
```

If everything goes fine, you should see:
Expand All @@ -189,7 +205,31 @@ Commands:

## Installing through Docker

We are working on this section
We provide a Docker image with SOMEF already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:

```bash
docker build -t somef .
```

Or just use the Docker image already built in [DockerHub](https://hub.docker.com/r/kcapd/somef):

```bash
docker pull kcapd/somef
```

Then, to run your image just type:

```bash
docker run --rm -it kcapd/somef
```

And you will be ready to use SOMEF (see section below). If you want to have access to the results we recommend [mounting a volume](https://docs.docker.com/storage/volumes/). For example, the following command will mount the current directory as the `out` folder in the Docker image:

```bash
docker run -it --rm -v $PWD/:/out kcapd/somef
```

If you move any files produced by somef into `/out`, then you will be able to see them in your current directory.

## Configure

Expand Down Expand Up @@ -228,6 +268,9 @@ Options:
-h, --help Show this message and exit.
```

### Updating SOMEF

If you update SOMEF to a newer version, we recommend you `configure` again the library (by running `somef configure`). The rationale is that different versions may rely on classifiers which may be stored in a different path.

## Usage

Expand All @@ -239,6 +282,7 @@ Usage: somef_core describe [OPTIONS]
Running the Command Line Interface

Options:
-t, --threshold FLOAT Threshold to classify the text [required]
Input: [mutually_exclusive, required]
-r, --repo_url URL Github/Gitlab Repository URL
-d, --doc_src PATH Path to the README file source
Expand All @@ -250,6 +294,15 @@ Options:
output will be in JSON

-c, --codemeta_out PATH Path to an output codemeta file
-g, --graph_out PATH Path to the output Knowledge Graph export
file. If supplied, the output will be a
Knowledge Graph, in the format given in the
--format option chosen (turtle, json-ld)

-f, --graph_format [turtle|json-ld]
If the --graph_out option is given, this is
the format that the graph will be stored in

-p, --pretty Pretty print the JSON output file so that it
is easy to compare to another JSON output
file.
Expand All @@ -264,6 +317,11 @@ Options:
will be stored at the
desired path

-all, --requirements_all Export all detected requirements, including
text and libraries (default).

-v, --requirements_v Export only requirements from structured
sources (pom.xml, requirements.txt, etc.)

-h, --help Show this message and exit.
```
Expand All @@ -273,9 +331,29 @@ Options:
The following command extracts all metadata available from [https://github.com/dgarijo/Widoco/](https://github.com/dgarijo/Widoco/).

```bash
somef_core describe -r https://github.com/dgarijo/Widoco/ -o test.json
somef_core describe -r https://github.com/dgarijo/Widoco/ -o test.json -t 0.8
```

Try SOMEF in Binder with our sample notebook: [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/KnowledgeCaptureAndDiscovery/somef/HEAD?filepath=notebook%2FSOMEF%20Usage%20Example.ipynb)

## Contribute:

If you want to contribute with a pull request, please do so by submitting it to the `dev` branch.

## Next features:

To see upcoming features, please have a look at our [open issues](https://github.com/SciCodes/somef-core/issues) and [milestones](https://github.com/SciCodes/somef-core/milestones)

## Extending SOMEF categories:

To run a classifier with an additional category or remove an existing one, a corresponding path entry in the config.json should be provided and the category type should be added/removed in the category variable in `cli.py`.

## Metadata Support

SOMEF supports the extraction and analysis of metadata in package files of several programming languages. Current support includes: `setup.py` and `pyproject.toml` for Python, `pom.xml` for Java, `.gemspec` for Ruby, `DESCRIPTION` for R, `bower.json` for JavaScript, HTML or CSS, `.cabal` for Haskell, `cargo.toml` for RUST, `composer` for PHP, `.juliaProject.toml` for Julia , `AUTHORS`, `codemeta.json`, `publiccode.yml`, `dockerfile` and `citation.cff`
This includes identifying dependencies, runtime requirements, and development tools specified in project configuration files.

## Limitations

SOMEF is designed to work primarily with repositories written in English.
Repositories in other languages may not be processed as effectively, and results could be incomplete or less accurate.
Loading