Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ repos:
hooks:
- id: check-yaml
- id: end-of-file-fixer
- id: trailing-whitespace
- id: trailing-whitespace
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Concepts have to be identifiable by `rdf:type`.
The training of the predictor requires annotated text.
Each training sample should be annotated with one or more concepts from the thesaurus.

## Installation
## Installation

### Requirements

Expand All @@ -32,7 +32,7 @@ stwfsapy is available on [PyPI](pypi.org) . You can install stwfsapy using pip:

This will install a python package called `stwfsapy`.

Note that it is generally recommended to use a [virtual environment](https://docs.python.org/3/tutorial/venv.html) to avoid
Note that it is generally recommended to use a [virtual environment](https://docs.python.org/3/tutorial/venv.html) to avoid
conflicting behaviour with the system package manager.

### From source
Expand All @@ -41,7 +41,7 @@ You also have the option to checkout the repository and install the packages fro

```shell
# call inside the project directory
poetry install --without ci
poetry install --without ci
```

## Usage
Expand Down Expand Up @@ -105,15 +105,15 @@ Afterwards it can be loaded as follows:
from stwfsapy.predictor import StwfsapyPredictor

StwfsapyPredictor.load('/path/to/storage/location')
```
```

## Contribute

Contributions via pull requests are welcome. Please create an issue beforehand
to explain and discuss the reasons for the respective contribution. We recommend
[forking](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo) the repository, if you have not already done so, before working on any possible pull request.

`stwfsapy` code should follow the [Black style] (https://black.readthedocs.io/en/stable/). The Black tool is included as a development dependency; you can run `black .` in the project root to autoformat code. There is also the possibility of doing linting and code formatting with a Git Pre-Commit hook script. To this end a `.pre-commit-config.yaml` configuration file has been added. The [pre-commit](https://pre-commit.com/) tool has been included as a development dependency. You would have to run the command `pre-commit install` inside your local virtual environment. Subsequently, the Black and Ruff tools will automatically check the linting and formatting of modified or new scripts after each time a `git commit` command is executed.
`stwfsapy` code should follow the [Black style](https://black.readthedocs.io/en/stable/). The Black tool is included as a development dependency; you can run `black .` in the project root to autoformat code. There is also the possibility of doing linting and code formatting with a Git Pre-Commit hook script. To this end a `.pre-commit-config.yaml` configuration file has been added. The [pre-commit](https://pre-commit.com/) tool has been included as a development dependency. You would have to run the command `pre-commit install` inside your local virtual environment. Subsequently, the Black and Ruff tools will automatically check the linting and formatting of modified or new scripts after each time a `git commit` command is executed.

## References
[1] [Toepfer, Martin, and Christin Seifert. "Fusion architectures for automatic subject indexing under concept drift" International Journal on Digital Libraries (IJDL), 2018.](https://ris.utwente.nl/ws/portalfiles/portal/248044709/Toepfer2018fusion.pdf)
Expand Down
10 changes: 10 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,16 @@ bandit = "~1.8"
[tool.ruff]
target-version = "py310"

[tool.ruff.lint]
select = [
# pycodestyle
"E",
# Pyflakes
"F",
# isort
"I"
]

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Expand Down
1 change: 1 addition & 0 deletions stwfsapy/automata/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@


from typing import List

from stwfsapy.automata import nfa


Expand Down
6 changes: 3 additions & 3 deletions stwfsapy/automata/conversion.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@


from collections import defaultdict
from typing import Tuple, Set, Dict, FrozenSet, Iterable, List
from stwfsapy.automata import dfa
from stwfsapy.automata import nfa
from queue import Queue
from typing import Dict, FrozenSet, Iterable, List, Set, Tuple

from stwfsapy.automata import dfa, nfa


class NfaToDfaConverter:
Expand Down
2 changes: 1 addition & 1 deletion stwfsapy/automata/dfa.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# limitations under the License.


from typing import List, Dict, Iterable, Tuple, Any, Callable, Optional
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple

_KEY_STATE_SYMBOL_TRANSITIONS = "symbol_transitions"
_KEY_STATE_NON_WORD_CHAR_TRANSITION = "non_word_char_transitions"
Expand Down
2 changes: 1 addition & 1 deletion stwfsapy/automata/heap.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# limitations under the License.


from typing import List, Dict, Tuple, Any, SupportsFloat
from typing import Any, Dict, List, SupportsFloat, Tuple


class BinaryMinHeap:
Expand Down
3 changes: 2 additions & 1 deletion stwfsapy/automata/nfa.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@


from collections import defaultdict
from typing import Set, List, Any, DefaultDict
from typing import Any, DefaultDict, List, Set

from stwfsapy.automata.heap import BinaryMinHeap


Expand Down
3 changes: 1 addition & 2 deletions stwfsapy/expansion.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,8 @@
# limitations under the License.


from typing import Callable, Pattern, List
import re

from typing import Callable, List, Pattern

_symbol_base_expression = re.compile(r"([\[\]()\{\}*?])")

Expand Down
7 changes: 4 additions & 3 deletions stwfsapy/frequency_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,12 @@
# limitations under the License.


from collections import defaultdict, OrderedDict
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.exceptions import NotFittedError
from collections import OrderedDict, defaultdict
from math import log

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.exceptions import NotFittedError


class FrequencyFeatures(BaseEstimator, TransformerMixin):
Expand Down
2 changes: 1 addition & 1 deletion stwfsapy/position_features.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
# limitations under the License.


from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin


class PositionFeatures(BaseEstimator, TransformerMixin):
Expand Down
47 changes: 25 additions & 22 deletions stwfsapy/predictor.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,33 +13,32 @@
# limitations under the License.


from stwfsapy.util.passthrough_transformer import PassthroughTransformer
from stwfsapy.frequency_features import FrequencyFeatures
from stwfsapy.position_features import PositionFeatures
import pickle as pkl
from collections import defaultdict
from typing import Dict, FrozenSet, List, Iterable, Container, Tuple, TypeVar, Union
from scipy.sparse import spmatrix
from numpy import array
from json import dumps, loads
from logging import getLogger
from rdflib.term import URIRef
from typing import Container, Dict, FrozenSet, Iterable, List, Tuple, TypeVar, Union
from zipfile import ZipFile

from numpy import array
from rdflib import Graph
from rdflib.term import URIRef
from scipy.sparse import csr_matrix, spmatrix
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier

from stwfsapy import case_handlers, expansion
from stwfsapy import thesaurus as t
from stwfsapy.automata import nfa, construction, conversion, dfa
from stwfsapy.thesaurus_features import ThesaurusFeatureTransformation
from stwfsapy.automata import construction, conversion, dfa, nfa
from stwfsapy.frequency_features import FrequencyFeatures
from stwfsapy.position_features import PositionFeatures
from stwfsapy.text_features import mk_text_features
from stwfsapy.thesaurus_features import ThesaurusFeatureTransformation
from stwfsapy.util.input_handler import get_input_handler
from stwfsapy import case_handlers
from stwfsapy import expansion
import pickle as pkl
from json import dumps, loads
from zipfile import ZipFile

from stwfsapy.util.passthrough_transformer import PassthroughTransformer

T = TypeVar("T")
N = TypeVar("N", int, float)
Expand Down Expand Up @@ -273,7 +272,8 @@ def fit(self, X, y=None, **kwargs):
Fits the classifier to the given training data.

:params X: Iterable of text inputs.
:params y: Iterable of correct concepts given by their URI for supervised training.
:params y: Iterable of correct concepts given by their URI for supervised
training.

Returns:
self: The fitted StwfsapyPredictor instance.
Expand All @@ -295,7 +295,8 @@ def predict_proba(self, X) -> csr_matrix:
:params X: Iterable of input texts.

Returns:
A sparse matrix of shape (n_samples, n_concepts) with concept match probabilities.
A sparse matrix of shape (n_samples, n_concepts) with concept match
probabilities.
"""
match_X, doc_counts = self.match_and_extend(X)
if match_X:
Expand All @@ -314,7 +315,8 @@ def suggest_proba(self, texts) -> List[List[Tuple[str, float]]]:
:params texts: Iterable of strings (documents).

Returns:
A list of lists, where each inner list contains tuples of (concept, probability).
A list of lists, where each inner list contains tuples of
(concept, probability).
"""
match_X, doc_counts = self.match_and_extend(texts)
if match_X:
Expand All @@ -336,7 +338,8 @@ def predict(self, X) -> csr_matrix:
:params X: Iterable of input strings.

Returns:
A sparse matrix of shape (n_samples, n_concepts) indicating predicted concept matches.
A sparse matrix of shape (n_samples, n_concepts) indicating predicted
concept matches.
"""
match_X, doc_counts = self.match_and_extend(X)
if match_X:
Expand Down
3 changes: 1 addition & 2 deletions stwfsapy/tests/automata/construction_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,8 @@
# limitations under the License.


from stwfsapy.automata import nfa
from stwfsapy.automata import construction as c

from stwfsapy.automata import nfa
from stwfsapy.tests.automata.data import accept

expression = "test"
Expand Down
2 changes: 1 addition & 1 deletion stwfsapy/tests/automata/data.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@


import pytest
from stwfsapy.automata import nfa

from stwfsapy.automata import nfa

symbol0 = "s"
symbol1 = "t"
Expand Down
3 changes: 2 additions & 1 deletion stwfsapy/tests/automata/heap_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
# limitations under the License.


from stwfsapy.automata import heap
import pytest

from stwfsapy.automata import heap


def check_heap(queue):
for i in range(1, len(queue.heap)):
Expand Down
2 changes: 1 addition & 1 deletion stwfsapy/tests/automata/integration_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@
# limitations under the License.


from stwfsapy.automata import nfa
from stwfsapy.automata import construction as const
from stwfsapy.automata import conversion as conv
from stwfsapy.automata import nfa
from stwfsapy.tests.automata.data import accept


Expand Down
1 change: 1 addition & 0 deletions stwfsapy/tests/automata/nfa_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@


import pytest

from stwfsapy.automata import nfa
from stwfsapy.tests.automata.data import symbol0

Expand Down
6 changes: 3 additions & 3 deletions stwfsapy/tests/automata/search_overlap_regression_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,11 @@

"""The tests in this file compare the behaviors of
stwfsapy and zaptain-stwfsa regarding overlap of potential matches."""
from stwfsapy.automata import nfa
import stwfsapy.automata.construction as const
import stwfsapy.automata.conversion as conv
import pytest

import stwfsapy.automata.construction as const
import stwfsapy.automata.conversion as conv
from stwfsapy.automata import nfa

label_global = "global"
id_global = "id_global"
Expand Down
1 change: 0 additions & 1 deletion stwfsapy/tests/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@
from rdflib import URIRef
from rdflib.term import Literal


test_type_thesaurus = URIRef("http://type.org/thesaurus")
test_type_concept = URIRef("http://type.org/concept")

Expand Down
1 change: 1 addition & 0 deletions stwfsapy/tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from pytest import fixture
from rdflib.graph import Graph
from rdflib.namespace import RDF, SKOS

from stwfsapy.tests import common as c


Expand Down
3 changes: 2 additions & 1 deletion stwfsapy/tests/expansion/any_case_from_braces_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
# limitations under the License.


from stwfsapy import expansion as e
import common as c

from stwfsapy import expansion as e

replacement_fun_any = e._replace_by_pattern_fun(e._any_case_from_braces_expression)


Expand Down
3 changes: 2 additions & 1 deletion stwfsapy/tests/expansion/collect_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
# limitations under the License.


import stwfsapy.expansion as e
from inspect import signature

import stwfsapy.expansion as e

_name_abbreviation_fun = e._expand_abbreviation_with_punctuation_fun.__name__
_name_ampersand_fun = e._expand_ampersand_with_spaces_fun.__name__
_name_replacer = "replacer"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@
# limitations under the License.


from stwfsapy import expansion as e
import common as c

from stwfsapy import expansion as e

replacement_fun_upper = e._replace_by_pattern_fun(
e._upper_case_abbreviation_from_braces_expression
)
Expand Down
7 changes: 4 additions & 3 deletions stwfsapy/tests/frequency_features_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@
# limitations under the License.


from stwfsapy.frequency_features import FrequencyFeatures
from sklearn.exceptions import NotFittedError
import numpy as np
from math import log

import numpy as np
import pytest
from sklearn.exceptions import NotFittedError

from stwfsapy.frequency_features import FrequencyFeatures

frequency_input = [
("cncpt_1", [3, 4, 0, 2], 0),
Expand Down
2 changes: 1 addition & 1 deletion stwfsapy/tests/position_features_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@
# limitations under the License.


from stwfsapy.position_features import PositionFeatures
import numpy as np

from stwfsapy.position_features import PositionFeatures

position_feature_data = [
(3, [3, 4, 0, 2]),
Expand Down
Loading