This repository contains supplementary data, and links to the model and corpora used for the paper Transfer learning for biomedical named entity recognition with neural networks.
Corpora pre-processing steps were collected in a single script with a jupyter notebook for ease-of-use. Script and notebook can be found in code.
The model used in this study is NeuroNER [1], a domain-independent named entity recognizer (NER) based on a bi-directional long short term memory network-conditional random field (LSTM-CRF). A repository for the model can be found here.
NeuroNER uses standard python config files to specify hyperparameters. We provide three of these config files for reproducibility (see code/configs):
baseline.ini: config used while training on the target data sets (i.e., the baseline.)source.ini: config used while training on the source data sets.transfer.ini: config used while transferring a model trained on the source data set for training on a target data set.
The word embeddings used in this study were obtained from here [2]. Code for converting the word vectors to the .txt format necessary for use with NeuroNER can be found in the jupyter notebook in code, under data cleaning.
All corpora used in this study (which can be re-distributed) are in the corpora folder (given in Brat-standoff format).
Data can be uncompressed with the following command:
tar -zxvf <name_of_corpora>.
Alternatively, the corpora can be publicly accessed at the following links:
| Corpora | Text Genre | Standard | Entities | Publication |
|---|---|---|---|---|
| AZDC | Scientific Article | Gold | disease | link |
| BioCreative II GM | Scientific Article | Gold | genes/proteins | link |
| BioInfer | Scientific Article | Gold | genes/proteins | link |
| BioSemantics | Patent | Gold | chemicals, disease | link |
| CALBC-III-Small | Scientific Article | Silver | chemicals, diseases, species, genes/proteins | link |
| CDR | Scientific Article | Gold | chemicals, diseases | link |
| CellFinder | Scientific Article | Gold | species, gene/proteins, cells, anatomy | link |
| CHEMDNER Patent | Patent | Gold | chemicals | link |
| DECA | Scientific Article | Gold | gene/proteins | link |
| FSU-PRGE | Scientific Article | Gold | genes/proteins | link |
| Linneaus | Scientific Article | Gold | species | link |
| LocText | Scientific Article | Gold | species, genes/proteins | link |
| IEPA | Scientific Article | Gold | genes/proteins | link |
| miRNA | Scientific Article | Gold | diseases, species, genes/proteins | link |
| NCBI disease | Scientific Article | Gold | diseases | link |
| S800 | Scientific Article | Gold | species | link |
| Variome | Scientific Article | Gold | diseases, species, genes/proteins | link |
Many of these corpora can also be accessed and visualized in the browser here [3].
The supplementary data can be found in the file supplementary/additional_file_1.pdf. Additionally, blacklists used for the silver-standard corpora (SSCs) can be found in supplementary/blacklists.
- Dernoncourt, F., Lee, J. Y., & Szolovits, P. (2017). NeuroNER: an easy-to-use program for named-entity recognition based on neural networks. arXiv preprint arXiv:1705.05487.
- Moen, S. P. F. G. H., & Ananiadou, T. S. S. (2013). Distributional semantics resources for biomedical text processing. In Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan (pp. 39-43).
- Stenetorp, P., Topić, G., Pyysalo, S., Ohta, T., Kim, J. D., & Tsujii, J. I. (2011, June). BioNLP shared task 2011: Supporting resources. In Proceedings of the BioNLP Shared Task 2011 Workshop (pp. 112-120). Association for Computational Linguistics.