Skip to content

Can't find T790M mutation in civicmine #6

@hongiiv

Description

@hongiiv

Hi jakelever,

Thanks for this wonderful project.

When i used the civicmine (http://bionlp.bcgsc.ca/civicmine) i can't find "T790M" in any sentence. It was odd for me because EGFR T790M is very famous biomarker in treatment cancer.

This is a tokenizer problem that Spacy language model (en_core_web_sm) tokenizes the "T790M" as a "T790" and "M". (('T790', 'NOUN'), ('M', 'PROPN'))

I changed the kindred package like this (kindred/Parser.py)

if not model in Parser._models:
      Parser._models[model] = spacy.load(model, disable=['ner'])

      self.nlp = Parser._models[model]
      special_case = [{ORTH: "T790M"}]
      self.nlp.tokenizer.add_special_case("T790M", special_case)

Now "T790M" is ('T790M', 'VERB') fixed.

best,
jakelever

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions