Skip to content

Augustus Training Fails with: size 100 is greater than the number of genes in file #556

@amsesk

Description

@amsesk

Are you using the latest release?
If you are not using the latest release of funannotate, please upgrade, if bug persists then report here.

I am using funannotate 1.7.3 as it is the version that we had the HPC folks install on our cluster for us.

Describe the bug
A clear and concise description of what the bug is.

funannotate predict fails during augustus training with the following error: size 100 is greater than the number of genes in file

I know that the problem is caused by BUSCO not being able to validate enough models for training augustus. For a variety of reasons (e.g., genome is probably incomplete, genome is subject to rapid evolution, etc.) BUSCO only finds and validates 60-70 markers in each nucleotide assembly. I specified --min_training_models 60 to attempt to circumvent this issue. That worked for the logic check in funannotate.library.trainAugustus, but then augustus throws the above error. I'm fairly certain that it's the call to randomSplit.pl that produces the error.

I examined your source and see that in predict.py on lines 1390-1393 (version as of 02/22/2021 @ 21:00:00 EST) the numTrainingSet variable is hard-coded to be either 10% of args.min_training_models or 100 if args.mins_training_models <= 1000.

Is this supposed to be hard-coded in that it's protecting me from training augustus at all when BUSCO can't pull enough markers for training (i.e., < 100)? Is training augustus with <100 markers a bad idea? It seems to me that I could circumvent this error by setting numTrainingSet = numpy.floor(args.min_training_models / 2) in cases where args.min_training_models < 200 (or something like that), but maybe this is hard coded for a good reason.

Can you please clarify if this is a bug or intended behavior? I'd like to train augustus for these predictions, but don't see a way around this without forcing funannotate to call trainAugustus with a lower value of numTrainingSet.

Thanks for your help!

What command did you issue?
Copy/paste the command used.

funannotate predict -i <masked_assembly> --species "<my_fungus>" --isolate "<my_strain>" -o "<my_strain>_funannotate" --cpus 12 --busco_db fungi --busco_seed_species rhizopus_oryzae --optimize_augustus --min_training_models 60

Logfiles

-------------------------------------------------------
[06:50 AM]: OS: linux2, 36 cores, ~ 196 GB RAM. Python: 2.7.16
[06:50 AM]: Running funannotate v1.7.3
[06:50 AM]: Parsed training data, run ab-initio gene predictors as follows:
[06:50 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
  Program      Training-Method
  augustus     busco          
  genemark     selftraining   
  glimmerhmm   busco          
  snap         busco          
[06:51 AM]: Genome loaded: 8,411 scaffolds; 143,003,589 bp; 10.56% repeats masked
[06:51 AM]: Mapping 550,173 proteins to genome using diamond and exonerate
[06:55 AM]: Found 369,950 preliminary alignments --> aligning with exonerate
[07:22 AM]: Exonerate finished: found 1,493 alignments
[07:22 AM]: Running GeneMark-ES on assembly
[07:24 AM]: GeneMark-ES failed: ARSEF_5376_funannotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[07:24 AM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[07:24 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[07:25 AM]: 64 valid BUSCO predictions found, now formatting for EVM
[07:26 AM]: Running EVM commands with 11 CPUs
[07:34 AM]: Converting to GFF3 and collecting all EVM results
[07:35 AM]: 64 total gene models from EVM, now validating with BUSCO HMM search
[07:35 AM]: 64 BUSCO predictions validated
[07:35 AM]: Training Augustus using BUSCO gene models
size 100 is greater than the number of genes in file ### <-- here's the error
/scratch/tyjames_root/tyjames/amsesk/2021_neozygites/funannotate/ARSEF_5376_funannotate/predict_misc/augustus.training.busco.gb. Aborting.

OS/Install Information

  • output of funannotate check --show-versions
-------------------------------------------------------
Checking dependencies for 1.7.3
-------------------------------------------------------
You are running Python v 2.7.16. Now checking python packages...
biopython: 1.76
goatools: 0.9.9
matplotlib: 2.2.3
natsort: 6.2.0
numpy: 1.16.2
pandas: 0.24.2
psutil: 5.7.0
requests: 2.23.0
scikit-learn: 0.20.3
scipy: 1.2.1
seaborn: 0.9.0
All 11 python packages installed


You are running Perl v 5.026002. Now checking perl modules...
Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.852
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.43
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.30
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed


Checking Environmental Variables...
$FUNANNOTATE_DB=/nfs/turbo/lsa-amsesk/database/funannotateDB
$PASAHOME=/opt/Anaconda-Python2-2019.03/opt/pasa-2.4.1
$TRINITYHOME=/opt/Anaconda-Python2-2019.03/opt/trinity-2.8.5
$EVM_HOME=/opt/Anaconda-Python2-2019.03/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/opt/Anaconda-Python2-2019.03/pkgs/augustus-3.3.2-pl526h985c5e9_2/config
$GENEMARK_PATH=/opt/GeneMark/gm_et_linux_64/gmes_petap
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.2
bamtools: bamtools 2.5.1
bedtools: bedtools v2.29.2
blat: BLAT v36
diamond: 0.9.24
emapper.py: 2.0.1-4-g2466c1b
ete3: 3.1.1
exonerate: exonerate 2.4.0
fasta: no way to determine
glimmerhmm: 3.0.4
gmap: 2017-11-15
gmes_petap.pl: 4.38
hisat2: 2.1.0
hmmscan: HMMER 3.3 (Nov 2019)
hmmsearch: HMMER 3.3 (Nov 2019)
java: 11.0.1
kallisto: 0.46.0
mafft: v7.455 (2019/Dec/7)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.17-r941
proteinortho: 6.0.14
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.9
signalp: 4.1
snap: 2006-07-28
stringtie: 2.1.1
tRNAscan-SE: 2.0.5 (October 2019)
tantan: tantan 13
tbl2asn: no way to determine, likely 25.X
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
All 36 external dependencies are installed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions