Augustus Training Fails with: size 100 is greater than the number of genes in file

**Are you using the latest release?**
If you are not using the latest release of funannotate, please upgrade, if bug persists then report here.

I am using funannotate 1.7.3 as it is the version that we had the HPC folks install on our cluster for us.

**Describe the bug**
A clear and concise description of what the bug is.

`funannotate predict` fails during augustus training with the following error: `size 100 is greater than the number of genes in file`

I know that the problem is caused by BUSCO not being able to validate enough models for training augustus. For a variety of reasons (e.g., genome is probably incomplete, genome is subject to rapid evolution, etc.) BUSCO only finds and validates 60-70 markers in each nucleotide assembly. I specified `--min_training_models 60` to attempt to circumvent this issue. That worked for the logic check in `funannotate.library.trainAugustus`, but then augustus throws the above error. I'm fairly certain that it's the call to `randomSplit.pl` that produces the error.

I examined your source and see that in `predict.py` on lines 1390-1393 (version as of 02/22/2021 @ 21:00:00 EST) the `numTrainingSet` variable is hard-coded to be either 10% of `args.min_training_models` or 100 if `args.mins_training_models <= 1000`. 

Is this supposed to be hard-coded in that it's protecting me from training augustus at all when BUSCO can't pull enough markers for training (i.e., < 100)? Is training augustus with <100 markers a bad idea? It seems to me that I could circumvent this error by setting `numTrainingSet = numpy.floor(args.min_training_models / 2)` in cases where args.min_training_models < 200 (or something like that), but maybe this is hard coded for a good reason.

Can you please clarify if this is a bug or intended behavior? I'd like to train augustus for these predictions, but don't see a way around this without forcing funannotate to call trainAugustus with a lower value of `numTrainingSet`.

Thanks for your help!

**What command did you issue?**
Copy/paste the command used.
```
funannotate predict -i <masked_assembly> --species "<my_fungus>" --isolate "<my_strain>" -o "<my_strain>_funannotate" --cpus 12 --busco_db fungi --busco_seed_species rhizopus_oryzae --optimize_augustus --min_training_models 60
```

**Logfiles**
```
-------------------------------------------------------
[06:50 AM]: OS: linux2, 36 cores, ~ 196 GB RAM. Python: 2.7.16
[06:50 AM]: Running funannotate v1.7.3
[06:50 AM]: Parsed training data, run ab-initio gene predictors as follows:
[06:50 AM]: Loading genome assembly and parsing soft-masked repetitive sequences
  Program      Training-Method
  augustus     busco          
  genemark     selftraining   
  glimmerhmm   busco          
  snap         busco          
[06:51 AM]: Genome loaded: 8,411 scaffolds; 143,003,589 bp; 10.56% repeats masked
[06:51 AM]: Mapping 550,173 proteins to genome using diamond and exonerate
[06:55 AM]: Found 369,950 preliminary alignments --> aligning with exonerate
[07:22 AM]: Exonerate finished: found 1,493 alignments
[07:22 AM]: Running GeneMark-ES on assembly
[07:24 AM]: GeneMark-ES failed: ARSEF_5376_funannotate/predict_misc/genemark/output/gmhmm.mod file missing, please check logfiles.
[07:24 AM]: GeneMark predictions failed. If you can run GeneMark outside of funannotate, then pass the results to --genemark_gtf.
[07:24 AM]: Running BUSCO to find conserved gene models for training ab-initio predictors
[07:25 AM]: 64 valid BUSCO predictions found, now formatting for EVM
[07:26 AM]: Running EVM commands with 11 CPUs
[07:34 AM]: Converting to GFF3 and collecting all EVM results
[07:35 AM]: 64 total gene models from EVM, now validating with BUSCO HMM search
[07:35 AM]: 64 BUSCO predictions validated
[07:35 AM]: Training Augustus using BUSCO gene models
size 100 is greater than the number of genes in file ### <-- here's the error
/scratch/tyjames_root/tyjames/amsesk/2021_neozygites/funannotate/ARSEF_5376_funannotate/predict_misc/augustus.training.busco.gb. Aborting.
```

**OS/Install Information**
 -  output of `funannotate check --show-versions`
```
-------------------------------------------------------
Checking dependencies for 1.7.3
-------------------------------------------------------
You are running Python v 2.7.16. Now checking python packages...
biopython: 1.76
goatools: 0.9.9
matplotlib: 2.2.3
natsort: 6.2.0
numpy: 1.16.2
pandas: 0.24.2
psutil: 5.7.0
requests: 2.23.0
scikit-learn: 0.20.3
scipy: 1.2.1
seaborn: 0.9.0
All 11 python packages installed


You are running Perl v 5.026002. Now checking perl modules...
Bio::Perl: 1.007002
Carp: 1.38
Clone: 0.42
DBD::SQLite: 1.64
DBD::mysql: 4.046
DBI: 1.642
DB_File: 1.852
Data::Dumper: 2.173
File::Basename: 2.85
File::Which: 1.23
Getopt::Long: 2.5
Hash::Merge: 0.300
JSON: 4.02
LWP::UserAgent: 6.43
Logger::Simple: 2.0
POSIX: 1.76
Parallel::ForkManager: 2.02
Pod::Usage: 1.69
Scalar::Util::Numeric: 0.40
Storable: 3.15
Text::Soundex: 3.05
Thread::Queue: 3.12
Tie::File: 1.02
URI::Escape: 3.31
YAML: 1.30
threads: 2.15
threads::shared: 1.56
All 27 Perl modules installed


Checking Environmental Variables...
$FUNANNOTATE_DB=/nfs/turbo/lsa-amsesk/database/funannotateDB
$PASAHOME=/opt/Anaconda-Python2-2019.03/opt/pasa-2.4.1
$TRINITYHOME=/opt/Anaconda-Python2-2019.03/opt/trinity-2.8.5
$EVM_HOME=/opt/Anaconda-Python2-2019.03/opt/evidencemodeler-1.1.1
$AUGUSTUS_CONFIG_PATH=/opt/Anaconda-Python2-2019.03/pkgs/augustus-3.3.2-pl526h985c5e9_2/config
$GENEMARK_PATH=/opt/GeneMark/gm_et_linux_64/gmes_petap
All 6 environmental variables are set
-------------------------------------------------------
Checking external dependencies...
PASA: 2.4.1
CodingQuarry: 2.0
Trinity: 2.8.5
augustus: 3.3.2
bamtools: bamtools 2.5.1
bedtools: bedtools v2.29.2
blat: BLAT v36
diamond: 0.9.24
emapper.py: 2.0.1-4-g2466c1b
ete3: 3.1.1
exonerate: exonerate 2.4.0
fasta: no way to determine
glimmerhmm: 3.0.4
gmap: 2017-11-15
gmes_petap.pl: 4.38
hisat2: 2.1.0
hmmscan: HMMER 3.3 (Nov 2019)
hmmsearch: HMMER 3.3 (Nov 2019)
java: 11.0.1
kallisto: 0.46.0
mafft: v7.455 (2019/Dec/7)
makeblastdb: makeblastdb 2.2.31+
minimap2: 2.17-r941
proteinortho: 6.0.14
pslCDnaFilter: no way to determine
salmon: salmon 0.14.1
samtools: samtools 1.9
signalp: 4.1
snap: 2006-07-28
stringtie: 2.1.1
tRNAscan-SE: 2.0.5 (October 2019)
tantan: tantan 13
tbl2asn: no way to determine, likely 25.X
tblastn: tblastn 2.2.31+
trimal: trimAl v1.4.rev15 build[2013-12-17]
trimmomatic: 0.39
All 36 external dependencies are installed
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Augustus Training Fails with: size 100 is greater than the number of genes in file #556

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Augustus Training Fails with: size 100 is greater than the number of genes in file #556

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions