Cactus can take more than an hour to align this dataset on your instance, and it is strongly dependent on the allocated CPUs and memory for the job. I have provided the alignment in case you want to skip this step. Look for [HelicChr2.hal] among the files you downloaded.
The output of Cactus is a HAL (Hierarchical Alignment Format) file. As the name suggests, the file is hierarchical and designed to store genome-scale multiple sequence alignments.
A HAL can represent a complete tree of genomes, including their sequence data, annotations, and evolutionary relationships. You can also think of a HAL file as some kind of database in binary format, meaning that you cannot directly access with a simple text editor. The HAL file is optimized for random access, as well as supporting incremental updates to the alignment. If needed, you can use the Haltools to manipulate and convert the alignment into other formats.
Figure. (A) A single genome as represented in HAL. Two sequences are stored in an array of DNA characters and are segmented with respect to its parent (top segments) and children (bottom segments). (B) The same genome in the context of HAL graph of five genomes. The dashed edge corresponds to an inversion event. Hickey G, Paten B, Earl D, Zerbino D, Haussler D (2013). HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics
In this section, you can utilize halSummarizeMutations to extract various useful information, such as mutations (insertions, deletions, inversions, duplications, transpositions, gap insertions, and gap deletions) in each branch of the alignment. The job is a single CPU process and may require ~15 minutes to complete. You can try redirecting the output to a log file and running it in the background using the '&' symbol.
Solution
halSummarizeMutations HelicChr2.hal [optional --maxNFraction]NOTE: In halSummarizeMutations you can specify --targetGenomes or --rootGenome option, --maxNFraction 0 will instead prevent rearrangements with missing data as being identified as such. More generally, if an insertion of length 50 contains c N-characters, it will be labeled as missing data (rather than an insertion) if c/N > maxNFraction.
Once the analysis is done can you assess:
- What predominant mutation occurs at the branch of Heliconius?
- And how many bases and how many events does it involve?
Solution 1. The predominant mutation, in terms of bases, should be Insertions, with over 1.6 Mb, with ~9 thousand events. .
NOTE: This command can take a while, if you don't want to wait there's a hidden file named .HelicChr2.SummarizeMutations.csv you can run the mv command to make it visible:
mv .HelicChr2.SummarizeMutations.csv HelicChr2.SummarizeMutations.csvYou can also compute the alignment coverage using halAlignmentDepth, which returns the number of genomes aligned to a reference species of your choice. It will output data to the Stdout in wiggle format. You can redirect the output to a file and convert it into a compressed binary file (BigWig format) using wigToBigWig that can be easily visualized on a genome browser such as IGV.
halAlignmentDepth has few options, if you execute it you'll see what these are:
halAlignmentDepth v2.2: Make alignment depth wiggle plot for a genome. By default,
this is a count of the number of other unique genomes each
base aligns to, including ancestral genomes.
USAGE:
halAlignmentDepth [Options] <halPath> <refGenome>
ARGUMENTS:
halPath: input hal file
refGenome: reference genome to scan
OPTIONS:
--cacheBytes <value>: obsolete name for --hdf5CacheBytes [default = 1048576]
--cacheMDC <value>: obsolete name for --hdf5CacheMDC [default = 113]
--cacheRDC <value>: obsolete name for --hdf5CacheRDC [default = 521]
--cacheW0 <value>: obsolete name for --hdf5CacheW0 [default = 0.75]
--countDupes: count each other *position* each base aligns to, rather
than the number of unique genomes, including paralogies
so a genome can be counted multiple times. This will
give the height of the MAF column created with hal2maf.
[default = 0]
--format <value>: choose the back-end storage format. [default = hdf5]
--hdf5CacheBytes <value>: maximum size in bytes of regular hdf5 cache [default =
1048576]
--hdf5CacheMDC <value>: number of metadata slots in hdf5 cache [default = 113]
--hdf5CacheRDC <value>: number of regular slots in hdf5 cache. should be a prime
number ~= 10 * DefaultCacheRDCBytes / chunk [default =
521]
--hdf5CacheW0 <value>: w0 parameter for hdf5 cache [default = 0.75]
--hdf5InMemory: load all data in memory (and disable hdf5 cache) [default
= 0]
--help: display this help page [default = 0]
--inMemory: obsolete name for --hdf5InMemory [default = 0]
--length <value>: length of the reference genome (or sequence if specified)
to convert. If set to 0, the entire thing is converted
[default = 0]
--noAncestors: do not count ancestral genomes. [default = 0]
--outWiggle <value>: output wig file (stdout if none) [default = stdout]
--refSequence <value>: sequence name to export (all sequences by default)
[default = ""]
--rootGenome <value>: name of root genome (none if empty) [default = ""]
--start <value>: coordinate within reference genome (or sequence if
specified) to start at [default = 0]
--step <value>: step size [default = 1]
--targetGenomes <value>: comma-separated (no spaces) list of target genomes
(others are excluded) (vist all if empty) [default = ""]
You gonna be using --noAncestors, --rootGenome and --targetGenomes. Check what these options do.
- Use the tool to compute the overall coverage using H. melpomene (
Hmel) as a reference
halAlignmentDepth --noAncestors HelicChr2.hal Hmel > Hmel.Cov.wig- convert the wig file into a compress binary using
wigToBigWig.
wigToBigWig Hmel.Cov.wig Hmel.Chr2.fasta.fai Hmel.Cov.bwNOTE: You may have notice that wigToBigWig requires the fasta index file. This file is a tab-delimited file which contains information of the fasta file, such as the lengths of sequences. You can easily generate it with samtools
samtools faidx Hmel.Chr2.fasta- Compute the coverage of only Heliconius species and convert the output
halAlignmentDepth --rootGenome Heliconius --noAncestors HelicChr2.hal Hmel > Hmel.HelOnly.Cov.wig
wigToBigWig Hmel.HelOnly.Cov.wig Hmel.Chr2.fasta.fai Hmel.HelOnly.Cov.bw- Compute the coverage of only non-Heliconius species
halAlignmentDepth --targetGenomes Eisa,Dpha,Smor --noAncestors HelicChr2.hal Hmel > Hmel.NonHelOnly.Cov.wig
wigToBigWig Hmel.NonHelOnly.Cov.wig Hmel.Chr2.fasta.fai Hmel.NonHelOnly.Cov.bw- Now load the files (
Hmel.Cov.bw,Hmel.HelOnly.Cov.bw,Hmel.NonHelOnly.Cov.bw) into IGV, can you spot any region of the chromosome that is more specific to Heliconius species?
NOTE: To load data in IGV you need the genome in fasta format (what is our genome?) and its index.
Previous section: Part I: Whole Genome Alignment
Next section: Part III: Identification of Conserved Regions