This is my attempt at the selection tests for the GSoC 2019 project, "Human history and data visualization" under the Canadian Center for Computational Genomics
usage: main.py [-h] [--log_dir] [--plot_dir] [--coords_dir] [--genedata]
[--labels] [--pops] [--seed] [--downsample_n] [--prune_size]
[--prune_step] [--prune_thresh] [--prune_iter] [--pca_n]
[--pca_scaler] [--plot_limit_components] [--title TITLE]
Applying Dimensionality Reduction on Genomic Data
optional arguments:
-h, --help show this help message and exit
--log_dir log directory
--plot_dir plot directory
--coords_dir coordinates directory
--genedata genomic data file
--labels sample population mappings file
--pops population description file
--seed seed for prng
--downsample_n number of SNPs to randomly choose
--prune_size size for pruning
--prune_step step for pruning
--prune_thresh threshold for pruning
--prune_iter iterations for pruning
--pca_n number of principal components
--pca_scaler scaler for pca
--plot_limit_components
number of components to consider plotting, [None]
means plot all
--title TITLE process title
With the dataset in a folder data in the repository directory, run the following command :
python main.py --genedata=data/ALL.wgs.nhgri_coriell_affy_6.20140825.genotypes_has_ped.vcf --labels=data/affy_samples.20141118.panel --pops=data/20131219.populations.tsv --title="Chromosome 1 Assembly B37" --plot_limit_components=4After running command, the project structure changes to :
.
├── coords
│ └── [1554763502304] Chromosome 1 Assembly B37
│ ├── [1554763502304] Chromosome 1 Assembly B37 - [PCA] coords.txt
│ ├── [1554763502304] Chromosome 1 Assembly B37 - [UMAP] coords.txt
│ └── [1554763502304] Chromosome 1 Assembly B37 - [UMAP_PCA] coords.txt
├── logs
│ └── [1554763502304] Chromosome 1 Assembly B37.log
├── main.py
├── plots
│ └── [1554763502304] Chromosome 1 Assembly B37 plots
│ ├── [1554763502304] Chromosome 1 Assembly B37 [PCA]
│ │ ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][2].png
│ │ ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][3].png
│ │ ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][4].png
│ │ ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[2][3].png
│ │ ├── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[2][4].png
│ │ └── [1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[3][4].png
│ ├── [1554763502304] Chromosome 1 Assembly B37 [UMAP]
│ │ └── [1554763502304] Chromosome 1 Assembly B37 [UMAP] - PC[1][2].png
│ └── [1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA]
│ └── [1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA] - PC[1][2].png
└── README.md
The coords directory contains the lower dimensional coordinates written into text files.
The logs directory contains the log reports.
The plots directory has the plots of the components with one another.
![Chromosome 1 Assembly B37 [PCA] - PC[1][2] alt text](/kelvindecosta/genomic-data-visualization-pipeline/raw/master/plots/[1554763502304] Chromosome 1 Assembly B37 plots/[1554763502304] Chromosome 1 Assembly B37 [PCA]/[1554763502304] Chromosome 1 Assembly B37 [PCA] - PC[1][2].png)
![[1554763502304] Chromosome 1 Assembly B37 [UMAP] - PC[1][2] alt text](/kelvindecosta/genomic-data-visualization-pipeline/raw/master/plots/[1554763502304] Chromosome 1 Assembly B37 plots/[1554763502304] Chromosome 1 Assembly B37 [UMAP]/[1554763502304] Chromosome 1 Assembly B37 [UMAP] - PC[1][2].png)
![[1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA] - PC[1][2] alt text](/kelvindecosta/genomic-data-visualization-pipeline/raw/master/plots/[1554763502304] Chromosome 1 Assembly B37 plots/[1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA]/[1554763502304] Chromosome 1 Assembly B37 [UMAP_PCA] - PC[1][2].png)