Skip to content

Commit bfd1cee

Browse files
committed
Docs grammar/style cleanup
Signed-off-by: Timur Rvachov <[email protected]>
1 parent 192e537 commit bfd1cee

File tree

8 files changed

+129
-136
lines changed

8 files changed

+129
-136
lines changed

docs/docs/datasets/CELLxGENE.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@
66

77
## Dataset attributes of version 2023-12-15
88

9-
Data was downloaded using the [CELLxGENE Discover Census version `2023-12-15`](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html#lts-2023-12-15). We first downloaded cellxgene census version 2023-12-15 using the `cellxgene_census` python API. We limited cell data to `organism=”Homo sapiens”`, with a non “na” `suspension_type`, `is_primary_data=True`, and `disease=”normal”` to limit to non-diseased tissues that are also the primary data source per cell to make sure that cells are only included once in the download. We tracked metadata including “assay”, “sex”, “development_stage”, “tissue_general”, “dataset_id” and “self_reported_ethnicity”. The metadata “assay”, “tissue_general”, and “dataset_id” were used to construct dataset splits into train, validation, and test sets. The training set represented 99% of the downloaded cells. We partitioned the data by dataset_id into a train set (99%) and a hold-out set (1%), to make sure that the hold-out datasets were independently collected single cell experiments, which helps evaluate generalizability to new future datasets. In this training split, we made sure that all “assay” and “tissue_general” labels were present in the training set so that our model would have maximal visibility into different tissues and assay biases. Finally the 1% hold-out set was split further into a validation and test set. This final split was mostly done randomly by cell, however we set aside a full dataset into the test split so that we could evaluate performance after training on a completely unseen dataset, including when monitoring the validation loss during training.
9+
Data was downloaded using the [CELLxGENE Discover Census version `2023-12-15`](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_data_release_info.html#lts-2023-12-15). We first downloaded CELLxGENE census version 2023-12-15 using the `cellxgene_census` python API. We limited cell data to `organism="Homo sapiens"`, with a non "na" `suspension_type`, `is_primary_data=True`, and `disease="normal"` to limit to non-diseased tissues that are also the primary data source per cell to make sure that cells are only included once in the download. We tracked metadata including "assay", "sex", "development_stage", "tissue_general", "dataset_id" and "self_reported_ethnicity". The metadata "assay", "tissue_general", and "dataset_id" were used to construct dataset splits into train, validation, and test sets. The training set represented 99% of the downloaded cells. We partitioned the data by dataset_id into a train set (99%) and a hold-out set (1%), to make sure that the hold-out datasets were independently collected single cell experiments, which helps evaluate generalizability to new future datasets. In this training split, we made sure that all "assay" and "tissue_general" labels were present in the training set so that our model would have maximal visibility into different tissues and assay biases. Finally the 1% hold-out set was split further into a validation and test set. This final split was mostly done randomly by cell, however we set aside a full dataset into the test split so that we could evaluate performance after training on a completely unseen dataset, including when monitoring the validation loss during training.
1010

11-
These parameters resulted in 23.87 Million single cells collected from a variety of public datasets, all hosted by CZI cell x gene census. After the splitting procedure we had:
11+
These parameters resulted in 23.87 Million single cells collected from a variety of public datasets, all hosted by CZI CELLxGENE census. After the splitting procedure we had:
1212

1313
- 23.64 Million cells in the training split
1414
- 0.13 Million cells in the validation split
@@ -53,11 +53,11 @@ Different assays have different ranges of reported gene measurements. On the low
5353

5454
#### Dataset distribution
5555

56-
Dataset (eg a publication that produces data and uploads to cellxgene) leads to known batch effects due to different handling proceedures, collection procedures, etc. We stratify our training vs hold-out split by this covariate for this reason. Exploring the breakdown of datasets we see that the top 10 datsets represent approximately 10 million cells of the full cellxgene datset. The largest dataset alone has 4 million cells.
56+
Dataset (e.g., a publication that produces data and uploads to CELLxGENE) leads to known batch effects due to different handling procedures, collection procedures, etc. We stratify our training vs hold-out split by this covariate for this reason. Exploring the breakdown of datasets we see that the top 10 datasets represent approximately 10 million cells of the full CELLxGENE dataset. The largest dataset alone has 4 million cells.
5757

5858
![Top datasets make up a large fraction of cells](../assets/old_images/cellxgene/num_cells_by_dataset.png)
5959

60-
Looking at the makeup of these top datasets, we see that most represent single tissue categories predominately. Most of these tend to be nervous system datsets with the exception of one which is balanced between many cell types.
60+
Looking at the makeup of these top datasets, we see that most represent single tissue categories predominately. Most of these tend to be nervous system datasets with the exception of one which is balanced between many cell types.
6161
![Top 9 datasets are largely biased toward single cell types](../assets/old_images/cellxgene/top9_datasets_tissue_distribution.png)
6262

6363
## References
@@ -87,7 +87,7 @@ Our training, validation and test data, including subsets made available for tes
8787
* Publication Reference: Cheng et al. (2018) Cell Reports; Publication: https://doi.org/10.1016/j.celrep.2018.09.006 Dataset Version: https://datasets.cellxgene.cziscience.com/912d943b-9060-4fd3-a12c-ad641a89f0e4.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/43d4bb39-21af-4d05-b973-4c1fed7b916c
8888
* Publication Reference: Cowan et al. (2020) Cell; Publication: https://doi.org/10.1016/j.cell.2020.08.013 Dataset Version: https://datasets.cellxgene.cziscience.com/b1989183-5808-46ab-87f5-978febb2d26e.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/2f4c738f-e2f3-4553-9db2-0582a38ea4dc
8989
* Publication Reference: Cowan et al. (2020) Cell; Publication: https://doi.org/10.1016/j.cell.2020.08.013 Dataset Version: https://datasets.cellxgene.cziscience.com/c0d3867e-1a7b-4e57-af62-c563f1934226.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/2f4c738f-e2f3-4553-9db2-0582a38ea4dc
90-
* Publication Reference: Dom\u00ednguez Conde et al. (2022) Science; Publication: https://doi.org/10.1126/science.abl5197 Dataset Version: https://datasets.cellxgene.cziscience.com/08f58b32-a01b-4300-8ebc-2b93c18f26f7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3
90+
* Publication Reference: Domínguez Conde et al. (2022) Science; Publication: https://doi.org/10.1126/science.abl5197 Dataset Version: https://datasets.cellxgene.cziscience.com/08f58b32-a01b-4300-8ebc-2b93c18f26f7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/62ef75e4-cbea-454e-a0ce-998ec40223d3
9191
* Publication Reference: Easter et al. (2024) Nat Commun; Publication: https://doi.org/10.1038/s41467-024-49037-y Dataset Version: https://datasets.cellxgene.cziscience.com/221dff56-a47d-4563-90ed-51b60e2f16d5.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/71f4bccf-53d4-4c12-9e80-e73bfb89e398
9292
* Publication Reference: Egozi et al. (2021) Nat Med; Publication: https://doi.org/10.1038/s41591-021-01586-1 Dataset Version: https://datasets.cellxgene.cziscience.com/e3a84fef-b6df-49b2-b0ca-ecaf444773ec.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/7651ac1a-f947-463a-9223-a9e408a41989
9393
* Publication Reference: Elmentaite et al. (2020) Developmental Cell; Publication: https://doi.org/10.1016/j.devcel.2020.11.010 Dataset Version: https://datasets.cellxgene.cziscience.com/3aedefc0-401a-4ee8-a1b5-a0ffc20e1ff2.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/17481d16-ee44-49e5-bcf0-28c0780d8c4a
@@ -282,7 +282,7 @@ Our training, validation and test data, including subsets made available for tes
282282
* Publication Reference: Smillie et al. (2019) Cell; Publication: https://doi.org/10.1016/j.cell.2019.06.029 Dataset Version: https://datasets.cellxgene.cziscience.com/6c483976-30de-4835-97f0-2b9bc93614e7.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/33d19f34-87f5-455b-8ca5-9023a2e5453d
283283
* Publication Reference: Smith et al. (2021) Proc. Natl. Acad. Sci. U.S.A.; Publication: https://doi.org/10.1073/pnas.2023333118 Dataset Version: https://datasets.cellxgene.cziscience.com/bf50dbfb-9ca0-4f0d-8deb-a1a810a0e313.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e02201d7-f49f-401f-baf0-1eb1406546c0
284284
* Publication Reference: Smith et al. (2021) Proc. Natl. Acad. Sci. U.S.A.; Publication: https://doi.org/10.1073/pnas.2023333118 Dataset Version: https://datasets.cellxgene.cziscience.com/ff7778bf-7a65-4d23-a9f4-b26c47926c28.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/e02201d7-f49f-401f-baf0-1eb1406546c0
285-
* Publication Reference: Sol\u00e9-Boldo et al. (2020) Commun Biol; Publication: https://doi.org/10.1038/s42003-020-0922-4 Dataset Version: https://datasets.cellxgene.cziscience.com/bc8d7152-3b69-4153-9314-7342ae58fbde.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/c353707f-09a4-4f12-92a0-cb741e57e5f0
285+
* Publication Reference: Solé-Boldo et al. (2020) Commun Biol; Publication: https://doi.org/10.1038/s42003-020-0922-4 Dataset Version: https://datasets.cellxgene.cziscience.com/bc8d7152-3b69-4153-9314-7342ae58fbde.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/c353707f-09a4-4f12-92a0-cb741e57e5f0
286286
* Publication Reference: Stephenson et al. (2021) Nat Med; Publication: https://doi.org/10.1038/s41591-021-01329-2 Dataset Version: https://datasets.cellxgene.cziscience.com/46586a98-b75d-4557-9cc4-839fc28e67d5.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/ddfad306-714d-4cc0-9985-d9072820c530
287287
* Publication Reference: Stewart et al. (2019) Science; Publication: https://doi.org/10.1126/science.aat5031 Dataset Version: https://datasets.cellxgene.cziscience.com/40ebb8e4-1a25-4a33-b8ff-02d1156e4e9b.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec
288288
* Publication Reference: Stewart et al. (2019) Science; Publication: https://doi.org/10.1126/science.aat5031 Dataset Version: https://datasets.cellxgene.cziscience.com/fe7e4408-7390-4f93-95aa-ffe472843421.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/120e86b4-1195-48c5-845b-b98054105eec

docs/docs/datasets/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ The BioNeMo Framework provides access to a variety of high-quality datasets for
44

55
| **Dataset** | **Modality** | **Uses** |
66
| -------------------------------------------------------- | -------------- | ------------------------------------------------ |
7-
| [CELLxGENE](./CELLxGENE.md) | Single Cell | Single-Cell Gene Expression
7+
| [CELLxGENE](./CELLxGENE.md) | Single Cell | Single-Cell Gene Expression |
88
| [UniProt](./uniprot.md) | Protein | Protein Sequence and Function Analysis |
99

1010
For more information about the datasets included in the BioNeMo Framework, refer to the Dataset Cards linked in the table above or the original sources referenced in the respective dataset descriptions.

docs/docs/datasets/uniprot.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@ randomly chosen UniRef90 sequence from each.
2121

2222
## Data Availability
2323

24-
Two versions of the dataset are distributed, a full training dataset (~80Gb) and a 10,000 UniRef50 cluster random slice
25-
(~150Mb). To load and use the sanity dataset, the [bionemo.core.data.load][bionemo.core.data.load.load] function
24+
Two versions of the dataset are distributed, a full training dataset (~80GB) and a 10,000 UniRef50 cluster random slice
25+
(~150MB). To load and use the sanity dataset, the [bionemo.core.data.load][bionemo.core.data.load.load] function
2626
can be used to materialize the sanity dataset in the BioNeMo2 cache directory:
2727

2828
```python
@@ -36,7 +36,7 @@ sanity_data_dir = load("esm2/testdata_esm2_pretrain:2.0")
3636
* [Sanity Dataset](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/resources/esm2_pretrain_nemo2_testdata/files)
3737
* [Full Dataset]
3838

39-
## Reference
39+
## References
4040

4141
1. UniProt Consortium. (2023). UniProt: The universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1),
4242
D523–D531. doi:10.1093/nar/gkac1052

docs/docs/models/ESM-2/index.md

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@ These models are ready for commercial use.
1414

1515
### Third-Party Community Consideration
1616

17-
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-partys requirements
17+
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements
1818
for this application and use case [1]; see link to [Non-NVIDIA Model Card for ESM-2 3B model](
19-
https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [non-NVIDIA Model Card for ESM-2 650M model](
19+
https://huggingface.co/facebook/esm2_t36_3B_UR50D) and [Non-NVIDIA Model Card for ESM-2 650M model](
2020
https://huggingface.co/facebook/esm2_t33_650M_UR50D)
2121

2222
### References
@@ -27,7 +27,7 @@ Santos Costa, A., 2023. Evolutionary-scale prediction of atomic-level protein st
2727

2828
[2] "UniProt: the universal protein knowledgebase in 2021." Nucleic acids research 49, no. D1 (2021): D480-D489.
2929

30-
[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for
30+
[3] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. BERT: Pre-training of deep bidirectional transformers for
3131
language understanding. arXiv preprint arXiv:1810.04805.
3232

3333
### Model Architecture
@@ -47,7 +47,7 @@ length 1022. Longer sequences are automatically truncated to this length.
4747

4848
### Output
4949

50-
**Output Type(s):** Embeddings (Amino-acid and sequence-level)
50+
**Output Type(s):** Embeddings (Amino acid and sequence-level)
5151

5252
**Output Parameters:** 1D
5353

@@ -63,15 +63,15 @@ acid.
6363

6464
**Supported Hardware Microarchitecture Compatibility**
6565

66-
* [Ampere]
67-
* [Hopper]
68-
* [Volta]
66+
* NVIDIA Ampere
67+
* NVIDIA Hopper
68+
* NVIDIA Volta
6969

7070
**[Preferred/Supported] Operating System(s)**
7171

72-
* [Linux]
72+
* Linux
7373

74-
### Model Version(s)
74+
### Model Versions
7575

7676
* [esm2/650m:2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv650m)
7777
* [esm2/3b:2.0](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/esm2nv3b)
@@ -81,30 +81,30 @@ acid.
8181
### Training Dataset
8282

8383
Original ESM-2 checkpoints from HuggingFace were trained with the UniProt 2021_04 sequence database. For more details on
84-
the training dataset, see Lin *et al.* 2023. The train / test splits used by the original authors were not distributed.
84+
the training dataset, see Lin *et al.* 2023. The train/test splits used by the original authors were not distributed.
8585
A pre-training database compiled by NVIDIA following a similar approach is described in [UniProt
86-
Dataset](../datasets/uniprot.md).
86+
Dataset](../../datasets/uniprot.md).
8787

8888
### Inference
8989

9090
**Engine:** BioNeMo, NeMo
9191

9292
**Test Hardware**
9393

94-
* [Ampere]
95-
* [Hopper]
96-
* [Volta]
94+
* NVIDIA Ampere
95+
* NVIDIA Hopper
96+
* NVIDIA Volta
9797

9898
## License
9999

100-
ESM-2 is as provided under the Apache 2.0 license.
100+
ESM-2 is provided under the Apache 2.0 license.
101101

102102
## Competitive Benchmarking
103103

104104
### Accuracy
105105

106106
A validation set of 328,360 UniRef50 representative sequences were randomly selected from UniRef 2024_03 (see [UniProt
107-
Dataset](../datasets/uniprot.md)). This validation set was used to ensure that the output of BioNeMo-converted
107+
Dataset](../../datasets/uniprot.md)). This validation set was used to ensure that the output of BioNeMo-converted
108108
checkpoints is consistent with their outputs when evaluated with the HuggingFace Transformers library.
109109

110110
| Checkpoint | HuggingFace | BioNeMo2 | Lin *et al.* 2023 |
@@ -123,21 +123,21 @@ checkpoints is consistent with their outputs when evaluated with the HuggingFace
123123

124124
![ESM-2 Single-Device Training Performance](../../assets/images/esm2/esm2_single_node_training_perf.png)
125125

126-
The pure-pytorch baseline (compiled with `torch.compile()`) raised an out-of-memory error for batch sizes larger than 16
127-
at the ESM2-650M model size. The `bionemo2` model could handle batch sizes of 46, reaching a model flops utilization of
126+
The pure-PyTorch baseline (compiled with `torch.compile()`) raised an out-of-memory error for batch sizes larger than 16
127+
at the ESM2-650M model size. The `bionemo2` model could handle batch sizes of 46, reaching a model FLOPs utilization of
128128
59.2% on an NVIDIA A100.
129129

130130
#### Model Scaling
131131

132132
![ESM-2 Model Scaling](../../assets/images/esm2/esm2_model_scaling.png)
133133

134134
Training ESM-2 at the 650M, 3B, and 15B model variants show improved performance with the BioNeMo2 framework over the
135-
pure-pytorch baseline. These experiments were conducted on 16x NVIDIA A100 or 16x NVIDIA H100 GPUs split across two
135+
pure-PyTorch baseline. These experiments were conducted on 16x NVIDIA A100 or 16x NVIDIA H100 GPUs split across two
136136
nodes. <sup>*</sup>*Note:* 15B model variants were trained on 64 GPUs with the BioNeMo2 framework.
137137

138138
#### Device Scaling
139139

140140
![ESM-2 Device Scaling](../../assets/images/esm2/esm2_device_scaling.png)
141141

142142
Training ESM-3B on 256 NVIDIA A100s on 32 nodes achieved 96.85% of the theoretical linear throughput expected from
143-
extrapolating single-node (8 GPU) performance, representing a model flops utilization of 60.6% at 256 devices.
143+
extrapolating single-node (8 GPU) performance, representing a model FLOPs utilization of 60.6% at 256 devices.

0 commit comments

Comments
 (0)