You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Signed-off-by: Ryan Russell <[email protected]>
Various readability fixes focused on `.md` files:
- Grammar
- Fix some incorrect command references to `distributed_kmeans.py`
- Styling the markdown bash code snippets sections so they format
Attempted to put a lot of little things into one PR and commit; let me know if any mods are needed!
Best,
Ryan
Pull Request resolved: facebookresearch/faiss#2378
Reviewed By: alexanderguzhva
Differential Revision: D37717671
Pulled By: mdouze
fbshipit-source-id: 0039192901d98a083cd992e37f6b692d0572103a
Copy file name to clipboardExpand all lines: benchs/distributed_ondisk/README.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,15 +10,15 @@ Hopefully, changing to another type of scheduler should be quite straightforward
10
10
11
11
## Distributed k-means
12
12
13
-
To cluster 500M vectors to 10M centroids, it is useful to have a distriubuted k-means implementation.
13
+
To cluster 500M vectors to 10M centroids, it is useful to have a distributed k-means implementation.
14
14
The distribution simply consists in splitting the training vectors across machines (servers) and have them do the assignment.
15
15
The master/client then synthesizes the results and updates the centroids.
16
16
17
17
The distributed k-means implementation here is based on 3 files:
18
18
19
19
-[`distributed_kmeans.py`](distributed_kmeans.py) contains the k-means implementation.
20
20
The main loop of k-means is re-implemented in python but follows closely the Faiss C++ implementation, and should not be significantly less efficient.
21
-
It relies on a `DatasetAssign` object that does the assignement to centrtoids, which is the bulk of the computation.
21
+
It relies on a `DatasetAssign` object that does the assignment to centroids, which is the bulk of the computation.
22
22
The object can be a Faiss CPU index, a GPU index or a set of remote GPU or CPU indexes.
23
23
24
24
-[`run_on_cluster.bash`](run_on_cluster.bash) contains the shell code to run the distributed k-means on a cluster.
@@ -30,7 +30,7 @@ The file is also assumed to be accessible from all server machines with eg. a di
30
30
31
31
### Local tests
32
32
33
-
Edit `distibuted_kmeans.py` to point `testdata` to your local copy of the dataset.
33
+
Edit `distributed_kmeans.py` to point `testdata` to your local copy of the dataset.
34
34
35
35
Then, 4 levels of sanity check can be run:
36
36
```bash
@@ -47,7 +47,7 @@ The output should look like [This gist](https://gist.github.com/mdouze/ffa01fe66
47
47
48
48
### Distributed sanity check
49
49
50
-
To run the distributed k-means, `distibuted_kmeans.py` has to be run both on the servers (`--server` option) and client sides (`--client` option).
50
+
To run the distributed k-means, `distributed_kmeans.py` has to be run both on the servers (`--server` option) and client sides (`--client` option).
51
51
Edit the top of `run_on_cluster.bash` to set the path of the data to cluster.
52
52
53
53
Sanity checks can be run with
@@ -56,7 +56,7 @@ Sanity checks can be run with
56
56
bash run_on_cluster.bash test_kmeans_0
57
57
# using all the machine's GPUs
58
58
bash run_on_cluster.bash test_kmeans_1
59
-
#distrbuted run, with one local server per GPU
59
+
#distributed run, with one local server per GPU
60
60
bash run_on_cluster.bash test_kmeans_2
61
61
```
62
62
The test `test_kmeans_2` simulates a distributed run on a single machine by starting one server process per GPU and connecting to the servers via the rpc protocol.
@@ -67,10 +67,10 @@ The output should look like [this gist](https://gist.github.com/mdouze/5b2dc69b7
67
67
### Distributed run
68
68
69
69
The way the script can be distributed depends on the cluster's scheduling system.
70
-
Here we use Slurm, but it should be relatively easy to adapt to any scheduler that can allocate a set of matchines and start the same exectuable on all of them.
70
+
Here we use Slurm, but it should be relatively easy to adapt to any scheduler that can allocate a set of machines and start the same executable on all of them.
71
71
72
72
The command
73
-
```
73
+
```bash
74
74
bash run_on_cluster.bash slurm_distributed_kmeans
75
75
```
76
76
asks SLURM for 5 machines with 4 GPUs each with the `srun` command.
@@ -90,12 +90,12 @@ The output should look like [this gist](https://gist.github.com/mdouze/8d25e89fb
90
90
For the real run, we run the clustering on 50M vectors to 1M centroids.
91
91
This is just a matter of using as many machines / GPUs as possible in setting the output centroids with the `--out filename` option.
0: writing centroids to /checkpoint/matthijs/ondisk_distributed/1M_centroids.npy
101
101
```
@@ -121,25 +121,25 @@ This is performed by the script [`make_trained_index.py`](make_trained_index.py)
121
121
122
122
## Building the index by slices
123
123
124
-
We call the slices "vslices" as they are vertical slices of the big matrix, see explanation in the wiki section [Split across datanbase partitions](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors#split-across-database-partitions).
124
+
We call the slices "vslices" as they are vertical slices of the big matrix, see explanation in the wiki section [Split across database partitions](https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors#split-across-database-partitions).
125
125
126
126
The script [make_index_vslice.py](make_index_vslice.py) makes an index for a subset of the vectors of the input data and stores it as an independent index.
127
127
There are 200 slices of 5M vectors each for Deep1B.
128
128
It can be run in a brute-force parallel fashion, there is no constraint on ordering.
129
129
To run the script in parallel on a slurm cluster, use:
130
-
```
130
+
```bash
131
131
bash run_on_cluster.bash make_index_vslices
132
132
```
133
133
For a real dataset, the data would be read from a DBMS.
134
134
In that case, reading the data and indexing it in parallel is worthwhile because reading is very slow.
135
135
136
-
## Splitting accross inverted lists
136
+
## Splitting across inverted lists
137
137
138
138
The 200 slices need to be merged together.
139
139
This is done with the script [merge_to_ondisk.py](merge_to_ondisk.py), that memory maps the 200 vertical slice indexes, extracts a subset of the inverted lists and writes them to a contiguous horizontal slice.
140
140
We slice the inverted lists into 50 horizontal slices.
141
141
This is run with
142
-
```
142
+
```bash
143
143
bash run_on_cluster.bash make_index_hslices
144
144
```
145
145
@@ -150,11 +150,11 @@ The horizontal slices need to be loaded in the right order and combined into an
150
150
This is done in the [combined_index.py](combined_index.py) script.
151
151
It provides a `CombinedIndexDeep1B` object that contains an index object that can be searched.
Copy file name to clipboardExpand all lines: contrib/README.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ A very simple Remote Procedure Call library, where function parameters and resul
19
19
### client_server.py
20
20
21
21
The server handles requests to a Faiss index. The client calls the remote index.
22
-
This is mainly to shard datasets over several machines, see [Distributd index](https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM#distributed-index)
22
+
This is mainly to shard datasets over several machines, see [Distributed index](https://github.com/facebookresearch/faiss/wiki/Indexes-that-do-not-fit-in-RAM#distributed-index)
23
23
24
24
### ondisk.py
25
25
@@ -52,7 +52,7 @@ A few functions to override the coarse quantizer in IVF, providing additional fl
52
52
53
53
(may require h5py)
54
54
55
-
Defintion of how to access data for some standard datsets.
55
+
Definition of how to access data for some standard datasets.
0 commit comments