Indexer as hierarchicalkmeans cannot get the correct label partition

```
from pecos.xmc import Indexer
import scipy.sparse as smat
import numpy as np

label_embeddings = np.array(
      [[-9.21174158,  5.11299655],
       [-8.59250195, -1.11406841],
       [-4.30549653,  3.99404334],
       [-4.43811548,  4.68773409],
       [-6.00330942,  7.96222741],
       [-6.87172864,  8.01769469],
       [-8.86330667,  4.96141572],
       [-4.3774397 ,  4.60103839],
       [-6.42845615,  7.20886612],
       [-9.69681323, -2.32416397]], dtype=np.float32)
# ground truth for label clusters
target_label_clusters =np.array([0,2,3,3,1,1,0,3,1,2])
label_embeddings = smat.csr_matrix(label_embeddings)
chain = Indexer.gen(feat_mat=label_embeddings, indexer_type="hierarchicalkmeans", max_leaf_size=3, spherical=False)
```
I made a fake `label_embeddings` to make it easier to see the problem.
![kmeans_cluster](https://github.com/amzn/pecos/assets/5330101/0b42347d-735f-49bd-bc7a-29b623e924fc)
We can plot `label_embeddings` on 2D image. It should be partition like this.

But I made a breakpoint at [codes = clib.run_clustering](https://github.com/amzn/pecos/blob/mainline/pecos/xmc/base.py#L207C35-L207C35), I got the codes as [1, 0, 1, 3, 2, 2, 1, 3, 3, 0]

Comparing it to the target_label_clusters=[0,2,3,3,1,1,0,3,1,2], 
codes[2]=1, its correct cluster should be 3; codes[8]=3, its correct cluster should be 2.

I can't figure out why such a simple feature can't be divided correctly. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexer as hierarchicalkmeans cannot get the correct label partition #253

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Indexer as hierarchicalkmeans cannot get the correct label partition #253

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions