-
Notifications
You must be signed in to change notification settings - Fork 228
shuffle index list with numpy, scatter list, use file for large lists #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
a5b01fb
b626416
1070dc2
11e4df0
360ff19
5c0ca62
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -219,26 +219,21 @@ def barrier(args): | |||||
| else: | ||||||
| dist.barrier() | ||||||
|
|
||||||
| def bcast(args, vals, root=0): | ||||||
| """Broadcast list of vals from root to all ranks, returns newly allocated list""" | ||||||
| def scatterv_(args, invals, counts, outval, root=0): | ||||||
| """Scatter values from invals according to counts array, receive values in outval""" | ||||||
| if args.use_mpi: | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You could actually assert that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, thanks. |
||||||
| vals = args.mpi_comm.bcast(vals, root=root) | ||||||
| return vals | ||||||
| displ = [sum(counts[:rank]) for rank in range(args.numranks)] | ||||||
|
||||||
| args.mpi_comm.Scatterv([invals, np.array(counts), np.array(displ), args.MPI.INT64_T], outval, root=root) | ||||||
| else: | ||||||
| # broadcast length of vals list | ||||||
| length = [len(vals)] | ||||||
| dist.broadcast_object_list(length, src=root) | ||||||
|
|
||||||
| # allocate a tensor of appropriate size | ||||||
| # initialize tensor with list values on root | ||||||
| tensors = [] | ||||||
|
||||||
| tensors = [] | |
| chunks = [] |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you convert it into torch tensor, you can use cumsum in order to get end. Or better there seems to have a build in function for what you want: https://pytorch.org/docs/stable/generated/torch.split.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Thanks for the tip on torch.split. That looks to be perfect.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: tensor doesn't express much
| tensor = torch.from_numpy(outval) | |
| out = torch.from_numpy(outval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there an issue where the size of tensors don't match anymore? Could you add a check that outval, correspond to the same size as corresponding count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that name is not very descriptive rename. And good idea to check the size of the output array.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idxlist = idxlist[:num_samples]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Also if you don't mind can you use 20_000_000 it's easier to read IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will go away. That 20 million was an arbitrary number I picked out of the air anyway. Thanks for the tip on the underscores for readability. I'll do that if this has to come back.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there such a thing? That we can't load all indices in memory? 8bytes per int64, so 20_000_000 elements represent around 160Mb which should be more than okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more worried about the communication method being used. We can revisit this if it happens to show up.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should compute those offsets only once and dispatch to every rank.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this is linked to int64.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this last barrier is needed, you could just let other ranks go to the next step without worrying no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can even run dset[idx] if I'm not wrong, to be confirmed. If so I'd suggest something like
for row in dset[idx]:
for key in args.columns:
text = rows[key]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's useful to know. If you don't mind, I'd like to stick with the harder way for now, because I'm using this same code to support a JSON file and that particular code only supports an int for the argument in the __get__ call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that just len(dset)? Unless you want to take in account count also. I don't think it's worth adding a parameter to the signature of the method, since you can recover it from args.count and len(dset).
Seeing as it's already an approximation, you could say that percent = dset_stats[0] / len(idx) * 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is to adjust for count being less than len(dset). I'll create a separate function to compute num_samples and drop it from the param list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could extrapolate also? Overall I feel it's a shame to add arguments for the sake of a logger. If you really want to keep it, I'm also okay with that.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this in get_args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should mention that this supports only int64 for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to the docstring about that.