-
Notifications
You must be signed in to change notification settings - Fork 225
ADIOS Restart: Avoid NxN access pattern with Aggregators #900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ADIOS Restart: Avoid NxN access pattern with Aggregators #900
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 5
|
the code changes looks good! |
3f51adc to
8d84cb0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove comment
|
weird, but it's not restarting... runs fine until and then hangs. Not all ranks reach
Update: the last two messages only appear if the rank actually has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use MPI_Exscan for exclusive scan (without rank i):
Careful: The value in recvbuf on process 0 is undefined and unreliable as recvbuf is not significant for process 0. The value of recvbuf on process 1 is always the value in sendbuf on process 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totalNumParticles is of uint64_cu and might not go well with uint64_t ...
|
bug: |
|
the |
|
|
|
yes, the |
|
hm, not all ranks reach "load particles on device chunk offset=0" ... |
|
Yes, this is the current limit of |
|
Can you check if all ranks leaf the adios call and enter MPI_scan? |
|
yes, 96 ranks reach "... particles from offset ... " the lines below |
|
hm, the "ADIOS: ( end ) load species" is in the "if" of ">0 particles"... |
|
Some posts before you wrote that |
|
Ahh ok I can answer my last question by my self. The message is from the |
|
not sure, I am mixing |
|
I have to declutter these |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for sanity. I am not sure if undefined means "guaranteed to be not overwritten" in this manual
The value in recvbuf on process 0 is undefined and unreliable as recvbuf
is not significant for process 0.
|
After the last commits, all 48 ranks reach for both species the nevertheless, the whole sim still hangs after that outputting as before: and that's it ^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be saver to use gc.getCommunicator().getMPIComm() here?
adiosComm is initialized in pluginLoad via a MPI_Comm_dup from it.
adios_read_init_method uses adiosComm afterward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in ADIOSCountParticles.hpp we use MPI_UNSIGNED_LONG_LONG for uint64_t values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, could it be possible that the ranks we use here are not the same as in gc.getScalarPosition() ? That would explain a lot ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getScalarPosition() in libPMacc/include/mappings/simulation/GridController.hpp comes from
uint32_t getScalarPosition() const
{
return DataSpaceOperations<DIM>::map(getGpuNodes(), getPosition());
}hm... I would really like to use a MPI_Scan here, it scales (log N) way better than a crumpy MPI_Allgather ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah crap, that's a communicator and grid controller design flaw... I will use the less effcient Allgather now, but we should refactor the thing s othat getScalarPosition can actually be used with MPI communicators again 🔓
f3c58ed to
3666d7d
Compare
|
@psychocoderHPC for the first time in my life, I found a useful application of P.S.: still not running, but due to an std::bad_alloc during attribute reads... too tired to continue just now, I must have mixed up some lengths ;) |
|
yes that's the plan. the question is if I can apply a fix to load the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be consistent, that would be the actual rank (but does not matter due to allgather).
f5d1ea2 to
8900924
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i*5
453f280 to
a75d432
Compare
|
just needs a moving window test, besides that it works on both small and very large scale runs. |
|
rebase me :) (#907 was merged) |
|
I get the flowing error with LWFA (enabled moving window) with debug output add additional own debug output to check the pointer address |
|
looks like failed open or a memory corruption. let's investigate. |
|
-> you are working on outdated code, pls checkout this branch (forgot to |
|
NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO |
|
OK all is fine. LWFA (with moving window) is restart able and the pngs shows no difference. please rebase this branch! |
During restart, the written N files by N aggregators are read again by each rank which causes NxN open files and fails for large numbers of aggregators on parallel file systems. We now read a local block only and allgather it to calculate the offset in the 1D partice attribute arrays again. Due to a implementation bug in `ADIOSCountParticles.hpp:120` we have to immitate the (wrong) Sort-by-Rank for Particles. Particle offsets are (unfurtunately) calculated by rank but only the scalarPos is given in the `particles_info` object. An upcoming change in that object with `OpenPMD` `particlePatches` will change that.
a75d432 to
8f845fb
Compare
|
cool, thx for testing! :) |
ADIOS Restart: Avoid NxN access pattern with Aggregators
c147cf1a5d hack for native clang usage for HIP ceead8c719 Merge pull request ComputationalRadiationPhysics#926 from psychocoderHPC/fix-implicitCAstWarning 86a8b8def8 test: fix implicit cast warning f90d1dc515 Merge pull request ComputationalRadiationPhysics#924 from BenjaminW3/topic-version-0.5 b1d2a8d866 Increase version to 0.5.0 167ca262f8 Merge pull request ComputationalRadiationPhysics#923 from BenjaminW3/topic-omp-num-threads-1 be8bf55791 Merge pull request ComputationalRadiationPhysics#920 from SimeonEhrig/bufferCopyComment 4162d26b61 Fix exception in TaskKernelCpuOmp2Blocks when OMP_NUM_THREADS==1 daffff6252 Add comment to pitch function at the bufferCopy example 284eef5113 Merge pull request ComputationalRadiationPhysics#916 from BenjaminW3/topic-gh-action 0a4969e5d0 Merge pull request ComputationalRadiationPhysics#917 from sbastrakov/doc-addBufferPinning 4924eb0bd6 Merge pull request ComputationalRadiationPhysics#918 from BenjaminW3/topic-remove-commented-out-sections fa2ce0ceef Add info on pinning to the CUDA mapping docs 68deac0768 Remove commented out code be64df2d29 Add automated gh-pages deployment for branch develop cb0e27819b Merge pull request ComputationalRadiationPhysics#915 from sbastrakov/topic-c++14HelperTypes adf11e573c Use C++ helper types for traits 8decb8d5b4 Merge pull request ComputationalRadiationPhysics#914 from sbastrakov/doc-addCuplaReference 9f528697f3 Merge pull request ComputationalRadiationPhysics#913 from sbastrakov/fix-ExampleCommentAcceleratorList 07e455b637 Add a reference to cupla to readme a06f345e7f Add forgotten TBB accelerator to the list in comments of the examples 432331fcc7 Merge pull request ComputationalRadiationPhysics#909 from BenjaminW3/topic-c++14-2 40bfeaaee7 Incorporate review comments 74e0ffa006 Remove meta::IntegerSequence 71afe1f0bb Remove unused includes 572777ed5a Remove unused TransformIntegerSequence e52b90d920 Remove unused meta::IndexSequence ed5b5f8d9b Prepare usage of std::integer_sequence aa0635525d Use std::integer_sequence instead of own IntegerSequence 6b914ca157 Use std::integer_sequence instead of own IntegerSequence for NdLoop 9811f23a30 Merge pull request ComputationalRadiationPhysics#910 from ax3l/topic-removeBetaStatusDev 9f3f01bb40 remove beta status b99acc704c Merge pull request ComputationalRadiationPhysics#907 from BenjaminW3/topic-integer_sequence b2db39d599 Merge pull request ComputationalRadiationPhysics#906 from BenjaminW3/topic-increase-min-boost 3992f097cf Raise minimum supported boost version from 1.62.0 to 1.65.1 53f74a28ee Merge pull request ComputationalRadiationPhysics#904 from BenjaminW3/topic-increase-min-ubuntu 1b346420de Replace alpaka::meta::IndexSequence with C++14 std::index_sequence 7180827504 Merge pull request ComputationalRadiationPhysics#900 from BenjaminW3/topic-c++14 be03160b3c Remove Support for ubuntu 14.04 bb3d6c49f0 Raise minimum to -std=c++14 and remove support for CUDA 8.0 and gcc 4.9 7910971a54 Merge pull request ComputationalRadiationPhysics#899 from BenjaminW3/topic-xcode-11_3 29234ffcc2 Add support for XCode 11.3 5135bdb27b Merge pull request ComputationalRadiationPhysics#901 from BenjaminW3/topic-fix-tbb-win-download bcd6d46ef6 Fix TBB installation REVERT: ab0b8a460f Merge pull request ComputationalRadiationPhysics#905 from psychocoderHPC/fix-tbb-win-download REVERT: d7471b9381 Merge pull request ComputationalRadiationPhysics#903 from psychocoderHPC/topic-removeBetaStatus REVERT: 13c06f9667 Fix TBB installation REVERT: ea6b56b0fb remove beta status git-subtree-dir: thirdParty/alpaka git-subtree-split: c147cf1a5d69e9f553986566a571298d92b856f5
48972eb593 hack for native clang usage for HIP 4eaff438cb Increase version to 0.5.0 b5f4402022 Fix exception in TaskKernelCpuOmp2Blocks when OMP_NUM_THREADS==1 7569489385 Add comment to pitch function at the bufferCopy example 0e1757dfff import ComputationalRadiationPhysics#864 b1042de4d3 HIP-clang support 284eef5113 Merge pull request ComputationalRadiationPhysics#916 from BenjaminW3/topic-gh-action 0a4969e5d0 Merge pull request ComputationalRadiationPhysics#917 from sbastrakov/doc-addBufferPinning 4924eb0bd6 Merge pull request ComputationalRadiationPhysics#918 from BenjaminW3/topic-remove-commented-out-sections fa2ce0ceef Add info on pinning to the CUDA mapping docs 68deac0768 Remove commented out code be64df2d29 Add automated gh-pages deployment for branch develop cb0e27819b Merge pull request ComputationalRadiationPhysics#915 from sbastrakov/topic-c++14HelperTypes adf11e573c Use C++ helper types for traits 8decb8d5b4 Merge pull request ComputationalRadiationPhysics#914 from sbastrakov/doc-addCuplaReference 9f528697f3 Merge pull request ComputationalRadiationPhysics#913 from sbastrakov/fix-ExampleCommentAcceleratorList 07e455b637 Add a reference to cupla to readme a06f345e7f Add forgotten TBB accelerator to the list in comments of the examples 432331fcc7 Merge pull request ComputationalRadiationPhysics#909 from BenjaminW3/topic-c++14-2 40bfeaaee7 Incorporate review comments 74e0ffa006 Remove meta::IntegerSequence 71afe1f0bb Remove unused includes 572777ed5a Remove unused TransformIntegerSequence e52b90d920 Remove unused meta::IndexSequence ed5b5f8d9b Prepare usage of std::integer_sequence aa0635525d Use std::integer_sequence instead of own IntegerSequence 6b914ca157 Use std::integer_sequence instead of own IntegerSequence for NdLoop 9811f23a30 Merge pull request ComputationalRadiationPhysics#910 from ax3l/topic-removeBetaStatusDev 9f3f01bb40 remove beta status b99acc704c Merge pull request ComputationalRadiationPhysics#907 from BenjaminW3/topic-integer_sequence b2db39d599 Merge pull request ComputationalRadiationPhysics#906 from BenjaminW3/topic-increase-min-boost 3992f097cf Raise minimum supported boost version from 1.62.0 to 1.65.1 53f74a28ee Merge pull request ComputationalRadiationPhysics#904 from BenjaminW3/topic-increase-min-ubuntu 1b346420de Replace alpaka::meta::IndexSequence with C++14 std::index_sequence 7180827504 Merge pull request ComputationalRadiationPhysics#900 from BenjaminW3/topic-c++14 be03160b3c Remove Support for ubuntu 14.04 bb3d6c49f0 Raise minimum to -std=c++14 and remove support for CUDA 8.0 and gcc 4.9 7910971a54 Merge pull request ComputationalRadiationPhysics#899 from BenjaminW3/topic-xcode-11_3 29234ffcc2 Add support for XCode 11.3 5135bdb27b Merge pull request ComputationalRadiationPhysics#901 from BenjaminW3/topic-fix-tbb-win-download bcd6d46ef6 Fix TBB installation REVERT: ab0b8a460f Merge pull request ComputationalRadiationPhysics#905 from psychocoderHPC/fix-tbb-win-download REVERT: d7471b9381 Merge pull request ComputationalRadiationPhysics#903 from psychocoderHPC/topic-removeBetaStatus REVERT: 13c06f9667 Fix TBB installation REVERT: ea6b56b0fb remove beta status git-subtree-dir: thirdParty/alpaka git-subtree-split: 48972eb59308971c29f1ee10aa374190c591c585

During restart, the written N files by N aggregators (N <= |MPI_Size|)
are read again by each rank which causes NxN open files and fails for
large numbers of aggregators on parallel file systems.
We read our individual chunk now instead and perform a rather lightweight |MPI_Scan| on it to retreive the offsets.
Note 1: This bug is basically the same problem we had once with the
|libSplash| |SerialDataCollector| when we assumed that in general, a
re-sorting of GPUs /could/ have happened. (A pre-|alpha| bug during
GB2013.) On the ADIOS side, this can actually be achieved very
efficiently with an |ADIOS_READ_METHOD_BP_AGGREGATE| but is still very
experimental (did not work with zero-reads and/or zlib transport enabled).
Note 2: The HDF5 implementation
picongpu/src/picongpu/include/plugins/hdf5/restart/LoadSpecies.hpp
Lines 114 to 137 in f0ba515
can be optimized the same way, nevertheless due to our MPI-I/O read
there it is not as severe as with aggregations right now.