ADIOS Restart: Avoid NxN access pattern with Aggregators #900

ax3l · 2015-06-01T17:39:40Z

During restart, the written N files by N aggregators (N <= |MPI_Size|)
are read again by each rank which causes NxN open files and fails for
large numbers of aggregators on parallel file systems.

We read our individual chunk now instead and perform a rather light
weight |MPI_Scan| on it to retreive the offsets.

Note 1: This bug is basically the same problem we had once with the
|libSplash| |SerialDataCollector| when we assumed that in general, a
re-sorting of GPUs /could/ have happened. (A pre-|alpha| bug during
GB2013.) On the ADIOS side, this can actually be achieved very
efficiently with an |ADIOS_READ_METHOD_BP_AGGREGATE| but is still very
experimental (did not work with zero-reads and/or zlib transport enabled).

Note 2: The HDF5 implementation

picongpu/src/picongpu/include/plugins/hdf5/restart/LoadSpecies.hpp

Lines 114 to 137 in f0ba515

    
           uint64Quint particlesInfo[gc.getGlobalSize()]; 
        
           Dimensions particlesInfoSizeRead; 
        
           params->dataCollector->read(params->currentStep, 
        
                                       (std::string(subGroup) + std::string("/particles_info")).c_str(), 
        
                                       particlesInfoSizeRead, 
        
                                       particlesInfo); 
        
           assert(particlesInfoSizeRead[0] == gc.getGlobalSize()); 
        
           /* search my entry (using my scalar position) in particlesInfo */ 
        
           uint64_t particleOffset = 0; 
        
           uint64_t myScalarPos = gc.getScalarPosition(); 
        
           for (size_t i = 0; i < particlesInfoSizeRead[0]; ++i) 
        
           { 
        
               if (particlesInfo[i][1] == myScalarPos) 
        
               { 
        
                   totalNumParticles = particlesInfo[i][0]; 
        
                   break; 
        
               } 
        
               particleOffset += particlesInfo[i][0]; 
        
           }

can be optimized the same way, nevertheless due to our MPI-I/O read
there it is not as severe as with aggregations right now.

tested with large scale runs and lwfa moving window

ax3l · 2015-06-01T17:55:03Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

psychocoderHPC · 2015-06-01T17:59:31Z

the code changes looks good!

ax3l · 2015-06-01T18:26:49Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

remove comment

ax3l · 2015-06-01T19:15:59Z

weird, but it's not restarting...

runs fine until

[...]
PIConGPUVerbose MEMORY(8) | free mem after all particles are initialized 383 MiB
initialization time: 19sec 854msec = 19 sec
 50 % =      250 | time elapsed:                    0msec | avg time per step:   0msec

and then hangs.

Not all ranks reach

"[...] used frames to load particles [...]" (wc -l is 60 instead of 2*48 = 96)
Only 24 ranks reach "( end ) load species: i" and only 36 reach "( end ) load species: e"

Update: the last two messages only appear if the rank actually has >0 particles at all... -> will be updated for at least the ( end ) load species: output...

ax3l · 2015-06-01T19:30:12Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

Use MPI_Exscan for exclusive scan (without rank i):

Careful: The value in recvbuf on process 0 is undefined and unreliable as recvbuf is not significant for process 0. The value of recvbuf on process 1 is always the value in sendbuf on process 0.

totalNumParticles is of uint64_cu and might not go well with uint64_t ...

ax3l · 2015-06-01T19:53:11Z

bug: uint32_t leftOverParticles should be uint64_t

ax3l · 2015-06-01T19:55:30Z

the copySpeciesGlobal2Local should also treat all arguments as 64bit...

psychocoderHPC · 2015-06-01T20:00:22Z

uint32_t should currently ok because the GridBuffer can only handle int (32Bit). In the most PMacc operations you can only allocate 2^31 elements (not byte).
But true since we use mallocMC we can create more than 2^31 particles per species.

ax3l · 2015-06-01T20:02:06Z

yes, the copySpeciesGlobal2Local seems valid after a check (for less then 2 billion particles per GPU).

ax3l · 2015-06-01T20:05:15Z

hm, not all ranks reach "load particles on device chunk offset=0" ...

psychocoderHPC · 2015-06-01T20:05:54Z

Yes, this is the current limit of PMacc to save a many regisers. I think we can change this for particles in the near feature.

psychocoderHPC · 2015-06-01T20:06:35Z

Can you check if all ranks leaf the adios call and enter MPI_scan?

ax3l · 2015-06-01T20:12:21Z

yes, 96 ranks reach "... particles from offset ... " the lines below

ax3l · 2015-06-01T20:13:55Z

hm, the "ADIOS: ( end ) load species" is in the "if" of ">0 particles"...

psychocoderHPC · 2015-06-01T20:20:28Z

FreeMemory is also in the same if condition but the allocation is outside.

Some posts before you wrote that ~~not~~ all ranks reach ... particles from offset .... Is this problem now solved?

psychocoderHPC · 2015-06-01T20:28:28Z

Ahh ok I can answer my last question by my self. The message is from the for loop were particles were loaded chunk wise and only ranks were something is to do enter this part of code.

ax3l · 2015-06-01T20:29:03Z

not sure, I am mixing uint64_cu and uint64_t in MPI_Scan - I will try that again, but all seem to leave the scan again...

ax3l · 2015-06-01T20:33:53Z

I have to declutter these if's so they also show up for ==0 particles loaded.

ax3l · 2015-06-01T20:50:44Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

This is for sanity. I am not sure if undefined means "guaranteed to be not overwritten" in this manual

The value in recvbuf on process 0 is undefined and unreliable as recvbuf is not significant for process 0.

ax3l · 2015-06-01T20:53:25Z

After the last commits, all 48 ranks reach for both species the ( end ) load species: output.

nevertheless, the whole sim still hangs after that outputting as before:

initialization time: 20sec 793msec = 20 sec
 50 % =      250 | time elapsed:                    0msec | avg time per step:   0msec

and that's it ^^

ax3l · 2015-06-01T21:01:09Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

would it be saver to use gc.getCommunicator().getMPIComm() here?
adiosComm is initialized in pluginLoad via a MPI_Comm_dup from it.
adios_read_init_method uses adiosComm afterward.

in ADIOSCountParticles.hpp we use MPI_UNSIGNED_LONG_LONG for uint64_t values

Hm, could it be possible that the ranks we use here are not the same as in gc.getScalarPosition() ? That would explain a lot ^^

getScalarPosition() in libPMacc/include/mappings/simulation/GridController.hpp comes from

uint32_t getScalarPosition() const { return DataSpaceOperations<DIM>::map(getGpuNodes(), getPosition()); }

hm... I would really like to use a MPI_Scan here, it scales (log N) way better than a crumpy MPI_Allgather ^^

ah crap, that's a communicator and grid controller design flaw... I will use the less effcient Allgather now, but we should refactor the thing s othat getScalarPosition can actually be used with MPI communicators again 🔓

ax3l · 2015-06-01T22:27:57Z

@psychocoderHPC for the first time in my life, I found a useful application of MPI_Scan. We really need to redesign the communicator so we can use more efficient collectives...

P.S.: still not running, but due to an std::bad_alloc during attribute reads... too tired to continue just now, I must have mixed up some lengths ;)

ax3l · 2015-06-03T07:18:13Z

yes that's the plan. the question is if I can apply a fix to load the
current scheme for loads, so I do not have to change it twice.

ax3l · 2015-06-03T07:21:14Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

to be consistent, that would be the actual rank (but does not matter due to allgather).

ax3l · 2015-06-03T08:37:36Z

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp

ax3l · 2015-06-22T17:26:32Z

just needs a moving window test, besides that it works on both small and very large scale runs.

ax3l · 2015-07-22T11:23:35Z

rebase me :) (#907 was merged)

psychocoderHPC · 2015-07-23T11:27:13Z

I get the flowing error with LWFA (enabled moving window)

[1,2]<stderr>:ERROR: Null pointer passed as file to adios_inq_var()
[1,2]<stderr>:ERROR: Null pointer passed as file to adios_perform_reads()
[1,2]<stderr>:ADIOS: error at cmd 'adios_perform_reads( params->fp, 1 )' (-4) in /bigdata/hplsim/scratch/widera/dev/src/picongpu/include/plugins/adios/restart/LoadParticleAttributesFromADIOS.hpp:113Null pointer passed as file to adios_perform_reads()

with debug output

...
[1,7]<stdout>:PIConGPUVerbose INPUT_OUTPUT(32) | ADIOS: Loading 0 particles from offset 140733876213206
[1,7]<stdout>:PIConGPUVerbose INPUT_OUTPUT(32) | ADIOS: malloc mapped memory: e

add additional own debug output to check the pointer address

...
[1,7]<stdout>:PIConGPUVerbose INPUT_OUTPUT(32) | ADIOS:  ptr before read /data/3000/particles/e/position/x from e: 0
...

ax3l · 2015-07-23T11:34:43Z

looks like failed open or a memory corruption. let's investigate.

ax3l · 2015-07-23T11:50:03Z

-> you are working on outdated code, pls checkout this branch (forgot to git fetch?)

psychocoderHPC · 2015-07-23T11:53:14Z

NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
git fetch axel was missing

psychocoderHPC · 2015-07-23T12:04:40Z

OK all is fine.

LWFA (with moving window) is restart able and the pngs shows no difference.

please rebase this branch!

During restart, the written N files by N aggregators are read again by each rank which causes NxN open files and fails for large numbers of aggregators on parallel file systems. We now read a local block only and allgather it to calculate the offset in the 1D partice attribute arrays again. Due to a implementation bug in `ADIOSCountParticles.hpp:120` we have to immitate the (wrong) Sort-by-Rank for Particles. Particle offsets are (unfurtunately) calculated by rank but only the scalarPos is given in the `particles_info` object. An upcoming change in that object with `OpenPMD` `particlePatches` will change that.

ax3l · 2015-07-23T12:05:58Z

cool, thx for testing! :)
rebased now to the latest version of our dev.

ADIOS Restart: Avoid NxN access pattern with Aggregators

c147cf1a5d hack for native clang usage for HIP ceead8c719 Merge pull request ComputationalRadiationPhysics#926 from psychocoderHPC/fix-implicitCAstWarning 86a8b8def8 test: fix implicit cast warning f90d1dc515 Merge pull request ComputationalRadiationPhysics#924 from BenjaminW3/topic-version-0.5 b1d2a8d866 Increase version to 0.5.0 167ca262f8 Merge pull request ComputationalRadiationPhysics#923 from BenjaminW3/topic-omp-num-threads-1 be8bf55791 Merge pull request ComputationalRadiationPhysics#920 from SimeonEhrig/bufferCopyComment 4162d26b61 Fix exception in TaskKernelCpuOmp2Blocks when OMP_NUM_THREADS==1 daffff6252 Add comment to pitch function at the bufferCopy example 284eef5113 Merge pull request ComputationalRadiationPhysics#916 from BenjaminW3/topic-gh-action 0a4969e5d0 Merge pull request ComputationalRadiationPhysics#917 from sbastrakov/doc-addBufferPinning 4924eb0bd6 Merge pull request ComputationalRadiationPhysics#918 from BenjaminW3/topic-remove-commented-out-sections fa2ce0ceef Add info on pinning to the CUDA mapping docs 68deac0768 Remove commented out code be64df2d29 Add automated gh-pages deployment for branch develop cb0e27819b Merge pull request ComputationalRadiationPhysics#915 from sbastrakov/topic-c++14HelperTypes adf11e573c Use C++ helper types for traits 8decb8d5b4 Merge pull request ComputationalRadiationPhysics#914 from sbastrakov/doc-addCuplaReference 9f528697f3 Merge pull request ComputationalRadiationPhysics#913 from sbastrakov/fix-ExampleCommentAcceleratorList 07e455b637 Add a reference to cupla to readme a06f345e7f Add forgotten TBB accelerator to the list in comments of the examples 432331fcc7 Merge pull request ComputationalRadiationPhysics#909 from BenjaminW3/topic-c++14-2 40bfeaaee7 Incorporate review comments 74e0ffa006 Remove meta::IntegerSequence 71afe1f0bb Remove unused includes 572777ed5a Remove unused TransformIntegerSequence e52b90d920 Remove unused meta::IndexSequence ed5b5f8d9b Prepare usage of std::integer_sequence aa0635525d Use std::integer_sequence instead of own IntegerSequence 6b914ca157 Use std::integer_sequence instead of own IntegerSequence for NdLoop 9811f23a30 Merge pull request ComputationalRadiationPhysics#910 from ax3l/topic-removeBetaStatusDev 9f3f01bb40 remove beta status b99acc704c Merge pull request ComputationalRadiationPhysics#907 from BenjaminW3/topic-integer_sequence b2db39d599 Merge pull request ComputationalRadiationPhysics#906 from BenjaminW3/topic-increase-min-boost 3992f097cf Raise minimum supported boost version from 1.62.0 to 1.65.1 53f74a28ee Merge pull request ComputationalRadiationPhysics#904 from BenjaminW3/topic-increase-min-ubuntu 1b346420de Replace alpaka::meta::IndexSequence with C++14 std::index_sequence 7180827504 Merge pull request ComputationalRadiationPhysics#900 from BenjaminW3/topic-c++14 be03160b3c Remove Support for ubuntu 14.04 bb3d6c49f0 Raise minimum to -std=c++14 and remove support for CUDA 8.0 and gcc 4.9 7910971a54 Merge pull request ComputationalRadiationPhysics#899 from BenjaminW3/topic-xcode-11_3 29234ffcc2 Add support for XCode 11.3 5135bdb27b Merge pull request ComputationalRadiationPhysics#901 from BenjaminW3/topic-fix-tbb-win-download bcd6d46ef6 Fix TBB installation REVERT: ab0b8a460f Merge pull request ComputationalRadiationPhysics#905 from psychocoderHPC/fix-tbb-win-download REVERT: d7471b9381 Merge pull request ComputationalRadiationPhysics#903 from psychocoderHPC/topic-removeBetaStatus REVERT: 13c06f9667 Fix TBB installation REVERT: ea6b56b0fb remove beta status git-subtree-dir: thirdParty/alpaka git-subtree-split: c147cf1a5d69e9f553986566a571298d92b856f5

48972eb593 hack for native clang usage for HIP 4eaff438cb Increase version to 0.5.0 b5f4402022 Fix exception in TaskKernelCpuOmp2Blocks when OMP_NUM_THREADS==1 7569489385 Add comment to pitch function at the bufferCopy example 0e1757dfff import ComputationalRadiationPhysics#864 b1042de4d3 HIP-clang support 284eef5113 Merge pull request ComputationalRadiationPhysics#916 from BenjaminW3/topic-gh-action 0a4969e5d0 Merge pull request ComputationalRadiationPhysics#917 from sbastrakov/doc-addBufferPinning 4924eb0bd6 Merge pull request ComputationalRadiationPhysics#918 from BenjaminW3/topic-remove-commented-out-sections fa2ce0ceef Add info on pinning to the CUDA mapping docs 68deac0768 Remove commented out code be64df2d29 Add automated gh-pages deployment for branch develop cb0e27819b Merge pull request ComputationalRadiationPhysics#915 from sbastrakov/topic-c++14HelperTypes adf11e573c Use C++ helper types for traits 8decb8d5b4 Merge pull request ComputationalRadiationPhysics#914 from sbastrakov/doc-addCuplaReference 9f528697f3 Merge pull request ComputationalRadiationPhysics#913 from sbastrakov/fix-ExampleCommentAcceleratorList 07e455b637 Add a reference to cupla to readme a06f345e7f Add forgotten TBB accelerator to the list in comments of the examples 432331fcc7 Merge pull request ComputationalRadiationPhysics#909 from BenjaminW3/topic-c++14-2 40bfeaaee7 Incorporate review comments 74e0ffa006 Remove meta::IntegerSequence 71afe1f0bb Remove unused includes 572777ed5a Remove unused TransformIntegerSequence e52b90d920 Remove unused meta::IndexSequence ed5b5f8d9b Prepare usage of std::integer_sequence aa0635525d Use std::integer_sequence instead of own IntegerSequence 6b914ca157 Use std::integer_sequence instead of own IntegerSequence for NdLoop 9811f23a30 Merge pull request ComputationalRadiationPhysics#910 from ax3l/topic-removeBetaStatusDev 9f3f01bb40 remove beta status b99acc704c Merge pull request ComputationalRadiationPhysics#907 from BenjaminW3/topic-integer_sequence b2db39d599 Merge pull request ComputationalRadiationPhysics#906 from BenjaminW3/topic-increase-min-boost 3992f097cf Raise minimum supported boost version from 1.62.0 to 1.65.1 53f74a28ee Merge pull request ComputationalRadiationPhysics#904 from BenjaminW3/topic-increase-min-ubuntu 1b346420de Replace alpaka::meta::IndexSequence with C++14 std::index_sequence 7180827504 Merge pull request ComputationalRadiationPhysics#900 from BenjaminW3/topic-c++14 be03160b3c Remove Support for ubuntu 14.04 bb3d6c49f0 Raise minimum to -std=c++14 and remove support for CUDA 8.0 and gcc 4.9 7910971a54 Merge pull request ComputationalRadiationPhysics#899 from BenjaminW3/topic-xcode-11_3 29234ffcc2 Add support for XCode 11.3 5135bdb27b Merge pull request ComputationalRadiationPhysics#901 from BenjaminW3/topic-fix-tbb-win-download bcd6d46ef6 Fix TBB installation REVERT: ab0b8a460f Merge pull request ComputationalRadiationPhysics#905 from psychocoderHPC/fix-tbb-win-download REVERT: d7471b9381 Merge pull request ComputationalRadiationPhysics#903 from psychocoderHPC/topic-removeBetaStatus REVERT: 13c06f9667 Fix TBB installation REVERT: ea6b56b0fb remove beta status git-subtree-dir: thirdParty/alpaka git-subtree-split: 48972eb59308971c29f1ee10aa374190c591c585

ax3l added bug a bug in the project's code component: plugin in PIConGPU plugin labels Jun 1, 2015

ax3l assigned psychocoderHPC Jun 1, 2015

ax3l added this to the Open Beta milestone Jun 1, 2015

ax3l reviewed Jun 1, 2015
View reviewed changes

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp Outdated

Copy link

Member Author

ax3l Jun 1, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5

ax3l force-pushed the fix-restartOffsetMPIScan branch from 3f51adc to 8d84cb0 Compare June 1, 2015 17:59

ax3l reviewed Jun 1, 2015
View reviewed changes

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp Outdated

Copy link

Member Author

ax3l Jun 1, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

ax3l reviewed Jun 1, 2015
View reviewed changes

ax3l force-pushed the fix-restartOffsetMPIScan branch 3 times, most recently from f3c58ed to 3666d7d Compare June 1, 2015 22:22

ax3l reviewed Jun 3, 2015
View reviewed changes

ax3l force-pushed the fix-restartOffsetMPIScan branch from f5d1ea2 to 8900924 Compare June 3, 2015 08:15

ax3l reviewed Jun 3, 2015
View reviewed changes

src/picongpu/include/plugins/adios/restart/LoadSpecies.hpp Outdated

Copy link

Member Author

ax3l Jun 3, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i*5

ax3l force-pushed the fix-restartOffsetMPIScan branch 6 times, most recently from 453f280 to a75d432 Compare June 3, 2015 09:32

psychocoderHPC added the wait-4-author/WIP label Jun 9, 2015

ax3l added 2 commits July 23, 2015 14:05

ADIOS: expose ( end ) load species for Zero

ff74881

ax3l force-pushed the fix-restartOffsetMPIScan branch from a75d432 to 8f845fb Compare July 23, 2015 12:05

ax3l removed the wait-4-author/WIP label Jul 23, 2015

psychocoderHPC added a commit that referenced this pull request Jul 23, 2015

Merge pull request #900 from ax3l/fix-restartOffsetMPIScan

cc1fa31

ADIOS Restart: Avoid NxN access pattern with Aggregators

psychocoderHPC merged commit cc1fa31 into ComputationalRadiationPhysics:dev Jul 23, 2015

ax3l deleted the fix-restartOffsetMPIScan branch July 23, 2015 12:38

ax3l mentioned this pull request Mar 31, 2020

openPMD Plugin #2966

Merged

10 tasks

	uint64Quint particlesInfo[gc.getGlobalSize()];
	Dimensions particlesInfoSizeRead;

	params->dataCollector->read(params->currentStep,
	(std::string(subGroup) + std::string("/particles_info")).c_str(),
	particlesInfoSizeRead,
	particlesInfo);

	assert(particlesInfoSizeRead[0] == gc.getGlobalSize());

	/* search my entry (using my scalar position) in particlesInfo */
	uint64_t particleOffset = 0;
	uint64_t myScalarPos = gc.getScalarPosition();

	for (size_t i = 0; i < particlesInfoSizeRead[0]; ++i)
	{
	if (particlesInfo[i][1] == myScalarPos)
	{
	totalNumParticles = particlesInfo[i][0];
	break;
	}

	particleOffset += particlesInfo[i][0];
	}

ADIOS Restart: Avoid NxN access pattern with Aggregators #900

ADIOS Restart: Avoid NxN access pattern with Aggregators #900

Uh oh!

Conversation

ax3l commented Jun 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

psychocoderHPC commented Jun 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

psychocoderHPC commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

psychocoderHPC commented Jun 1, 2015

Uh oh!

psychocoderHPC commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

psychocoderHPC commented Jun 1, 2015

Uh oh!

psychocoderHPC commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jun 1, 2015

Uh oh!

ax3l commented Jun 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ax3l commented Jun 22, 2015

Uh oh!

ax3l commented Jul 22, 2015

Uh oh!

psychocoderHPC commented Jul 23, 2015

Uh oh!

ax3l commented Jul 23, 2015

Uh oh!

ax3l commented Jul 23, 2015

Uh oh!

psychocoderHPC commented Jul 23, 2015

Uh oh!

psychocoderHPC commented Jul 23, 2015

Uh oh!