Skip to content

Conversation

@ax3l
Copy link
Member

@ax3l ax3l commented Jun 1, 2015

During restart, the written N files by N aggregators (N <= |MPI_Size|)
are read again by each rank which causes NxN open files and fails for
large numbers of aggregators on parallel file systems.

We read our individual chunk now instead and perform a rather light
weight |MPI_Scan| on it to retreive the offsets.

Note 1: This bug is basically the same problem we had once with the
|libSplash| |SerialDataCollector| when we assumed that in general, a
re-sorting of GPUs /could/ have happened. (A pre-|alpha| bug during
GB2013.) On the ADIOS side, this can actually be achieved very
efficiently with an |ADIOS_READ_METHOD_BP_AGGREGATE| but is still very
experimental (did not work with zero-reads and/or zlib transport enabled).

Note 2: The HDF5 implementation

uint64Quint particlesInfo[gc.getGlobalSize()];
Dimensions particlesInfoSizeRead;
params->dataCollector->read(params->currentStep,
(std::string(subGroup) + std::string("/particles_info")).c_str(),
particlesInfoSizeRead,
particlesInfo);
assert(particlesInfoSizeRead[0] == gc.getGlobalSize());
/* search my entry (using my scalar position) in particlesInfo */
uint64_t particleOffset = 0;
uint64_t myScalarPos = gc.getScalarPosition();
for (size_t i = 0; i < particlesInfoSizeRead[0]; ++i)
{
if (particlesInfo[i][1] == myScalarPos)
{
totalNumParticles = particlesInfo[i][0];
break;
}
particleOffset += particlesInfo[i][0];
}

can be optimized the same way, nevertheless due to our MPI-I/O read
there it is not as severe as with aggregations right now.

  • tested with large scale runs and lwfa moving window

@ax3l ax3l added bug a bug in the project's code component: plugin in PIConGPU plugin labels Jun 1, 2015
@ax3l ax3l added this to the Open Beta milestone Jun 1, 2015
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 5

@psychocoderHPC
Copy link
Member

the code changes looks good!

@ax3l ax3l force-pushed the fix-restartOffsetMPIScan branch from 3f51adc to 8d84cb0 Compare June 1, 2015 17:59
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove comment

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

weird, but it's not restarting...

runs fine until

[...]
PIConGPUVerbose MEMORY(8) | free mem after all particles are initialized 383 MiB
initialization time: 19sec 854msec = 19 sec
 50 % =      250 | time elapsed:                    0msec | avg time per step:   0msec

and then hangs.

Not all ranks reach

  • "[...] used frames to load particles [...]" (wc -l is 60 instead of 2*48 = 96)
  • Only 24 ranks reach "( end ) load species: i" and only 36 reach "( end ) load species: e"

Update: the last two messages only appear if the rank actually has >0 particles at all... -> will be updated for at least the ( end ) load species: output...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use MPI_Exscan for exclusive scan (without rank i):

Careful: The value in recvbuf on process 0 is undefined and unreliable as recvbuf is not significant for process 0. The value of recvbuf on process 1 is always the value in sendbuf on process 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totalNumParticles is of uint64_cu and might not go well with uint64_t ...

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

bug: uint32_t leftOverParticles should be uint64_t

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

the copySpeciesGlobal2Local should also treat all arguments as 64bit...

@psychocoderHPC
Copy link
Member

uint32_t should currently ok because the GridBuffer can only handle int (32Bit). In the most PMacc operations you can only allocate 2^31 elements (not byte).
But true since we use mallocMC we can create more than 2^31 particles per species.

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

yes, the copySpeciesGlobal2Local seems valid after a check (for less then 2 billion particles per GPU).

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

hm, not all ranks reach "load particles on device chunk offset=0" ...

@psychocoderHPC
Copy link
Member

Yes, this is the current limit of PMacc to save a many regisers. I think we can change this for particles in the near feature.

@psychocoderHPC
Copy link
Member

Can you check if all ranks leaf the adios call and enter MPI_scan?

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

yes, 96 ranks reach "... particles from offset ... " the lines below

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

hm, the "ADIOS: ( end ) load species" is in the "if" of ">0 particles"...

@psychocoderHPC
Copy link
Member

FreeMemory is also in the same if condition but the allocation is outside.

Some posts before you wrote that not all ranks reach ... particles from offset .... Is this problem now solved?

@psychocoderHPC
Copy link
Member

Ahh ok I can answer my last question by my self. The message is from the for loop were particles were loaded chunk wise and only ranks were something is to do enter this part of code.

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

not sure, I am mixing uint64_cu and uint64_t in MPI_Scan - I will try that again, but all seem to leave the scan again...

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

I have to declutter these if's so they also show up for ==0 particles loaded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for sanity. I am not sure if undefined means "guaranteed to be not overwritten" in this manual

 The value in recvbuf on process 0 is undefined and unreliable as recvbuf
is not significant for process 0.

@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

After the last commits, all 48 ranks reach for both species the ( end ) load species: output.

nevertheless, the whole sim still hangs after that outputting as before:

initialization time: 20sec 793msec = 20 sec
 50 % =      250 | time elapsed:                    0msec | avg time per step:   0msec

and that's it ^^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be saver to use gc.getCommunicator().getMPIComm() here?
adiosComm is initialized in pluginLoad via a MPI_Comm_dup from it.
adios_read_init_method uses adiosComm afterward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in ADIOSCountParticles.hpp we use MPI_UNSIGNED_LONG_LONG for uint64_t values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, could it be possible that the ranks we use here are not the same as in gc.getScalarPosition() ? That would explain a lot ^^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getScalarPosition() in libPMacc/include/mappings/simulation/GridController.hpp comes from

uint32_t getScalarPosition() const
            {
                return DataSpaceOperations<DIM>::map(getGpuNodes(), getPosition());
            }

hm... I would really like to use a MPI_Scan here, it scales (log N) way better than a crumpy MPI_Allgather ^^

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah crap, that's a communicator and grid controller design flaw... I will use the less effcient Allgather now, but we should refactor the thing s othat getScalarPosition can actually be used with MPI communicators again 🔓

@ax3l ax3l force-pushed the fix-restartOffsetMPIScan branch 3 times, most recently from f3c58ed to 3666d7d Compare June 1, 2015 22:22
@ax3l
Copy link
Member Author

ax3l commented Jun 1, 2015

@psychocoderHPC for the first time in my life, I found a useful application of MPI_Scan. We really need to redesign the communicator so we can use more efficient collectives...

cat

P.S.: still not running, but due to an std::bad_alloc during attribute reads... too tired to continue just now, I must have mixed up some lengths ;)

@ax3l
Copy link
Member Author

ax3l commented Jun 3, 2015

yes that's the plan. the question is if I can apply a fix to load the
current scheme for loads, so I do not have to change it twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent, that would be the actual rank (but does not matter due to allgather).

@ax3l ax3l force-pushed the fix-restartOffsetMPIScan branch from f5d1ea2 to 8900924 Compare June 3, 2015 08:15
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i*5

@ax3l ax3l force-pushed the fix-restartOffsetMPIScan branch 6 times, most recently from 453f280 to a75d432 Compare June 3, 2015 09:32
@ax3l
Copy link
Member Author

ax3l commented Jun 22, 2015

just needs a moving window test, besides that it works on both small and very large scale runs.

@ax3l
Copy link
Member Author

ax3l commented Jul 22, 2015

rebase me :) (#907 was merged)

@psychocoderHPC
Copy link
Member

I get the flowing error with LWFA (enabled moving window)

[1,2]<stderr>:ERROR: Null pointer passed as file to adios_inq_var()
[1,2]<stderr>:ERROR: Null pointer passed as file to adios_perform_reads()
[1,2]<stderr>:ADIOS: error at cmd 'adios_perform_reads( params->fp, 1 )' (-4) in /bigdata/hplsim/scratch/widera/dev/src/picongpu/include/plugins/adios/restart/LoadParticleAttributesFromADIOS.hpp:113Null pointer passed as file to adios_perform_reads()

with debug output

...
[1,7]<stdout>:PIConGPUVerbose INPUT_OUTPUT(32) | ADIOS: Loading 0 particles from offset 140733876213206
[1,7]<stdout>:PIConGPUVerbose INPUT_OUTPUT(32) | ADIOS: malloc mapped memory: e

add additional own debug output to check the pointer address

...
[1,7]<stdout>:PIConGPUVerbose INPUT_OUTPUT(32) | ADIOS:  ptr before read /data/3000/particles/e/position/x from e: 0
...

@ax3l
Copy link
Member Author

ax3l commented Jul 23, 2015

looks like failed open or a memory corruption. let's investigate.

@ax3l
Copy link
Member Author

ax3l commented Jul 23, 2015

-> you are working on outdated code, pls checkout this branch (forgot to git fetch?)

@psychocoderHPC
Copy link
Member

NOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO
git fetch axel was missing

@psychocoderHPC
Copy link
Member

OK all is fine.

LWFA (with moving window) is restart able and the pngs shows no difference.

please rebase this branch!

ax3l added 2 commits July 23, 2015 14:05
During restart, the written N files by N aggregators are
read again by each rank which causes NxN open files and fails
for large numbers of aggregators on parallel file systems.

We now read a local block only and allgather it to calculate the
offset in the 1D partice attribute arrays again.

Due to a implementation bug in `ADIOSCountParticles.hpp:120` we
have to immitate the (wrong) Sort-by-Rank for Particles.
Particle offsets are (unfurtunately) calculated by rank but only
the scalarPos is given in the `particles_info` object.
An upcoming change in that object with `OpenPMD` `particlePatches`
will change that.
@ax3l ax3l force-pushed the fix-restartOffsetMPIScan branch from a75d432 to 8f845fb Compare July 23, 2015 12:05
@ax3l
Copy link
Member Author

ax3l commented Jul 23, 2015

cool, thx for testing! :)
rebased now to the latest version of our dev.

psychocoderHPC added a commit that referenced this pull request Jul 23, 2015
ADIOS Restart: Avoid NxN access pattern with Aggregators
@psychocoderHPC psychocoderHPC merged commit cc1fa31 into ComputationalRadiationPhysics:dev Jul 23, 2015
@ax3l ax3l deleted the fix-restartOffsetMPIScan branch July 23, 2015 12:38
psychocoderHPC pushed a commit to psychocoderHPC/picongpu that referenced this pull request Feb 20, 2020
c147cf1a5d hack for native clang usage for HIP
ceead8c719 Merge pull request ComputationalRadiationPhysics#926 from psychocoderHPC/fix-implicitCAstWarning
86a8b8def8 test: fix implicit cast warning
f90d1dc515 Merge pull request ComputationalRadiationPhysics#924 from BenjaminW3/topic-version-0.5
b1d2a8d866 Increase version to 0.5.0
167ca262f8 Merge pull request ComputationalRadiationPhysics#923 from BenjaminW3/topic-omp-num-threads-1
be8bf55791 Merge pull request ComputationalRadiationPhysics#920 from SimeonEhrig/bufferCopyComment
4162d26b61 Fix exception in TaskKernelCpuOmp2Blocks when OMP_NUM_THREADS==1
daffff6252 Add comment to pitch function at the bufferCopy example
284eef5113 Merge pull request ComputationalRadiationPhysics#916 from BenjaminW3/topic-gh-action
0a4969e5d0 Merge pull request ComputationalRadiationPhysics#917 from sbastrakov/doc-addBufferPinning
4924eb0bd6 Merge pull request ComputationalRadiationPhysics#918 from BenjaminW3/topic-remove-commented-out-sections
fa2ce0ceef Add info on pinning to the CUDA mapping docs
68deac0768 Remove commented out code
be64df2d29 Add automated gh-pages deployment for branch develop
cb0e27819b Merge pull request ComputationalRadiationPhysics#915 from sbastrakov/topic-c++14HelperTypes
adf11e573c Use C++ helper types for traits
8decb8d5b4 Merge pull request ComputationalRadiationPhysics#914 from sbastrakov/doc-addCuplaReference
9f528697f3 Merge pull request ComputationalRadiationPhysics#913 from sbastrakov/fix-ExampleCommentAcceleratorList
07e455b637 Add a reference to cupla to readme
a06f345e7f Add forgotten TBB accelerator to the list in comments of the examples
432331fcc7 Merge pull request ComputationalRadiationPhysics#909 from BenjaminW3/topic-c++14-2
40bfeaaee7 Incorporate review comments
74e0ffa006 Remove meta::IntegerSequence
71afe1f0bb Remove unused includes
572777ed5a Remove unused TransformIntegerSequence
e52b90d920 Remove unused meta::IndexSequence
ed5b5f8d9b Prepare usage of std::integer_sequence
aa0635525d Use std::integer_sequence instead of own IntegerSequence
6b914ca157 Use std::integer_sequence instead of own IntegerSequence for NdLoop
9811f23a30 Merge pull request ComputationalRadiationPhysics#910 from ax3l/topic-removeBetaStatusDev
9f3f01bb40 remove beta status
b99acc704c Merge pull request ComputationalRadiationPhysics#907 from BenjaminW3/topic-integer_sequence
b2db39d599 Merge pull request ComputationalRadiationPhysics#906 from BenjaminW3/topic-increase-min-boost
3992f097cf Raise minimum supported boost version from 1.62.0 to 1.65.1
53f74a28ee Merge pull request ComputationalRadiationPhysics#904 from BenjaminW3/topic-increase-min-ubuntu
1b346420de Replace alpaka::meta::IndexSequence with C++14 std::index_sequence
7180827504 Merge pull request ComputationalRadiationPhysics#900 from BenjaminW3/topic-c++14
be03160b3c Remove Support for ubuntu 14.04
bb3d6c49f0 Raise minimum to -std=c++14 and remove support for CUDA 8.0 and gcc 4.9
7910971a54 Merge pull request ComputationalRadiationPhysics#899 from BenjaminW3/topic-xcode-11_3
29234ffcc2 Add support for XCode 11.3
5135bdb27b Merge pull request ComputationalRadiationPhysics#901 from BenjaminW3/topic-fix-tbb-win-download
bcd6d46ef6 Fix TBB installation
REVERT: ab0b8a460f Merge pull request ComputationalRadiationPhysics#905 from psychocoderHPC/fix-tbb-win-download
REVERT: d7471b9381 Merge pull request ComputationalRadiationPhysics#903 from psychocoderHPC/topic-removeBetaStatus
REVERT: 13c06f9667 Fix TBB installation
REVERT: ea6b56b0fb remove beta status

git-subtree-dir: thirdParty/alpaka
git-subtree-split: c147cf1a5d69e9f553986566a571298d92b856f5
psychocoderHPC pushed a commit to psychocoderHPC/picongpu that referenced this pull request Feb 20, 2020
48972eb593 hack for native clang usage for HIP
4eaff438cb Increase version to 0.5.0
b5f4402022 Fix exception in TaskKernelCpuOmp2Blocks when OMP_NUM_THREADS==1
7569489385 Add comment to pitch function at the bufferCopy example
0e1757dfff import ComputationalRadiationPhysics#864
b1042de4d3 HIP-clang support
284eef5113 Merge pull request ComputationalRadiationPhysics#916 from BenjaminW3/topic-gh-action
0a4969e5d0 Merge pull request ComputationalRadiationPhysics#917 from sbastrakov/doc-addBufferPinning
4924eb0bd6 Merge pull request ComputationalRadiationPhysics#918 from BenjaminW3/topic-remove-commented-out-sections
fa2ce0ceef Add info on pinning to the CUDA mapping docs
68deac0768 Remove commented out code
be64df2d29 Add automated gh-pages deployment for branch develop
cb0e27819b Merge pull request ComputationalRadiationPhysics#915 from sbastrakov/topic-c++14HelperTypes
adf11e573c Use C++ helper types for traits
8decb8d5b4 Merge pull request ComputationalRadiationPhysics#914 from sbastrakov/doc-addCuplaReference
9f528697f3 Merge pull request ComputationalRadiationPhysics#913 from sbastrakov/fix-ExampleCommentAcceleratorList
07e455b637 Add a reference to cupla to readme
a06f345e7f Add forgotten TBB accelerator to the list in comments of the examples
432331fcc7 Merge pull request ComputationalRadiationPhysics#909 from BenjaminW3/topic-c++14-2
40bfeaaee7 Incorporate review comments
74e0ffa006 Remove meta::IntegerSequence
71afe1f0bb Remove unused includes
572777ed5a Remove unused TransformIntegerSequence
e52b90d920 Remove unused meta::IndexSequence
ed5b5f8d9b Prepare usage of std::integer_sequence
aa0635525d Use std::integer_sequence instead of own IntegerSequence
6b914ca157 Use std::integer_sequence instead of own IntegerSequence for NdLoop
9811f23a30 Merge pull request ComputationalRadiationPhysics#910 from ax3l/topic-removeBetaStatusDev
9f3f01bb40 remove beta status
b99acc704c Merge pull request ComputationalRadiationPhysics#907 from BenjaminW3/topic-integer_sequence
b2db39d599 Merge pull request ComputationalRadiationPhysics#906 from BenjaminW3/topic-increase-min-boost
3992f097cf Raise minimum supported boost version from 1.62.0 to 1.65.1
53f74a28ee Merge pull request ComputationalRadiationPhysics#904 from BenjaminW3/topic-increase-min-ubuntu
1b346420de Replace alpaka::meta::IndexSequence with C++14 std::index_sequence
7180827504 Merge pull request ComputationalRadiationPhysics#900 from BenjaminW3/topic-c++14
be03160b3c Remove Support for ubuntu 14.04
bb3d6c49f0 Raise minimum to -std=c++14 and remove support for CUDA 8.0 and gcc 4.9
7910971a54 Merge pull request ComputationalRadiationPhysics#899 from BenjaminW3/topic-xcode-11_3
29234ffcc2 Add support for XCode 11.3
5135bdb27b Merge pull request ComputationalRadiationPhysics#901 from BenjaminW3/topic-fix-tbb-win-download
bcd6d46ef6 Fix TBB installation
REVERT: ab0b8a460f Merge pull request ComputationalRadiationPhysics#905 from psychocoderHPC/fix-tbb-win-download
REVERT: d7471b9381 Merge pull request ComputationalRadiationPhysics#903 from psychocoderHPC/topic-removeBetaStatus
REVERT: 13c06f9667 Fix TBB installation
REVERT: ea6b56b0fb remove beta status

git-subtree-dir: thirdParty/alpaka
git-subtree-split: 48972eb59308971c29f1ee10aa374190c591c585
@ax3l ax3l mentioned this pull request Mar 31, 2020
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug a bug in the project's code component: plugin in PIConGPU plugin

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants