Fix #631 Allocation Of Memory On Jetson Boards by ax3l · Pull Request #633 · ComputationalRadiationPhysics/picongpu

ax3l · 2015-01-16T01:22:12Z

SoC-like boards share the same memory for the device and host. Since we still do double buffering, we should only reserve half of the available "device" memory for particles.

Add 8 for MEMORY to PIC_VERBOSE, e.g.:

$PICSRC/configure -a sm_35 -c"-DPIC_VERBOSE=9" ../paramSets/lwfa

Still needs run time testing:

jetson
k20, fermi, k80, ...

Also:

add a MEMORY log output about the selected choice

…Boards SoC-like boards share the same memory for the device and host. Since we still do double buffering, we should only reserve *half* of the available "device" memory for particles.

PrometheusPi · 2015-01-16T08:28:58Z

Great job! 👍 I will try to test this on the jetson cluster here in Jena.

ax3l · 2015-01-16T10:33:32Z

fantastic, I already validated the bare function on our local jetson cluster (but not this pull yet).

ax3l · 2015-01-16T10:39:51Z

@sigkill @ssstuvz that issue is relevant for your jetson (clusters) and should remove the instability and poor performance we observed. Please feel free to update to dev after this pull is reviewed and merged.

ax3l · 2015-01-16T13:11:31Z

Test of examples/LaserWakefield on a single jetson:

[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | size for all exchange = 36 MiB
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | create 48 MiB for species e
[1,0]<stdout>:mem for particles=48 MiB = 6445 Frames = 1649920 Particles
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | free mem after all mem is allocated 446 MiB

with .cfg

TBG_gpu_x=1
TBG_gpu_y=1
TBG_gpu_z=1

TBG_gridSize="-g 128 256 128"
TBG_steps="-s 1024"

48MB for e- is not that much... they have nearly 2GB available in total, we space 450MB for cuda weirdness (random number states, overheads and similar).
~~I did run cuda_memtest succesfully before that (allocated itself ~1400 MB).~~ (tested: no influence on sim alloc if disabled/enabled)

ax3l · 2015-01-16T13:17:09Z

ok, looks better with -g 64 64 64:

[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | size for all exchange = 36 MiB
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | create 370 MiB for species e
[1,0]<stdout>:mem for particles=370 MiB = 48838 Frames = 12502528 Particles
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | free mem after all mem is allocated 265 MiB

init time: ~60sec

works with submit/bash/bash_mpirun.tpl on 4 and 5 jetsons, too.
take care to set the TBG_gpusPerNode to 1 and to add -x LD_LIBRARY_PATH to the mpirun call in submit/bash/bash_mpirun.tpl.

P.S.: I recommend to uninstall the X server to get more memory for your sims :)

PrometheusPi · 2015-01-16T14:45:30Z

on 3 Jetsons in Jena:

[1,0]<stdout>:mem for particles=129 MiB = 75384 Frames = 4824576 Particles
[1,1]<stdout>:mem for particles=57 MiB = 33494 Frames = 2143616 Particles
[1,2]<stdout>:mem for particles=434 MiB = 252439 Frames = 16156096 Particles

with -d 3 1 1 -g 192 2560 1

MPI rank 1 has only very limited memory in this run! This might be problematic.

Should I rerun with verbose memory output?

sigkill · 2015-01-16T14:47:18Z

Wow still these improvements are awesome! Thank you for all the hard work!
On Jan 16, 2015 6:45 AM, "Richard Pausch" [email protected] wrote:

on 3 Jetsons in Jena:

[1,0]:mem for particles=129 MiB = 75384 Frames = 4824576 Particles
[1,1]:mem for particles=57 MiB = 33494 Frames = 2143616 Particles
[1,2]:mem for particles=434 MiB = 252439 Frames = 16156096 Particles

with -d 3 1 1 -g 192 2560 1

MPI rank 1 has only very limited memory in this run! This might be
problematic.

—
Reply to this email directly or view it on GitHub
#633 (comment)
.

ax3l · 2015-01-16T14:55:40Z

@PrometheusPi do not use that much cells for the small GPUs, they only have 2GB to fit in the whole OS, processes and PIConGPU.
Also: do you run X on your nodes? the main reasons for your problems are:

your gpu distribution -d is not well fit: use more gpus in y direction (-> smaller guard buffers)
-> your central GPU in x right now allocates guard buffers, the others don't
too much processes / other users on the nodes -> varying memory consumption and due to that varying memory PIConGPU can allocate
maybe X11 running -> disable it (check all running daemons for your nodes via the /etc/init.d/ scripts and htop)

that is how it should look (for an empty cluster with minimal daemons running - well, we even have X11 on -.-). I used 64^3 cells per Jetson (25% less than your setup):

[1,4]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,4]<stdout>:PIConGPUVerbose MEMORY(8) | create 446 MiB for species e
[1,1]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,1]<stdout>:PIConGPUVerbose MEMORY(8) | create 490 MiB for species e
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | create 467 MiB for species e
[1,3]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,3]<stdout>:PIConGPUVerbose MEMORY(8) | create 490 MiB for species e
[1,2]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.

@sigkill thanks! :)

PrometheusPi · 2015-01-16T15:28:51Z

@ax3l Yes I know that we have varying memory consumtion here.

I just added your new code to my current run with Foil where this GPU setup works best.
Yes, some demons eat memory 👿
We desabled the GUI interface, but kept X running (only 1% of memory)

Your example on ZIH/Jetson looks great.

ax3l · 2015-01-16T15:39:14Z

Yes, but the main thing is still: one GPU will use 2x more memory for the guards (the central one). with your setting that is quite relevant, too.

Here are some performance tuning tips, especially for the network.
To increase networking performance, increase the socket buffer sizes (in this case to 32MB, performed at boot):

sysctl –w net.core.rmem_max=33554432
sysctl –w net.core.wmem_max=33554432
sysctl –w net.core.rmem_default=33554432
sysctl –w net.core.wmem_default=33554432

P.S.: argh, now I want a PCI-E switch instead of an ethernet interconnect. Brrr.

- uint8_t is well defined, char is not

Fix #631 Allocation Of Memory On Jetson Boards

Fix ComputationalRadiationPhysics#631 Allocation Of Memory On Jetson …

39d1dc2

…Boards SoC-like boards share the same memory for the device and host. Since we still do double buffering, we should only reserve *half* of the available "device" memory for particles.

ax3l added the bug a bug in the project's code label Jan 16, 2015

ax3l added this to the Open Beta milestone Jan 16, 2015

Add log message MEMORY about SoC Mode

ac0ea9b

ax3l mentioned this pull request Jan 16, 2015

False density initialization with several GPUs using FreeFormula? #635

Closed

Replace 1Byte Char with uint8_t

ece91a5

- uint8_t is well defined, char is not

psychocoderHPC added a commit that referenced this pull request Jan 21, 2015

Merge pull request #633 from ax3l/fix-memOnSoC

ca1f768

Fix #631 Allocation Of Memory On Jetson Boards

psychocoderHPC merged commit ca1f768 into ComputationalRadiationPhysics:dev Jan 21, 2015

ax3l mentioned this pull request Jan 21, 2015

Jetson TK1: Shared Memory #631

Closed

ax3l added the affects latest release a bug that affects the latest stable release label Jan 21, 2015

ax3l deleted the fix-memOnSoC branch May 21, 2015 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #631 Allocation Of Memory On Jetson Boards#633

Fix #631 Allocation Of Memory On Jetson Boards#633
psychocoderHPC merged 3 commits into
ComputationalRadiationPhysics:devfrom
ax3l:fix-memOnSoC

ax3l commented Jan 16, 2015

Uh oh!

PrometheusPi commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

PrometheusPi commented Jan 16, 2015

Uh oh!

sigkill commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

PrometheusPi commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ax3l commented Jan 16, 2015

Uh oh!

PrometheusPi commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

PrometheusPi commented Jan 16, 2015

Uh oh!

sigkill commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

PrometheusPi commented Jan 16, 2015

Uh oh!

ax3l commented Jan 16, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants