Skip to content

Fix #631 Allocation Of Memory On Jetson Boards#633

Merged
psychocoderHPC merged 3 commits into
ComputationalRadiationPhysics:devfrom
ax3l:fix-memOnSoC
Jan 21, 2015
Merged

Fix #631 Allocation Of Memory On Jetson Boards#633
psychocoderHPC merged 3 commits into
ComputationalRadiationPhysics:devfrom
ax3l:fix-memOnSoC

Conversation

@ax3l
Copy link
Copy Markdown
Member

@ax3l ax3l commented Jan 16, 2015

SoC-like boards share the same memory for the device and host. Since we still do double buffering, we should only reserve half of the available "device" memory for particles.

Add 8 for MEMORY to PIC_VERBOSE, e.g.:

$PICSRC/configure -a sm_35 -c"-DPIC_VERBOSE=9" ../paramSets/lwfa

Still needs run time testing:

  • jetson
  • k20, fermi, k80, ...

Also:

  • add a MEMORY log output about the selected choice

…Boards

SoC-like boards share the same memory for the device
and host. Since we still do double buffering, we should
only reserve *half* of the available "device" memory
for particles.
@ax3l ax3l added the bug a bug in the project's code label Jan 16, 2015
@ax3l ax3l added this to the Open Beta milestone Jan 16, 2015
@PrometheusPi
Copy link
Copy Markdown
Member

Great job! 👍 I will try to test this on the jetson cluster here in Jena.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jan 16, 2015

fantastic, I already validated the bare function on our local jetson cluster (but not this pull yet).

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jan 16, 2015

@sigkill @ssstuvz that issue is relevant for your jetson (clusters) and should remove the instability and poor performance we observed. Please feel free to update to dev after this pull is reviewed and merged.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jan 16, 2015

Test of examples/LaserWakefield on a single jetson:

[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | size for all exchange = 36 MiB
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | create 48 MiB for species e
[1,0]<stdout>:mem for particles=48 MiB = 6445 Frames = 1649920 Particles
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | free mem after all mem is allocated 446 MiB

with .cfg

TBG_gpu_x=1
TBG_gpu_y=1
TBG_gpu_z=1

TBG_gridSize="-g 128 256 128"
TBG_steps="-s 1024"

48MB for e- is not that much... they have nearly 2GB available in total, we space 450MB for cuda weirdness (random number states, overheads and similar).
~~I did run cuda_memtest succesfully before that (allocated itself ~1400 MB).~~ (tested: no influence on sim alloc if disabled/enabled)

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jan 16, 2015

ok, looks better with -g 64 64 64:

[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | size for all exchange = 36 MiB
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | create 370 MiB for species e
[1,0]<stdout>:mem for particles=370 MiB = 48838 Frames = 12502528 Particles
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | free mem after all mem is allocated 265 MiB

init time: ~60sec

works with submit/bash/bash_mpirun.tpl on 4 and 5 jetsons, too.
take care to set the TBG_gpusPerNode to 1 and to add -x LD_LIBRARY_PATH to the mpirun call in submit/bash/bash_mpirun.tpl.

P.S.: I recommend to uninstall the X server to get more memory for your sims :)

@PrometheusPi
Copy link
Copy Markdown
Member

on 3 Jetsons in Jena:

[1,0]<stdout>:mem for particles=129 MiB = 75384 Frames = 4824576 Particles
[1,1]<stdout>:mem for particles=57 MiB = 33494 Frames = 2143616 Particles
[1,2]<stdout>:mem for particles=434 MiB = 252439 Frames = 16156096 Particles

with -d 3 1 1 -g 192 2560 1

MPI rank 1 has only very limited memory in this run! This might be problematic.

Should I rerun with verbose memory output?

@sigkill
Copy link
Copy Markdown

sigkill commented Jan 16, 2015

Wow still these improvements are awesome! Thank you for all the hard work!
On Jan 16, 2015 6:45 AM, "Richard Pausch" [email protected] wrote:

on 3 Jetsons in Jena:

[1,0]:mem for particles=129 MiB = 75384 Frames = 4824576 Particles
[1,1]:mem for particles=57 MiB = 33494 Frames = 2143616 Particles
[1,2]:mem for particles=434 MiB = 252439 Frames = 16156096 Particles

with -d 3 1 1 -g 192 2560 1

MPI rank 1 has only very limited memory in this run! This might be
problematic.


Reply to this email directly or view it on GitHub
#633 (comment)
.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jan 16, 2015

@PrometheusPi do not use that much cells for the small GPUs, they only have 2GB to fit in the whole OS, processes and PIConGPU.
Also: do you run X on your nodes? the main reasons for your problems are:

  • your gpu distribution -d is not well fit: use more gpus in y direction (-> smaller guard buffers)
    -> your central GPU in x right now allocates guard buffers, the others don't
  • too much processes / other users on the nodes -> varying memory consumption and due to that varying memory PIConGPU can allocate
  • maybe X11 running -> disable it (check all running daemons for your nodes via the /etc/init.d/ scripts and htop)

that is how it should look (for an empty cluster with minimal daemons running - well, we even have X11 on -.-). I used 64^3 cells per Jetson (25% less than your setup):

[1,4]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,4]<stdout>:PIConGPUVerbose MEMORY(8) | create 446 MiB for species e
[1,1]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,1]<stdout>:PIConGPUVerbose MEMORY(8) | create 490 MiB for species e
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,0]<stdout>:PIConGPUVerbose MEMORY(8) | create 467 MiB for species e
[1,3]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.
[1,3]<stdout>:PIConGPUVerbose MEMORY(8) | create 490 MiB for species e
[1,2]<stdout>:PIConGPUVerbose MEMORY(8) | Shared RAM between GPU and host detected - using only half of the 'device' memory.

@sigkill thanks! :)

@PrometheusPi
Copy link
Copy Markdown
Member

@ax3l Yes I know that we have varying memory consumtion here.

  • I just added your new code to my current run with Foil where this GPU setup works best.
  • Yes, some demons eat memory 👿
  • We desabled the GUI interface, but kept X running (only 1% of memory)

Your example on ZIH/Jetson looks great.

@ax3l
Copy link
Copy Markdown
Member Author

ax3l commented Jan 16, 2015

Yes, but the main thing is still: one GPU will use 2x more memory for the guards (the central one). with your setting that is quite relevant, too.

Here are some performance tuning tips, especially for the network.
To increase networking performance, increase the socket buffer sizes (in this case to 32MB, performed at boot):

sysctl –w net.core.rmem_max=33554432
sysctl –w net.core.wmem_max=33554432
sysctl –w net.core.rmem_default=33554432
sysctl –w net.core.wmem_default=33554432

P.S.: argh, now I want a PCI-E switch instead of an ethernet interconnect. Brrr.

- uint8_t is well defined, char is not
psychocoderHPC added a commit that referenced this pull request Jan 21, 2015
Fix #631 Allocation Of Memory On Jetson Boards
@psychocoderHPC psychocoderHPC merged commit ca1f768 into ComputationalRadiationPhysics:dev Jan 21, 2015
@ax3l ax3l mentioned this pull request Jan 21, 2015
@ax3l ax3l added the affects latest release a bug that affects the latest stable release label Jan 21, 2015
@ax3l ax3l deleted the fix-memOnSoC branch May 21, 2015 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects latest release a bug that affects the latest stable release bug a bug in the project's code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants