Improve NEST performance by revised connection exchange and spike delivery#2926
Merged
heplesser merged 146 commits intonest:masterfrom Sep 13, 2023
Merged
Improve NEST performance by revised connection exchange and spike delivery#2926heplesser merged 146 commits intonest:masterfrom
heplesser merged 146 commits intonest:masterfrom
Conversation
…spike_data; use max spike-data buffer size
…ther_spike_data Ensure thread-local memory allocation
…om:suku248/nest-simulator into test_single_threading_in_gather_spike_data
…gle_threading_in_gather_spike_data
…m:suku248/nest-simulator into single_batchwise
…single_threading_in_gather_spike_data Conflicts: nestkernel/event_delivery_manager.cpp
mlober
reviewed
Sep 11, 2023
Co-authored-by: Jochen Martin Eppler <jougs@gmx.net>
mlober
reviewed
Sep 11, 2023
Co-authored-by: Jochen Martin Eppler <jougs@gmx.net>
Co-authored-by: Melissa Lober <47082241+mlober@users.noreply.github.com>
Co-authored-by: Melissa Lober <47082241+mlober@users.noreply.github.com>
mlober
reviewed
Sep 11, 2023
Co-authored-by: Melissa Lober <47082241+mlober@users.noreply.github.com>
jessica-mitchell
approved these changes
Sep 13, 2023
mlober
approved these changes
Sep 13, 2023
tomtetzlaff
pushed a commit
to tomtetzlaff/nest-simulator
that referenced
this pull request
Mar 11, 2025
Improve NEST performance by revised connection exchange and spike delivery
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request significantly improves NEST performance by
The changes are described in detail below.
Note that transmission of
SecondaryEventsis essentially not affected by this PR since they are written directly into ready-made buffers on the receiver side. There is only some effect on connection transmission to the presynaptic side.Many NEST developers have contributed to this work, especially @diesmann, @suku248, @JoseJVS, @med-ayssar, @mlober, @hakonsbm, @ackurth and @JanVogelsang.
Breaking changes in NEST
Spikes from last slice not delivered
spike_recorder) are not affected since they receive spikes locally.Removed kernel parameters
sort_connections_by_source—use_compressed_spikesremains, which automatically activates connection sorting. There is simply no relevant use case where sorting but not compressing would make sense.adaptive_spike_buffers— spike buffers are now always adaptive, see section on Buffer growth and shrinkingmax_buffer_size_spike_data— there is no upper limit since all spikes need to be transmitted in one roundNew kernel parameters
The following parameters control or report spike buffer resizing (see Buffer growth and shrinking for details):
spike_buffer_grow_extraspike_buffer_shrink_limitspike_buffer_shrink_sparespike_buffer_resize_logThings to check in particular
kernel_manager.h)Modified tests
test_mip_corrdet— need to simulate one step longer due to delivery at beginning of next steptest_regression_issue-1034— need to subtract min delay due to moved deliveryConnection compression and transmission
gather_target_data_compressed().spikein a number of method and data structure names below has historic roots and should be cleaned up in a follow-up step.Data Structures
SourceTable::sources_sources_[target_thread][syn_id][lcid]stores raw connection info, built during connection creationSourceTable::compressible_sources_compressible_sources_[target_thread][syn_id][idx]contains onepair<source_node_id, SpikeData(target_thread, syn_id, conn_lcid, 0)>entry, mapping each source node id to aSpikeDataentry identifying the firstsources_entry for that source node in the sortedsources_arrayConnectionManager::compressed_spike_datacompressed_spike_data[syn_id][source_index][target_thread]is the result of the second compression steppairfromcompressible_sources_.SourceTable::compressed_spike_data_map_provides and index from source node id tosource_indexincompressed_spike_data.SourceTable::compressed_spike_data_map_compressed_spike_data_map_[syn_id]maps each unique source node id onto the correspondingsource_indexin thecompressed_spike_dataCSDMapEntrycompressed_spike_data_map_.ConnectionManager::iteration_state_vector< pair< syn_id, map< source_gid, CSDMapEntry >::const_iterator > >compressed_spike_data_map_.Source compression
SourceTable::collect_compressible_sources(), thread parallel)SourceTable::sources_, create entry incompressible_sources_connecting source ID to info about first entry for that source insources_and mark sequence of connections from that source neuron assource_has_more_targets.SourceTable::fill_compressed_spike_data(), serial)syn_id, iterate over connections on all threads incompressible_sources_compressed_spike_data_map_(for thatsyn_id)compressed_spike_datawith one slot per threadcompressed_spike_data_map_SpikeDataentry created in the first compression step.Connection transmission
Connection transmission works in multiple rounds if necessary, buffer size may be adjusted
Collocation of data is assigned to "assigned ranks"
Writing is mainly done by
ConnectionManager::fill_target_bufferas follows:compressed_spike_data_map_, outermost bysyn_id, then over source entries.Gather MPI-exchanges data that has been written to buffers
If not all data has been transmitted, do more rounds until all data has been transmitted.
For each compressed set of connections, we send to the presynaptic side
compressed_spike_dataNOTE: The iteration scheme is different from the original approach. We stop as soon as a single rank has filled its part of the buffer. In the original, iteration would continue until the last rank had filled its chunk. CSDMap entries were marked as processed when written. On the next round, iteration through CSDMap would start at the point where the first rank had to stop writing, skipping all entries that had been written.
Spike transmission
emitted_spikes_register_during node update are gathered at end of time slice and exchanged between ranks bygather_spike_data()andcollocated_spike_data_buffers_().deliver_events_()Data structures
Spike register
emitted_spikes_registerSpikeData(to be written directly to transmission buffer) and rank of target neuron (for writing to correct section of target buffer)emitted_spikes_register_inEventDeliverManager::send_remote(), which is called when a node sends a spikeemitted_spikes_register_inEventDeliveryManager::collocate_spike_data_buffers_(), called fromgather_spike_data()SendBufferPositionTargetSendBufferPositionused for connection communication with assigned ranks.SpikeDataOffGridversionsend_remote(), we immediately create the eventualSpikeDataentry, which is later copied to the transmission buffer bycollocate_spike_data_buffers_(), no more re-coding in the process(Target, lag)for direct insertion toemitted_spikes_registeras part ofSpikeDataWithRankentrySpikeDataWithRankcombinesSpikeDatawith target rank information needed for eventual writing to transmission buffer.emplace_back()intoemitted_spikes_register, i.e., direct construction instead of constuct and copy.set_lcid()to allow transmission of locally requiredbuffer size per rank in LCID field
get_marker()struct SpikeDataWithRankforemitted_spikes_register, alsovariant for
OffGridDeliver events first
deliver_events_()again in a separate method at the beginning of each update loop.clock_is advance by onemin_delaywhen spikes are delivered compared to when they were sent.min_delayneeds to be subtracted fromclock_when computing arrival timesprepared_timestampsindeliver_events_()rate_*_impl.hfilesSpike gathering and transmission
Marking completeness and required buffer size
SpikeData::marker_fieldbegpos, the lastendpos(this position is included in the chunk, it is not one beyond)local_max_spikes_per_rankis the largest number of spikes a given rank needs to transmit to any other rank.global_max_spikes_per_rankis the maximum of alllocal_max_spikes_per_rankvalues. It determines the minimum required buffer chunk size.SpikeDatamarker values are defined as follows: haveDEFAULT: Normal entry, cannot occur in endposEND: Marks last entry containing data.local_max_spikes_per_rankof the sending rank is equal to the current buffer sizelocal_max_spikes_per_rank.local_max >= chunk_size, set endpos markers toINVALIDand store `local_max_ there.INVALID,otherwise set
ENDmarker on last position written to.COMPLETEon endpos for chunk and storelocal_maxin endpos LCIDglobal_max_spikes_per_rankfrom alllocal_maxinformation obtainedglobal_max > chunk_size, grow buffer and repeat entire process.Buffer growth and shrinking
gather_spike_data_()global_max_spikes_per_rank_, i.e., the largest number of spikes that any rank has sent to any other rank. The individual sections of the spike transmission buffer must be at least this size.global_max_spikes_per_rank_, growth during gathering if required.Growing
global_max_spikes_per_rank_(1 + spike_buffer_grow_extra) * global_max_spikes_per_rank_to keep number of grow operations smallspike_buffer_grow_extra == 0.5Shrinking
global_max_spikes_per_rank_ < spike_buffer_shrink_limit * buffer_sizenew_size = ( 1 + spike_buffer_shrink_spare ) * global_max_spikes_per_rank_spike_buffer_shrink_limit = 0spike_buffer_shrink_limit == 0.3spike_buffer_shrink_spare == 0.1Logging
global_max_spikes_per_rank_and the new buffer size are recorded.spike_buffer_resize_log, which is a dictionary with the same structure aseventsdictionaries of recorders, i.e., containing one array for each of the three quantities recorded.Spike delivery
deliver_events_()is called at beginning of each time slice except for the very first time slice (time 0, nothing to deliver)deliver_events_()is called in a thread-parallel contextend_markerin section from each rankMinor changes
Limit on LCID values
MAX_LCIDnow used to markinvalid_lcidMAX_LCID-1nest_types.hMPIManagerchangesEventDeliveryManager.FULL_LOGGING()macrowrite_to_dump()method for logging outputcriticalsections for outputCMakeLists.txtcmake/ProcessOptions.cmakelibnestutil/config.h.inkernel_manager.h,cppTouch ups
intbysize_tfor spike multiplicitymusic_event_out_proxyspike_recordereventstimulation_backend_mpi.hrecording_backend_mpi.hUpdated tests
test_stdp_synapse— modernization, no change to logicSLI unittest
distributed_process_invariant_events...Open issues to be followed up
MAX_andinvalid_constants, see Systematize definition of INVALID_* and MAX_* constants #2529EventDeliveryManager. Move toConnectionManager.MPIManagerand code using the buffers in complicated ways. This should be made more systematic.deliver_events_()can be simplified by use of functions.SendBufferPositionbe turned into proper iterator (or array of iterators), and shouldTargetSendBufferPositionmoved to file of its own?spikein names in connection infrastructure buildingSourceTable::compressed_spike_data_map_can be cleared after connection transmissionThis PR replaces #2617.