Skip to content

Commit 943d9c7

Browse files
authored
Move sleep cluster logic to its own class (#3746)
* Move cluster logic to its own class Simplify the calculation. This is an alternative to #3723. Puma 6.6.1 behavior documented here https://gist.github.com/schneems/cef38e9448dfb72943d13050a7da0869 This changes the puma 7 behavior to: - Starts sleeping the event loop before being overloaded (the prior `pool.busy_threads` included in the ready @todo queue) and would only fire. This is closer to puma 6.6.1 behavior - Sleeps a consistent proportional amount, closer to puma 7 behavior, versus puma 6 used a static value. - Sleeping behavior is restricted to 2 or more workers, verus puma 7 was accidentally enabled for clusters with only 1 worker Fixes #3740. This `wait_for_less_busy_worker` value is now treated as a maximum value for the sleep calculation. It also now respects a value of 0 to mean "don't sleep at all" which was the prior behavior of that value. * Whitespace * Remove unused method The busy_thread call is now accessed directly through the thread pool and only used in one place. * Add docs about server <-> reactor relationship The code doesn't make it clear who is calling whom when looking at it in isolation. * Make maximum 25x thread count and allow for overage Puma can take in more requests than it has threads. This change preserves the same logic as before, but it doesn't hit maximum sleep until it is at 25x the number of max threads. That means that if a server was using 5 threads, before it would hit 0.005 sleep if those five threads were busy, now if (only) five threads are busy it will sleep 0.0002 seconds. * Move order of comparison If busy threads is zero we don't need to calculate percentage. * Refactor out clamp Because we're clamped at the numerator, the division will naturally tend towards 1.0. We don't need a second clamp on the result. This allows us to remove an intermediate variable and multiply directly on the return calculation. * Lazy max_threads There are proposals to make thread counts dynamic RE #3658. If this happens we need to pull in the current value, and cannot rely on it being set once. * Avoid re-checking the `max_delay` number every iteration * Refactor out branching The equation already holds up the tested properties without needing to special-case when busy threads is zero. Removes a branching conditional. * Push worker logic into delay class The logic of whether or not the calculation should be performed based on worker count can be calculated once and not repeated on every iteration. * Optimize case where no threads are busy Moves the condition from checking if the delay is zero to checking if the input is zero earlier. This is prevents some calculations and preserves the same number of conditionals.
1 parent 7aadde5 commit 943d9c7

3 files changed

Lines changed: 188 additions & 17 deletions

File tree

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# frozen_string_literal: true
2+
3+
module Puma
4+
# Calculate a delay value for sleeping when running in clustered mode
5+
#
6+
# The main reason this is a class is so it can be unit tested independently.
7+
# This makes modification easier in the future if we can encode properties of the
8+
# delay into a test instead of relying on end-to-end testing only.
9+
#
10+
# This is an imprecise mechanism to address specific goals:
11+
#
12+
# - Evenly distribute requests across all workers at start
13+
# - Evenly distribute CPU resources across all workers
14+
#
15+
# ## Goal: Distribute requests across workers at start
16+
#
17+
# There was a perf bug in Puma where one worker would wake up slightly before the rest and accept
18+
# all the requests on the socket even though it didn't have enough resources to process all of them.
19+
# This was originally fixed by never calling accept when a worker had more requests than threads
20+
# already https://github.com/puma/puma/pull/3678/files/2736ebddb3fc8528e5150b5913fba251c37a8bf7#diff-a95f46e7ce116caddc9b9a9aa81004246d5210d5da5f4df90a818c780630166bL251-L291
21+
#
22+
# With the introduction of true keepalive support, there are two ways a request can come in:
23+
# - A new request from a new client comes into the socket and it must be "accept"-d
24+
# - A keepalive request is served and the connection is retained. Another request is then accepted
25+
#
26+
# Ideally the server handles requests in the order they come in, and ideally it doesn't accept more requests than it can handle.
27+
# These goals are contradictory, because when the server is at maximum capacity due to keepalive connections, it could mean we
28+
# block all new requests, even if those came in before the new request on the older keepalive connection.
29+
#
30+
# ## Distribute CPU resources across all workers
31+
#
32+
# - This issue was opened https://github.com/puma/puma/issues/2078
33+
#
34+
# There are several entangled issues and it's not exactly clear the root cause, but the observable outcome
35+
# was that performance was better with a small sleep, and that eventually became the default.
36+
#
37+
# An attempt to describe why this works is here: https://github.com/puma/puma/issues/2078#issuecomment-3287032470.
38+
#
39+
# Summarizing: The delay is for tuning the rate at which "accept" is called on the socket.
40+
# Puma works by calling "accept" nonblock on the socket in a loop. When there are multiple workers,
41+
# (processes) then they will "race" to accept a request at roughly the same rate. However if one
42+
# worker has all threads busy processing requests, then accepting a new request might "steal" it from
43+
# a less busy worker. If a worker has no work to do, it should loop as fast as possible.
44+
#
45+
# ## Solution(s): Distribute requests across workers at start
46+
#
47+
# For now, both goals are framed as "load balancing" across workers (processes) and achieved through
48+
# the same mechanism of sleeping longer to delay busier workers. Rather than the prior Puma 6.x
49+
# and earlier behavior of using a binary on/off sleep value, we increase it an amound proportional
50+
# to the load the server is under. Capping the maximum delay to the scenario where all threads are busy
51+
# and the todo list has reached a multiplier of the maximum number of threads.
52+
#
53+
# Private: API may change unexpectedly
54+
class ClusterAcceptLoopDelay
55+
attr_reader :max_threads, :max_delay
56+
57+
# Initialize happens once, `call` happens often. Push global calculations here
58+
def initialize(
59+
# Number of workers in the cluster
60+
workers: ,
61+
# Maximum delay in seconds i.e. 0.005 is 5 microseconds
62+
max_delay: # In seconds i.e. 0.005 is 5 microseconds
63+
64+
)
65+
@on = max_delay > 0 && workers >= 2
66+
@max_delay = max_delay.to_f
67+
68+
# Reach maximum delay when `max_threads * overload_multiplier` is reached in the system
69+
@overload_multiplier = 25.0
70+
end
71+
72+
def on?
73+
@on
74+
end
75+
76+
# We want the extreme values of this delay to be known (minimum and maximum) as well as
77+
# a predictable curve between the two. i.e. no step functions or hard cliffs.
78+
#
79+
# Return value is always numeric. Returns 0 if there should be no delay
80+
def calculate(
81+
# Number of threads working right now, plus number of requests in the todo list
82+
busy_threads_plus_todo:,
83+
# Maximum number of threads in the pool, note that the busy threads (alone) may go over this value at times
84+
# if the pool needs to be reaped. The busy thread plus todo count may go over this value by a large amount
85+
max_threads:
86+
)
87+
max_value = @overload_multiplier * max_threads
88+
# Approaches max delay when `busy_threads_plus_todo` approaches `max_value`
89+
return max_delay * busy_threads_plus_todo.clamp(0, max_value) / max_value
90+
end
91+
end
92+
end

lib/puma/server.rb

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
require_relative 'util'
1414
require_relative 'request'
1515
require_relative 'configuration'
16+
require_relative 'cluster_accept_loop_delay'
1617

1718
require 'socket'
1819
require 'io/wait' unless Puma::HAS_NATIVE_IO_WAIT
@@ -58,7 +59,6 @@ def handle_request(client, requests)
5859
attr_accessor :app
5960
attr_accessor :binder
6061

61-
6262
# Create a server for the rack app +app+.
6363
#
6464
# +log_writer+ is a Puma::LogWriter object used to log info and error messages.
@@ -110,6 +110,10 @@ def initialize(app, events = nil, options = {})
110110
@enable_keep_alives &&= @queue_requests
111111
@io_selector_backend = @options[:io_selector_backend]
112112
@http_content_length_limit = @options[:http_content_length_limit]
113+
@cluster_accept_loop_delay = ClusterAcceptLoopDelay.new(
114+
workers: @options[:workers],
115+
max_delay: @options[:wait_for_less_busy_worker] || 0 # Real default is in Configuration::DEFAULTS, this is for unit testing
116+
)
113117

114118
if @options[:fiber_per_request]
115119
singleton_class.prepend(FiberPerRequest)
@@ -245,11 +249,6 @@ def pool_capacity
245249
@thread_pool&.pool_capacity
246250
end
247251

248-
# @!attribute [r] busy_threads
249-
def busy_threads
250-
@thread_pool&.busy_threads
251-
end
252-
253252
# Runs the server.
254253
#
255254
# If +background+ is true (the default) then a thread is spun
@@ -266,7 +265,11 @@ def run(background=true, thread_name: 'srv')
266265
@thread_pool = ThreadPool.new(thread_name, options) { |client| process_client client }
267266

268267
if @queue_requests
269-
@reactor = Reactor.new(@io_selector_backend) { |c| reactor_wakeup c }
268+
@reactor = Reactor.new(@io_selector_backend) { |c|
269+
# Inversion of control, the reactor is calling a method on the server when it
270+
# is done buffering a request or receives a new request from a keepalive connection.
271+
self.reactor_wakeup(c)
272+
}
270273
@reactor.run
271274
end
272275

@@ -291,6 +294,9 @@ def run(background=true, thread_name: 'srv')
291294
# This method is called from the Reactor thread when a queued Client receives data,
292295
# times out, or when the Reactor is shutting down.
293296
#
297+
# While the code lives in the Server, the logic is executed on the reactor thread, independently
298+
# from the server.
299+
#
294300
# It is responsible for ensuring that a request has been completely received
295301
# before it starts to be processed by the ThreadPool. This may be known as read buffering.
296302
# If read buffering is not done, and no other read buffering is performed (such as by an application server
@@ -339,7 +345,6 @@ def handle_servers
339345
pool = @thread_pool
340346
queue_requests = @queue_requests
341347
drain = options[:drain_on_shutdown] ? 0 : nil
342-
max_flt = @max_threads.to_f
343348

344349
addr_send_name, addr_value = case options[:remote_address]
345350
when :value
@@ -384,15 +389,13 @@ def handle_servers
384389
# clients until the code is finished.
385390
pool.wait_while_out_of_band_running
386391

387-
# only use delay when clustered and busy
388-
if pool.busy_threads >= @max_threads
389-
if @clustered
390-
delay = 0.0001 * ((@reactor&.reactor_size || 0) + pool.busy_threads * 1.5)/max_flt
391-
sleep delay
392-
else
393-
# use small sleep for busy single worker
394-
sleep 0.0001
395-
end
392+
# A well rested herd (cluster) runs faster
393+
if @cluster_accept_loop_delay.on? && (busy_threads_plus_todo = pool.busy_threads) > 0
394+
delay = @cluster_accept_loop_delay.calculate(
395+
max_threads: @max_threads,
396+
busy_threads_plus_todo: busy_threads_plus_todo
397+
)
398+
sleep(delay)
396399
end
397400

398401
io = begin
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# frozen_string_literal: true
2+
3+
require_relative "helper"
4+
require "puma/cluster_accept_loop_delay"
5+
6+
class TestClusterAcceptLoopDelay < PumaTest
7+
parallelize_me!
8+
9+
def test_off_when_fewer_than_two_workers
10+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
11+
workers: 2,
12+
max_delay: 1
13+
)
14+
assert_equal true, cal_delay.on?
15+
16+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
17+
workers: 1,
18+
max_delay: 1
19+
)
20+
assert_equal false, cal_delay.on?
21+
end
22+
23+
def test_zero_max_delay_always_returns_zero
24+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
25+
workers: 2,
26+
max_delay: 0
27+
)
28+
assert_equal false, cal_delay.on?
29+
assert_equal 0, cal_delay.calculate(busy_threads_plus_todo: 0, max_threads: 16)
30+
assert_equal 0, cal_delay.calculate(busy_threads_plus_todo: 42, max_threads: 16)
31+
assert_equal 0, cal_delay.calculate(busy_threads_plus_todo: 42 * 42, max_threads: 16)
32+
end
33+
34+
def test_zero_busy_threads_plus_todo_always_returns_zero
35+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
36+
workers: 2,
37+
max_delay: 0.005
38+
)
39+
40+
assert_equal 0, cal_delay.calculate(busy_threads_plus_todo: 0, max_threads: 10)
41+
end
42+
43+
def test_linear_increase_with_busy_threads_plus_todo
44+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
45+
workers: 2,
46+
max_delay: 0.05
47+
)
48+
49+
assert_in_delta 0, cal_delay.calculate(busy_threads_plus_todo: 0, max_threads: 1), 0.001
50+
assert_in_delta 0.002, cal_delay.calculate(busy_threads_plus_todo: 1, max_threads: 1), 0.001
51+
assert_in_delta 0.05, cal_delay.calculate(busy_threads_plus_todo: 25, max_threads: 1), 0.001
52+
assert_in_delta 0.05, cal_delay.calculate(busy_threads_plus_todo: 26, max_threads: 1), 0.001
53+
end
54+
55+
def test_always_return_float_when_non_zero
56+
# Dividing integers accidentally returns 0 so want to make sure we are correctly converting to float before division
57+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
58+
workers: 2,
59+
max_delay: Integer(5)
60+
)
61+
62+
assert_in_delta 0, cal_delay.calculate(busy_threads_plus_todo: 0.to_f, max_threads: Integer(1)), 0.001
63+
assert_equal Float, cal_delay.calculate(busy_threads_plus_todo: Integer(25), max_threads: Integer(1)).class
64+
assert_in_delta 5, cal_delay.calculate(busy_threads_plus_todo: 25, max_threads: Integer(1)), 0.001
65+
end
66+
67+
def test_extreme_busy_values_produce_sensible_delays
68+
cal_delay = Puma::ClusterAcceptLoopDelay.new(
69+
workers: 2,
70+
max_delay: 0.05
71+
)
72+
73+
assert_in_delta 0, cal_delay.calculate(busy_threads_plus_todo: -10, max_threads: 5), 0.001
74+
assert_in_delta 0.05, cal_delay.calculate(busy_threads_plus_todo: Float::INFINITY, max_threads: 5), 0.001
75+
end
76+
end

0 commit comments

Comments
 (0)