Skip to content

Commit cd359e4

Browse files
huansongmy-ship-it
authored andcommitted
Reduce flakiness in test fts_segment_reset
Have seen some flakiness in test fts_segment_reset because sometimes FTS would still promote mirror if the primary takes a bit longer to restart after getting out of RESET stage. An example like below: - Primary 0 gets out of RESET and was going to be restarted: 2022-05-23 15:32:53.924540 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","all server processes terminated; reinitializing",,,,,,,0,,"postmaster.c",4284, - And it takes primary 0 about 2-3 seconds to do so: 2022-05-23 15:32:56.184117 UTC,,,p105578,th1560833280,,,,0,,,seg0,,,,,"LOG","00000","database system is ready to accept connections” - Unfortunately before primary 0 could restart, FTS makes one last probe and finds that it is in recovery mode, and not making progress (which is "correct" because primary 0 has finished recovery): 2022-05-23 15:32:56.009206 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS: detected segment is in recovery mode and not making progress (content=0) primary dbid=2, mirror dbid=5",,,,,,,0,,"ftsprobe.c",254, 2022-05-23 15:32:56.065399 UTC,,,p102591,th2023709952,,,,0,con3,,seg-1,,,,,"LOG","00000","FTS max (5) retries exhausted (content=0, dbid=2) state=9",,,,,,,0,,"ftsprobe.c”,788 Currently, we let primary stay in the RESET stage for 27 seconds. The FTS has a default of 5-second retry cycle, at the end of which it makes promote decision. That leaves about 3 seconds for the primary to start after getting out of RESET, which is probably too short. Now make the retry cycle 15 seconds and let the RESET delay to be 17 seconds. That leave about 13 seconds for the primary to start after that, which should be well enough to reduce common flakiness.
1 parent a31ea51 commit cd359e4

File tree

2 files changed

+23
-12
lines changed

2 files changed

+23
-12
lines changed

src/test/isolation2/expected/fts_segment_reset.out

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,19 +10,24 @@
1010
-- start_ignore
1111
alter system set gp_fts_probe_interval to 10;
1212
ALTER
13+
-- Because after RESET, it still takes a little while for the primary
14+
-- to restart, and potentially makes FTS think it's in "recovery not
15+
-- in progress" stage and promote the mirror, we would need the FTS
16+
-- to make that decision a bit less frequently.
17+
alter system set gp_fts_probe_retries to 15;
18+
ALTER
1319
select pg_reload_conf();
1420
pg_reload_conf
1521
----------------
1622
t
1723
(1 row)
1824
-- end_ignore
1925

20-
-- Let the background writer sleep 27 seconds to delay the resetting.
21-
-- This number is selected because there's a slight chance that FTS senses
22-
-- "recovery not in progress" after its 5-second retry window and promote
23-
-- the mirror. So just put the end of the sleep perid away from the end
24-
-- of the retry windows.
25-
select gp_inject_fault('fault_in_background_writer_quickdie', 'sleep', '', '', '', 1, 1, 27, dbid) from gp_segment_configuration where role = 'p' and content = 0;
26+
-- Let the background writer sleep 17 seconds to delay the resetting.
27+
-- This number is selected to be larger than the 15-second retry window
28+
-- which makes a meaningful test, meanwhile reduce the chance that FTS sees
29+
-- a "recovery not in progress" primary as much as possible.
30+
select gp_inject_fault('fault_in_background_writer_quickdie', 'sleep', '', '', '', 1, 1, 17, dbid) from gp_segment_configuration where role = 'p' and content = 0;
2631
gp_inject_fault
2732
-----------------
2833
Success:
@@ -94,6 +99,7 @@ select pg_sleep(30);
9499
-- start_ignore
95100
-- restore parameters
96101
alter system reset gp_fts_probe_interval;
102+
alter system reset gp_fts_probe_retries;
97103
select pg_reload_conf();
98104
-- end_ignore
99105

src/test/isolation2/sql/fts_segment_reset.sql

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,19 @@
99
-- Let FTS detect/declare failure sooner
1010
-- start_ignore
1111
alter system set gp_fts_probe_interval to 10;
12+
-- Because after RESET, it still takes a little while for the primary
13+
-- to restart, and potentially makes FTS think it's in "recovery not
14+
-- in progress" stage and promote the mirror, we would need the FTS
15+
-- to make that decision a bit less frequently.
16+
alter system set gp_fts_probe_retries to 15;
1217
select pg_reload_conf();
1318
-- end_ignore
1419

15-
-- Let the background writer sleep 27 seconds to delay the resetting.
16-
-- This number is selected because there's a slight chance that FTS senses
17-
-- "recovery not in progress" after its 5-second retry window and promote
18-
-- the mirror. So just put the end of the sleep perid away from the end
19-
-- of the retry windows.
20-
select gp_inject_fault('fault_in_background_writer_quickdie', 'sleep', '', '', '', 1, 1, 27, dbid)
20+
-- Let the background writer sleep 17 seconds to delay the resetting.
21+
-- This number is selected to be larger than the 15-second retry window
22+
-- which makes a meaningful test, meanwhile reduce the chance that FTS sees
23+
-- a "recovery not in progress" primary as much as possible.
24+
select gp_inject_fault('fault_in_background_writer_quickdie', 'sleep', '', '', '', 1, 1, 17, dbid)
2125
from gp_segment_configuration where role = 'p' and content = 0;
2226

2327
-- Do not let the postmaster send SIGKILL to the bgwriter
@@ -54,6 +58,7 @@ select pg_sleep(30);
5458
-- start_ignore
5559
-- restore parameters
5660
alter system reset gp_fts_probe_interval;
61+
alter system reset gp_fts_probe_retries;
5762
select pg_reload_conf();
5863
-- end_ignore
5964

0 commit comments

Comments
 (0)