Skip to content

fix(console): fix intermittent login failures in dut_console tests#23342

Merged
lolyu merged 3 commits intosonic-net:masterfrom
lipxu:fix/dut-console-login-fixes
Mar 27, 2026
Merged

fix(console): fix intermittent login failures in dut_console tests#23342
lolyu merged 3 commits intosonic-net:masterfrom
lipxu:fix/dut-console-login-fixes

Conversation

@lipxu
Copy link
Copy Markdown
Contributor

@lipxu lipxu commented Mar 26, 2026

Fix two intermittent failures in dut_console tests: an accumulation-buffer fix for the Password prompt detection race, and a splitlines()[0] fix for reliable TMOUT value extraction in test_idle_timeout.

Description of PR

Summary:

Fix two independent intermittent failures in dut_console tests:

  1. ssh_console_conn.pylogin_stage_2() checked re.search(pwd_pattern, output) where output is only the most recent read_channel() chunk. The DUT's Password: prompt can arrive split across multiple TCP reads (e.g. Pa + ssword:), causing no chunk to match and the password never being sent — resulting in an intermittent "Socket is closed" failure in create_duthost_console (~1 in 5 runs). Fix: check return_msg (accumulated read buffer) instead of output.

  2. test_idle_timeout.pysplitlines()[-1] could return a partial prompt string (e.g. admin@hostname:) instead of the numeric TMOUT value when the prompt was not fully stripped from the command output. Fix: use splitlines()[0] to always read the first output line, which is always the numeric value.

Both fixes were validated on internal branch dev/xuliping/20260325_202511_console_login_fix across 5 full test runs with no failures.

Related: follows up on #23295 (blank Enter fix in the same login path — already merged).

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

dut_console tests were failing intermittently (~1 in 5 runs) with "Socket is closed" errors during console login. Root cause: the DUT's Password: prompt sometimes arrives split across multiple TCP reads, so per-chunk pattern matching never matches. A second independent failure in test_idle_timeout caused by splitlines()[-1] returning a prompt fragment instead of the numeric TMOUT value.

How did you do it?

  • ssh_console_conn.py: Changed re.search(pwd_pattern, output) to re.search(pwd_pattern, return_msg) in login_stage_2(), where return_msg is the accumulated read buffer across all chunks.
  • test_idle_timeout.py: Changed splitlines()[-1] to splitlines()[0] in the TMOUT value extraction, so we always get the first output line regardless of trailing prompt remnants.

How did you verify/test it?

Ran all dut_console test cases on a physical testbed using internal branch dev/xuliping/20260325_202511_console_login_fix for 5 full iterations. All tests passed with no failures.

Any platform specific information?

None — applies to all platforms using SSH console connections.

Supported testbed topology if it's a new test case?

N/A (bug fix only)

Documentation

N/A

lipxu and others added 2 commits March 26, 2026 07:01
…_timeout

When find_prompt() captures a partial prompt, netmiko strip_prompt() may fail
to remove the trailing prompt, causing splitlines()[-1] to return the prompt
string instead of the numeric value. Using splitlines()[0] always picks the
first meaningful output line, which is the actual numeric value.

Signed-off-by: Liping Xu <xuliping@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Password: prompt from the DUT is sometimes split across multiple
TCP reads (e.g., 'Pa' + 'ssword:'), causing re.search(pwd_pattern, output)
to fail on each individual chunk. By checking return_msg (the accumulated
read buffer) instead of output, we correctly detect the Password: prompt
even when it arrives in fragments.

This fixes the intermittent 'Socket is closed' failure in create_duthost_console
where 1 in ~5 runs would fail because the password was never sent.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Liping Xu <xuliping@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Add missing newline at end of file to fix the pre-commit
fix-end-of-files check failure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Liping Xu <xuliping@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Collaborator

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — Clean, Well-Targeted Bug Fix

Two surgical fixes for intermittent dut_console test failures:

  1. ssh_console_conn.py: Using accumulated return_msg instead of per-chunk output for password prompt detection — correctly handles TCP fragmentation splitting Password: across reads
  2. test_idle_timeout.py: splitlines()[0] instead of splitlines()[-1] for reliable TMOUT extraction — avoids prompt fragments

Both fixes are minimal, well-documented, and validated across 5 full test iterations. LGTM 🚀

# Search for password pattern / send password
if user_sent and not password_sent and re.search(pwd_pattern, output, flags=re.I):
# Use return_msg (accumulated) instead of output to handle cases where
# 'Password:' prompt is split across multiple TCP reads (e.g. 'Pa' + 'ssword:')
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Good fix — TCP fragmentation splitting Password: across reads is a classic race. Using the accumulated return_msg instead of the per-chunk output is the right approach. Well-commented too.

duthost = duthosts[enum_rand_one_per_hwsku_hostname]
logger.info("Get default session idle timeout")
default_tmout = duthost_console.send_command('echo $TMOUT')
default_tmout = duthost_console.send_command('echo $TMOUT').strip().splitlines()[0].strip()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Nit: .strip().splitlines()[0].strip() could raise IndexError if send_command returns an empty string (unlikely but possible on connection issues). A defensive guard would be:

lines = duthost_console.send_command('echo $TMOUT').strip().splitlines()
default_tmout = lines[0].strip() if lines else ""

Not a blocker — the old code would also fail on empty output.

@lolyu lolyu merged commit fbc5443 into sonic-net:master Mar 27, 2026
18 of 19 checks passed
@lolyu lolyu added the Request for 202511 branch Request to backport a change to 202511 branch label Mar 27, 2026
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
…onic-net#23342)

What is the motivation for this PR?
dut_console tests were failing intermittently (~1 in 5 runs) with "Socket is closed" errors during console login. Root cause: the DUT's Password: prompt sometimes arrives split across multiple TCP reads, so per-chunk pattern matching never matches. A second independent failure in test_idle_timeout caused by splitlines()[-1] returning a prompt fragment instead of the numeric TMOUT value.

How did you do it?
ssh_console_conn.py: Changed re.search(pwd_pattern, output) to re.search(pwd_pattern, return_msg) in login_stage_2(), where return_msg is the accumulated read buffer across all chunks.
test_idle_timeout.py: Changed splitlines()[-1] to splitlines()[0] in the TMOUT value extraction, so we always get the first output line regardless of trailing prompt remnants.
How did you verify/test it?
Ran all dut_console test cases on a physical testbed using internal branch dev/xuliping/20260325_202511_console_login_fix for 5 full iterations. All tests passed with no failures.

Any platform specific information?
None — applies to all platforms using SSH console connections.

Supported testbed topology if it's a new test case?
N/A (bug fix only)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Request for 202511 branch Request to backport a change to 202511 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants