[Mellanox] Disable SSD NCQ on Mellanox platforms#17567
[Mellanox] Disable SSD NCQ on Mellanox platforms#17567liat-grozovik merged 1 commit intosonic-net:masterfrom
Conversation
|
@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo |
|
@saiarcot895 Could you please help review? Thanks! |
|
How about the SN2700 A1? |
|
/azpw ms_conflict |
liat-grozovik
left a comment
There was a problem hiding this comment.
@volodymyrsamotiy you are missing SN5600 as well as SN2700-A1.
please go over the list of systems. BTW. this means you will have conflict with 2205 thus please update the missing systems and create a backport PR for 202205
|
@liat-grozovik are we good to go? |
472b63f to
49e301e
Compare
|
@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo |
@liat-grozovik can we move forward with this PR and handle 5600 and 2700-A1 with another PR? |
|
/azpw ms_conflict |
|
@StormLiangMS,@liushilongbuaa PR: #17567 is conflict with MS internal repo |
|
/azpw ms_conflict |
@yxieca, no need to handle 5600 and 2700-A1 with another PR, this PR was already updated, it has all the changes. |
@volodymyrsamotiy what is keeping this PR in draft mode? @liat-grozovik any other blocking issues? |
|
@liat-grozovik Can we unblock this PR now? |
|
Removed the label for 202205 branch as another PR has been opened to 202205 #17662 |
@liat-grozovik i think this has been addressed so please check again |
|
@volodymyrsamotiy PR conflicts with 202311 branch |
|
@volodymyrsamotiy can you help create 202311 PR? |
- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.
Syslog error message examples:
Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
[Arista] Disable ATA NCQ for a few products sonic-net#13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S sonic-net#13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous
- How I did it
Add a kernel parameter to tell libata to disable NCQ
- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
|
Cherry-pick PR to 202305: #17960 |
|
ADO: 25853968 |
- Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.
Syslog error message examples:
Error "ata1: SError: { UnrecovData Handshk }" - "failed command: WRITE FPDMA QUEUED".
Error "ata1: SError: { RecovComm HostInt PHYRdyChg CommWake 10B8B DevExch }" - "failed command: READ FPDMA QUEUED".
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
[Arista] Disable ATA NCQ for a few products #13739 [Arista] Disable ATA NCQ for a few products
[Arista] Disable SSD NCQ on DCS-7050CX3-32S #13964 [Arista] Disable SSD NCQ on DCS-7050CX3-32S
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
https://askubuntu.com/questions/133946/are-these-sata-errors-dangerous
- How I did it
Add a kernel parameter to tell libata to disable NCQ
- How to verify it
Use FIO tool - fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4
Why I did it
Based on some research some products might experience an occasional IO failures in the communication between CPU and SSD because of NCQ.
There seems to be a problem between some kernel versions and some SATA controllers.
Syslog error message examples:
Some vendors already disabled NCQ on their platforms in SONiC due to similar issue:
Also there are other discussions on Debian/Ubuntu forums about similar issues and it was suggested to disable NCQ:
Work item tracking
How I did it
Add a kernel parameter to tell libata to disable NCQ
How to verify it
Use FIO tool -
fio --direct=1 --rw=randrw --bs=64k --ioengine=libaio --iodepth=64 --runtime=120 --numjobs=4Test results with NCQ enabled:
Test results with NCQ disabled:
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)