-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hi,
There is issue when you mount moosefs with default options, then there is interruption in network for 4 minutes, all VMs will crash with errors and after bringing online network, everything will be still broken. CEPH did not had this issue so i wanted moosefs to be bulletproof as ceph is.
Actually just tested this so i got ready to implement solution. Contacted Piotr from moosefs regarding this matter and i were able to fix this with his help.
How to crash:
- Run VM on moosefs (can be as raw file but bdev will do the same probably)
- Disconnect client (node that does mfsmount) from chunks and master for 4 minutes.
- VM running on that node will crash and wont be able to recover.
oplog errors after on node network is restored (it wont commit those changes and will stay in that state forever probably):
07.18 17:26:26.159218: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.175195: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.175311: uid:0 gid:0 pid:1211923 cmd:getxattr (6247965,security.capability,0) (using cache): EIO (Input/output error)
07.18 17:26:26.175338: uid:0 gid:0 pid:1211923 cmd:write (6247965,4096,116391936) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.175520: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.183241: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.199212: uid:0 gid:0 pid:1211923 cmd:getxattr (6247965,security.capability,0) (using cache): EIO (Input/output error)
07.18 17:26:26.199236: uid:0 gid:0 pid:1211923 cmd:write (6247965,4096,116391936) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.199311: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.215169: uid:0 gid:0 pid:1211923 cmd:getxattr (6247965,security.capability,0) (using cache): EIO (Input/output error)
Errors inside VM after network is restored:

There is super simple fix which i did on my node and tested:
Just add this parameter to mfs mounting command
-o mfsioretries=999999999
Result: VM does not crash and when there is even 20 min network issue and after network is restored, everything is written properly.
Please add that to default mounting command
129 line:
if (defined $mfssubfolder) {
push @$cmd, '-o', "mfssubfolder=$mfssubfolder";
}
# Tytanick patch
push @$cmd, '-o', "mfsioretries=99999999";
push @$cmd, $scfg->{path};
run_command($cmd, errmsg => "mount error");
And to bdeve we should also add this but i DID NOT TESTED if this parameter works properly here as i was not able to use bdev yet.
89 line that fixed my issues when network is interrupted (pls test it):
my $cmd = ['/usr/sbin/mfsbdev', 'start', '-H', $mfsmaster, '-S', 'proxmox', '-p', $mfspassword, '-o mfsioretries=99999999'];
And also that should be added to bdev