Skip to content

4 minute network interruption causes VM to crash #32

@tytanick

Description

@tytanick

Hi,
There is issue when you mount moosefs with default options, then there is interruption in network for 4 minutes, all VMs will crash with errors and after bringing online network, everything will be still broken. CEPH did not had this issue so i wanted moosefs to be bulletproof as ceph is.
Actually just tested this so i got ready to implement solution. Contacted Piotr from moosefs regarding this matter and i were able to fix this with his help.

How to crash:

  1. Run VM on moosefs (can be as raw file but bdev will do the same probably)
  2. Disconnect client (node that does mfsmount) from chunks and master for 4 minutes.
  3. VM running on that node will crash and wont be able to recover.

oplog errors after on node network is restored (it wont commit those changes and will stay in that state forever probably):
07.18 17:26:26.159218: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.175195: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.175311: uid:0 gid:0 pid:1211923 cmd:getxattr (6247965,security.capability,0) (using cache): EIO (Input/output error)
07.18 17:26:26.175338: uid:0 gid:0 pid:1211923 cmd:write (6247965,4096,116391936) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.175520: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.183241: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.199212: uid:0 gid:0 pid:1211923 cmd:getxattr (6247965,security.capability,0) (using cache): EIO (Input/output error)
07.18 17:26:26.199236: uid:0 gid:0 pid:1211923 cmd:write (6247965,4096,116391936) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.199311: uid:0 gid:0 pid:1211923 cmd:read (6247965,4096,15040622592) [handle:01000002]: EIO (Input/output error)
07.18 17:26:26.215169: uid:0 gid:0 pid:1211923 cmd:getxattr (6247965,security.capability,0) (using cache): EIO (Input/output error)

Errors inside VM after network is restored:
Image

There is super simple fix which i did on my node and tested:
Just add this parameter to mfs mounting command
-o mfsioretries=999999999

Result: VM does not crash and when there is even 20 min network issue and after network is restored, everything is written properly.
Please add that to default mounting command

129 line:
if (defined $mfssubfolder) {
push @$cmd, '-o', "mfssubfolder=$mfssubfolder";
}

# Tytanick patch
push @$cmd, '-o', "mfsioretries=99999999";

push @$cmd, $scfg->{path};

run_command($cmd, errmsg => "mount error");

And to bdeve we should also add this but i DID NOT TESTED if this parameter works properly here as i was not able to use bdev yet.
89 line that fixed my issues when network is interrupted (pls test it):
my $cmd = ['/usr/sbin/mfsbdev', 'start', '-H', $mfsmaster, '-S', 'proxmox', '-p', $mfspassword, '-o mfsioretries=99999999'];
And also that should be added to bdev

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestgood first issueGood for newcomers

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions