Skip to content

Conversation

@DaveTeng0
Copy link
Contributor

What changes were proposed in this pull request?

Bugs like HDDS-7592 can break the FSO tree and cause data to be orphaned in the OM. We have developed a tool to identify and repair this condition in the OM and tested it on affected clusters. This jira is to contribute the tool back to the community under the ozone CLI.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8101

How was this patch tested?

Unit test, integration test.

@DaveTeng0
Copy link
Contributor Author

cc. @errose28

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @DaveTeng0 for the patch.

Some comments about POM and CLI. Note: I haven't checked the code of the tool itself (FSORepairTool).

@errose28 errose28 changed the title Add FSO repair tool to ozone CLI in read-only and repair modes HDDS-8101. Add FSO repair tool to ozone CLI in read-only and repair modes. Apr 30, 2024
Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to decide what the CLI for this should look like. We could do ozone {debug,repair} fso-tree or ozone repair fso-tree [--dry-run]. Also as we add more of these type of commands I think ones that are specific to a component should be under their own subcommand for organization, like ozone repair om fso-tree.

Attila also brought up the --dry-run mode. I think if the command is under repair only, then dry run would not be the expected default value. If we add the read-only invocation under debug then that becomes the equivalent of dry run and no flag is needed.

@DaveTeng0
Copy link
Contributor Author

I think we still need to decide what the CLI for this should look like. We could do ozone {debug,repair} fso-tree or ozone repair fso-tree [--dry-run]. Also as we add more of these type of commands I think ones that are specific to a component should be under their own subcommand for organization, like ozone repair om fso-tree.

Attila also brought up the --dry-run mode. I think if the command is under repair only, then dry run would not be the expected default value. If we add the read-only invocation under debug then that becomes the equivalent of dry run and no flag is needed.

Yeah! extracted common codes between FSODebugCLI and FSORepairCLI to separated base classes FSOBaseCLI and FSOBaseTool, and make them reuse same logic.

@DaveTeng0
Copy link
Contributor Author

Hello team! please feel free to let me know if there is any new comment~ Thanks!

Copy link
Contributor

@errose28 errose28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @DaveTeng0 I just did a quick pass on the code outside of the main repair logic.

There seems to be a lot of failures on your branch, although I haven't looked deeply and it could just be a bad CI run. Can you get that to a green state?

For the CLI, If we go the route of different debug and repair commands, I think we want ozone debug om fso-tree and ozone repair om fso-tree to be the two options. Just having ozone repair om fso-tree [--dry-run] isn't a bad approach either. I think I slightly prefer that one and it prevents the need for an extra om subcommand under each option (for now at least), but I'm ok with two separate commands as well.

@adoroszlai adoroszlai marked this pull request as draft October 18, 2024 06:25
@adoroszlai adoroszlai marked this pull request as ready for review October 18, 2024 08:49
@adoroszlai adoroszlai requested a review from errose28 October 22, 2024 18:39
@errose28
Copy link
Contributor

@sarvekshayr will be resuming the work on this task. The remaining items to be addressed are:

  • Print all messages to stdout. Remove log4j usage. A --verbose flag should indicate whether or not individual keys are printed.
  • Check that RocksDB wrappers are being used correctly and no banned imports are failing the build.
    • Direct usage of RocksDB will fail the build.
  • Add --bucket and --volume CLI flags to allow the tool to only focus on problematic areas.
  • Add disclaimer to the output that the tool is currently not supported with snapshots.
  • Estimate replicated size if possible.
  • Create a follow up Jira to support snapshots

@errose28 errose28 marked this pull request as draft October 23, 2024 17:35
@adoroszlai
Copy link
Contributor

Continued in #7368.

@adoroszlai adoroszlai closed this Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants