-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedMetadata #5829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -68,41 +68,46 @@ public FileStatus[] getAllFilesInPartition(Path partitionPath) throws IOExceptio | |
|
|
||
| @Override | ||
| public List<String> getAllPartitionPaths() throws IOException { | ||
| FileSystem fs = new Path(datasetBasePath).getFileSystem(hadoopConf.get()); | ||
| Path basePath = new Path(datasetBasePath); | ||
| FileSystem fs = basePath.getFileSystem(hadoopConf.get()); | ||
| if (assumeDatePartitioning) { | ||
| return FSUtils.getAllPartitionFoldersThreeLevelsDown(fs, datasetBasePath); | ||
| } | ||
|
|
||
| List<Path> pathsToList = new CopyOnWriteArrayList<>(); | ||
| pathsToList.add(new Path(datasetBasePath)); | ||
| pathsToList.add(basePath); | ||
| List<String> partitionPaths = new CopyOnWriteArrayList<>(); | ||
|
|
||
| while (!pathsToList.isEmpty()) { | ||
| // TODO: Get the parallelism from HoodieWriteConfig | ||
| int listingParallelism = Math.min(DEFAULT_LISTING_PARALLELISM, pathsToList.size()); | ||
|
|
||
| // List all directories in parallel | ||
| List<FileStatus[]> dirToFileListing = engineContext.map(pathsToList, path -> { | ||
| List<Pair<Path, FileStatus[]>> dirToFileListing = engineContext.map(pathsToList, path -> { | ||
| FileSystem fileSystem = path.getFileSystem(hadoopConf.get()); | ||
| return fileSystem.listStatus(path); | ||
| return Pair.of(path, fileSystem.listStatus(path)); | ||
| }, listingParallelism); | ||
| pathsToList.clear(); | ||
|
|
||
| // if current dictionary contains PartitionMetadata, add it to result | ||
| // if current dictionary does not contain PartitionMetadata, add it to queue | ||
| dirToFileListing.stream().flatMap(Arrays::stream).parallel() | ||
| .forEach(fileStatus -> { | ||
| if (fileStatus.isDirectory()) { | ||
| if (HoodiePartitionMetadata.hasPartitionMetadata(fs, fileStatus.getPath())) { | ||
| partitionPaths.add(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath())); | ||
| } else if (!fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) { | ||
| pathsToList.add(fileStatus.getPath()); | ||
| } | ||
| } else if (fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) { | ||
| String partitionName = FSUtils.getRelativePartitionPath(new Path(datasetBasePath), fileStatus.getPath().getParent()); | ||
| partitionPaths.add(partitionName); | ||
| } | ||
yihua marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| }); | ||
| dirToFileListing.forEach(p -> { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't we have an utilty for this in FSUtils. |
||
| Option<FileStatus> partitionMetaFile = Option.fromJavaOptional(Arrays.stream(p.getRight()).parallel() | ||
| .filter(fileStatus -> fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)) | ||
| .findFirst()); | ||
|
|
||
| if (partitionMetaFile.isPresent()) { | ||
| // Is a partition. | ||
| String partitionName = FSUtils.getRelativePartitionPath(basePath, p.getLeft()); | ||
| partitionPaths.add(partitionName); | ||
| } else { | ||
| // Add sub-dirs to the queue | ||
| pathsToList.addAll(Arrays.stream(p.getRight()) | ||
| .filter(fileStatus -> fileStatus.isDirectory() && !fileStatus.getPath().getName().equals(HoodieTableMetaClient.METAFOLDER_NAME)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. looks like you don't need to check
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried to incorporate this. I wanted to see if we can do this only for first time where we list the base path. but we have to do it within the while loop and so we might do it unnecessarily for every loop anyways. will go ahead w/ the patch for now for 0.11.1. but lets discuss and I can put up a follow up PR.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is there a follow up JIRA?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reverted this in the latest patch put up #6234 . no follow ups required for now. |
||
| .map(fileStatus -> fileStatus.getPath()) | ||
| .collect(Collectors.toList())); | ||
| } | ||
| }); | ||
| } | ||
| return partitionPaths; | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan making this parallelized with
engineContextshould solve the problem and avoid listing each partition, right, instead of reverting the changes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 . I think if we simply do the processing we need to do within the single engineContext.map(..) call, we should be able to solve the original problem in #4643.