Skip to content
Closed
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
8708856
add getContentSummary and prelim test
sumangala17 Dec 11, 2020
01aba06
remove gfs call
sumangala17 Dec 14, 2020
2876a8b
add tests
sumangala17 Dec 15, 2020
a9960da
pr draft
sumangala17 Dec 15, 2020
30cf195
checkstyle fix
sumangala17 Dec 16, 2020
95d1396
linkedBlockingQ + junit test fix
sumangala17 Dec 17, 2020
03d342c
linkedBlockingQ + junit test fix (#5)
sumangala-patki Dec 17, 2020
bb55b14
using executors
sumangala17 Dec 22, 2020
a9e94a9
using executors
sumangala17 Dec 22, 2020
06609da
run()->call(), terminate condition, add invalid path test
sumangala17 Dec 24, 2020
1433c85
pr revw + checkstyle
sumangala17 Dec 24, 2020
27b6007
Merge branch 'trunk' into HADOOP-17428
sumangala17 Dec 24, 2020
d747f06
findbugs use future returned
sumangala17 Dec 24, 2020
be2daf0
completion service + temp concurrency tests
sumangala17 Jan 5, 2021
96cd2b9
pr revw + exec test
sumangala17 Jan 7, 2021
e3eaca7
clean up
sumangala17 Jan 8, 2021
94a95df
minor changes
sumangala17 Jan 8, 2021
48d0607
rm thread test
sumangala17 Jan 9, 2021
a10be00
checkstyle
sumangala17 Jan 10, 2021
636b434
Merge branch 'trunk' into HADOOP-17428
sumangala17 Jan 10, 2021
bc276b2
revw changes + doc
sumangala17 Jan 12, 2021
744f8c4
javadoc
sumangala17 Jan 12, 2021
9070413
trigger yetus
sumangala17 Jan 12, 2021
657d7ea
Merge branch 'trunk' into HADOOP-17428
sumangala17 Feb 7, 2021
041d9bc
use listingsupport to abstract store
sumangala17 Feb 7, 2021
9c92338
merge
sumangala17 Feb 22, 2021
4be7b19
checkstyle
sumangala17 Feb 22, 2021
7a2e218
Merge branch 'trunk' into HADOOP-17428
sumangala17 Feb 22, 2021
d21b58a
import order
sumangala17 Feb 22, 2021
9b2723b
Merge branch 'trunk' into HADOOP-17428
sumangala17 Mar 31, 2021
2378431
log ex
sumangala17 Apr 19, 2021
fe71af1
Merge branch 'trunk' into HADOOP-17428
sumangala17 May 11, 2021
fa34b57
rm abfs cs
sumangala17 May 11, 2021
aa48086
test fix
sumangala17 May 12, 2021
f320785
clean up
sumangala17 May 12, 2021
be0e94c
merge with tc
sumangala17 Jul 5, 2021
2104268
Merge branch 'trunk' into HADOOP-17428
sumangala17 Aug 23, 2021
a718cbd
address revw comments
sumangala17 Aug 25, 2021
c9d65aa
review comments part 2: move executor->abfsStore
sumangala17 Aug 26, 2021
4003aff
Merge branch 'trunk' into HADOOP-17428
sumangala17 Sep 8, 2021
3039f7f
use iterator + rejected-ex handler
sumangala17 Sep 10, 2021
8259a2e
undo extra formatting
sumangala17 Sep 13, 2021
b64b492
more formatting
sumangala17 Sep 13, 2021
16d9436
format
sumangala17 Sep 13, 2021
137627d
fix merge conflict
sumangala17 Jan 10, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.BlockLocation;
import org.apache.hadoop.fs.CommonPathCapabilities;
import org.apache.hadoop.fs.ContentSummary;
import org.apache.hadoop.fs.CreateFlag;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
Expand All @@ -79,6 +80,7 @@
import org.apache.hadoop.fs.azurebfs.contracts.services.AzureServiceErrorCode;
import org.apache.hadoop.fs.azurebfs.security.AbfsDelegationTokenManager;
import org.apache.hadoop.fs.azurebfs.services.AbfsCounters;
import org.apache.hadoop.fs.azurebfs.services.ContentSummaryProcessor;
import org.apache.hadoop.fs.azurebfs.utils.Listener;
import org.apache.hadoop.fs.azurebfs.utils.TracingContext;
import org.apache.hadoop.fs.azurebfs.utils.TracingHeaderFormat;
Expand Down Expand Up @@ -433,6 +435,31 @@ public boolean delete(final Path f, final boolean recursive) throws IOException

}

/**
* Returns a ContentSummary instance containing the count of directories,
* files and total number of bytes under a given path
* @param path The given path
* @return ContentSummary
* @throws IOException if an error is encountered during listStatus calls
* or if there is any issue with the thread pool used while processing
*/
@Override
public ContentSummary getContentSummary(Path path) throws IOException {
try {
TracingContext tracingContext = new TracingContext(clientCorrelationId,
fileSystemId, FSOperationType.GET_CONTENT_SUMMARY, true,
tracingHeaderFormat, listener);
return (new ContentSummaryProcessor(abfsStore)).getContentSummary(path,
tracingContext);
} catch (InterruptedException e) {
LOG.debug("Thread interrupted");
throw new IOException(e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InterruptedIOException

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

} catch(ExecutionException ex) {
LOG.debug("GetContentSummary failed with error: {}", ex.getMessage());
throw new IOException(ex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer PathIOException with path included

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

@Override
public FileStatus[] listStatus(final Path f) throws IOException {
LOG.debug(
Expand Down Expand Up @@ -1435,7 +1462,7 @@ Map<String, Long> getInstrumentationMap() {
}

@VisibleForTesting
String getFileSystemId() {
public String getFileSystemId() {
return fileSystemId;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ public enum FSOperationType {
DELETE("DL"),
GET_ACL_STATUS("GA"),
GET_ATTR("GR"),
GET_CONTENT_SUMMARY("GC"),
GET_FILESTATUS("GF"),
LISTSTATUS("LS"),
MKDIR("MK"),
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
* <p>
* http://www.apache.org/licenses/LICENSE-2.0
* <p>
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.hadoop.fs.azurebfs.services;

import java.io.IOException;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.LinkedBlockingQueue;
import java.util.concurrent.SynchronousQueue;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

import org.apache.hadoop.fs.azurebfs.utils.TracingContext;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import org.apache.hadoop.fs.ContentSummary;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.Path;

public class ContentSummaryProcessor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: javadocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added description

private static final int CORE_POOL_SIZE = 1;
private static final int MAX_THREAD_COUNT = 16;
private static final int KEEP_ALIVE_TIME = 5;
private static final int POLL_TIMEOUT = 100;
private static final Logger LOG = LoggerFactory.getLogger(ContentSummaryProcessor.class);
private final AtomicLong fileCount = new AtomicLong(0L);
private final AtomicLong directoryCount = new AtomicLong(0L);
private final AtomicLong totalBytes = new AtomicLong(0L);
private final AtomicInteger numTasks = new AtomicInteger(0);
private final ListingSupport abfsStore;
private final ExecutorService executorService = new ThreadPoolExecutor(
CORE_POOL_SIZE, MAX_THREAD_COUNT, KEEP_ALIVE_TIME, TimeUnit.SECONDS,
new SynchronousQueue<>());
private final CompletionService<Void> completionService =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abfs Store to create the executor so that all ops which do async IO (this, block uploads, prefetch, openFile async...) can share a single pool

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
Changes: abfsStore creates thread pool executor; each individual getContentSummary call creates a separate completionService instance (which uses the common thread pool)

new ExecutorCompletionService<>(executorService);
private final LinkedBlockingQueue<FileStatus> queue = new LinkedBlockingQueue<>();

/**
* Processes a given path for count of subdirectories, files and total number
* of bytes
* @param abfsStore Instance of AzureBlobFileSystemStore, used to make
* listStatus calls to server
*/
public ContentSummaryProcessor(ListingSupport abfsStore) {
this.abfsStore = abfsStore;
}

public ContentSummary getContentSummary(Path path, TracingContext tracingContext)
throws IOException, ExecutionException, InterruptedException {
try {
processDirectoryTree(path, tracingContext);
while (!queue.isEmpty() || numTasks.get() > 0) {
try {
completionService.take().get();
} finally {
numTasks.decrementAndGet();
LOG.debug("FileStatus queue size = {}, number of submitted unfinished tasks = {}, active thread count = {}",
queue.size(), numTasks, ((ThreadPoolExecutor) executorService).getActiveCount());
}
}
} finally {
executorService.shutdownNow();
LOG.debug("Executor shutdown");
}
LOG.debug("Processed content summary of subtree under given path");
ContentSummary.Builder builder = new ContentSummary.Builder()
.directoryCount(directoryCount.get()).fileCount(fileCount.get())
.length(totalBytes.get()).spaceConsumed(totalBytes.get());
return builder.build();
}

/**
* Calls listStatus on given path and populated fileStatus queue with
* subdirectories. Is called by new tasks to process the complete subtree
* under a given path
* @param path: Path to a file or directory
* @throws IOException: listStatus error
* @throws InterruptedException: error while inserting into queue
*/
private void processDirectoryTree(Path path, TracingContext tracingContext)
throws IOException, InterruptedException {
FileStatus[] fileStatuses = abfsStore.listStatus(path, tracingContext);

for (FileStatus fileStatus : fileStatuses) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you supported paged results, you could start queuing subdir work while still iterating through the list.

In directories with many pages of results including directories, this could speed up processing as subdirectory scanning could start as soon as of the first page of results had been retrieved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we should be able to incorporate that using the listiterator. Thanks, will make the change

Copy link
Contributor Author

@sumangala-patki sumangala-patki Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to confirm the advantage of processing page-wise listStatus results; would like to know your opinion. Analyzed time taken by direct liststatus call vs using listiterator (queueing subdir while iterating), but getting ambiguous results.

The tests used involved creating a directory tree and calling GetContentSummary on the top folder, as the primary use of this api might be on the root of an account.

Expt 1: Directory tree with 12 levels (tree height=12), where each level comprises one dir and 1-2 files.
Expt 2: Same 12-level structure as 1, with a branch (of 2 subdir levels) around the mid-level, i.e., two subdirs at level 5, each having a subdir. All directories in the tree have ~15 files
Expt 3: Same as expt 2, but with each dir having more than 5000 files (will result in liststatus results being fetched in multiple pages)

The analysis was done for both lexicographical positions of directory with respect to files at the same level, as it determines whether the directory is fetched first. The time taken was calculated as the time between the first ListStatus REST call and the DeleteFileSystem call (post the last LS) => this will eliminate differences in file/dir creation time.

Expt number	Dir after files		Dir before files
1		LS (few ms)		LS
2		LS (0.5s)		Itr (8.7s)
3		LS (3s)			Itr (4.5s)

LS(t) -> Normal direct ListStatus call was faster by t
Itr(t) -> ListIterator was faster by t

Using iterator seems beneficial for some scenarios, will go ahead with it.

if (fileStatus.isDirectory()) {
queue.put(fileStatus);
processDirectory();
conditionalSubmitTaskToExecutor(tracingContext);
} else {
processFile(fileStatus);
}
}
}

private void processDirectory() {
directoryCount.incrementAndGet();
}

/**
* Increments file count and byte count
* @param fileStatus: Provides file size to update byte count
*/
private void processFile(FileStatus fileStatus) {
fileCount.incrementAndGet();
totalBytes.addAndGet(fileStatus.getLen());
}

/**
* Submit task for processing a subdirectory based on current size of
* filestatus queue and number of already submitted tasks
*/
private synchronized void conditionalSubmitTaskToExecutor(TracingContext tracingContext) {
if (!queue.isEmpty() && numTasks.get() < MAX_THREAD_COUNT) {
numTasks.incrementAndGet();
Copy link
Contributor

@bilaharith bilaharith Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if there is a method for simple increment, since you are not using the return value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find any op without returning, will have to stick to this

completionService.submit(() -> {
FileStatus fileStatus1;
while ((fileStatus1 = queue.poll(POLL_TIMEOUT, TimeUnit.MILLISECONDS))
!= null) {
processDirectoryTree(fileStatus1.getPath(), new TracingContext(tracingContext));
}
return null;
});
}
}

}
Loading