-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-15183. S3Guard store becomes inconsistent after partial failure of rename #951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
steveloughran
wants to merge
9
commits into
apache:trunk
from
steveloughran:s3/HADOOP-15183-s3guard-rename-failures
Closed
Changes from 1 commit
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
eedd4eb
HADOOP-15183
steveloughran 67bec7f
HADOOP-15183: review comments from Sean and Aaron
steveloughran dbdbfe8
HADOOP-15183: Factor out the rename code from S3A FS into its own Ren…
steveloughran 97e5eba
HADOOP-15183 add some more debug level logging of what's been returne…
steveloughran d35424f
HADOOP-15183: ongoing merge with OOB Deletes
steveloughran 4a77eb6
HADOOP-15183 ensuring parent entries are correctly set up.
steveloughran 95ce655
HADOOP-15183: fix intermittent test; style cleanups, finishedWrite ad…
steveloughran 930c27e
HADOOP-15183: DynamoDBMetadataStore tuning.
steveloughran 29d160c
HADOOP-15183: NPE in s3guard prune: both DDB init paths need to creat…
steveloughran File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -198,14 +198,9 @@ | |
| * directory helps prevent unnecessary queries during traversal of an entire | ||
| * sub-tree. | ||
| * | ||
| <<<<<<< ours | ||
| * Some mutating operations, notably | ||
| * {@link MetadataStore#deleteSubtree(Path, ITtlTimeProvider)} and | ||
| * {@link MetadataStore#move(Collection, Collection, ITtlTimeProvider)}, | ||
| ======= | ||
| * Some mutating operations, notably {@link #deleteSubtree(Path)} and | ||
| * {@link MetadataStore#move(Collection, Collection, BulkOperationState)} | ||
| >>>>>>> theirs | ||
| * {@link MetadataStore#move(Collection, Collection, ITtlTimeProvider, BulkOperationState)} | ||
| * are less efficient with this schema. | ||
| * They require mutating multiple items in the DynamoDB table. | ||
| * | ||
|
|
@@ -343,6 +338,12 @@ public class DynamoDBMetadataStore implements MetadataStore, | |
| */ | ||
| private ListeningExecutorService executor; | ||
|
|
||
| /** | ||
| * Time source. This is used during writes when parent | ||
| * entries need to be created. | ||
| */ | ||
| private ITtlTimeProvider timeProvider; | ||
|
|
||
| /** | ||
| * A utility function to create DynamoDB instance. | ||
| * @param conf the file system configuration | ||
|
|
@@ -418,6 +419,7 @@ public void initialize(FileSystem fs) throws IOException { | |
| this::retryEvent | ||
| ); | ||
|
|
||
| timeProvider = new S3Guard.TtlTimeProvider(conf); | ||
| initTable(); | ||
|
|
||
| instrumentation.initialized(); | ||
|
|
@@ -437,6 +439,9 @@ void bindToOwnerFilesystem(final S3AFileSystem fs) { | |
| instrumentation = context.getInstrumentation().getS3GuardInstrumentation(); | ||
| username = context.getUsername(); | ||
| executor = context.createThrottledExecutor(); | ||
| timeProvider = Preconditions.checkNotNull( | ||
| context.getTimeProvider(), | ||
| "ttlTimeProvider must not be null"); | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -795,11 +800,13 @@ DirListingMetadata getDirListingMetadataFromDirMetaAndList(Path path, | |
| * Callers are required to synchronize on ancestorState. | ||
| * @param pathsToCreate paths to create | ||
| * @param ancestorState ongoing ancestor state. | ||
| * @param ttlTimeProvider Must not be null | ||
| * @return the full ancestry paths | ||
| */ | ||
| private Collection<DDBPathMetadata> completeAncestry( | ||
| final Collection<DDBPathMetadata> pathsToCreate, | ||
| final AncestorState ancestorState) throws PathIOException { | ||
| final AncestorState ancestorState, | ||
| final ITtlTimeProvider ttlTimeProvider) throws PathIOException { | ||
| List<DDBPathMetadata> ancestorsToAdd = new ArrayList<>(0); | ||
| LOG.debug("Completing ancestry for {} paths", pathsToCreate.size()); | ||
| // we sort the inputs to guarantee that the topmost entries come first. | ||
|
|
@@ -845,7 +852,7 @@ private Collection<DDBPathMetadata> completeAncestry( | |
| parent, path); | ||
| final S3AFileStatus status = makeDirStatus(parent, username); | ||
| DDBPathMetadata md = new DDBPathMetadata(status, Tristate.FALSE, | ||
| false); | ||
| false, false, ttlTimeProvider.getNow()); | ||
| ancestorState.put(parent, md); | ||
| ancestorsToAdd.add(md); | ||
| parent = parent.getParent(); | ||
|
|
@@ -857,12 +864,20 @@ private Collection<DDBPathMetadata> completeAncestry( | |
| /** | ||
| * {@inheritDoc} | ||
| * <p> | ||
| * The implementation scans all up the directory tree and does a get() | ||
| * for each entry; at each level one is found it is added to the ancestor | ||
| * state. | ||
| * <p> | ||
| * The original implementation would stop on finding the first non-empty | ||
| * parent. This (re) implementation issues a GET for every parent entry | ||
| * and so detects and recovers from a tombstone marker further up the tree | ||
| * (i.e. an inconsistent store is corrected for). | ||
| * <p> | ||
| * if {@code operationState} is not null, when this method returns the | ||
| * operation state will be updated with all new entries created. | ||
| * This ensures that subsequent operations with the same store will not | ||
| * trigger new updates. | ||
| * @param qualifiedPath path to update | ||
| * @param timeProvider | ||
| * @param operationState (nullable) operational state for a bulk update | ||
| * @throws IOException on failure. | ||
| */ | ||
|
|
@@ -878,6 +893,7 @@ public void addAncestors( | |
| final AncestorState ancestorState = extractOrCreate(operationState, | ||
| BulkOperationState.OperationType.Rename); | ||
| Path parent = qualifiedPath.getParent(); | ||
| boolean entryFound = false; | ||
|
|
||
| // Iterate up the parents. | ||
| // note that only ancestorState get/set operations are synchronized; | ||
|
|
@@ -899,6 +915,9 @@ public void addAncestors( | |
| // a directory entry will go in. | ||
| PathMetadata directory = get(parent); | ||
| if (directory == null || directory.isDeleted()) { | ||
| if (entryFound) { | ||
| LOG.warn("Inconsistent S3Guard table: adding directory {}", parent); | ||
| } | ||
| S3AFileStatus status = makeDirStatus(username, parent); | ||
| LOG.debug("Adding new ancestor entry {}", status); | ||
| DDBPathMetadata meta = new DDBPathMetadata(status, Tristate.FALSE, | ||
|
|
@@ -909,6 +928,8 @@ public void addAncestors( | |
| // here that put operation would actually (mistakenly) skip | ||
| // creating the entry. | ||
| } else { | ||
| // an entry was found. Check its type | ||
| entryFound = true; | ||
| if (directory.getFileStatus().isFile()) { | ||
| throw new PathIOException(parent.toString(), | ||
| "Cannot overwrite parent file: metadatstore is" | ||
|
|
@@ -918,7 +939,6 @@ public void addAncestors( | |
| synchronized (ancestorState) { | ||
| ancestorState.put(parent, new DDBPathMetadata(directory)); | ||
| } | ||
| break; | ||
| } | ||
| parent = parent.getParent(); | ||
| } | ||
|
|
@@ -927,7 +947,7 @@ public void addAncestors( | |
| if (!newDirs.isEmpty()) { | ||
| // patch up the time. | ||
| patchLastUpdated(newDirs, timeProvider); | ||
| innerPut(newDirs, operationState); | ||
| innerPut(newDirs, operationState, timeProvider); | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -983,7 +1003,8 @@ public void move( | |
| newItems.addAll( | ||
| completeAncestry( | ||
| pathMetaToDDBPathMeta(pathsToCreate), | ||
| ancestorState)); | ||
| ancestorState, | ||
| extractTimeProvider(ttlTimeProvider))); | ||
| } | ||
| } | ||
| // sort all the new items topmost first. | ||
|
|
@@ -1162,7 +1183,7 @@ public void put( | |
| public void put( | ||
| final Collection<? extends PathMetadata> metas, | ||
| @Nullable final BulkOperationState operationState) throws IOException { | ||
| innerPut(pathMetaToDDBPathMeta(metas), operationState); | ||
| innerPut(pathMetaToDDBPathMeta(metas), operationState, timeProvider); | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -1176,13 +1197,15 @@ public void put( | |
| * create entries in the table without parents. | ||
| * @param metas metadata entries to write. | ||
| * @param operationState (nullable) operational state for a bulk update | ||
| * @param ttlTimeProvider | ||
| * @throws IOException failure. | ||
| */ | ||
| @SuppressWarnings("SynchronizationOnLocalVariableOrMethodParameter") | ||
| @Retries.RetryTranslated | ||
| private void innerPut( | ||
| final Collection<DDBPathMetadata> metas, | ||
| @Nullable final BulkOperationState operationState) throws IOException { | ||
| @Nullable final BulkOperationState operationState, | ||
| final ITtlTimeProvider ttlTimeProvider) throws IOException { | ||
| if (metas.isEmpty()) { | ||
| // Happens when someone calls put() with an empty list. | ||
| LOG.debug("Ignoring empty list of entries to put"); | ||
|
|
@@ -1196,7 +1219,7 @@ private void innerPut( | |
| Item[] items; | ||
| synchronized (ancestorState) { | ||
| items = pathMetadataToItem( | ||
| completeAncestry(metas, ancestorState)); | ||
| completeAncestry(metas, ancestorState, ttlTimeProvider)); | ||
| } | ||
| LOG.debug("Saving batch of {} items to table {}, region {}", items.length, | ||
| tableName, region); | ||
|
|
@@ -1534,7 +1557,7 @@ private void removeAuthoritativeDirFlag( | |
| try { | ||
| LOG.debug("innerPut on metas: {}", metas); | ||
| if (!metas.isEmpty()) { | ||
| innerPut(metas, state); | ||
| innerPut(metas, state, timeProvider); | ||
| } | ||
| } catch (IOException e) { | ||
| String msg = String.format("IOException while setting false " | ||
|
|
@@ -2210,6 +2233,17 @@ public AncestorState initiateBulkWrite( | |
| return new AncestorState(operation, dest); | ||
| } | ||
|
|
||
| /** | ||
| * Extract a time provider from the argument or fall back to the | ||
| * one in the constructor. | ||
| * @param ttlTimeProvider nullable time source passed in as an argument. | ||
| * @return a non-null time source. | ||
| */ | ||
| private ITtlTimeProvider extractTimeProvider( | ||
| @Nullable ITtlTimeProvider ttlTimeProvider) { | ||
| return ttlTimeProvider != null ? ttlTimeProvider : timeProvider; | ||
| } | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we discussed w/ @bgaborg @mackrorysd this will go away soon, and is fine for now. |
||
| /** | ||
| * Take an {@code IllegalArgumentException} raised by a DDB operation | ||
| * and if it contains an inner SDK exception, unwrap it. | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting change to this function. slower but more robust (the removed
breakbelow, that is, not this log message)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, we might as well do the
depth(path)get operations in parallel if they always happen, and thebreakbehavior you remove is not configurable. In terms of write latency it would removedepth(path)-1round trips (approx.). Proposing this as a followup JIRA, not doing it here.