Reduce the ListRequests for a recursive put for targets ending in "/"#821
Reduce the ListRequests for a recursive put for targets ending in "/"#821MatthijsPiek wants to merge 1 commit intofsspec:mainfrom
Conversation
|
Sorry, I broke some tests. I don't think I can justify spending more time on this, so not sure I'll be able to remedy that. Maybe some parts of this pull request are valuable to you. |
This certainly sounds useful. I'll look through the rest of your work in the near future. |
|
I am contemplating fsspec/filesystem_spec#1592 as an explicit override point, so that s3fs can control how it creates a "directory tree" and so enable not making the exists() calls without having to mess about with the directory cache. The s3fs (and gcsfs, adlfs) versions would select the paths to attempt to create as follows: |
|
Sounds good. Using the dircache for this isn't straightforward. |
Recently we have found that
s3.put("/var/tmp/", "s3://test/wibble")ands3.put("/var/tmp/", "s3://test/wibble/")result in one ListObjectsV2 per directory which is written.Investigation indicates it is the _exists check to see if the bucket you are writing to really exists. It does this for every directory. In practice we have seen 12 ListObjectsV2 calls for 1 PutObject.
This pull request resolves this for targets with do not end in "/" because it leverages the dircache state from the
isdir()call to see if this bucket already exists. Targets which do end in "/" skip theisdir()call and still exhibit the old behaviour. Best would be if there would be a way to cache the _exists result or maybe disable the bucket creation entirely.I didn't know how to cache the _exists result in the current
DirCachesetup.To test this properly I created a fixture which allows you to count the requests made to S3. That may be useful for other similar problems as well.