Skip to content

Get size of annexed files from keys where possible#86

Merged
yarikoptic merged 2 commits intomasterfrom
gh-84
Nov 23, 2022
Merged

Get size of annexed files from keys where possible#86
yarikoptic merged 2 commits intomasterfrom
gh-84

Conversation

@jwodder
Copy link
Copy Markdown
Contributor

@jwodder jwodder commented Nov 23, 2022

Closes #84.

@jwodder jwodder added the performance Improve performance of an existing feature label Nov 23, 2022
@codecov
Copy link
Copy Markdown

codecov bot commented Nov 23, 2022

Codecov Report

Base: 71.06% // Head: 71.51% // Increases project coverage by +0.44% 🎉

Coverage data is based on head (8412c61) compared to base (5421d3a).
Patch coverage: 70.32% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #86      +/-   ##
==========================================
+ Coverage   71.06%   71.51%   +0.44%     
==========================================
  Files          11       11              
  Lines         788      839      +51     
==========================================
+ Hits          560      600      +40     
- Misses        228      239      +11     
Impacted Files Coverage Δ
datalad_fuse/fuse_.py 25.98% <8.69%> (-0.34%) ⬇️
datalad_fuse/fsspec.py 83.12% <70.00%> (-1.35%) ⬇️
datalad_fuse/utils.py 95.83% <94.33%> (-4.17%) ⬇️
datalad_fuse/tests/test_fuse.py 100.00% <100.00%> (ø)
datalad_fuse/tests/test_util.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@jwodder jwodder marked this pull request as ready for review November 23, 2022 16:11
@yarikoptic
Copy link
Copy Markdown
Member

I am curious -- have you tried this branch on 000026 in https://github.com/dandi/dandisets-healthstatus -- did it provide remedy for the slow "traversal"?

@jwodder
Copy link
Copy Markdown
Contributor Author

jwodder commented Nov 23, 2022

@yarikoptic I'd rather not try the healthcheck on this unless #83 was merged in so I could rebase on top of it.

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic I'd rather not try the healthcheck on this unless #83 was merged in so I could rebase on top of it.

take #83 out of draft? ;-)

@jwodder
Copy link
Copy Markdown
Contributor Author

jwodder commented Nov 23, 2022

@yarikoptic Traversing 000026 using this branch now takes about 2 or 3 minutes (I don't have an exact time).

Copy link
Copy Markdown
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! Just one comment possibly to act on -- let's not bother with commit date for the files under .git/annex/objects



@dataclass
class AnnexKey:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, we better have/re-use this construct in DataLad to avoid duplicating it across codebase. Would be useful/replace AnnexRepo.get_size_from_key (https://github.com/datalad/datalad/blob/HEAD/datalad/support/annexrepo.py#L560) and useful for _sanitize_key (https://github.com/datalad/datalad/blob/HEAD/datalad/support/annex_utils.py) -- probably get to_filename for that purpose

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you telling me to use those DataLad functions here, or to copy AnnexKey to DataLad, or something else?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just saying that eventually we might want to "borrow" your construct from here, I like it, instead of our functions in datalad.

r = mkstat(
is_file=True,
size=iadok.size,
timestamp=self._adapter.get_commit_datetime(path),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't we under .git/annex/objects here and thus the commit date wouldn't really be pertinent to that key file ? then let's just use some arbitrary timestamp -- e.g. fixed timestamp on when we started this fusefs instance. Should help us to save some cpu cycles

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit date is cached when the adapter for the (sub)dataset is created [link], so there aren't many cycles to save.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it still needs to do some traversal to figure out the top of the dataset right? indeed might be negligible though

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let's proceed for now as is, and optimize if we see it adds penalty

@yarikoptic
Copy link
Copy Markdown
Member

@yarikoptic Traversing 000026 using this branch now takes about 2 or 3 minutes (I don't have an exact time).

ok, not super fast but much better than before and given number of files -- not too bad really. Would be worth py-spy top'ing it to see where time is spent. Let's proceed with this as already significant improvement.

@yarikoptic yarikoptic added the release Create a release when this pr is merged label Nov 23, 2022
@yarikoptic yarikoptic merged commit 6c2a7dc into master Nov 23, 2022
@yarikoptic yarikoptic deleted the gh-84 branch November 23, 2022 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Improve performance of an existing feature release Create a release when this pr is merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

get stats (size) from the file/git without talking to remote location

2 participants