Get size of annexed files from keys where possible#86
Conversation
Codecov ReportBase: 71.06% // Head: 71.51% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #86 +/- ##
==========================================
+ Coverage 71.06% 71.51% +0.44%
==========================================
Files 11 11
Lines 788 839 +51
==========================================
+ Hits 560 600 +40
- Misses 228 239 +11
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
|
I am curious -- have you tried this branch on 000026 in https://github.com/dandi/dandisets-healthstatus -- did it provide remedy for the slow "traversal"? |
|
@yarikoptic I'd rather not try the healthcheck on this unless #83 was merged in so I could rebase on top of it. |
take #83 out of draft? ;-) |
|
@yarikoptic Traversing 000026 using this branch now takes about 2 or 3 minutes (I don't have an exact time). |
yarikoptic
left a comment
There was a problem hiding this comment.
looks great! Just one comment possibly to act on -- let's not bother with commit date for the files under .git/annex/objects
|
|
||
|
|
||
| @dataclass | ||
| class AnnexKey: |
There was a problem hiding this comment.
eh, we better have/re-use this construct in DataLad to avoid duplicating it across codebase. Would be useful/replace AnnexRepo.get_size_from_key (https://github.com/datalad/datalad/blob/HEAD/datalad/support/annexrepo.py#L560) and useful for _sanitize_key (https://github.com/datalad/datalad/blob/HEAD/datalad/support/annex_utils.py) -- probably get to_filename for that purpose
There was a problem hiding this comment.
Are you telling me to use those DataLad functions here, or to copy AnnexKey to DataLad, or something else?
There was a problem hiding this comment.
I was just saying that eventually we might want to "borrow" your construct from here, I like it, instead of our functions in datalad.
| r = mkstat( | ||
| is_file=True, | ||
| size=iadok.size, | ||
| timestamp=self._adapter.get_commit_datetime(path), |
There was a problem hiding this comment.
aren't we under .git/annex/objects here and thus the commit date wouldn't really be pertinent to that key file ? then let's just use some arbitrary timestamp -- e.g. fixed timestamp on when we started this fusefs instance. Should help us to save some cpu cycles
There was a problem hiding this comment.
The commit date is cached when the adapter for the (sub)dataset is created [link], so there aren't many cycles to save.
There was a problem hiding this comment.
it still needs to do some traversal to figure out the top of the dataset right? indeed might be negligible though
There was a problem hiding this comment.
ok, let's proceed for now as is, and optimize if we see it adds penalty
ok, not super fast but much better than before and given number of files -- not too bad really. Would be worth py-spy top'ing it to see where time is spent. Let's proceed with this as already significant improvement. |
Closes #84.