Current index feels much slower than it ought to be. I am creating this issue to track work on this topic.
See this message by Franek for his observations (I/O wait bottleneck, database caching, OOM issues).
I've got proof-of-concepts for the following topics:
- Rework of
update.py. It is suboptimal. Having to take locks to write into databases makes everything really slow and (I believe) explains most of the performance issues (caused by IO wait). This can be confirmed by running the same commands without doing any processing on the output: it is much faster. I have a PoC for solving this, it does all database accesses in the main thread (and uses multiprocessing.Pool to spawn sub-processes).
- Improve individual commands of
script.sh. Some commands are more wasteful than needed.
- The
sed(1) call in list-blobs is a big bottleneck for no specific reason. This won't be a massive time saver as we are talking about a second per tag.
find-file-doc-comments.pl in parse-docs is really expensive. We could avoid calling it on files for which we know they cannot have any doc comment.
Those combined, for the first 5 Linux tags: I get wallclock/usr/sys 126s/1017s/395s versus 1009s/1341s/490s. For the old update.py, I passed my CPU count as argument ie 20.
Those changes will require a way to compare databases, see this message for reasoning behind. Solutions to this are either a custom Python script or a shell script that uses db_dump -p and diff, as recommended here.
There could however be other topics to improve performance. Are those worth it, that is the question. Probably not.
- We might want to change the overall structure: calling into a shell script for each blob, spawning multiple processes, is not the fastest way to solve the problem. We could have
script.sh commands take multiple blobs.
- Or we could avoid
script.sh and calls ctags or tokenize by ourselves.
- We could change the database structure. Current database compresses well (14G becomes 5.2G after
zstd -1), which means there is superfluous information. The value format could be optimized, possibly made binary.
Current index feels much slower than it ought to be. I am creating this issue to track work on this topic.
See this message by Franek for his observations (I/O wait bottleneck, database caching, OOM issues).
I've got proof-of-concepts for the following topics:
update.py. It is suboptimal. Having to take locks to write into databases makes everything really slow and (I believe) explains most of the performance issues (caused by IO wait). This can be confirmed by running the same commands without doing any processing on the output: it is much faster. I have a PoC for solving this, it does all database accesses in the main thread (and usesmultiprocessing.Poolto spawn sub-processes).script.sh. Some commands are more wasteful than needed.sed(1)call inlist-blobsis a big bottleneck for no specific reason. This won't be a massive time saver as we are talking about a second per tag.find-file-doc-comments.plinparse-docsis really expensive. We could avoid calling it on files for which we know they cannot have any doc comment.Those combined, for the first 5 Linux tags: I get wallclock/usr/sys 126s/1017s/395s versus 1009s/1341s/490s. For the old
update.py, I passed my CPU count as argument ie20.Those changes will require a way to compare databases, see this message for reasoning behind. Solutions to this are either a custom Python script or a shell script that uses
db_dump -panddiff, as recommended here.There could however be other topics to improve performance. Are those worth it, that is the question. Probably not.
script.shcommands take multiple blobs.script.shand callsctagsor tokenize by ourselves.zstd -1), which means there is superfluous information. The value format could be optimized, possibly made binary.