-
Notifications
You must be signed in to change notification settings - Fork 254
Fix and improve parallel mode #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
allevato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this and providing such thorough benchmarks.
My main concern here is that I'd prefer that swift-format not have to implement its own Lock primitive in order to get decent performance. AFAIK, neither implementation in that class will work on Windows, since it doesn't support pthreads (@compnerd correct me if I'm wrong here).
Ideally, I'd like to have this work without needing three (or even two) low-level platform-specific solutions, since it shouldn't be the responsibility of swift-format to maintain those. Can we improve the existing use of Dispatch or figure out why NSLock performed so horribly instead and still get some wins, even if it's not the absolute fastest outcome?
|
Thanks @allevato 🙏🏼 Yeah, I had a feeling that adding our own I'll give it another shot with Will update this in the meantime |
|
@allevato - you are correct, pthreads (POSIX Threads) are not available on Windows, and we would need to use Windows threads. There are already 2 (or was it 3?) implementations of Lock that I've done previously, though using |
3c5d9fb to
4449670
Compare
|
Sadly, using
|
4449670 to
1fb7c19
Compare
Sources/SwiftFormatCore/Rule.swift
Outdated
| } | ||
|
|
||
| if let cachedName = cachedName { | ||
| nameCacheLock.lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the time, this worked fine, but maybe it's worth revisiting if it's worth caching this at all #242 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very valid point. I did some tests flattening the FileToProcess array and removing Rule's nameCache, effectively removing all need for synchronization in a concurrent setup (you can find them in one of the benchmarks in the initial PR description). It was indeed faster than this current NSLock version, but only in parallel mode. At the time I reverted back to using a cache with the low level lock as it was more performant still (13.5s vs 17.5s in parallel mode), while being the fastest setup I could achieve overall.
I've re-ran some benchmarks and I think that it's probably not worth removing synchronization, as you can see below.
(I also noticed a very big difference in parallel mode from my previous tests, so it's likely that I screwed up when copy/pasting the previous results 🙈)
flattened FileToProcess array, no Rule nameCache
serial
Benchmark #1: ./swift-format format --configuration .swift-format -r -i Sources
Time (mean ± σ): 96.948 s ± 2.903 s [User: 93.642 s, System: 2.513 s]
Range (min … max): 93.368 s … 102.528 s 10 runs
parallel
Benchmark #1: ./swift-format format --configuration .swift-format -r -i -p Sources
Time (mean ± σ): 34.519 s ± 6.421 s [User: 364.389 s, System: 13.934 s]
Range (min … max): 18.097 s … 38.766 s 10 runs
After opening this PR I also did some other attempts at rethinking the Rule.ruleName mechanics, but ended up giving up because I never reached an elegant solution. I tried:
- Precomputing all rule names statically (during codegen), but hit some issues related with dynamic dispatch (
Rule.ruleNameprotocol extension would always be called). - Removing
Rule.ruleNamedefault protocol extension and using inheritance, worked but forced rules to overrideruleName, defeating the purpose of the protocol's abstraction a bit. - Changing
Rule's protocol itself (e.g. makingruleNamean instance property), which created other issues and the abstraction itself felt a bit wrong, because the rule's name should effectively be at the type level.
I definitely think we could start this discussion on how to improve this because Rule.ruleName is a part of a very hot path especially because it's a crucial step in visitAny. In the meantime I'll give it another go to see if I can come up with something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think I managed to come up with a pretty good trade off that allows having a static rule name cache, by shuffling some files and using codegen from generate-pipeline 😁
Let's see what @allevato thinks of it
1fb7c19 to
9823b4b
Compare
|
Given that using By flattening the In order to have both I think this achieves quite a good tradeoff, and doesn't mess too much with project organization. What do you say, @allevato? New results:
flattened
|
|
This is shaping up nicely!
Almost. They're not concrete rules; they're the base types that other rules should extend. It's important that we keep those two types in the It may be some time before we have the runtime metadata APIs we need to make that a reality, but I'd like to keep the design with that in mind, otherwise we'd just have to back the changes out later anyway. I don't mind the generation of the cache itself, but can you make it work without moving those types around? |
Yes, I am aware that they are base classes to the other rules (hence placing them in a directory called Base), but didn't consider the future plans of having dynamic discovery of rules without
It would quite a trivial change to do, but it's a valid point nonetheless.
I can definitely try. But there's both a circular dependency and a dynamic dispatch problem as things are, which makes this complicated:
These were the reasons why I ended up moving those two base classes and the To work around issue in point 5 a solution would be to implement The absolute best approach for this performance-wise would be for rules to simply declare a I'll give it yet another go to see what I can come up with. Do you have any idea of what could work? |
9823b4b to
2c9cfb4
Compare
|
Well, I have yet another possible solution, which hopefully will tick all the necessary boxes 😅 So, in order to have a static rule name cache that's accessible from This allows using While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each All things considered, results were quite good nonetheless:
flattened
|
This feels like a good way to do it! I can also see it scaling well to support dynamic discovery whenever we're able to do that, because we'd inject a different cache into the context instead of mutating some global singleton. It looks like there are still some outdated changes from the previous push in this PR (the old generated cache file is still there along with the new one, and the lint/format rule base types are still in the
I don't think the cache will be copied here; the dictionary is never mutated after it's created, so creating a new |
That's great news!
Whoops 🙈 Cherry-pick fail. Will fix right away
Yes, that makes sense. COW's got our backs here 😁 |
2c9cfb4 to
39aa5ab
Compare
Now that we have the `isOptIn` property for rules, this should be used to disable rules by default instead of removing them from pipeline generation entirely. This also unblocks swiftlang#261, since excluded rules wouldn't show up in the generated name cache, causing their tests to crash.
|
We're almost there! I ran the tests locally and realized that the old hardcoded list of rules suppressed from I've got #263 out to fix this, which should unblock this change. |
|
Nice catch, and thanks for the quick fix! 🙏🏼 If I'm not mistaken, that was caused by the new |
|
I think it's fine to have that assertion; if a rule's name isn't in the cache, it's probably a sign of other problems (like I merged #263, so if you rebase and rerun |
Parallel mode was crashing with bad memory access due to data races accessing `Frontend.configurationLoader`, more specifically its `cache`. After serializing access (initially with an `NSLock`), I observed that despite not crashing anymore, the performance was surprisingly *worst* than in single threaded mode. That led me down a road of investigating why this was happening, and after some profiling I discovered that `Rule`'s `nameCache` was the main source of contention - causing around 30% of total spent time waiting for locks (`ulock_wait`) on my tests. After replacing the synchronization mechanism on `nameCache` with a more efficient `os_unfair_lock_t` (`pthread_mutex_t` in Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue. I then used the same `Lock` approach to serialize access to `Frontend.configurationLoader` which increased the performance gap even further. After these improvements, I was able to obtain quite significant performance gains using `Lock`: - serial (non optimized) vs parallel (optimized): ~5.4x (13.5s vs 74s) - serial (optimized) vs serial (non optimized): ~1.6x (44s vs 74s) - serial (optimized) vs parallel (optimized): ~3.2x (13.5s vs 44s) Sadly, a custom `Lock` implementation is not ideal for `swift-format` to maintain and Windows support was not provided. As such, The alternative would be to use `NSLock` which being a part of `Foundation` is supported on all major platforms. Using `NSLock` the improvements were not so good, unfortunately: - serial (non optimized) vs parallel (NSLock): ~1,9x (38s vs 74s) - serial (NSLock) vs serial (non optimized): ~1,4x (52s vs 74s) - serial (NSLock) vs parallel (NSLock): ~1,3x (38s vs 52s) The best solution was then to try and remove all concurrency contention from `swift-format`. By flattening the `FileToProcess` array and making the `Rule`'s `nameCache` static, close to optimal performance should be achievable. Base rules `SyntaxFormatRule` and `SyntaxLintRule` should be kept in the `SwiftFormatCore` target to eventually support dynamic discovery of rules (ditching `generate-pipeline`) and only require rule implementers to import `SwiftFormatCore`. On the other hand, any statically generated `ruleName` cache must be in `SwiftFormatRules` target, because that's where rule implementations reside to capture their `ObjectIdentifier` at compile time. So, in order to have a static rule name cache that's accessible from `SwiftFormatCore` (especially from `SyntaxFormatRule`), a solution was to inject a `ruleNameCache` dictionary in `Context`, which is injected into every `Rule`. This allows using `generate-pipelines` to a produce a `ruleNameCache` in `SwiftFormatRules` that's injected on all `Context`s by each of the pipelines (which is done in `SwiftFormat` target). While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each `Context`), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow. All things considered, results were quite good: - serial (non optimized) vs parallel (optimized): ~7x (10.5s vs 74s) - serial (optimized) vs serial (non optimized): ~1.7x (42.5s vs 74s) - serial (optimized) vs parallel (optimized): ~4x (10.5s vs 42.5s) Tests were made on my `MacBookPro16,1` (8-core [email protected]), on a project with 2135 Swift files, compiling `swift-format` in Release mode. ## Changes - Make file preparation serial by calculating all `FileToProcess` before processing them in parallel. - Update `Context` to receive a `ruleNameCache` of type `[ObjectIdentifier: String]` on `init`. - Update `Context.isRuleEnabled` to receive the `Rule` type instead of the rule's name, to use the new cache. - Add new `RuleNameCacheGenerator` to `generate-pipeline` which outputs a static `ruleNameCache` dictionary to `RuleNameCache+Generated.swift` file in `SwiftFormatRule` target with all existing rules. - Remove the `nameCache` and `nameCacheQueue` `Rule.swift`.
Yeah, that was precisely my thought process! 😄
Indeed it did! Pushing rebased and updated changes. |
39aa5ab to
6d500ea
Compare
allevato
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience! I just did some quick testing and it looks good, so I think we're good to merge now.
…l-mode Fix and improve parallel mode
Now that we have the `isOptIn` property for rules, this should be used to disable rules by default instead of removing them from pipeline generation entirely. This also unblocks swiftlang#261, since excluded rules wouldn't show up in the generated name cache, causing their tests to crash.
Motivation
Parallel mode was crashing with bad memory access due to data races accessing
Frontend.configurationLoader, more specifically itscache.After serializing access (initially with an
NSLock), I observed that despite not crashing anymore, the performance was surprisingly worst than in single threaded mode.That led me down a road of investigating why this was happening, and after some profiling I discovered that
Rule'snameCachewas the main source of contention - causing around 30% of total spent time waiting for locks (ulock_wait) on my tests.After replacing the synchronization mechanism on
nameCachewith a more efficientos_unfair_lock_t(pthread_mutex_tin Linux), the contention dropped significantly, and parallel mode now outperformed single threaded mode as expected. As a bonus, these changes also improved single threaded mode performance as well, due to the reduced overhead of using a lock vs a queue.I then used the same
Lockapproach to serialize access toFrontend.configurationLoaderwhich increased the performance gap even further.After these improvements, I was able to obtain quite significant performance gains:
Tests were made on my
MacBookPro16,1(8-core i9 @2.4GHz), on a project with 2135 Swift files, compilingswift-formatin Release mode.Update 1
Sadly, a custom
Lockimplementation is not ideal forswift-formatto maintain and Windows support was not provided. As such, The alternative would be to useNSLockwhich being a part ofFoundationis supported on all major platforms.Using
NSLockthe improvements were not so good, unfortunately:Update 2
The best solution was then to try and remove all concurrency contention from
swift-format. By flattening theFileToProcessarray and making theRule'snameCachestatic, the best overall results were achieved:In order to have both
Rule.ruleNamecaching and avoid synchronization, a static cache of rule names is now generated via thegenerate-pipelinetool.To make the rule types available to the
Rule.ruleNameprotocol extension, it was moved toSwiftFormatRulestarget, together withSyntaxFormatRuleandSyntaxLintRule.Update 3
Base rules
SyntaxFormatRuleandSyntaxLintRuleshould be kept in theSwiftFormatCoretarget to eventually support dynamic discovery of rules (ditchinggenerate-pipeline) and only require rule implementers to importSwiftFormatCore.On the other hand, any statically generated
ruleNamecache must be inSwiftFormatRulestarget, because that's where rule implementations reside to capture theirObjectIdentifierat compile time.So, in order to have a static rule name cache that's accessible from
SwiftFormatCore(especially fromSyntaxFormatRule), a solution was to inject aruleNameCachedictionary inContext, which is injected into everyRule.This allows using
generate-pipelinesto a produce aruleNameCacheinSwiftFormatRulesthat's injected on allContexts by each of the pipelines (which is done inSwiftFormattarget).While not as ideal as having the cached accessed in a single place (this approach incurs in the cache being copied for each
Context), it achieves a good tradeoff given current constraints). Lets hope the compiler is smart enough to optimize this somehow.All things considered, results were quite good:
Changes
Make file preparation serial by calculating all
FileToProcessbefore processing them in parallel.Update
Contextto receive aruleNameCacheof type[ObjectIdentifier: String]oninit.Update
Context.isRuleEnabledto receive theRuletype instead of the rule's name, to use the new cache.Add new
RuleNameCacheGeneratortogenerate-pipelinewhich outputs a staticruleNameCachedictionary toRuleNameCache+Generated.swiftfile inSwiftFormatRuletarget with all existing rules.Remove the
nameCacheandnameCacheQueueRule.swift.Benchmarks
Below are benchmarks of some configurations attempted during development, using
hyperfine.(Some of these configurations' implementations are available in branches on my fork).
The flattened
FileToProcessarray was an attempt to remove the need to synchronize access toFrontend.configurationLoadercompletely, by making initial file preparation sequential. It results in a slight performance improvement in single threaded mode (~3.6%), but a performance loss in parallel mode (~4.6%). Opted for maximizing parallel mode's performance, as that should be the more optimized mode.Work was done based off
swift-5.4-branch, because I'm using Xcode 12.5.single-thread
no lock in
ConfigurationLoader, DispatchQueue inRulenameCachebranches:
swift-5.4-branch,mainLock(os_unfair_lock_t) inConfigurationLoaderand inRulenameCachebranch:
swift-5.4-branch-lock-configuration-loader-and-rule-name-cache, this PR'sflattened
FileToProcessarray,Lock(os_unfair_lock_t) inRulenameCachebranch:
swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cacheparallel
NSLockinConfigurationLoader, DispatchQueue inRulenameCacheLock(os_unfair_lock_t) inConfigurationLoaderand inRulenameCachebranch:
swift-5.4-branch-lock-configuration-loader-and-rule-name-cache, this PR'sflattened
FileToProcessarray,Lock(os_unfair_lock_t) inRulenameCachebranch:
swift-5.4-branch-serial-files-to-process-and-lock-rule-name-cacheflattened
FileToProcessarray, noRulenameCachebranch:
swift-5.4-branch-serial-files-to-process-and-no-rule-name-cacheBenchmarks - Update 1
NSLockinConfigurationLoaderand inRulenameCachebranch:
swift-5.4-branch-nslock-configuration-loader-and-rule-name-cacheserial
parallel
Benchmarks - Update 2
flattened
FileToProcessarray, staticRulenameCachebranch:
swift-5.4-branch-static-rule-name-cacheserial
parallel
Benchmarks - Update 3
flattened
FileToProcessarray, staticruleNameCacheinjected inContextbranch:
swift-5.4-branch-rule-name-cache-in-contextserial
parallel