-
-
Notifications
You must be signed in to change notification settings - Fork 110
Switch List implementation to use Hash-based lookup #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The Hash doesn't require manual reindexing when new rules are added.
Moreover, the Hash-based algorithm has almost O(1) lookup time.
Actually, the lookup time is O(k), where k is the number of parts in
the input string.
find("www.example.com") -> k = 2
find("www.example.com") -> k = 3
find("www.subdomain.example.com") -> k = 4
It's fair to consider that the average number of parts is 3, and
hostnames longer than 5 parts are quite uncommon.
Note that the Hash-based lookup is highly influenced by whatever
underlying Hash implementation is provided by the programming language.
A Perfect Hash would be preferable in terms of lookup time as it offers
real O(1) lookup time complexity (whereas a dynamic Hash is avg O(1)),
however a Perfect Hash would require a computation of a perfect hashing
function, without considering that it would not allow the flexibility
of adding/removing rules at runtime.
➜ publicsuffix-ruby git:(thesis-hash) ✗ ruby benchmarks/bm_parts.rb
Warming up --------------------------------------
tokenizer1 26.384k i/100ms
tokenizer2 26.571k i/100ms
tokenizer3 32.293k i/100ms
tokenizer4 27.595k i/100ms
Calculating -------------------------------------
tokenizer1 310.488k (± 6.6%) i/s - 1.557M in 5.035961s
tokenizer2 308.801k (± 8.3%) i/s - 1.541M in 5.027643s
tokenizer3 378.716k (± 5.3%) i/s - 1.905M in 5.045422s
tokenizer4 305.493k (± 9.6%) i/s - 1.518M in 5.018550s
Comparison:
tokenizer3: 378716.5 i/s
tokenizer1: 310488.3 i/s - 1.22x slower
tokenizer2: 308800.6 i/s - 1.23x slower
tokenizer4: 305493.5 i/s - 1.24x slower
After I finally realize why the benchmarks were still using the old code, and fixing the issue in 5ed8d00, here's the new benchmarks that compare the existing implementation with the new lookup based on Hash. Using the naive indexing: ➜ publicsuffix-ruby git:(master) ruby benchmarks/bm_find.rb Rehearsal ------------------------------------------------------------- NAME_SHORT 1.550000 0.010000 1.560000 ( 1.563616) NAME_SHORT (noprivate) 2.060000 0.020000 2.080000 ( 2.117548) NAME_MEDIUM 1.720000 0.020000 1.740000 ( 1.760489) NAME_MEDIUM (noprivate) 2.430000 0.020000 2.450000 ( 2.649166) NAME_LONG 1.630000 0.000000 1.630000 ( 1.643268) NAME_LONG (noprivate) 2.210000 0.020000 2.230000 ( 2.262352) NAME_WILD 0.600000 0.000000 0.600000 ( 0.601043) NAME_WILD (noprivate) 1.320000 0.070000 1.390000 ( 1.475682) NAME_EXCP 0.940000 0.060000 1.000000 ( 1.071000) NAME_EXCP (noprivate) 1.120000 0.010000 1.130000 ( 1.136978) IAAA 0.690000 0.000000 0.690000 ( 0.694769) IAAA (noprivate) 1.010000 0.010000 1.020000 ( 1.011105) IZZZ 0.560000 0.000000 0.560000 ( 0.569191) IZZZ (noprivate) 0.900000 0.000000 0.900000 ( 0.895128) PAAA 7.310000 0.090000 7.400000 ( 8.036596) PAAA (noprivate) 7.910000 0.080000 7.990000 ( 8.450394) PZZZ 1.060000 0.000000 1.060000 ( 1.109186) PZZZ (noprivate) 1.390000 0.010000 1.400000 ( 1.411946) JP 50.590000 0.390000 50.980000 ( 52.698865) JP (noprivate) 49.840000 0.230000 50.070000 ( 50.385524) IT 9.440000 0.020000 9.460000 ( 9.502403) IT (noprivate) 9.940000 0.030000 9.970000 ( 10.008055) COM 8.610000 0.030000 8.640000 ( 8.657849) COM (noprivate) 9.330000 0.130000 9.460000 ( 9.700029) -------------------------------------------------- total: 175.410000sec user system total real NAME_SHORT 1.580000 0.000000 1.580000 ( 1.588811) NAME_SHORT (noprivate) 2.000000 0.010000 2.010000 ( 2.024544) NAME_MEDIUM 1.960000 0.020000 1.980000 ( 2.012659) NAME_MEDIUM (noprivate) 2.150000 0.020000 2.170000 ( 2.193273) NAME_LONG 1.660000 0.000000 1.660000 ( 1.666938) NAME_LONG (noprivate) 2.010000 0.000000 2.010000 ( 2.018177) NAME_WILD 0.600000 0.000000 0.600000 ( 0.601061) NAME_WILD (noprivate) 0.920000 0.000000 0.920000 ( 0.920315) NAME_EXCP 0.700000 0.010000 0.710000 ( 0.708406) NAME_EXCP (noprivate) 1.260000 0.010000 1.270000 ( 1.298971) IAAA 0.810000 0.010000 0.820000 ( 0.829160) IAAA (noprivate) 1.180000 0.000000 1.180000 ( 1.207569) IZZZ 0.640000 0.010000 0.650000 ( 0.646752) IZZZ (noprivate) 1.020000 0.000000 1.020000 ( 1.037327) PAAA 6.180000 0.020000 6.200000 ( 6.227082) PAAA (noprivate) 6.970000 0.050000 7.020000 ( 7.089971) PZZZ 0.930000 0.000000 0.930000 ( 0.937254) PZZZ (noprivate) 1.310000 0.010000 1.320000 ( 1.324235) JP 47.930000 0.200000 48.130000 ( 48.440196) JP (noprivate) 48.440000 0.260000 48.700000 ( 49.110888) IT 9.660000 0.090000 9.750000 ( 9.874755) IT (noprivate) 9.950000 0.070000 10.020000 ( 10.163920) COM 7.930000 0.020000 7.950000 ( 7.986893) COM (noprivate) 8.170000 0.010000 8.180000 ( 8.186619) Using Hash: ➜ publicsuffix-ruby git:(thesis-hash) ruby benchmarks/bm_find.rb Rehearsal ------------------------------------------------------------- NAME_SHORT 0.310000 0.000000 0.310000 ( 0.363447) NAME_SHORT (noprivate) 0.360000 0.000000 0.360000 ( 0.402509) NAME_MEDIUM 0.320000 0.000000 0.320000 ( 0.317237) NAME_MEDIUM (noprivate) 0.410000 0.000000 0.410000 ( 0.413092) NAME_LONG 0.400000 0.000000 0.400000 ( 0.396608) NAME_LONG (noprivate) 0.510000 0.000000 0.510000 ( 0.510915) NAME_WILD 0.390000 0.000000 0.390000 ( 0.393804) NAME_WILD (noprivate) 0.510000 0.010000 0.520000 ( 0.507487) NAME_EXCP 0.400000 0.000000 0.400000 ( 0.401723) NAME_EXCP (noprivate) 0.520000 0.000000 0.520000 ( 0.525549) IAAA 0.240000 0.000000 0.240000 ( 0.244243) IAAA (noprivate) 0.360000 0.000000 0.360000 ( 0.359558) IZZZ 0.250000 0.000000 0.250000 ( 0.249716) IZZZ (noprivate) 0.360000 0.000000 0.360000 ( 0.356862) PAAA 0.440000 0.000000 0.440000 ( 0.445464) PAAA (noprivate) 0.590000 0.000000 0.590000 ( 0.591834) PZZZ 0.450000 0.000000 0.450000 ( 0.446044) PZZZ (noprivate) 0.520000 0.000000 0.520000 ( 0.524458) JP 0.320000 0.000000 0.320000 ( 0.327063) JP (noprivate) 0.430000 0.000000 0.430000 ( 0.430906) IT 0.270000 0.000000 0.270000 ( 0.265015) IT (noprivate) 0.340000 0.000000 0.340000 ( 0.345299) COM 0.250000 0.000000 0.250000 ( 0.244028) COM (noprivate) 0.340000 0.010000 0.350000 ( 0.343862) ---------------------------------------------------- total: 9.310000sec user system total real NAME_SHORT 0.220000 0.000000 0.220000 ( 0.221509) NAME_SHORT (noprivate) 0.320000 0.000000 0.320000 ( 0.329044) NAME_MEDIUM 0.290000 0.000000 0.290000 ( 0.296088) NAME_MEDIUM (noprivate) 0.390000 0.000000 0.390000 ( 0.393592) NAME_LONG 0.420000 0.000000 0.420000 ( 0.419251) NAME_LONG (noprivate) 0.500000 0.000000 0.500000 ( 0.499873) NAME_WILD 0.420000 0.000000 0.420000 ( 0.421002) NAME_WILD (noprivate) 0.480000 0.000000 0.480000 ( 0.485180) NAME_EXCP 0.400000 0.000000 0.400000 ( 0.401010) NAME_EXCP (noprivate) 0.510000 0.000000 0.510000 ( 0.506889) IAAA 0.250000 0.000000 0.250000 ( 0.257035) IAAA (noprivate) 0.350000 0.000000 0.350000 ( 0.352895) IZZZ 0.250000 0.000000 0.250000 ( 0.250804) IZZZ (noprivate) 0.350000 0.010000 0.360000 ( 0.352272) PAAA 0.440000 0.000000 0.440000 ( 0.444238) PAAA (noprivate) 0.540000 0.000000 0.540000 ( 0.549019) PZZZ 0.440000 0.000000 0.440000 ( 0.449137) PZZZ (noprivate) 0.550000 0.000000 0.550000 ( 0.559688) JP 0.330000 0.000000 0.330000 ( 0.337413) JP (noprivate) 0.450000 0.010000 0.460000 ( 0.458545) IT 0.240000 0.000000 0.240000 ( 0.247337) IT (noprivate) 0.350000 0.000000 0.350000 ( 0.351233) COM 0.260000 0.000000 0.260000 ( 0.261882) COM (noprivate) 0.340000 0.000000 0.340000 ( 0.347857)
Using the naive indexing:
➜ publicsuffix-ruby git:(master) ruby test/profilers/execution_profiler.rb
Total allocated: 204162 bytes (4420 objects)
Total retained: 0 bytes (0 objects)
allocated memory by gem
-----------------------------------
204002 publicsuffix-ruby/lib
160 other
allocated memory by class
-----------------------------------
177036 String
18416 Array
2560 Hash
2134 Regexp
1168 RubyVM::Env
1120 MatchData
800 Proc
576 Enumerator::Lazy
96 Enumerator::Generator
96 Enumerator::Yielder
80 PublicSuffix::Domain
80 PublicSuffix::Rule::Wildcard
allocated objects by gem
-----------------------------------
4416 publicsuffix-ruby/lib
4 other
allocated objects by class
-----------------------------------
4332 String
32 Array
16 Hash
10 Proc
10 RubyVM::Env
4 Enumerator::Lazy
4 MatchData
4 Regexp
2 Enumerator::Generator
2 Enumerator::Yielder
2 PublicSuffix::Domain
2 PublicSuffix::Rule::Wildcard
retained memory by gem
-----------------------------------
NO DATA
retained memory by file
-----------------------------------
NO DATA
retained memory by location
-----------------------------------
NO DATA
retained memory by class
-----------------------------------
NO DATA
retained objects by gem
-----------------------------------
NO DATA
retained objects by file
-----------------------------------
NO DATA
retained objects by location
-----------------------------------
NO DATA
retained objects by class
-----------------------------------
NO DATA
Using Hash:
➜ publicsuffix-ruby git:(thesis-hash) ruby test/profilers/execution_profiler.rb
Total allocated: 15170 bytes (160 objects)
Total retained: 0 bytes (0 objects)
allocated memory by gem
-----------------------------------
15010 publicsuffix-ruby/lib
160 other
allocated memory by class
-----------------------------------
8076 String
2560 Hash
2134 Regexp
1120 Array
1120 MatchData
80 PublicSuffix::Domain
80 PublicSuffix::Rule::Wildcard
allocated objects by gem
-----------------------------------
156 publicsuffix-ruby/lib
4 other
allocated objects by class
-----------------------------------
108 String
24 Array
16 Hash
4 MatchData
4 Regexp
2 PublicSuffix::Domain
2 PublicSuffix::Rule::Wildcard
retained memory by gem
-----------------------------------
NO DATA
retained memory by file
-----------------------------------
NO DATA
retained memory by location
-----------------------------------
NO DATA
retained memory by class
-----------------------------------
NO DATA
retained objects by gem
-----------------------------------
NO DATA
retained objects by file
-----------------------------------
NO DATA
retained objects by location
-----------------------------------
NO DATA
retained objects by class
-----------------------------------
NO DATA
When the rule is stored, we can remove the value from the Rule as
the value if effectively the key of the Hash.
➜ publicsuffix-ruby git:(before) ruby test/profilers/initialization_profiler.rb
Total allocated: 5882690 bytes (52219 objects)
Total retained: 1375819 bytes (24188 objects)
➜ publicsuffix-ruby git:(before) ruby test/profilers/execution_profiler.rb
Total allocated: 15170 bytes (160 objects)
Total retained: 0 bytes (0 objects)
➜ publicsuffix-ruby git:(after) ✗ ruby test/profilers/initialization_profiler.rb
Total allocated: 6205130 bytes (60280 objects)
Total retained: 1052404 bytes (16127 objects)
➜ publicsuffix-ruby git:(after) ✗ ruby test/profilers/execution_profiler.rb
Total allocated: 15330 bytes (164 objects)
Total retained: 0 bytes (0 objects)
compared to master
➜ publicsuffix-ruby git:(master) ruby test/profilers/initialization_profiler.rb
Total allocated: 6525758 bytes (72086 objects)
Total retained: 1020387 bytes (19234 objects)
➜ publicsuffix-ruby git:(master) ruby test/profilers/execution_profiler.rb
Total allocated: 204162 bytes (4420 objects)
Total retained: 0 bytes (0 objects)
Execution time is unchanged.
➜ publicsuffix-ruby git:(before) ruby test/benchmarks/bm_find.rb
user system total real
NAME_SHORT 0.260000 0.000000 0.260000 ( 0.262684)
NAME_SHORT (noprivate) 0.370000 0.010000 0.380000 ( 0.372534)
NAME_MEDIUM 0.330000 0.000000 0.330000 ( 0.335683)
NAME_MEDIUM (noprivate) 0.490000 0.000000 0.490000 ( 0.494590)
NAME_LONG 0.510000 0.010000 0.520000 ( 0.519750)
NAME_LONG (noprivate) 0.590000 0.000000 0.590000 ( 0.594626)
NAME_WILD 0.480000 0.000000 0.480000 ( 0.490432)
NAME_WILD (noprivate) 0.580000 0.010000 0.590000 ( 0.594776)
NAME_EXCP 0.460000 0.000000 0.460000 ( 0.470119)
NAME_EXCP (noprivate) 0.590000 0.010000 0.600000 ( 0.601316)
IAAA 0.300000 0.000000 0.300000 ( 0.305301)
IAAA (noprivate) 0.400000 0.000000 0.400000 ( 0.410586)
IZZZ 0.280000 0.000000 0.280000 ( 0.283711)
IZZZ (noprivate) 0.400000 0.010000 0.410000 ( 0.408137)
PAAA 0.490000 0.000000 0.490000 ( 0.501869)
PAAA (noprivate) 0.600000 0.000000 0.600000 ( 0.612187)
PZZZ 0.510000 0.010000 0.520000 ( 0.519206)
PZZZ (noprivate) 0.590000 0.000000 0.590000 ( 0.600264)
JP 0.390000 0.000000 0.390000 ( 0.404432)
JP (noprivate) 0.540000 0.010000 0.550000 ( 0.558351)
IT 0.290000 0.000000 0.290000 ( 0.298931)
IT (noprivate) 0.410000 0.000000 0.410000 ( 0.420742)
COM 0.290000 0.010000 0.300000 ( 0.300935)
COM (noprivate) 0.400000 0.000000 0.400000 ( 0.409309)
➜ publicsuffix-ruby git:(after) ✗ ruby test/benchmarks/bm_find.rb
user system total real
NAME_SHORT 0.320000 0.000000 0.320000 ( 0.320201)
NAME_SHORT (noprivate) 0.430000 0.000000 0.430000 ( 0.443678)
NAME_MEDIUM 0.380000 0.000000 0.380000 ( 0.388169)
NAME_MEDIUM (noprivate) 0.490000 0.010000 0.500000 ( 0.491073)
NAME_LONG 0.480000 0.000000 0.480000 ( 0.483376)
NAME_LONG (noprivate) 0.620000 0.010000 0.630000 ( 0.634896)
NAME_WILD 0.570000 0.020000 0.590000 ( 0.628489)
NAME_WILD (noprivate) 0.700000 0.030000 0.730000 ( 0.769070)
NAME_EXCP 0.580000 0.020000 0.600000 ( 0.618683)
NAME_EXCP (noprivate) 0.740000 0.030000 0.770000 ( 0.799244)
IAAA 0.410000 0.030000 0.440000 ( 0.474761)
IAAA (noprivate) 0.550000 0.040000 0.590000 ( 0.645329)
IZZZ 0.380000 0.020000 0.400000 ( 0.432898)
IZZZ (noprivate) 0.520000 0.020000 0.540000 ( 0.579073)
PAAA 0.680000 0.040000 0.720000 ( 0.760276)
PAAA (noprivate) 0.720000 0.020000 0.740000 ( 0.773864)
PZZZ 0.700000 0.040000 0.740000 ( 0.782113)
PZZZ (noprivate) 0.650000 0.010000 0.660000 ( 0.664647)
JP 0.470000 0.000000 0.470000 ( 0.478473)
JP (noprivate) 0.580000 0.010000 0.590000 ( 0.589827)
IT 0.360000 0.000000 0.360000 ( 0.379309)
IT (noprivate) 0.450000 0.010000 0.460000 ( 0.471794)
COM 0.330000 0.010000 0.340000 ( 0.334253)
COM (noprivate) 0.530000 0.030000 0.560000 ( 0.592813)
Using the new benchmarks introduced in dec53e6, the allocation is clearly lower even during execution time. ➜ publicsuffix-ruby git:(master) ✗ ruby test/profilers/find_profiler.rb Total allocated: 31472 bytes (691 objects) Total retained: 0 bytes (0 objects) ➜ publicsuffix-ruby git:(master) ✗ ruby test/profilers/domain_profiler.rb Total allocated: 37410 bytes (744 objects) Total retained: 0 bytes (0 objects) vs ➜ publicsuffix-ruby git:(thesis-hash) ruby test/profilers/find_profiler.rb Total allocated: 1264 bytes (22 objects) Total retained: 0 bytes (0 objects) ➜ publicsuffix-ruby git:(thesis-hash) ruby test/profilers/domain_profiler.rb Total allocated: 7202 bytes (75 objects) Total retained: 0 bytes (0 objects)
.new now takes all parameters, as you would create a completely new instance when you have the data. A new method called .build is used to create a new Rule from a rule content.
Better distinguish between a Rule (public API) and an Entry (internal API).
It doesn't support keyword arguments with no default, and proper memory profiling.
Contributor
|
I tested it on my app. The gem now loads 3 times faster ( 👏 cc @burke |
Owner
Author
|
Thanks for the feedback @casperisfine. I have some more research going on to use a modification of a Trie or a DAFSA to reduce the memory allocation. That said, I'm quite happy with the speed right now. |
weppos
added a commit
that referenced
this pull request
Aug 4, 2017
roback
added a commit
to twingly/twingly-url
that referenced
this pull request
Feb 9, 2018
Unfortunately it doesn't look like this fixes any of our issues, but since it made the profiling run a bit faster (and the fact that the tests didn't break) I made a PR of this anyway. (Profiling total run: 1.6663s -> 1.4801s). Some related links: * weppos/publicsuffix-ruby#130 * weppos/publicsuffix-ruby#133 * sporkmonger/addressable#267
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a major refactoring of the internals of the List implementation (the way the list is stored), and the
findoperation algorithm. The goal was to decrease the memory footprint and increase the speed of the lookup.This is part of a study and research I am conducting about data structures and algorithms. The various commits contains extra information about the various changes and optimizations.
Before
After