-
Notifications
You must be signed in to change notification settings - Fork 320
Description
New Module
From this version, more project module will join Apache Pegasus Project. In this version, the following projects are included:
- RDSN: In the past,
rdsnexists as a sub-project in current repository, in this release, it officially become the core module of the project. - PegasusClient: Pegasus support multi-language client. In the past, it existed different repository, in this release, the following clients are included in the repository:
- PegasusDocker: Pegasus support docker-build, now it is included the latest version
- PegasusShell: In the past, Pegasus use c++ shell to manage the cluster, now in the latest version, we bring new shell tools: AdminCli and Pegic
New architecture
In this version, we remove the shared log to enhance the pegasus performance, Related pull request as follow:
- feat: add tracer point in plog XiaoMi/rdsn#993
- feat: support callback for plog append XiaoMi/rdsn#994
- feat: add force flush option for private log XiaoMi/rdsn#999
- feat: remove shared log XiaoMi/rdsn#1019
- fix: stack overflow in duplication test XiaoMi/rdsn#1022
- feat: make option empty_write_disable mutable XiaoMi/rdsn#1048
- fix: remove unused judgement of FLAGS_enable_latency_tracer in private log XiaoMi/rdsn#1028
- feat(remove slog): adapt to modifications in rdsn #890
New Feature
Replica-factor update
Supporting flexible replica count. In the past, the replica factor was Immutable once one table was created. In current version, user can dynamically adjust the factor of specified table. Related pull request as follow:
- feat(update_replication_factor#1): support to get the replication factor of each table XiaoMi/rdsn#1061
- feat(update_replication_factor#2): add RPC for setting the replication factor of each table XiaoMi/rdsn#1072
- feat(update_replication_factor#3): support to set the replication factor for all partitions of each table XiaoMi/rdsn#1077
- feat(update_replication_factor#4): support to set env for the replication factor XiaoMi/rdsn#1087
- feat(update_replication_factor#7): replace cluster-level max_replicas_in_group with max_replica_count of each replica XiaoMi/rdsn#1109
- feat(update_replication_factor#8): recover uncompleted update of replication factor during initialization of meta service XiaoMi/rdsn#1110
- feat(update_replication_factor#9): merge max_replica_count_rebased_dev into master branch XiaoMi/rdsn#1115
- feat(update_replication_factor#10): support to get/set the replication factor of each table by shell #914
- feat(update_replication_factor#11): add max_reserved_dropped_replicas into server config file #999
- fix(update_replication_factor#12): update the mutation_2pc_min_replica_count base the max count of table #1035
Read Request Limiter
In the past, we only support write limiter, in this version, we add the supporting for read:
- feat: add a token_bucket_throttling_controller XiaoMi/rdsn#941
- feat: add an interface for rpc_holder to set error XiaoMi/rdsn#939
- feat: support read throttling by size #829
- feat: add new codes in env to support
READ_SIZE_THROTTLINGXiaoMi/rdsn#948 - feat: add an interface for app_base to get replica info XiaoMi/rdsn#947
- fix: add TokenBucket header in the
token_bucket_throttling_controller.hheader XiaoMi/rdsn#946
Jemalloc Support
Build Feature
We have made some restrictions on the compilation environment and support MacOS and aarcch64:
- fix(build): update GCC and CMake checker XiaoMi/rdsn#1041
- feat(build): support to build on MacOS XiaoMi/rdsn#1034
- feat: support aarch64 XiaoMi/rdsn#1097
- fix(build): fix build failure on MacOS #1049
New BatchGetAPI
In the past, the batchGet implement based the singleGet, the latest version will aggregate different request first berfore sending, it will improve the performace:
- feat: add 'BATCH_GET' interface for read optimization #897
- feat: add a batch get client interface which uses 'BATCH_GET' rpc for read optimization XiaoMi/pegasus-java-client#175
Task Queue limiter
- feat(task): add dynamic control of task queue length XiaoMi/rdsn#902
- feat: add throttle config of read/write request #831
Feature enhancement
Bulkload
We improve bulkload feature to reduce the io-load of downloading and ingesting, besides, we offer better interfaces and failure handling logic, the related pull request as follow:
- feat(bulk_load): async create connection with remote file provider XiaoMi/rdsn#952
- feat(bulk_load): not rollback to downloading if bulk load meet error in succeed stage XiaoMi/rdsn#958
- feat(bulk_load): remove bulk load request short interval avoid unnecessary timeout XiaoMi/rdsn#959
- refactor(bulk_load): update some bulk load logs XiaoMi/rdsn#960
- refactor(bulk_load): refactor get_app for bulk_load_service XiaoMi/rdsn#962
- feat(bulk_load): add unhealthy partition check XiaoMi/rdsn#964
- fix: remove bulk_load_meta_service::check_partition_status useless this XiaoMi/rdsn#967
- fix(bulk_load): fix replica validate status check XiaoMi/rdsn#1004
- fix(bulk_load): update ingestion error handling with move_files mode XiaoMi/rdsn#1011
- feat(bulk_load): avoid unecessary repeated ingestion XiaoMi/rdsn#1018
- feat(bulk_load): add verify_before_ingest option XiaoMi/rdsn#1027
- refactor(bulk_load): rename update_partition_status_on_remote_storage XiaoMi/rdsn#1031
- feat(bulk_load): support disk_level ingesting restriction part1 - add ingestion_context class XiaoMi/rdsn#1035
- feat(bulk_load): support disk_level ingesting restriction part2 implementation XiaoMi/rdsn#1039
- fix(bulk_load): fix bug that poping all committed mutations ineffective XiaoMi/rdsn#1102
- feat(bulk_load): support clear last bulk load state rpc XiaoMi/rdsn#1103
- feat(bulk_load): update downloading to avoid blocking default thread pool XiaoMi/rdsn#1104
- feat(bulk_load): support different tables can execute bulk load concurrently XiaoMi/rdsn#1105
- feat(direct_io): download files from hdfs with direct I/O XiaoMi/rdsn#1069
- feat(verify_file): update the file validation logic for bulk load downloading to avoid io-read workload XiaoMi/rdsn#1074
- fix: make is_bulk_loading filed optional in query_bulk_load_response XiaoMi/rdsn#1009
- fix(bulk_load): add ballot check before executing ingestion #881
- feat(bulk_load): support verify_before_ingest option #888
- feat(bulk_load): support clear last bulk load state interface #968
- feat(bulk_load): add bulk load ingestion progress for shell and admin-cli #975
- feat(online_migration): part2 - add ingest_behind option for bulk load XiaoMi/rdsn#1002
- fix: adapt query_bulk_load_response is_bulk_loading filed optional #868
Duplication
In the past, duplication has some shortcoming: It depends remote filesystem to sync the checkpoint; The synchronization of plog data only sends a single mutation at each RPC. In this version, we enhance the above problem(the detail design see #892), related pull request as follows:
- feat(dup_enhancement#1): master add new dup status to support new dup XiaoMi/rdsn#1038
- feat(dup_enhancement#2): master add more follower's info for master meta XiaoMi/rdsn#1040
- feat(dup_enhancement#3): master update load_meta_servers to avoid crush when duplication config is wrong XiaoMi/rdsn#1045
- feat(dup_enhancement#4): master implement
create_follower_app_for_duplicationfunction XiaoMi/rdsn#1046 - feat(dup_enhancement#5): master add and support updating
checkpoint_preparedXiaoMi/rdsn#1049 - feat(dup_enhancement#6): master implement
check_follower_app_if_create_completedfunction XiaoMi/rdsn#1051 - feat(dup_enhancement#7): master implement
prepare_dupfunction in replica side for metaDS_PREPAREstatus XiaoMi/rdsn#1053 - feat(dup_enhancement#8): master replica sync info to meta support
checkpoint_has_preparedXiaoMi/rdsn#1055 - feat(dup_enhancement#9): master replica support
on_query_last_checkpoint_inforpc for follower replica XiaoMi/rdsn#1056 - fix(dup_enhancement#10): follower meta need return
ERR_APP_EXISTwhen receive create same app XiaoMi/rdsn#1059 - feat(dup_enhancement#11): follower replica add
replica_followerto support duplicate checkpoint when open replica XiaoMi/rdsn#1060 - feat(dup_enhancement#12): follower replica add framework and implement the main function XiaoMi/rdsn#1063
- feat(dup_enhancement#13): follower replica support query and update master partition config XiaoMi/rdsn#1064
- feat(dup_enhancement#14): follower replica support nfs copy checkpoint XiaoMi/rdsn#1065
- feat(dup_enhancement#17): replica follower load duplication data when open replica #917
- fix(dup_enhancement#15): follower replica add
duplciation_dirinreplica_app_baseXiaoMi/rdsn#1066 - fix(dup_enhancement#16): change some
replica_app_basefunction to virtual for polymorphism XiaoMi/rdsn#1067 - feat(dup_enhancement#18): support batch send log on master side and handle it on follower side #919
- fix(dup_enhancement#19): change checkpoint status after manual checkpoint completed XiaoMi/rdsn#1071
- fix(dup_enhancement#20): put
async_duplicate_checkpoint_from_master_replicaintoDEFAULT_POOLto avoid thread lock XiaoMi/rdsn#1076 - fix(dup_enhancement#21): add batch replay log to support batch send mutation XiaoMi/rdsn#1080
- fix(dup_enhancement#22): change batch send config by using rdsn config value #930
- refactor(dup_enhancement#22): delete useless
freezedargument XiaoMi/rdsn#1084 - refactor(dup_enhancement#24): delete useless
freezedargument #935 - feat(dup_enhancement): support using nfs to duplicate checkpoint replace fds and support batch send plog #936
- feat(dup_enhancement#25): add
-s | --sstargument to support ignoring checkpoint when duplicate XiaoMi/rdsn#1085 - feat(dup_enhancement#25): add
-s | --sstargument to support ignoring checkpoint when duplicate #940 - fix: assert is not necessary when the table config is updated XiaoMi/rdsn#1121
- feat(dup_enhancement#26): update duplicate rpc thrift in go-client #1007
- feat(dup_enhancement#27): update dulication command in admin-cli #1008
- fix: add app_info duplicating initial value XiaoMi/rdsn#976
- fix: fix recovery verification bug #1065
- fix: fix recovery code clang failed #1078
PerfCounter
In the version, we support new metric implement to optimize performance:
- feat: implement long adder to optimize the counter of new metrics system XiaoMi/rdsn#1033
- feat(new_metrics): implement the metric entity & its prototype XiaoMi/rdsn#1070
- feat(new_metrics): implement the metric registry XiaoMi/rdsn#1073
- feat(new_metrics): implement the metric & its prototype XiaoMi/rdsn#1075
- feat(new_metrics): implement the counter XiaoMi/rdsn#1081
- feat: support to close percentile to prevent from heap-use-after-free error #1074
Manual Compaction
- feat(manual_compaction): meta server support starting compaction XiaoMi/rdsn#989
- feat(manual_compaction): meta server support querying compaction status XiaoMi/rdsn#987
- feat(manual_compaction): replica report status part3 XiaoMi/rdsn#983
- feat(manual_compaction): replica report status part2. implement function query_compact_status #854
- feat(manual_compaction): replica report status part1. add function query_compact_status XiaoMi/rdsn#981
Learn with NFS
To reduce the impact of data migration for IO-LOAD and ensure the migration rate, our data transmission supports disk level speed limits:
- feat: add rate limiter for per disk when nfs copy data XiaoMi/rdsn#944
- feat(utils): add the token buckets wrapper to get bucket by name XiaoMi/rdsn#943
- feat: change the nfs limiter diable by default XiaoMi/rdsn#985
Latency Tracer
The latest latency tracer support perf-counter and fix some bugs:
- feat: add dsn_unlikely in macro ADD_POINT XiaoMi/rdsn#1029
- fix: avoid to call add_point in latency_tracer XiaoMi/rdsn#1023
- refactor: use latest latency tool to add point for write process XiaoMi/rdsn#951
- feat: support perfcounter and more information trace log XiaoMi/rdsn#945
- fix: break execute some method when disable latency tracer XiaoMi/rdsn#965
Other important
- feat: add deny table env to reject client read/write request XiaoMi/rdsn#1086
- feat(backup): add write limiter for hdfs XiaoMi/rdsn#988
- feat: zookeeper client support kerberos XiaoMi/rdsn#979
- feat: implement grouped flag validators XiaoMi/rdsn#978
- feat: restrict the replication factor while creating app XiaoMi/rdsn#963
- build: require add issue reference in description using github action #870
- feat(scripts): update rolling_update and rebalance scripts #852
- feat(script): lengthen downgrade and close replica timeout #907
- feat: support to restrict the size of rocksdb info logs #1009
- chore(script): Use fqdn instead of ip to generate the meta_server conf #1085
- feat: The conf server_list of meta_server support use fqdn:port #1061
Java Client
- fix: HashKeyData constructor compatible XiaoMi/pegasus-java-client#184
- fix: fix a mismatch code XiaoMi/pegasus-java-client#181
- feat: add a app create interface to the java client XiaoMi/pegasus-java-client#180
- fix: batchMultiGet may not fetch all required data XiaoMi/pegasus-java-client#177
- feat(java-client): Add hasNext() for PegasusScannerInterface #1019
- ci(java-client): remove spotless dependency #1000
- feat: add app drop interface to the java client #973
Go Client
- feat: add clear bulk load rpc (#109) XiaoMi/pegasus-go-client#110
- feat: add clear bulk load rpc XiaoMi/pegasus-go-client#109
- feat: add manual compaction related rpc XiaoMi/pegasus-go-client#107
- fix: the timeout should use UnixNano to compute XiaoMi/pegasus-go-client#105
- fix: rpc will failed when server side enable
drop timeout requestfeature XiaoMi/pegasus-go-client#104 - feat: handle ERR_DISK_INSUFFICIENT error XiaoMi/pegasus-go-client#102
- feat: add bulk load related structures XiaoMi/pegasus-go-client#101
- feat: support add disk rpc thrift XiaoMi/pegasus-go-client#99
Python Client
Admin Cli
- support nodes migrator
- feat: support single migrate to target pegasus-kv/admin-cli#62
- feature: support batch migrate multi tables pegasus-kv/admin-cli#60
- refactor: update too many stdout log pegasus-kv/admin-cli#59
- refactor(nodes-migratot): support the multi target node execute the migrate task pegasus-kv/admin-cli#54
- feat(nodes-migrator 4/4): implement the migrate task submit pegasus-kv/admin-cli#53
- feat(nodes-migrator 3/4): implement sync nodes info pegasus-kv/admin-cli#52
- feat(nodes-migrator 2/4): implement the migrator.run function pegasus-kv/admin-cli#51
- feat(nodes-migrator 1/4): the first framework step of migrate nodes pegasus-kv/admin-cli#50
- feat: support manual compaction rpc pegasus-kv/admin-cli#58
- feat: support disk check before partition split pegasus-kv/admin-cli#57
- feat: partition split support sub-cmd pegasus-kv/admin-cli#56
- feat: support bulk load rpc pegasus-kv/admin-cli#55
- fix(admin-cli): fix admin-cli usage doc #987
- feat(admin-cli): support nodes capacity balance using admin-cli #969
- refactor(admin-cli): move disk-balance code into
toolkitspackage #958 - feat(admin-cli): support online-migrate table to another cluster using table-duplication #1006
- feat(admin-cli): add clear_bulk_load rpc #976
Pegasus Docker
Code Refactor
- refactor: make ctor/dtor function of singleton classes to be private XiaoMi/rdsn#1068
- refactor: use the corrent cast operation XiaoMi/rdsn#1062
- refactor: drop redundant config_status::pending_proposal XiaoMi/rdsn#1050
- refactor: change the type of prepare_list from raw pointer to std::unique_ptr XiaoMi/rdsn#1032
- refactor: remove coredump dir which is useless XiaoMi/rdsn#1030
- refactor: remove deprecated replica .info file XiaoMi/rdsn#1015
- refactor: simplify code by macros XiaoMi/rdsn#1012
- build: require add issue reference in description using github action XiaoMi/rdsn#1010
- build: require add issue reference in description using github action XiaoMi/rdsn#1010
- refactor: add a new function to store app_info into file XiaoMi/rdsn#1003
- refactor: remove unused function in replication_common.h XiaoMi/rdsn#1000
- refactor: move backup common types to backup_common.h XiaoMi/rdsn#998
- refactor: move manual compaction common types to a seperate file XiaoMi/rdsn#997
- refactor: move partition split common types to a seperate file XiaoMi/rdsn#996
- refactor: move bulk load common types to a seperate file XiaoMi/rdsn#995
- refactor: move func get_current_cluster_name out of duplication_common.h XiaoMi/rdsn#992
- refactor: merge load balance refactor code XiaoMi/rdsn#974
- feat: add alive nodes perf-counter XiaoMi/rdsn#968
- refactor: add an interface for checking if the bucket is empty XiaoMi/rdsn#950
- refactor: reimplement copy operations in load balancer XiaoMi/rdsn#942
- refactor: drop unused task codes of AIO #1021
- refactor: drop unused 'this' from lambda capture for unit-tests that verify replication factor #1005
- refactor: make ctor/dtor function of singleton classes to be private #921
- refactor: use the correct cast operation #916
Bug Fix
Core
- fix(network): use multi io_services in asio XiaoMi/rdsn#1016
- fix: message body size unset after parsed which leads to large io throughputs XiaoMi/rdsn#1008
- fix: crash when huge write XiaoMi/rdsn#1099
- fix: use fsync() to prevent .app-info/.init-info lost risk on XFS after power outage XiaoMi/rdsn#1017
- fix: shutdown error while closing replica XiaoMi/rdsn#1052
Common
- fix: fail_point is failed when use
FAIL_POINT_INJECT_VOID_FXiaoMi/rdsn#1047 - fix: update disk state when disk migrate completed XiaoMi/rdsn#1088
- fix: add a decree check when replica server receive the assign_primary request (#895) XiaoMi/rdsn#1044
- fix: remove min_live_node_count_for_unfreeze from meta_options to ensure it can be correctly updated dynamically XiaoMi/rdsn#1037
- fix(build): fix higher version cmake produce filename as 'compiler_depend.ts' bug XiaoMi/rdsn#1036
- fix(rocksdb): upgrade rocksdb to fix NotifyOnFlushCompleted() for atomic flush XiaoMi/rdsn#1001
- fix(one-time backup): fix bug when backup request is timeout XiaoMi/rdsn#990
- fix: the join point name of on_rpc_create_response XiaoMi/rdsn#984
- fix: add pragma once in backup_engine.h XiaoMi/rdsn#982
- fix: drop redundant recent_choose_primary_fail_count XiaoMi/rdsn#980
- fix: coredump when table name contains '_' and prometheus is enabled #828
- fix(collector): fix unexpected crash #984
- fix: shutdown error while closing replica #909
- fix(script): change the library version of poco #911
- fix: ops tools--minos2 have no bootstrap function #833
Performance
In this benchmark, we use the new machine, for the result is more reasonable, we re-run the Pegasus Server 2.3:
- Machine parameters: DDR4 16G * 8 | Intel Silver4210*2 2.20Ghz/3.20Ghz | SSD 480G * 8 SATA
- Cluster Server: 3 * MetaServerNode 5 * ReplicaServerNode
- YCSB Client: 3 * ClientNode
- Request Length: 1KB(set/get)
- Centos7 5.4.54-2.0.4.std7c.el7.x86_64
Pegasus Server 2.3
| Case | client and thread | R:W | R-QPS | R-Avg | R-P99 | W-QPS | W-Avg | W-P99 |
|---|---|---|---|---|---|---|---|---|
| Write Only | 3 clients * 15 threads | 0:1 | - | - | - | 48805 | 919 | 2124 |
| Read Only | 3 clients * 50 threads | 1:0 | 370068 | 402 | 988 | - | - | - |
| Read Write | 3 clients * 30 threads | 1:1 | 50762 | 532 | 5859 | 50759 | 1233 | 4162 |
| Read Write | 3 clients * 15 threads | 1:3 | 14471 | 443 | 3869 | 43425 | 884 | 1899 |
| Read Write | 3 clients * 15 threads | 1:30 | 1583 | 473 | 3432 | 47551 | 928 | 2066 |
| Read Write | 3 clients * 30 threads | 3:1 | 119093 | 406 | 3367 | 39693 | 1035 | 2581 |
| Read Write | 3 clients * 50 threads | 30:1 | 322904 | 435 | 1034 | 10762 | 882 | 1392 |
Pegasus Server 2.4
| Case | client and thread | R:W | R-QPS | R-Avg | R-P99 | W-QPS | W-Avg | W-P99 |
|---|---|---|---|---|---|---|---|---|
| Write Only | 3 clients * 15 threads | 0:1 | - | - | - | 56953 | 787 | 1786 |
| Read Only | 3 clients * 50 threads | 1:0 | 360642 | 413 | 984 | - | - | - |
| Read Write | 3 clients * 30 threads | 1:1 | 62572 | 464 | 5274 | 62561 | 985 | 3764 |
| Read Write | 3 clients * 15 threads | 1:3 | 16844 | 372 | 3980 | 50527 | 762 | 1551 |
| Read Write | 3 clients * 15 threads | 1:30 | 1861 | 381 | 3557 | 55816 | 790 | 1688 |
| Read Write | 3 clients * 30 threads | 3:1 | 140484 | 351 | 3277 | 46822 | 856 | 2044 |
| Read Write | 3 clients * 50 threads | 30:1 | 336106 | 419 | 1221 | 11203 | 763 | 1276 |
Config-Update
+ [pegasus.server]
+ rocksdb_max_log_file_size = 8388608
+ rocksdb_log_file_time_to_roll = 86400
+ rocksdb_keep_log_file_num = 32
+ [replication]
+ plog_force_flush = false
- mutation_2pc_min_replica_count = 2
+ mutation_2pc_min_replica_count = 0 # 0 means it's value based table max replica count
+ enable_direct_io = false # Whether to enable direct I/O when download files from hdfs, default false
+ direct_io_buffer_pages = 64 # Number of pages we need to set to direct io buffer, default 64 which is recommend in my test.
+ max_concurrent_manual_emergency_checkpointing_count = 10
+ enable_latency_tracer_report = false
+ latency_tracer_counter_name_prefix = trace_latency
+ hdfs_read_limit_rate_mb_per_sec = 200
+ hdfs_write_limit_rate_mb_per_sec = 200
+ duplicate_log_batch_bytes = 0 # 0 means no batch before sending
+ [nfs]
- max_copy_rate_megabytes = 500
+ max_copy_rate_megabytes_per_disk = 0
- max_send_rate_megabytes = 500
+ max_send_rate_megabytes_per_disk = 0
+ [meta_server]
+ max_reserved_dropped_replicas = 0
+ bulk_load_verify_before_ingest = false
+ bulk_load_node_max_ingesting_count = 4
+ bulk_load_node_min_disk_count = 1
+ enable_concurrent_bulk_load = false
+ max_allowed_replica_count = 5
+ min_allowed_replica_count = 1
+ [task.LPC_WRITE_REPLICATION_LOG_SHARED]
+ enable_trace = true # true will mark the task will be traced latency if open global traceContributors
acelyc111
cauchy1988
empiredan
foreverneverer
GehaFearless
GiantKing
happydongyaoyao
hycdong
levy5307
lidingshengHHU
neverchanje
padmejin
Smityz
totalo
WHBANG
xxmazha
ZhongChaoqiang