-
Notifications
You must be signed in to change notification settings - Fork 36
docs: add cbdb feature overview and architecture #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 19 commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
5af098a
docs: add cbdb feature overview and architecture
TomShawn f6d8bcc
Merge branch 'main' into cbdb-feature-overview
TomShawn ba9dae5
Delete cbdb-architecture.md
TomShawn e823de5
Update cbdb-overview.md
TomShawn b086012
refine
TomShawn 1263098
remove user scenarios
TomShawn 1ff9bf5
add some blank files
TomShawn 5378ebf
address comments
TomShawn 12c65c9
address comment
TomShawn 133c48b
modify overview to make it simpler
TomShawn 25e04fc
refine feature overview
TomShawn 06dc4da
Merge remote-tracking branch 'upstream/main' into cbdb-feature-overview
TomShawn dc25715
refine architecture translation
TomShawn 1e072cb
refne
TomShawn fe9b968
add cbdb feature overview
TomShawn 9708868
Apply suggestions from code review
TomShawn 0dc729e
Update docs/cbdb-overview.md
TomShawn f1bfe0f
add cbdb scenarios
TomShawn 2ea499c
Update cbdb-scenarios.md
TomShawn de06486
Apply suggestions from ljj
TomShawn 92db885
address comments from my-ship-it
TomShawn 55eda0d
fix misspelling
TomShawn 1785a5a
add scenarios for en
TomShawn File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| --- | ||
| title: Architecture | ||
| --- | ||
|
|
||
| # Cloudberry Database Architecture | ||
|
|
||
| This document introduces the product architecture and the implementation mechanism of the internal modules in Cloudberry Database. | ||
|
|
||
| In most cases, Cloudberry Database is similar to PostgreSQL in terms of SQL support, features, configuration options, and user functionalities. The experience users have with Cloudberry Database is similar to interacting with a standalone PostgreSQL system. | ||
|
|
||
| Cloudberry Database uses MPP (Massively Parallel Processing) architecture to store and process large volumes of data, by distributing data and computing workloads across multiple servers or hosts. | ||
|
|
||
| MPP, also known as the shared-nothing architecture, refers to systems with multiple hosts that work together to perform a task. Each host has its own processor, memory, disk, network resources, and operating system. Cloudberry Database uses this high-performance architecture to distribute data loads and can use all system resources in parallel to process queries. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| From users' view, Cloudberry Database is a complete relational database management system (RDBMS). In a physical view, it contains multiple PostgreSQL instances. To make these independent PostgreSQL instances work together, Cloudberry Database performs distributed cluster processing at different levels for data storage, computing, communication, and management. Although Cloudberry Database is a cluster, it hides all the distributed details from the user and provides a single logical database. This greatly eases the work of developers and operational staff. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The architecture diagram of Cloudberry Database is as follows: | ||
|
|
||
| <!--  --> | ||
|
|
||
| - **Master node** (or control node) is the gateway to the Cloudberry Database system, which accepts client connections and SQL queries, and allocates tasks to data node instances. Users interact with Cloudberry Database by connecting to the master node using a client program (such as psql) or an application programming interface (API) (such as JDBC, ODBC, or libpq PostgreSQL C API). | ||
| - The master node acts as the global system directory, containing a set of system tables that record the metadata of Cloudberry Database. | ||
| - The master node does not store any user data. User data is stored only in the data node instances. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - The master node performs authentication for client connections, processes SQL commands, distributes workload among segments, coordinates the results returned by each segment, and presents the final results to the client program. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Cloudberry Database uses Write Ahead Logging (WAL) for master node/standby mirroring. In WAL-based logging, all modifications are first written to a log before being written to the disk, which ensures the data integrity of any in-process operation. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - **Segment** (or data node) instances are individual Postgres processes, each storing a portion of the data and executing the corresponding part of the query. When a user connects to the database through the master node and submits a query request, a process is created on each segment node to handle the query. User-defined tables and their indexes are distributed across the available segments, and each segment node contains distinct portions of the data. The processes of data processing runs in the corresponding segment. Users interact with segments through the master, and the segment operate on servers known as the segment host. | ||
|
|
||
| Typically, a segment host runs 2 to 8 data nodes, depending on the processor, memory, storage, network interface, and workload. The configuration of the segment host needs to be balanced, because evenly distributing the data and workload among segments is the key to achieving optimal performance with Cloudberry Database, which allows all segments to start processing a task and finish the work at the same time. | ||
|
|
||
| - **Interconnect** is the network layer in the Cloudberry Database system architecture. Interconnect refers to the network infrastructure upon which the communication between the master node and the segments relies, which uses a standard Ethernet switching structure. | ||
|
|
||
| For performance reasons, it is recommended to use a network with a speed of 10 GB or faster. By default, the Interconnect module uses the UDP protocol with flow control (UDPIFC) for communication to send messages through the network. The data packet verification performed by Cloudberry Database exceeds the scope provided by UDP, meaning that its reliability is equivalent to using the TCP protocol, and its performance and scalability surpass the TCP protocol. If the Interconnect is changed to use the TCP protocol, the scalability of Cloudberry Database is limited to 1000 segments. This limit does not apply when UDPIFC is used as the default protocol. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - Cloudberry Database uses Multiversion Concurrency Control (MVCC) to ensure data consistency. This means that when querying the database, each transaction only sees a snapshot of the data, ensuring that current transactions do not see modifications made by other transactions on the same records. This provides transaction isolation for each transaction in the database. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| MVCC minimizes lock contention to ensure performance in a multi-user environment. This is done by avoiding explicit locking for database transactions. | ||
|
|
||
| In concurrency control, MVCC does not introduce conflicts for query (read) locks and write locks. In addition, read and write operations do not block each other. This is the biggest advantages of using MVCC over using lock mechanism. | ||
TomShawn marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.