-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCeph.txt
52 lines (49 loc) · 2.35 KB
/
Ceph.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
components:
(1) clients: run on user space; maintain own cache, independent of page or
buffer cache
(2) OSD(Object Storage Device): communicate directly with clients, manage
file IO
(3) MDS(Metadata Server): metadata operation
(4) monitor(5.4):
1/ cluster collects failure reports, and filter out problems
2/ provide consistent access to cluster map via election, active peer
monitoring, short-term leases and 2PC
How content is stored:
(1) Object is mapped to PG (Placement Group), logical object pool
(2) PG is mapped to OSD (Object Storage Daemon), unit of RADOS storage
- Multiple OSDs share CPU, mem, b/w
- the unit of replication, migration, recovery
(3) How OSD is written into storage:
- data part: directly goes to storage device
- metadata part: goes through RocksDB, BlueFS, storage device
design feature:
(1) decouple data and metadata, respectively managed by OSD and MDS
1/ distribute workload(GFS: seperate control flow and data flow)
2/ delegate low-level block allocation to individual devices
(2) dynamic distributed metadata management: Dynamic Subtree Partitioning
(3) reliable autonomic distributed object storage
OSD manages data migration, replication, failure detection and failure
recovery
implementation details:
(1) client synchronization: for concurrent multiple writes,
1/ revoke caching and buffer capacity, sync dirty data
2/ relax consistency level(3.2)
(2) MDS optimize for common metadata access(3.3)
(3) file: (ino, ono) -> oid
object: hash(oid) & mask -> pgid
pg(Placement Group): CRUSH(pgid) -> (osd1, ..., osd n)
note:
1/ CRUSH() is deterministic, thus can be calculated by clients, MDS
and OSD; which removes the need to maintain and distribute objects list
2/ (osd1, ..., osd n) is ordered, primary and replica is thus decided
(4) inode embedded within directory at MDS: locality and prefetching(4.1)
(5) anchor table: optimize for multiple hard link inodes(4.1)
(6) MDS cluster distributes cached metadata hierarchically across nodes(4.1)
(7) hot read directory replicated across nodes
large directory, or heavy write workload cached across nodes(4.1)
(8) node content: security, file, immutable, assigned different consistency
level(4.2)
(9) OSD data is written and replicated in sync mode(ack after primary and
replicas all ack)(5.2)
(10) object storage in user-space: self-implemented IO scheduler, cache manage-
ment, block allocation(5.6)