Before any operation on ceph, it’s always useful to dedicate a console to watch a ceph -w command.
Set the noout and norebalance global flags for your OSDs. You can do it from the Proxmox VE GUI or command line. This will turn the ceph status to HEALTH_WARN : noout,norebalance flag(s) set
Safely shutdown the Proxmox node (use the GUI)
Copy the old NVMe to the new one. You can also use dd if you have USB NVMe adapters.
Start Proxmox node with the new disk. From now, all the next steps will be made online, with services and ceph running.
Use fdisk to rearrange partitions. You might have to delete & recreate the last partition.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# fdisk /dev/nvme0n1
Welcome to fdisk(util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
GPT PMBR size mismatch(1000215215 != 7814037167) will be corrected by write.
The backup GPT table is not on the endof the device. This problem will be corrected by write.
root@pve3:~# fdisk /dev/nvme0n1
Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
GPT PMBR size mismatch (1000215215 != 7814037167) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
root@pve3:~# fdisk /dev/nvme0n1
Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
GPT PMBR size mismatch (1000215215 != 7814037167) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Note the GPT PMBR size mismatch showing that the new disk is 8 times bigger. List the partitions with p
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Command(m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 3420472014 1007K BIOS boot
/dev/nvme0n1p2 204810506231048576 512M EFI System
/dev/nvme0n1p3 10506241000215060999164437476.4G Linux LVM
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 1000215060 999164437 476.4G Linux LVM
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 1000215060 999164437 476.4G Linux LVM
Hopefully, and in most of the cases, the PV with the Linux LVM is at the end. So we need to carefully note the ‘Start’ of the partition, delete it, and recreate it since we want to use the new free space.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Command(m for help): d
Partition number(1-3, default 3): 3
Partition 3 has been deleted.
Command(m for help): n
Partition number(3-128, default 3):
First sector(1050624-7814037134, default 1050624):
Last sector, +/-sectors or +/-size{K,M,G,T,P}(1050624-7814037134, default 7814037134):
Created a new partition 3of type 'Linux filesystem' and of size 3.6 TiB.
Partition #3 contains a LVM2_member signature.
Do you want to remove the signature? [Y]es/[N]o: N
Command(m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 3420472014 1007K BIOS boot
/dev/nvme0n1p2 204810506231048576 512M EFI System
/dev/nvme0n1p3 1050624781403713478129865113.6T Linux LVM
Command(m for help): w
The partition table has been altered.
Syncing disks.
Command (m for help): d
Partition number (1-3, default 3): 3
Partition 3 has been deleted.
Command (m for help): n
Partition number (3-128, default 3):
First sector (1050624-7814037134, default 1050624):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (1050624-7814037134, default 7814037134):
Created a new partition 3 of type 'Linux filesystem' and of size 3.6 TiB.
Partition #3 contains a LVM2_member signature.
Do you want to remove the signature? [Y]es/[N]o: N
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux filesystem
Command (m for help): t
Partition number (1-3, default 3):
Partition type or alias (type L to list all): 30
Changed type of partition 'Linux filesystem' to 'Linux LVM'.
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux LVM
Command (m for help): w
The partition table has been altered.
Syncing disks.
Command (m for help): d
Partition number (1-3, default 3): 3
Partition 3 has been deleted.
Command (m for help): n
Partition number (3-128, default 3):
First sector (1050624-7814037134, default 1050624):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (1050624-7814037134, default 7814037134):
Created a new partition 3 of type 'Linux filesystem' and of size 3.6 TiB.
Partition #3 contains a LVM2_member signature.
Do you want to remove the signature? [Y]es/[N]o: N
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux filesystem
Command (m for help): t
Partition number (1-3, default 3):
Partition type or alias (type L to list all): 30
Changed type of partition 'Linux filesystem' to 'Linux LVM'.
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux LVM
Command (m for help): w
The partition table has been altered.
Syncing disks.
As you can see, all the defaults were OK in re-creating the new partition table. We must only set the partition type (t) to 30 (Linux LVM)
Now that the partition table reflects the new size, we can resize the LVM PV and check the new free space
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# pvresize /dev/nvme0n1p3
Physical volume "/dev/nvme0n1p3" changed
1 physical volume(s) resized or updated / 0 physical volume(s) not resized
root@pve3:~# vgdisplay pve
--- Volume group ---
VG Name pve
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 491
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 14
Open LV 9
Max PV 0
Cur PV 1
Act PV 1
VG Size <3.64 TiB
PE Size 4.00 MiB
Total PE 953733
Alloc PE / Size 82142 / <320.87 GiB
Free PE / Size 871591 / 3.32 TiB
VG UUID 1HhlCb-UbmY-1Smz-eEEa-YmzS-qdIB-YK3WoY
root@pve3:~# pvresize /dev/nvme0n1p3
Physical volume "/dev/nvme0n1p3" changed
1 physical volume(s) resized or updated / 0 physical volume(s) not resized
root@pve3:~# vgdisplay pve
--- Volume group ---
VG Name pve
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 491
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 14
Open LV 9
Max PV 0
Cur PV 1
Act PV 1
VG Size <3.64 TiB
PE Size 4.00 MiB
Total PE 953733
Alloc PE / Size 82142 / <320.87 GiB
Free PE / Size 871591 / 3.32 TiB
VG UUID 1HhlCb-UbmY-1Smz-eEEa-YmzS-qdIB-YK3WoY
root@pve3:~# pvresize /dev/nvme0n1p3
Physical volume "/dev/nvme0n1p3" changed
1 physical volume(s) resized or updated / 0 physical volume(s) not resized
root@pve3:~# vgdisplay pve
--- Volume group ---
VG Name pve
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 491
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 14
Open LV 9
Max PV 0
Cur PV 1
Act PV 1
VG Size <3.64 TiB
PE Size 4.00 MiB
Total PE 953733
Alloc PE / Size 82142 / <320.87 GiB
Free PE / Size 871591 / 3.32 TiB
VG UUID 1HhlCb-UbmY-1Smz-eEEa-YmzS-qdIB-YK3WoY
root@pve3:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
cephusbdata1 USB -wi-ao---- 1.95t
bluestore-block1 patriot-vg -wi-ao---- 370.00g
bluestore-block2 patriot-vg -wi-ao---- 370.00g
patriot-tmp patriot-vg -wi-ao---- 75.00g
bluestore-db-usb1 pve -wi-ao---- 80.00g
bluestore-db1 pve -wi-ao---- 30.00g
bluestore-db2 pve -wi-ao---- 30.00g
data pve twi-aotz-- 154.00g 61.59 21.42
root pve -wi-ao---- 20.25g
swap pve -wi-ao---- 6.12g
vm-112-disk-0 pve Vwi-a-tz-- 10.00g data 99.42
vm-113-disk-0 pve Vwi-aotz-- 12.00g data 100.00
vm-122-disk-0 pve Vwi-aotz-- 4.00g data 99.65
vm-129-disk-0 pve Vwi-aotz-- 32.00g data 96.14
vm-203-disk-0 pve Vwi-aotz-- 16.00g data 99.23
vm-223-disk-0 pve Vwi-a-tz-- 8.00g data 30.92
vm-303-cloudinit pve Vwi-a-tz-- 4.00m data 9.38
vm-303-disk-0 pve Vwi-a-tz-- 20.00g data 99.02
root@pve3:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
cephusbdata1 USB -wi-ao---- 1.95t
bluestore-block1 patriot-vg -wi-ao---- 370.00g
bluestore-block2 patriot-vg -wi-ao---- 370.00g
patriot-tmp patriot-vg -wi-ao---- 75.00g
bluestore-db-usb1 pve -wi-ao---- 80.00g
bluestore-db1 pve -wi-ao---- 30.00g
bluestore-db2 pve -wi-ao---- 30.00g
data pve twi-aotz-- 154.00g 61.59 21.42
root pve -wi-ao---- 20.25g
swap pve -wi-ao---- 6.12g
vm-112-disk-0 pve Vwi-a-tz-- 10.00g data 99.42
vm-113-disk-0 pve Vwi-aotz-- 12.00g data 100.00
vm-122-disk-0 pve Vwi-aotz-- 4.00g data 99.65
vm-129-disk-0 pve Vwi-aotz-- 32.00g data 96.14
vm-203-disk-0 pve Vwi-aotz-- 16.00g data 99.23
vm-223-disk-0 pve Vwi-a-tz-- 8.00g data 30.92
vm-303-cloudinit pve Vwi-a-tz-- 4.00m data 9.38
vm-303-disk-0 pve Vwi-a-tz-- 20.00g data 99.02
We’re only interested in the bluestore-* and cephusbdata1 since they are not in the pve VG. The USB VG is on a slow USB disk, and the partriot-vg is on an aging old SATA SSD. I want to move everything to the new NVMe, so in the pve VG.
Since NVMe is faster, I won’t split the bluestore-block and bluestore-db to different devices. I’ll create a new big OSD on a new LV with data block, db (and WALL) on the same LV.
stderr: 2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1AuthRegistry(0x7fe8b0060800) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
I mostly use the ceph -w command, and the PVE GUI, but you might find more appropriate commands in the documentation for your case. The PVE GUI is quite basic, and the recovery/rebalance time is always wrong.
After some time, you should get something like
The ‘4 pools have too few placement groups’ message is just because I did set the autoscaling in warn mode, and all the OSDs are still present.
Remove the old OSDs
It’s now time to set the old OSDs to out is not already, and check that the health is still OK. Then progressively shut them down, monitoring the health status. You can do everything from the PVE GUI but for some operations you might have to log in using root / PAM.
If health do not show any problem, you can select the old OSDs (down) and destroy them from the GUI. It will propose you to clean-up disks.
And keep only the new OSDs
Tuning
If you have a very large pool, you can activate the bulk flag for that pool. In my case :
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# ceph osd pool set cephfs_data bulk true
set pool 2 bulk to true
root@pve3:~# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.mgr 40192k 3.0 9900G 0.00001.01 on False
cephfs_metadata 839.0M 3.0 9900G 0.00024.016 on False
cephblock 307.0G 3.0 9900G 0.09301.016 off False
cephfs_data_cache 52606M 51200M 2.0 9900G 0.01041.02 off False
root@pve3:~# ceph osd pool set cephfs_data bulk true
set pool 2 bulk to true
root@pve3:~# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.mgr 40192k 3.0 9900G 0.0000 1.0 1 on False
cephfs_metadata 839.0M 3.0 9900G 0.0002 4.0 16 on False
cephblock 307.0G 3.0 9900G 0.0930 1.0 16 off False
cephfs_data_cache 52606M 51200M 2.0 9900G 0.0104 1.0 2 off False
.rgw.root 1327 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.log 182 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.control 0 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.meta 0 3.0 9900G 0.0000 4.0 4 32 warn False
cephfs_data 1041G 3.0 9900G 0.3156 1.0 64 off True
root@pve3:~# ceph osd pool set cephfs_data bulk true
set pool 2 bulk to true
root@pve3:~# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.mgr 40192k 3.0 9900G 0.0000 1.0 1 on False
cephfs_metadata 839.0M 3.0 9900G 0.0002 4.0 16 on False
cephblock 307.0G 3.0 9900G 0.0930 1.0 16 off False
cephfs_data_cache 52606M 51200M 2.0 9900G 0.0104 1.0 2 off False
.rgw.root 1327 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.log 182 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.control 0 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.meta 0 3.0 9900G 0.0000 4.0 4 32 warn False
cephfs_data 1041G 3.0 9900G 0.3156 1.0 64 off True
Wait for autoscaler to compute new numbers from the new OSD count, then enable the autoscaler back on.
Possible problems
Running out of PGs
If you reduce the number of OSD, and you have more PG allocated than the max_pg_per_osd. I had the problem because I had autoscale on, and ceph does not remove the out OSDs from the max PG count to compute the number of PGs for each pool. Also, I did set a ratio which takes precedence. So, I had to switch off autoscale, disable the target ratio, and adjust the PG number for some pools to match the default max 100 PG/OSD.
Useful commands:
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
ceph osd pool autoscale-status # Check PGs and autoscale recommendations.
ceph osd pool set {pool-name} target_size_ratio 0# Disable the TARGET RATIO
ceph osd pool autoscale-statusceph osd pool set {pool-name} pg_num 64
ceph osd pool autoscale-statusceph osd pool set {pool-name} pgp_num 64# You need to adjust both pg & pgp for the rebalance to occur.
ceph osd down {osd-num}
ceph osd dump | grep repli
ceph osd pool autoscale-status # Check PGs and autoscale recommendations.
ceph osd pool set {pool-name} target_size_ratio 0 # Disable the TARGET RATIO
ceph osd pool autoscale-statusceph osd pool set {pool-name} pg_num 64
ceph osd pool autoscale-statusceph osd pool set {pool-name} pgp_num 64 # You need to adjust both pg & pgp for the rebalance to occur.
ceph osd down {osd-num}
ceph osd dump | grep repli
ceph osd pool autoscale-status # Check PGs and autoscale recommendations.
ceph osd pool set {pool-name} target_size_ratio 0 # Disable the TARGET RATIO
ceph osd pool autoscale-statusceph osd pool set {pool-name} pg_num 64
ceph osd pool autoscale-statusceph osd pool set {pool-name} pgp_num 64 # You need to adjust both pg & pgp for the rebalance to occur.
ceph osd down {osd-num}
ceph osd dump | grep repli
The PG adjustments for pools might take time, and might occur after rebalancing (be patient).
PG stuck in rebalancing
First reweighting the OSDs should prevent rebalancing problems
Or : How to remove an unwanted or failed Ceph service due to entering a wrong name
This has no consequences, but if you do not want them in your startup logs any more, you might have to remove them manually. WARNING with the below command, make sure you grep -v the
Don’t be surprised if you compare with a native disk, that’s a network cluster ! Also, the above command can quickly make ceph-osd crashes, especially if you use high iodepth and numjobs. Increase the size according to your HW performance.
Like it ?
Get notified on new posts (max 1 / month) Soyez informés lors des prochains articles