Pivert's Blog

Upgrade NVMe on Linux / Proxmox / Ceph


,
Reading Time: 10 minutes

I just upgraded my tiny 512GB NVMe SSD to a 4TB one on each node. I started with the host pve3. Here are the steps.

Prerequisites

  • You have a new NVMe SSD of at least the same size.
  • You have a way to copy 2 NVMe SSDs (I opted for an ORICO NVME SSD Clone device)
  • Check that your Proxmox cluster is OK with pvecm status
  • Check ceph status shows HEALTH_OK

Glossary

  • LVM : Linux Volume Manager
  • PV : LVM Physical Volume
  • VG : LVM Volume Group
  • LV : LVM Logical Volume
  • OSD : Ceph Object Storage Daemon
  • PVE : Proxmox Virtual Environments or Proxmox VE
  • PVE GUI : Proxmox VE web interface

Procedure

  • Before any operation on ceph, it’s always useful to dedicate a console to watch a ceph -w command.
  • Set the noout and norebalance global flags for your OSDs. You can do it from the Proxmox VE GUI or command line. This will turn the ceph status to HEALTH_WARN : noout,norebalance flag(s) set
  • Safely shutdown the Proxmox node (use the GUI)
  • Copy the old NVMe to the new one. You can also use dd if you have USB NVMe adapters.
  • Start Proxmox node with the new disk. From now, all the next steps will be made online, with services and ceph running.
  • Use fdisk to rearrange partitions. You might have to delete & recreate the last partition.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# fdisk /dev/nvme0n1
Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
GPT PMBR size mismatch (1000215215 != 7814037167) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
root@pve3:~# fdisk /dev/nvme0n1 Welcome to fdisk (util-linux 2.36.1). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. GPT PMBR size mismatch (1000215215 != 7814037167) will be corrected by write. The backup GPT table is not on the end of the device. This problem will be corrected by write.
root@pve3:~# fdisk /dev/nvme0n1

Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

GPT PMBR size mismatch (1000215215 != 7814037167) will be corrected by write.
The backup GPT table is not on the end of the device. This problem will be corrected by write.

Note the GPT PMBR size mismatch showing that the new disk is 8 times bigger. List the partitions with p

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 1000215060 999164437 476.4G Linux LVM
Command (m for help): p Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors Disk model: CT4000P3SSD8 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD Device Start End Sectors Size Type /dev/nvme0n1p1 34 2047 2014 1007K BIOS boot /dev/nvme0n1p2 2048 1050623 1048576 512M EFI System /dev/nvme0n1p3 1050624 1000215060 999164437 476.4G Linux LVM
Command (m for help): p

Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8                            
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD

Device           Start        End   Sectors   Size Type
/dev/nvme0n1p1      34       2047      2014  1007K BIOS boot
/dev/nvme0n1p2    2048    1050623   1048576   512M EFI System
/dev/nvme0n1p3 1050624 1000215060 999164437 476.4G Linux LVM

Hopefully, and in most of the cases, the PV with the Linux LVM is at the end. So we need to carefully note the ‘Start’ of the partition, delete it, and recreate it since we want to use the new free space.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Command (m for help): d
Partition number (1-3, default 3): 3
Partition 3 has been deleted.
Command (m for help): n
Partition number (3-128, default 3):
First sector (1050624-7814037134, default 1050624):
Last sector, +/-sectors or +/-size{K,M,G,T,P} (1050624-7814037134, default 7814037134):
Created a new partition 3 of type 'Linux filesystem' and of size 3.6 TiB.
Partition #3 contains a LVM2_member signature.
Do you want to remove the signature? [Y]es/[N]o: N
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux filesystem
Command (m for help): t
Partition number (1-3, default 3):
Partition type or alias (type L to list all): 30
Changed type of partition 'Linux filesystem' to 'Linux LVM'.
Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
Device Start End Sectors Size Type
/dev/nvme0n1p1 34 2047 2014 1007K BIOS boot
/dev/nvme0n1p2 2048 1050623 1048576 512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux LVM
Command (m for help): w
The partition table has been altered.
Syncing disks.
Command (m for help): d Partition number (1-3, default 3): 3 Partition 3 has been deleted. Command (m for help): n Partition number (3-128, default 3): First sector (1050624-7814037134, default 1050624): Last sector, +/-sectors or +/-size{K,M,G,T,P} (1050624-7814037134, default 7814037134): Created a new partition 3 of type 'Linux filesystem' and of size 3.6 TiB. Partition #3 contains a LVM2_member signature. Do you want to remove the signature? [Y]es/[N]o: N Command (m for help): p Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors Disk model: CT4000P3SSD8 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD Device Start End Sectors Size Type /dev/nvme0n1p1 34 2047 2014 1007K BIOS boot /dev/nvme0n1p2 2048 1050623 1048576 512M EFI System /dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux filesystem Command (m for help): t Partition number (1-3, default 3): Partition type or alias (type L to list all): 30 Changed type of partition 'Linux filesystem' to 'Linux LVM'. Command (m for help): p Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors Disk model: CT4000P3SSD8 Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD Device Start End Sectors Size Type /dev/nvme0n1p1 34 2047 2014 1007K BIOS boot /dev/nvme0n1p2 2048 1050623 1048576 512M EFI System /dev/nvme0n1p3 1050624 7814037134 7812986511 3.6T Linux LVM Command (m for help): w The partition table has been altered. Syncing disks.
Command (m for help): d
Partition number (1-3, default 3): 3

Partition 3 has been deleted.

Command (m for help): n
Partition number (3-128, default 3): 
First sector (1050624-7814037134, default 1050624): 
Last sector, +/-sectors or +/-size{K,M,G,T,P} (1050624-7814037134, default 7814037134): 

Created a new partition 3 of type 'Linux filesystem' and of size 3.6 TiB.
Partition #3 contains a LVM2_member signature.

Do you want to remove the signature? [Y]es/[N]o: N

Command (m for help): p

Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8                            
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD

Device           Start        End    Sectors  Size Type
/dev/nvme0n1p1      34       2047       2014 1007K BIOS boot
/dev/nvme0n1p2    2048    1050623    1048576  512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511  3.6T Linux filesystem

Command (m for help): t
Partition number (1-3, default 3): 
Partition type or alias (type L to list all): 30

Changed type of partition 'Linux filesystem' to 'Linux LVM'.

Command (m for help): p
Disk /dev/nvme0n1: 3.64 TiB, 4000787030016 bytes, 7814037168 sectors
Disk model: CT4000P3SSD8                            
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD

Device           Start        End    Sectors  Size Type
/dev/nvme0n1p1      34       2047       2014 1007K BIOS boot
/dev/nvme0n1p2    2048    1050623    1048576  512M EFI System
/dev/nvme0n1p3 1050624 7814037134 7812986511  3.6T Linux LVM

Command (m for help): w
The partition table has been altered.
Syncing disks.

As you can see, all the defaults were OK in re-creating the new partition table. We must only set the partition type (t) to 30 (Linux LVM)

We can check the new partition table

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# sfdisk -d /dev/nvme0n1
label: gpt
label-id: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
device: /dev/nvme0n1
unit: sectors
first-lba: 34
last-lba: 7814037134
sector-size: 512
/dev/nvme0n1p1 : start= 34, size= 2014, type=21686148-6449-6E6F-744E-656564454649, uuid=40F0D63F-33A4-40A3-8547-C9747E3DD8F6
/dev/nvme0n1p2 : start= 2048, size= 1048576, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=A150376E-A003-4C05-AE41-28E6170D7AC6
/dev/nvme0n1p3 : start= 1050624, size= 7812986511, type=E6D6D379-F507-44C2-A23C-238F2A3DF928, uuid=FCD0E153-944A-BD4E-A82A-D47A8C4291B8
root@pve3:~# sfdisk -d /dev/nvme0n1 label: gpt label-id: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD device: /dev/nvme0n1 unit: sectors first-lba: 34 last-lba: 7814037134 sector-size: 512 /dev/nvme0n1p1 : start= 34, size= 2014, type=21686148-6449-6E6F-744E-656564454649, uuid=40F0D63F-33A4-40A3-8547-C9747E3DD8F6 /dev/nvme0n1p2 : start= 2048, size= 1048576, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=A150376E-A003-4C05-AE41-28E6170D7AC6 /dev/nvme0n1p3 : start= 1050624, size= 7812986511, type=E6D6D379-F507-44C2-A23C-238F2A3DF928, uuid=FCD0E153-944A-BD4E-A82A-D47A8C4291B8
root@pve3:~# sfdisk -d /dev/nvme0n1 
label: gpt
label-id: 6D06B1F4-8489-4C56-92D3-1EFCA1183DBD
device: /dev/nvme0n1
unit: sectors
first-lba: 34
last-lba: 7814037134
sector-size: 512

/dev/nvme0n1p1 : start=          34, size=        2014, type=21686148-6449-6E6F-744E-656564454649, uuid=40F0D63F-33A4-40A3-8547-C9747E3DD8F6
/dev/nvme0n1p2 : start=        2048, size=     1048576, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B, uuid=A150376E-A003-4C05-AE41-28E6170D7AC6
/dev/nvme0n1p3 : start=     1050624, size=  7812986511, type=E6D6D379-F507-44C2-A23C-238F2A3DF928, uuid=FCD0E153-944A-BD4E-A82A-D47A8C4291B8
  • Now that the partition table reflects the new size, we can resize the LVM PV and check the new free space
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# pvresize /dev/nvme0n1p3
Physical volume "/dev/nvme0n1p3" changed
1 physical volume(s) resized or updated / 0 physical volume(s) not resized
root@pve3:~# vgdisplay pve
--- Volume group ---
VG Name pve
System ID
Format lvm2
Metadata Areas 1
Metadata Sequence No 491
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 14
Open LV 9
Max PV 0
Cur PV 1
Act PV 1
VG Size <3.64 TiB
PE Size 4.00 MiB
Total PE 953733
Alloc PE / Size 82142 / <320.87 GiB
Free PE / Size 871591 / 3.32 TiB
VG UUID 1HhlCb-UbmY-1Smz-eEEa-YmzS-qdIB-YK3WoY
root@pve3:~# pvresize /dev/nvme0n1p3 Physical volume "/dev/nvme0n1p3" changed 1 physical volume(s) resized or updated / 0 physical volume(s) not resized root@pve3:~# vgdisplay pve --- Volume group --- VG Name pve System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 491 VG Access read/write VG Status resizable MAX LV 0 Cur LV 14 Open LV 9 Max PV 0 Cur PV 1 Act PV 1 VG Size <3.64 TiB PE Size 4.00 MiB Total PE 953733 Alloc PE / Size 82142 / <320.87 GiB Free PE / Size 871591 / 3.32 TiB VG UUID 1HhlCb-UbmY-1Smz-eEEa-YmzS-qdIB-YK3WoY
root@pve3:~# pvresize /dev/nvme0n1p3 
  Physical volume "/dev/nvme0n1p3" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized
root@pve3:~# vgdisplay pve
  --- Volume group ---
  VG Name               pve
  System ID             
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  491
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                14
  Open LV               9
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               <3.64 TiB
  PE Size               4.00 MiB
  Total PE              953733
  Alloc PE / Size       82142 / <320.87 GiB
  Free  PE / Size       871591 / 3.32 TiB
  VG UUID               1HhlCb-UbmY-1Smz-eEEa-YmzS-qdIB-YK3WoY

Cool, the Free Size shows 3.32 TiB unused space.

Let’s check the current data allocation

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
cephusbdata1 USB -wi-ao---- 1.95t
bluestore-block1 patriot-vg -wi-ao---- 370.00g
bluestore-block2 patriot-vg -wi-ao---- 370.00g
patriot-tmp patriot-vg -wi-ao---- 75.00g
bluestore-db-usb1 pve -wi-ao---- 80.00g
bluestore-db1 pve -wi-ao---- 30.00g
bluestore-db2 pve -wi-ao---- 30.00g
data pve twi-aotz-- 154.00g 61.59 21.42
root pve -wi-ao---- 20.25g
swap pve -wi-ao---- 6.12g
vm-112-disk-0 pve Vwi-a-tz-- 10.00g data 99.42
vm-113-disk-0 pve Vwi-aotz-- 12.00g data 100.00
vm-122-disk-0 pve Vwi-aotz-- 4.00g data 99.65
vm-129-disk-0 pve Vwi-aotz-- 32.00g data 96.14
vm-203-disk-0 pve Vwi-aotz-- 16.00g data 99.23
vm-223-disk-0 pve Vwi-a-tz-- 8.00g data 30.92
vm-303-cloudinit pve Vwi-a-tz-- 4.00m data 9.38
vm-303-disk-0 pve Vwi-a-tz-- 20.00g data 99.02
root@pve3:~# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert cephusbdata1 USB -wi-ao---- 1.95t bluestore-block1 patriot-vg -wi-ao---- 370.00g bluestore-block2 patriot-vg -wi-ao---- 370.00g patriot-tmp patriot-vg -wi-ao---- 75.00g bluestore-db-usb1 pve -wi-ao---- 80.00g bluestore-db1 pve -wi-ao---- 30.00g bluestore-db2 pve -wi-ao---- 30.00g data pve twi-aotz-- 154.00g 61.59 21.42 root pve -wi-ao---- 20.25g swap pve -wi-ao---- 6.12g vm-112-disk-0 pve Vwi-a-tz-- 10.00g data 99.42 vm-113-disk-0 pve Vwi-aotz-- 12.00g data 100.00 vm-122-disk-0 pve Vwi-aotz-- 4.00g data 99.65 vm-129-disk-0 pve Vwi-aotz-- 32.00g data 96.14 vm-203-disk-0 pve Vwi-aotz-- 16.00g data 99.23 vm-223-disk-0 pve Vwi-a-tz-- 8.00g data 30.92 vm-303-cloudinit pve Vwi-a-tz-- 4.00m data 9.38 vm-303-disk-0 pve Vwi-a-tz-- 20.00g data 99.02
root@pve3:~# lvs
  LV                VG         Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  cephusbdata1      USB        -wi-ao----   1.95t                                                    
  bluestore-block1  patriot-vg -wi-ao---- 370.00g                                                    
  bluestore-block2  patriot-vg -wi-ao---- 370.00g                                                    
  patriot-tmp       patriot-vg -wi-ao----  75.00g                                                    
  bluestore-db-usb1 pve        -wi-ao----  80.00g                                                    
  bluestore-db1     pve        -wi-ao----  30.00g                                                    
  bluestore-db2     pve        -wi-ao----  30.00g                                                    
  data              pve        twi-aotz-- 154.00g             61.59  21.42                           
  root              pve        -wi-ao----  20.25g                                                    
  swap              pve        -wi-ao----   6.12g                                                    
  vm-112-disk-0     pve        Vwi-a-tz--  10.00g data        99.42                                  
  vm-113-disk-0     pve        Vwi-aotz--  12.00g data        100.00                                 
  vm-122-disk-0     pve        Vwi-aotz--   4.00g data        99.65                                  
  vm-129-disk-0     pve        Vwi-aotz--  32.00g data        96.14                                  
  vm-203-disk-0     pve        Vwi-aotz--  16.00g data        99.23                                  
  vm-223-disk-0     pve        Vwi-a-tz--   8.00g data        30.92                                  
  vm-303-cloudinit  pve        Vwi-a-tz--   4.00m data        9.38                                   
  vm-303-disk-0     pve        Vwi-a-tz--  20.00g data        99.02 

We’re only interested in the bluestore-* and cephusbdata1 since they are not in the pve VG. The USB VG is on a slow USB disk, and the partriot-vg is on an aging old SATA SSD. I want to move everything to the new NVMe, so in the pve VG.

Since NVMe is faster, I won’t split the bluestore-block and bluestore-db to different devices. I’ll create a new big OSD on a new LV with data block, db (and WALL) on the same LV.

Create a new LV for the Ceph bluestore OSD

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# lvcreate -L 3300G -n bluestore-nvme pve
Logical volume "bluestore-nvme" created.
root@pve3:~# lvcreate -L 3300G -n bluestore-nvme pve Logical volume "bluestore-nvme" created.
root@pve3:~# lvcreate -L 3300G -n bluestore-nvme pve
  Logical volume "bluestore-nvme" created.

Create the new OSD

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# ceph-volume lvm prepare --data pve/bluestore-nvme
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new f13c6fe2-80a1-4da7-866e-e15493f23b5b
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-9
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -h ceph:ceph /dev/pve/bluestore-nvme
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-21
Running command: /usr/bin/ln -s /dev/pve/bluestore-nvme /var/lib/ceph/osd/ceph-9/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-9/activate.monmap
stderr: 2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 AuthRegistry(0x7fe8b0060800) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
stderr: got monmap epoch 6
--> Creating keyring file for osd.9
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 9 --monmap /var/lib/ceph/osd/ceph-9/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-9/ --osd-uuid f13c6fe2-80a1-4da7-866e-e15493f23b5b --setuser ceph --setgroup ceph
stderr: 2023-05-08T19:34:26.185+0200 7f9bf435e240 -1 bluestore(/var/lib/ceph/osd/ceph-9/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: pve/bluestore-nvme
root@pve3:~# ceph-volume lvm prepare --data pve/bluestore-nvme Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new f13c6fe2-80a1-4da7-866e-e15493f23b5b Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-9 --> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin Running command: /usr/bin/chown -h ceph:ceph /dev/pve/bluestore-nvme Running command: /usr/bin/chown -R ceph:ceph /dev/dm-21 Running command: /usr/bin/ln -s /dev/pve/bluestore-nvme /var/lib/ceph/osd/ceph-9/block Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-9/activate.monmap stderr: 2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory 2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 AuthRegistry(0x7fe8b0060800) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx stderr: got monmap epoch 6 --> Creating keyring file for osd.9 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/keyring Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/ Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 9 --monmap /var/lib/ceph/osd/ceph-9/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-9/ --osd-uuid f13c6fe2-80a1-4da7-866e-e15493f23b5b --setuser ceph --setgroup ceph stderr: 2023-05-08T19:34:26.185+0200 7f9bf435e240 -1 bluestore(/var/lib/ceph/osd/ceph-9/) _read_fsid unparsable uuid --> ceph-volume lvm prepare successful for: pve/bluestore-nvme
root@pve3:~# ceph-volume lvm prepare --data pve/bluestore-nvme
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new f13c6fe2-80a1-4da7-866e-e15493f23b5b
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-9
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -h ceph:ceph /dev/pve/bluestore-nvme
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-21
Running command: /usr/bin/ln -s /dev/pve/bluestore-nvme /var/lib/ceph/osd/ceph-9/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-9/activate.monmap
 stderr: 2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2023-05-08T19:34:25.969+0200 7fe8b6e93700 -1 AuthRegistry(0x7fe8b0060800) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: got monmap epoch 6
--> Creating keyring file for osd.9
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 9 --monmap /var/lib/ceph/osd/ceph-9/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-9/ --osd-uuid f13c6fe2-80a1-4da7-866e-e15493f23b5b --setuser ceph --setgroup ceph
 stderr: 2023-05-08T19:34:26.185+0200 7f9bf435e240 -1 bluestore(/var/lib/ceph/osd/ceph-9/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: pve/bluestore-nvme

You can check with ceph status, ceph osd status and via the PVE GUI that you have a new OSD running. It’s new and not up yet.

Activate the new OSD

When creating the OSD, we noticed that it was number 9. Take the fsid and activate it.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# cat /var/lib/ceph/osd/ceph-9/fsid
f13c6fe2-80a1-4da7-866e-e15493f23b5b
root@pve3:~# ceph-volume lvm activate 9 f13c6fe2-80a1-4da7-866e-e15493f23b5b
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/pve/bluestore-nvme --path /var/lib/ceph/osd/ceph-9 --no-mon-config
Running command: /usr/bin/ln -snf /dev/pve/bluestore-nvme /var/lib/ceph/osd/ceph-9/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-9/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-21
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9
Running command: /usr/bin/systemctl enable ceph-volume@lvm-9-f13c6fe2-80a1-4da7-866e-e15493f23b5b
stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-9-f13c6fe2-80a1-4da7-866e-e15493f23b5b.service → /lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@9
stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@9.service → /lib/systemd/system/ceph-osd@.service.
Running command: /usr/bin/systemctl start ceph-osd@9
--> ceph-volume lvm activate successful for osd ID: 9
root@pve3:~# cat /var/lib/ceph/osd/ceph-9/fsid f13c6fe2-80a1-4da7-866e-e15493f23b5b root@pve3:~# ceph-volume lvm activate 9 f13c6fe2-80a1-4da7-866e-e15493f23b5b Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9 Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/pve/bluestore-nvme --path /var/lib/ceph/osd/ceph-9 --no-mon-config Running command: /usr/bin/ln -snf /dev/pve/bluestore-nvme /var/lib/ceph/osd/ceph-9/block Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-9/block Running command: /usr/bin/chown -R ceph:ceph /dev/dm-21 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9 Running command: /usr/bin/systemctl enable ceph-volume@lvm-9-f13c6fe2-80a1-4da7-866e-e15493f23b5b stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-9-f13c6fe2-80a1-4da7-866e-e15493f23b5b.service → /lib/systemd/system/ceph-volume@.service. Running command: /usr/bin/systemctl enable --runtime ceph-osd@9 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@9.service → /lib/systemd/system/ceph-osd@.service. Running command: /usr/bin/systemctl start ceph-osd@9 --> ceph-volume lvm activate successful for osd ID: 9
root@pve3:~# cat /var/lib/ceph/osd/ceph-9/fsid
f13c6fe2-80a1-4da7-866e-e15493f23b5b
root@pve3:~# ceph-volume lvm activate 9 f13c6fe2-80a1-4da7-866e-e15493f23b5b
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/pve/bluestore-nvme --path /var/lib/ceph/osd/ceph-9 --no-mon-config
Running command: /usr/bin/ln -snf /dev/pve/bluestore-nvme /var/lib/ceph/osd/ceph-9/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-9/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-21
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-9
Running command: /usr/bin/systemctl enable ceph-volume@lvm-9-f13c6fe2-80a1-4da7-866e-e15493f23b5b
 stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-9-f13c6fe2-80a1-4da7-866e-e15493f23b5b.service → /lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@9
 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@9.service → /lib/systemd/system/ceph-osd@.service.
Running command: /usr/bin/systemctl start ceph-osd@9
--> ceph-volume lvm activate successful for osd ID: 9

After activating it, the ceph osd status will show the new OSD in STATE exists,up

Proceed to other nodes

Repeat the above procedure on the other nodes.

Rebalance

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# ceph osd unset norebalance
norebalance is unset
root@pve3:~# ceph osd unset noout
noout is unset
root@pve3:~# ceph osd unset norebalance norebalance is unset root@pve3:~# ceph osd unset noout noout is unset
root@pve3:~# ceph osd unset norebalance
norebalance is unset
root@pve3:~# ceph osd unset noout
noout is unset

The rebalance process will immediately start. Note we have a much bigger Ceph storage capacity with the new added OSD.

It will take up to the maximum bandwidth available, and slow down to give priority to the ceph cluster activity.

Reweight the old OSDs

Instead of marking them out, I’ll reweight them to 0.00, to first empty them and speed up the rebalance process.

Use ceph osd reweight <osdid> 0.00 for all the OSDs to be removed.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# for x in {0..8}; do echo ceph osd reweight $x 0.00; done;
ceph osd reweight 0 0.00
ceph osd reweight 1 0.00
ceph osd reweight 2 0.00
ceph osd reweight 3 0.00
ceph osd reweight 4 0.00
ceph osd reweight 5 0.00
ceph osd reweight 6 0.00
ceph osd reweight 7 0.00
ceph osd reweight 8 0.00
root@pve3:~# ceph osd reweight 0 0.00
ceph osd reweight 1 0.00
ceph osd reweight 2 0.00
ceph osd reweight 3 0.00
ceph osd reweight 4 0.00
ceph osd reweight 5 0.00
ceph osd reweight 6 0.00
ceph osd reweight 7 0.00
ceph osd reweight 8 0.00
reweighted osd.0 to 0 (0)
reweighted osd.1 to 0 (0)
reweighted osd.2 to 0 (0)
reweighted osd.3 to 0 (0)
reweighted osd.4 to 0 (0)
reweighted osd.5 to 0 (0)
reweighted osd.6 to 0 (0)
reweighted osd.7 to 0 (0)
reweighted osd.8 to 0 (0)
root@pve3:~#
root@pve3:~# ceph osd status
ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE
0 pve2 0 0 0 0 0 0 exists,up
1 pve1 0 0 0 0 0 0 exists,up
2 pve3 0 0 0 0 0 0 exists,up
3 pve1 0 0 0 0 0 0 exists,up
4 pve3 0 0 0 0 0 0 exists,up
5 pve2 0 0 0 0 0 0 exists,up
6 pve1 0 0 0 0 0 0 exists,up
7 pve3 0 0 0 0 0 0 exists,up
8 pve2 0 0 0 0 0 0 exists,up
9 pve3 332G 2967G 20 112k 1 2457 exists,up
10 pve2 332G 2967G 16 135k 0 6553 exists,up
11 pve1 294G 3005G 2 686k 0 20 exists,up
root@pve3:~# for x in {0..8}; do echo ceph osd reweight $x 0.00; done; ceph osd reweight 0 0.00 ceph osd reweight 1 0.00 ceph osd reweight 2 0.00 ceph osd reweight 3 0.00 ceph osd reweight 4 0.00 ceph osd reweight 5 0.00 ceph osd reweight 6 0.00 ceph osd reweight 7 0.00 ceph osd reweight 8 0.00 root@pve3:~# ceph osd reweight 0 0.00 ceph osd reweight 1 0.00 ceph osd reweight 2 0.00 ceph osd reweight 3 0.00 ceph osd reweight 4 0.00 ceph osd reweight 5 0.00 ceph osd reweight 6 0.00 ceph osd reweight 7 0.00 ceph osd reweight 8 0.00 reweighted osd.0 to 0 (0) reweighted osd.1 to 0 (0) reweighted osd.2 to 0 (0) reweighted osd.3 to 0 (0) reweighted osd.4 to 0 (0) reweighted osd.5 to 0 (0) reweighted osd.6 to 0 (0) reweighted osd.7 to 0 (0) reweighted osd.8 to 0 (0) root@pve3:~# root@pve3:~# ceph osd status ID HOST USED AVAIL WR OPS WR DATA RD OPS RD DATA STATE 0 pve2 0 0 0 0 0 0 exists,up 1 pve1 0 0 0 0 0 0 exists,up 2 pve3 0 0 0 0 0 0 exists,up 3 pve1 0 0 0 0 0 0 exists,up 4 pve3 0 0 0 0 0 0 exists,up 5 pve2 0 0 0 0 0 0 exists,up 6 pve1 0 0 0 0 0 0 exists,up 7 pve3 0 0 0 0 0 0 exists,up 8 pve2 0 0 0 0 0 0 exists,up 9 pve3 332G 2967G 20 112k 1 2457 exists,up 10 pve2 332G 2967G 16 135k 0 6553 exists,up 11 pve1 294G 3005G 2 686k 0 20 exists,up
root@pve3:~# for x in {0..8}; do echo ceph osd reweight $x 0.00; done;
ceph osd reweight 0 0.00
ceph osd reweight 1 0.00
ceph osd reweight 2 0.00
ceph osd reweight 3 0.00
ceph osd reweight 4 0.00
ceph osd reweight 5 0.00
ceph osd reweight 6 0.00
ceph osd reweight 7 0.00
ceph osd reweight 8 0.00
root@pve3:~# ceph osd reweight 0 0.00
ceph osd reweight 1 0.00
ceph osd reweight 2 0.00
ceph osd reweight 3 0.00
ceph osd reweight 4 0.00
ceph osd reweight 5 0.00
ceph osd reweight 6 0.00
ceph osd reweight 7 0.00
ceph osd reweight 8 0.00
reweighted osd.0 to 0 (0)
reweighted osd.1 to 0 (0)
reweighted osd.2 to 0 (0)
reweighted osd.3 to 0 (0)
reweighted osd.4 to 0 (0)
reweighted osd.5 to 0 (0)
reweighted osd.6 to 0 (0)
reweighted osd.7 to 0 (0)
reweighted osd.8 to 0 (0)
root@pve3:~# 
root@pve3:~# ceph osd status
ID  HOST   USED  AVAIL  WR OPS  WR DATA  RD OPS  RD DATA  STATE      
 0  pve2     0      0       0        0       0        0   exists,up  
 1  pve1     0      0       0        0       0        0   exists,up  
 2  pve3     0      0       0        0       0        0   exists,up  
 3  pve1     0      0       0        0       0        0   exists,up  
 4  pve3     0      0       0        0       0        0   exists,up  
 5  pve2     0      0       0        0       0        0   exists,up  
 6  pve1     0      0       0        0       0        0   exists,up  
 7  pve3     0      0       0        0       0        0   exists,up  
 8  pve2     0      0       0        0       0        0   exists,up  
 9  pve3   332G  2967G     20      112k      1     2457   exists,up  
10  pve2   332G  2967G     16      135k      0     6553   exists,up  
11  pve1   294G  3005G      2      686k      0       20   exists,up  

I used the first command just to print, check, then copy/paste the output to the console to apply.

Alternatively, you can use the ceph osd crush reweight. Example to generate commands ready for copy/paste:

for x in {0..8}; do echo ceph osd crush reweight osd.$x 0; done;

This will now take hours or days for all the data to be migrated to the new OSDs.

You should have a HEALTH_OK status

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# ceph status
cluster:
id: e7628d51-32b5-4f5c-8eec-1cafb41ead74
health: HEALTH_OK
services:
mon: 3 daemons, quorum pve1,pve3,pve2 (age 55m)
mgr: pve1(active, since 55m), standbys: pve3, pve2
mds: 1/1 daemons up, 2 standby
osd: 12 osds: 12 up (since 24m), 3 in (since 20m); 195 remapped pgs
data:
volumes: 1/1 healthy
pools: 10 pools, 229 pgs
objects: 1.04M objects, 1.2 TiB
usage: 976 GiB used, 8.7 TiB / 9.7 TiB avail
pgs: 4166448/3108540 objects misplaced (134.032%)
160 active+clean+remapped
35 active+remapped+backfilling
33 active+clean
1 active+clean+snaptrim
io:
client: 325 KiB/s rd, 1.3 MiB/s wr, 4 op/s rd, 21 op/s wr
recovery: 112 MiB/s, 69 keys/s, 85 objects/s
root@pve3:~# ceph status cluster: id: e7628d51-32b5-4f5c-8eec-1cafb41ead74 health: HEALTH_OK services: mon: 3 daemons, quorum pve1,pve3,pve2 (age 55m) mgr: pve1(active, since 55m), standbys: pve3, pve2 mds: 1/1 daemons up, 2 standby osd: 12 osds: 12 up (since 24m), 3 in (since 20m); 195 remapped pgs data: volumes: 1/1 healthy pools: 10 pools, 229 pgs objects: 1.04M objects, 1.2 TiB usage: 976 GiB used, 8.7 TiB / 9.7 TiB avail pgs: 4166448/3108540 objects misplaced (134.032%) 160 active+clean+remapped 35 active+remapped+backfilling 33 active+clean 1 active+clean+snaptrim io: client: 325 KiB/s rd, 1.3 MiB/s wr, 4 op/s rd, 21 op/s wr recovery: 112 MiB/s, 69 keys/s, 85 objects/s
root@pve3:~# ceph status
  cluster:
    id:     e7628d51-32b5-4f5c-8eec-1cafb41ead74
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum pve1,pve3,pve2 (age 55m)
    mgr: pve1(active, since 55m), standbys: pve3, pve2
    mds: 1/1 daemons up, 2 standby
    osd: 12 osds: 12 up (since 24m), 3 in (since 20m); 195 remapped pgs
 
  data:
    volumes: 1/1 healthy
    pools:   10 pools, 229 pgs
    objects: 1.04M objects, 1.2 TiB
    usage:   976 GiB used, 8.7 TiB / 9.7 TiB avail
    pgs:     4166448/3108540 objects misplaced (134.032%)
             160 active+clean+remapped
             35  active+remapped+backfilling
             33  active+clean
             1   active+clean+snaptrim
 
  io:
    client:   325 KiB/s rd, 1.3 MiB/s wr, 4 op/s rd, 21 op/s wr
    recovery: 112 MiB/s, 69 keys/s, 85 objects/s

Monitoring

I mostly use the ceph -w command, and the PVE GUI, but you might find more appropriate commands in the documentation for your case. The PVE GUI is quite basic, and the recovery/rebalance time is always wrong.

After some time, you should get something like

The ‘4 pools have too few placement groups’ message is just because I did set the autoscaling in warn mode, and all the OSDs are still present.

Remove the old OSDs

It’s now time to set the old OSDs to out is not already, and check that the health is still OK. Then progressively shut them down, monitoring the health status. You can do everything from the PVE GUI but for some operations you might have to log in using root / PAM.

If health do not show any problem, you can select the old OSDs (down) and destroy them from the GUI. It will propose you to clean-up disks.

And keep only the new OSDs

Tuning

If you have a very large pool, you can activate the bulk flag for that pool. In my case :

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve3:~# ceph osd pool set cephfs_data bulk true
set pool 2 bulk to true
root@pve3:~# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.mgr 40192k 3.0 9900G 0.0000 1.0 1 on False
cephfs_metadata 839.0M 3.0 9900G 0.0002 4.0 16 on False
cephblock 307.0G 3.0 9900G 0.0930 1.0 16 off False
cephfs_data_cache 52606M 51200M 2.0 9900G 0.0104 1.0 2 off False
.rgw.root 1327 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.log 182 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.control 0 3.0 9900G 0.0000 1.0 4 32 warn False
default.rgw.meta 0 3.0 9900G 0.0000 4.0 4 32 warn False
cephfs_data 1041G 3.0 9900G 0.3156 1.0 64 off True
root@pve3:~# ceph osd pool set cephfs_data bulk true set pool 2 bulk to true root@pve3:~# ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK .mgr 40192k 3.0 9900G 0.0000 1.0 1 on False cephfs_metadata 839.0M 3.0 9900G 0.0002 4.0 16 on False cephblock 307.0G 3.0 9900G 0.0930 1.0 16 off False cephfs_data_cache 52606M 51200M 2.0 9900G 0.0104 1.0 2 off False .rgw.root 1327 3.0 9900G 0.0000 1.0 4 32 warn False default.rgw.log 182 3.0 9900G 0.0000 1.0 4 32 warn False default.rgw.control 0 3.0 9900G 0.0000 1.0 4 32 warn False default.rgw.meta 0 3.0 9900G 0.0000 4.0 4 32 warn False cephfs_data 1041G 3.0 9900G 0.3156 1.0 64 off True
root@pve3:~# ceph osd pool set cephfs_data bulk true
set pool 2 bulk to true
root@pve3:~# ceph osd pool autoscale-status
POOL                   SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK   
.mgr                 40192k                3.0         9900G  0.0000                                  1.0       1              on         False  
cephfs_metadata      839.0M                3.0         9900G  0.0002                                  4.0      16              on         False  
cephblock            307.0G                3.0         9900G  0.0930                                  1.0      16              off        False  
cephfs_data_cache    52606M       51200M   2.0         9900G  0.0104                                  1.0       2              off        False  
.rgw.root             1327                 3.0         9900G  0.0000                                  1.0       4          32  warn       False  
default.rgw.log        182                 3.0         9900G  0.0000                                  1.0       4          32  warn       False  
default.rgw.control      0                 3.0         9900G  0.0000                                  1.0       4          32  warn       False  
default.rgw.meta         0                 3.0         9900G  0.0000                                  4.0       4          32  warn       False  
cephfs_data           1041G                3.0         9900G  0.3156                                  1.0      64              off        True   

Wait for autoscaler to compute new numbers from the new OSD count, then enable the autoscaler back on.

Possible problems

Running out of PGs

If you reduce the number of OSD, and you have more PG allocated than the max_pg_per_osd. I had the problem because I had autoscale on, and ceph does not remove the out OSDs from the max PG count to compute the number of PGs for each pool. Also, I did set a ratio which takes precedence. So, I had to switch off autoscale, disable the target ratio, and adjust the PG number for some pools to match the default max 100 PG/OSD.

Useful commands:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
ceph osd pool autoscale-status # Check PGs and autoscale recommendations.
ceph osd pool set {pool-name} target_size_ratio 0 # Disable the TARGET RATIO
ceph osd pool autoscale-statusceph osd pool set {pool-name} pg_num 64
ceph osd pool autoscale-statusceph osd pool set {pool-name} pgp_num 64 # You need to adjust both pg & pgp for the rebalance to occur.
ceph osd down {osd-num}
ceph osd dump | grep repli
ceph osd pool autoscale-status # Check PGs and autoscale recommendations. ceph osd pool set {pool-name} target_size_ratio 0 # Disable the TARGET RATIO ceph osd pool autoscale-statusceph osd pool set {pool-name} pg_num 64 ceph osd pool autoscale-statusceph osd pool set {pool-name} pgp_num 64 # You need to adjust both pg & pgp for the rebalance to occur. ceph osd down {osd-num} ceph osd dump | grep repli
ceph osd pool autoscale-status                     # Check PGs and autoscale recommendations.
ceph osd pool set {pool-name} target_size_ratio 0  # Disable the TARGET RATIO
ceph osd pool autoscale-statusceph osd pool set {pool-name} pg_num 64
ceph osd pool autoscale-statusceph osd pool set {pool-name} pgp_num 64 # You need to adjust both pg & pgp for the rebalance to occur.
ceph osd down {osd-num}
ceph osd dump | grep repli

The PG adjustments for pools might take time, and might occur after rebalancing (be patient).

PG stuck in rebalancing

First reweighting the OSDs should prevent rebalancing problems

But if you created a CRUSH rule like the one proposed in Extend a Ceph cluster with slower HDD disks using SSD cache-tiering, and you want to remove OSDs enforced by that rule, you might have to ensure all pools are back on the default replicated_rule.

OSD services remain active at startup

Or : How to remove an unwanted or failed Ceph service due to entering a wrong name

This has no consequences, but if you do not want them in your startup logs any more, you might have to remove them manually.
WARNING with the below command, make sure you grep -v the

osd
osd id you want to keep. (test without the
xargs
xargs part)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
root@pve1:~# ps -ef | grep osd
ceph 3266 1 12 16:18 ? 00:17:10 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph
root 89661 66392 0 18:38 pts/1 00:00:00 grep osd
root@pve1:~# ls -1 /etc/systemd/system/multi-user.target.wants/ | grep ceph-volume | grep -v lvm-11- | xargs systemctl disable
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-1d645695-bed4-42df-aa29-e7381d5afa59.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-3-94d08df1-7c9e-4072-b5ba-e5399c1e77c7.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-d3ab1ea9-e3db-426a-abe2-4000415decc5.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-5501a4e2-67dc-4dfc-8510-f65920b18b2c.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-7d865311-bf42-45cf-bc76-307f2452384d.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-6-de79485d-414d-40aa-8290-83db26292a39.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-3-0ee6fad6-cba7-48b4-8c3a-430190be03d2.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-08566904-6f0e-4afd-ab25-079359feb0bd.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-5-3dba44fa-599f-4056-9c9d-2eefc475ed96.service.
root@pve1:~# ps -ef | grep osd ceph 3266 1 12 16:18 ? 00:17:10 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph root 89661 66392 0 18:38 pts/1 00:00:00 grep osd root@pve1:~# ls -1 /etc/systemd/system/multi-user.target.wants/ | grep ceph-volume | grep -v lvm-11- | xargs systemctl disable Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-1d645695-bed4-42df-aa29-e7381d5afa59.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-3-94d08df1-7c9e-4072-b5ba-e5399c1e77c7.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-d3ab1ea9-e3db-426a-abe2-4000415decc5.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-5501a4e2-67dc-4dfc-8510-f65920b18b2c.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-7d865311-bf42-45cf-bc76-307f2452384d.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-6-de79485d-414d-40aa-8290-83db26292a39.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-3-0ee6fad6-cba7-48b4-8c3a-430190be03d2.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-08566904-6f0e-4afd-ab25-079359feb0bd.service. Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-5-3dba44fa-599f-4056-9c9d-2eefc475ed96.service.
root@pve1:~# ps -ef | grep osd
ceph        3266       1 12 16:18 ?        00:17:10 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph
root       89661   66392  0 18:38 pts/1    00:00:00 grep osd
root@pve1:~# ls -1 /etc/systemd/system/multi-user.target.wants/ | grep ceph-volume | grep -v lvm-11- | xargs systemctl disable
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-1d645695-bed4-42df-aa29-e7381d5afa59.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-3-94d08df1-7c9e-4072-b5ba-e5399c1e77c7.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-d3ab1ea9-e3db-426a-abe2-4000415decc5.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-5501a4e2-67dc-4dfc-8510-f65920b18b2c.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-0-7d865311-bf42-45cf-bc76-307f2452384d.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-6-de79485d-414d-40aa-8290-83db26292a39.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-3-0ee6fad6-cba7-48b4-8c3a-430190be03d2.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-1-08566904-6f0e-4afd-ab25-079359feb0bd.service.
Removed /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-5-3dba44fa-599f-4056-9c9d-2eefc475ed96.service.

Benchmark

You might like to benchmark your setup. You’ll find many fio command lines looking like this

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
fio --filename=/mnt/pve/cephfs/backup/postgresql/bench/bench.file --size=128GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
fio --filename=/mnt/pve/cephfs/backup/postgresql/bench/bench.file --size=128GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1
fio --filename=/mnt/pve/cephfs/backup/postgresql/bench/bench.file --size=128GB --direct=1 --rw=randrw --bs=4k --ioengine=libaio --iodepth=1 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1

Don’t be surprised if you compare with a native disk, that’s a network cluster !
Also, the above command can quickly make ceph-osd crashes, especially if you use high iodepth and numjobs. Increase the size according to your HW performance.

Like it ?

Get notified on new posts (max 1 / month)
Soyez informés lors des prochains articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Anti-Robot Verification
FriendlyCaptcha ⇗