blog:20240707_zfs_replace_story
记录一次 ZFS 由于设备名称变更导致故障的修复
发现一台 ubuntu 23.10 的 ZFS 卷降级了
# zpool status stor pool: stor state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J scan: scrub repaired 0B in 19:13:23 with 0 errors on Sun Jun 9 19:37:31 2024 config: NAME STATE READ WRITE CKSUM stor DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 scsi-35000cca260cd1f18 ONLINE 0 0 0 scsi-35000cca260cc3084 ONLINE 0 0 0 scsi-35000cca260cc6be0 ONLINE 0 0 0 13184766210832087855 FAULTED 0 0 0 was /dev/sdf1 8984617841033776882 FAULTED 0 0 0 was /dev/sdg1 wwn-0x5000cca2604ac3e0 ONLINE 0 0 0 errors: No known data errors # zpool status -L stor pool: stor state: DEGRADED status: One or more devices could not be used because the label is missing or invalid. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the device using 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J scan: scrub repaired 0B in 19:13:23 with 0 errors on Sun Jun 9 19:37:31 2024 config: NAME STATE READ WRITE CKSUM stor DEGRADED 0 0 0 raidz2-0 DEGRADED 0 0 0 sdc ONLINE 0 0 0 sdd ONLINE 0 0 0 sde ONLINE 0 0 0 13184766210832087855 FAULTED 0 0 0 was /dev/sdf1 8984617841033776882 FAULTED 0 0 0 was /dev/sdg1 sdg ONLINE 0 0 0 errors: No known data errors
我的天, raidz2 掉了2个盘!
经检查, 硬盘并没有问题, 只是重启后设备名变了(从上面 sdg 在列里, 但是却说曾经的 sdg1 出错了可以看出), 这样 ZFS 就出问题了??!! 我在 FreeBSD 下重来没碰到过.
试着用 zpool replace, 报错
# zpool replace stor 13184766210832087855 /dev/sdf invalid vdev specification use '-f' to override the following errors: /dev/sdf1 is part of active pool 'stor' # zpool replace -f stor 13184766210832087855 /dev/sdf invalid vdev specification the following errors must be manually repaired: /dev/sdf1 is part of active pool 'stor'
尝试zpool labelclear, 报错. wipefs 掉再 replace, 还是报错
# zpool labelclear /dev/sdf failed to clear label for /dev/sdf # wipefs -a /dev/sdf /dev/sdf: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54 /dev/sdf: 8 bytes were erased at offset 0x74702555e00 (gpt): 45 46 49 20 50 41 52 54 /dev/sdf: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa /dev/sdf: calling ioctl to re-read partition table: Success # zpool replace stor 13184766210832087855 /dev/sdf cannot replace 13184766210832087855 with /dev/sdf: /dev/sdf is busy, or device removal is in progress
艹! 已经不能描述我当前的心情了.
最后查了下网上的一些案例, 先 export, 然后用 import -d 的方式. 但是如果只用 -d /dev/disk/by-id/ 是不行的, 直接用多次 -d 来解决.
将有问题的设备离线, 导出池后, 用 -d /dev/disk/by-id/ 加上多个 -d 设备名导入池. 然后将有问题的设备重新上线.
# zpool offline stor 13184766210832087855 # zpool offline stor 8984617841033776882 # zpool export stor # zpool import -d /dev/disk/by-id/ -d /dev/sdc -d /dev/sdd -d /dev/sde -d /dev/sdg -d /dev/sdf -d /dev/sdi stor # zpool online stor 13184766210832087855 # zpool online stor 8984617841033776882
最后终于池在线了, 并且对2个有问题的设备重建.
# zpool status stor pool: stor state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Sun Jul 7 01:17:01 2024 44.7G / 32.3T scanned at 1.09G/s, 0B / 32.3T issued 0B resilvered, 0.00% done, no estimated completion time config: NAME STATE READ WRITE CKSUM stor ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 scsi-35000cca260cd1f18 ONLINE 0 0 0 scsi-35000cca260cc3084 ONLINE 0 0 0 scsi-35000cca260cc6be0 ONLINE 0 0 0 wwn-0x5000cca260cc504c ONLINE 0 0 0 wwn-0x5000cca260cc32d0 ONLINE 0 0 0 (awaiting resilver) wwn-0x5000cca2604ac3e0 ONLINE 0 0 0 errors: No known data errors
实际重建速度在后面是很快的, 前面做扫描用了一段很长的时间.
blog/20240707_zfs_replace_story.txt · 最后更改: 2024/07/07 05:20 由 Hshh