etcd的灾难恢复, 需要使用到快照文件, 本质是一次恢复快照的操作. 当etcd集群的大多数节点永久失联或短时间内无法继续正常使用时, 要想恢复etcd集群的服务, 需要执行etcd的快照恢复操作. 注意v2和v3版本数据恢复是两码事儿, 本篇重点介绍v3版本的数据备份和恢复
实验环境 前篇文章中留存的实验环境:
1 2 3 4 5 1161d5b4260241e3, started, lv-etcd-research-alpha-1, http://192.168.149.63:2380, http://192.168.149.63:2379 2145c204a51dbbc7, started, lv-etcd-research-alpha-0, http://192.168.149.60:2380, http://192.168.149.60:2379 4252aec339d438d9, started, lv-etcd-research-alpha-3, http://192.168.149.62:2380, http://192.168.149.62:2379 e26482910894af8d, started, lv-etcd-research-alpha-2, http://192.168.149.61:2380, http://192.168.149.61:2379 ea04db3353b9fd4e, started, lv-etcd-research-alpha-4, http://192.168.149.64:2380, http://192.168.149.64:2379
本篇文章将以此环境为基础, 使用备份的快照文件, 创建新的集群
整体步骤
创建快照文件
将快照文件分发到新集群的每一台主机上
使用 etcdctl snapshot restore 命令启动临时逻辑集群, 在新的数据目录中恢复数据
使用新的数据目录, 启动etcd服务
创建快照文件 在此假设集群中的5个节点, 仅剩192.168.149.60
存活, 我们需要首先在存活的节点, 将数据导出(快照)
当然, 正常情况下, 生产环境会定期对etcd做快照备份, 对于这种情况, 直接拿最新的一份快照恢复即可
1 2 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379 snapshot save snapshot.db Snapshot saved at snapshot.db
分发快照文件 第二步, 需要分发快照文件到新集群的机器上. 结合在我目前的环境中, 我需要把该快照文件发送到
192.168.149.61
192.168.149.62
192.168.149.63
192.168.149.64
在实际环境中, 可能另外四台, 或者原集群全部主机都无法使用, 此时需要将之前备份的快照文件, 从备份服务器下载到组建新集群的各个主机上
1 2 3 4 5 6 7 8 > scp snapshot.db root@192.168.149.61:/var/lib/etcd/ snapshot.db 100% 20MB 42.1MB/s 00:00 > scp snapshot.db root@192.168.149.62:/var/lib/etcd/ snapshot.db 100% 20MB 30.3MB/s 00:00 > scp snapshot.db root@192.168.149.63:/var/lib/etcd/ snapshot.db 100% 20MB 36.9MB/s 00:00 > scp snapshot.db root@192.168.149.64:/var/lib/etcd/ snapshot.db
恢复快照 在 192.168.149.60
上执行
1 2 3 4 5 6 7 cd /var/lib/etcd ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir="/var/lib/etcd/new.etcd" \ --name="lv-etcd-research-beta-0" \ --initial-advertise-peer-urls="http://192.168.149.60:22380" \ --initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \ --initial-cluster-token="lv-etcd-research-beta-temp"
执行结果
1 2 3 4 5 2019-06-14 14:47:34.172213 I | etcdserver/membership: added member 6914761fd26729d7 [http://192.168.149.62:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:47:34.172370 I | etcdserver/membership: added member b8ca704ce48fc6c2 [http://192.168.149.63:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:47:34.172419 I | etcdserver/membership: added member bff8d73529095f70 [http://192.168.149.64:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:47:34.172464 I | etcdserver/membership: added member eb548d413adb4560 [http://192.168.149.61:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:47:34.172506 I | etcdserver/membership: added member fee07bdb23e26b2f [http://192.168.149.60:22380] to cluster c1cdf0b2061f8dcc
在 192.168.149.63
上执行
1 2 3 4 5 6 7 cd /var/lib/etcd ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir="/var/lib/etcd/new.etcd" \ --name="lv-etcd-research-beta-1" \ --initial-advertise-peer-urls="http://192.168.149.63:22380" \ --initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \ --initial-cluster-token="lv-etcd-research-beta-temp"
执行结果:
1 2 3 4 5 2019-06-14 14:49:23.896672 I | etcdserver/membership: added member 6914761fd26729d7 [http://192.168.149.62:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:49:23.897112 I | etcdserver/membership: added member b8ca704ce48fc6c2 [http://192.168.149.63:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:49:23.897210 I | etcdserver/membership: added member bff8d73529095f70 [http://192.168.149.64:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:49:23.897264 I | etcdserver/membership: added member eb548d413adb4560 [http://192.168.149.61:22380] to cluster c1cdf0b2061f8dcc 2019-06-14 14:49:23.897403 I | etcdserver/membership: added member fee07bdb23e26b2f [http://192.168.149.60:22380] to cluster c1cdf0b2061f8dcc
在 192.168.149.61
上执行
1 2 3 4 5 6 7 cd /var/lib/etcd ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir="/var/lib/etcd/new.etcd" \ --name="lv-etcd-research-beta-2" \ --initial-advertise-peer-urls="http://192.168.149.61:22380" \ --initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \ --initial-cluster-token="lv-etcd-research-beta-temp"
执行结果同上
在 192.168.149.62
上执行
1 2 3 4 5 6 7 cd /var/lib/etcd ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir="/var/lib/etcd/new.etcd" \ --name="lv-etcd-research-beta-3" \ --initial-advertise-peer-urls="http://192.168.149.62:22380" \ --initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \ --initial-cluster-token="lv-etcd-research-beta-temp"
执行结果同上
在 192.168.149.64
上执行
1 2 3 4 5 6 7 cd /var/lib/etcd ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir="/var/lib/etcd/new.etcd" \ --name="lv-etcd-research-beta-4" \ --initial-advertise-peer-urls="http://192.168.149.64:22380" \ --initial-cluster="lv-etcd-research-beta-0=http://192.168.149.60:22380,lv-etcd-research-beta-1=http://192.168.149.63:22380,lv-etcd-research-beta-2=http://192.168.149.61:22380,lv-etcd-research-beta-3=http://192.168.149.62:22380,lv-etcd-research-beta-4=http://192.168.149.64:22380" \ --initial-cluster-token="lv-etcd-research-beta-temp"
执行结果同上
本步骤操作, 将快照中的数据写入到指定的文件夹下, 并写入新集群的元数据信息. 这里需要注意的是, 快照中的数据是干净的数据, 不包含原节点的节点ID和集群ID等元数据信息. 执行 restore
操作后, 集群信息由命令后面的参数决定, 所以后续所有的节点, 仅需要指定新的数据目录启动即可, 集群信息可不指定, 因为已经写入到db中.
启动新集群 修改etcd 配置文件, 指定新的数据目录来启动服务. 在我这里的实验环境中, 由于老etcd集群都在正常运行中, 我这里通过指定不同的端口, 在原有的5台机器中启动第二套新集群, 验证恢复操作
在 192.168.149.60
上执行
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # 创建新的配置文件 echo 'ETCD_DATA_DIR="/var/lib/etcd/new.etcd" ETCD_LISTEN_PEER_URLS="http://0.0.0.0:22380" ETCD_LISTEN_CLIENT_URLS="http://0.0.0.0:22379" ETCD_NAME="lv-etcd-research-beta-0" ETCD_ADVERTISE_CLIENT_URLS="http://192.168.149.60:22379" ETCD_INITIAL_CLUSTER_TOKEN="lv-etcd-research-beta"' > /etc/etcd/etcd_new.conf # 复制原有的启动文件 cp /usr/lib/systemd/system/etcd.service /usr/lib/systemd/system/etcd_new.service # 更改启动文件中指定的配置文件 sed -i s/etcd.conf/etcd_new.conf/g /usr/lib/systemd/system/etcd_new.service systemctl daemon-reload systemctl start etcd_new
其他四台以此类推…
检查原集群和新集群成员
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379 member list -w table +------------------+---------+--------------------------+----------------------------+----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+--------------------------+----------------------------+----------------------------+ | 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.63:2379 | | 2145c204a51dbbc7 | started | lv-etcd-research-alpha-0 | http://192.168.149.60:2380 | http://192.168.149.60:2379 | | 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 | | e26482910894af8d | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 | | ea04db3353b9fd4e | started | lv-etcd-research-alpha-4 | http://192.168.149.64:2380 | http://192.168.149.64:2379 | +------------------+---------+--------------------------+----------------------------+----------------------------+ > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:22379 member list -w table +------------------+---------+-------------------------+-----------------------------+-----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+-------------------------+-----------------------------+-----------------------------+ | 6914761fd26729d7 | started | lv-etcd-research-beta-3 | http://192.168.149.62:22380 | http://192.168.149.63:22379 | | b8ca704ce48fc6c2 | started | lv-etcd-research-beta-1 | http://192.168.149.63:22380 | http://192.168.149.63:22379 | | bff8d73529095f70 | started | lv-etcd-research-beta-4 | http://192.168.149.64:22380 | http://192.168.149.64:22379 | | eb548d413adb4560 | started | lv-etcd-research-beta-2 | http://192.168.149.61:22380 | http://192.168.149.61:22379 | | fee07bdb23e26b2f | started | lv-etcd-research-beta-0 | http://192.168.149.60:22380 | http://192.168.149.60:22379 | +------------------+---------+-------------------------+-----------------------------+-----------------------------+
注意v2与v3区别 官方不建议v2和v3混合使用, 也就是说, 如果你既有v2的存储需求又有v3的存储需求, 最好应该是用两个独立的集群将需求隔离开. 在备份恢复这个操作上, 也充分体现了v2 v3不要混用的重要性. 因为以上操作都是针对v3版本的备份和恢复. 即使备份的etcd集群中存在v2的数据, 在使用该方案恢复后, v2的数据将不会出现在新的集群中.
如果你需要针对v2版本做备份和恢复, 可以参考官方文档:
https://etcd.io/docs/v2/admin_guide/#disaster-recovery
大致步骤如下:
使用 etcdctl backup 命令备份数据到新的目录
使用新的目录, 以 --force-new-cluster
的模式指定新的目录, 启动单节点etcd服务
如果需要恢复的是一个集群, 你需要先执行 ETCDCTL_API=2 etcdctl member update
命令更新
最后按照正常运行时配置添加节点即可