etcd 的节点变更有两种方式变更, 一种是数据迁移, 一种是通过增加新节点, 同步数据完成后, 删除老节点来实现的. 本篇文章介绍前者, 通过数据目录的迁移, 来实现etcd节点的迁移
实验环境 当前etcd集群信息
1 2 3 4 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list 1161d5b4260241e3, started, lv-etcd-research-alpha-1, http://192.168.149.60:2380, http://192.168.149.60:2379 4252aec339d438d9, started, lv-etcd-research-alpha-3, http://192.168.149.62:2380, http://192.168.149.62:2379 e6f45ed7d9402b75, started, lv-etcd-research-alpha-2, http://192.168.149.61:2380, http://192.168.149.61:2379
1 2 3 4 5 6 7 8 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | http://192.168.149.60:2379 | 1161d5b4260241e3 | 3.2.28 | 18 MB | false | 7 | 124802 | | http://192.168.149.62:2379 | 4252aec339d438d9 | 3.2.28 | 18 MB | false | 7 | 124802 | | http://192.168.149.61:2379 | e6f45ed7d9402b75 | 3.2.28 | 18 MB | true | 7 | 124802 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+
本次目标是将 192.168.149.60
节点迁移到 192.168.149.63
节点
总体迁移步骤
先在192.168.149.60
上停止etcd服务, 如果该进程已经挂掉, 也就省去了停止etcd的步骤了😆 前提是你必须要保证, 它挂的很彻底, 不要迁移了一半又自己活过来…
从老机器上迁移数据到新机器对应目录
在任意节点执行member update操作, 更新peerURLs信息为新机器的 IP:Port
从老机器上将配置文件一并拷贝到新机器, 修改成新机器IP地址后, 保证指向的数据目录正确, 启动即可
Step 1: 停服务 在需要迁移的节点上, kill掉etcd的进程, 如果条件允许, 不要-9
, 优雅关闭优先
此时查询集群状态:
1 2 3 4 5 6 7 8 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table Failed to get the status of endpoint http://192.168.149.60:2379 (context deadline exceeded) +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | http://192.168.149.62:2379 | 4252aec339d438d9 | 3.2.28 | 18 MB | false | 7 | 124940 | | http://192.168.149.61:2379 | e6f45ed7d9402b75 | 3.2.28 | 18 MB | true | 7 | 124940 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+
192.168.149.60
节点已经处于失联状态
Step 2: 迁移数据目录 在192.168.149.63
上执行(预建数据目录)
1 > mkdir -p /var/lib/etcd/default.etcd
在192.168.149.60
上执行(打包发送)
1 2 > tar -cvzf member.tar.gz member > scp member.tar.gz root@192.168.149.63:/var/lib/etcd/default.etcd/
在192.168.149.63
上执行(解压)
1 2 > cd /var/lib/etcd/default.etcd/ > tar -xvzf member.tar.gz
Step 3: 更新member信息 在任意一个节点执行, 已更新原节点 peerURLs 信息
1 2 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.61:2379 member update 1161d5b4260241e3 --peer-urls="http://192.168.149.63:2380" Member 1161d5b4260241e3 updated in cluster 2c25150e88501a13
--endpoints http://192.168.149.61:2379
因为 192.168.149.60
节点已停止服务, 所以这里需要选择一个其他的endpoint节点来对集群进行操作
1161d5b4260241e3
是 192.168.149.60
的节点ID, 如果忘记的话, 可以执行 member list
查看
回显显示命令已正确执行, 查询状态如下:
1 2 3 4 5 6 7 8 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list -w table +------------------+---------+--------------------------+----------------------------+----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+--------------------------+----------------------------+----------------------------+ | 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.60:2379 | | 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 | | e6f45ed7d9402b75 | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 | +------------------+---------+--------------------------+----------------------------+----------------------------+
可以看到第一行, PEER ADDRS 已经正确更新成为 http://192.168.149.63:2380
, 但是后面的 CLIENT ADDRS 依然是原来的 http://192.168.149.60:2379
. 这个不用担心, 等新的节点启动后, 这个值就会变成正确的地址
Step 4: 在新节点启动服务 在新节点启动服务之前, 记得把配置文件, 从老节点拷贝过去. 拷贝完成后, 一定要参数进行修改.
1 2 3 4 5 6 7 8 9 10 # 以下两个参数如果指定了: 0.0.0.0 就无需更改, 如果是精确指定每个IP地址, 则需要将IP60更改为63 ETCD_LISTEN_PEER_URLS ETCD_LISTEN_CLIENT_URLS # 以下两个参数注意修改IP地址到新机器的IP ETCD_INITIAL_ADVERTISE_PEER_URLS ETCD_ADVERTISE_CLIENT_URLS # 以下集群信息中, 记得也将原IP修改为新机器的IP地址 ETCD_INITIAL_CLUSTER
以上修改的参数中, ETCD_LISTEN_PEER_URLS
ETCD_LISTEN_CLIENT_URLS
ETCD_ADVERTISE_CLIENT_URLS
是最重要的参数, 一定要和新机器的IP地址匹配
因为etcd是运行时重新配置, 另外两个 INIT 的参数虽然在服务启动的时候不再起什么作用了, 但是为了后期看到配置文件后不知道迷茫, 也最好都统一修改到新机器的IP地址
同理 ETCD_INITIAL_CLUSTER_STATE="new"
参数可以保留, 因为不起作用
启动etcd服务
集群状态:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | http://192.168.149.63:2379 | 1161d5b4260241e3 | 3.2.28 | 18 MB | false | 7 | 128011 | | http://192.168.149.62:2379 | 4252aec339d438d9 | 3.2.28 | 18 MB | false | 7 | 128011 | | http://192.168.149.61:2379 | e6f45ed7d9402b75 | 3.2.28 | 18 MB | true | 7 | 128011 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list -w table +------------------+---------+--------------------------+----------------------------+----------------------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | +------------------+---------+--------------------------+----------------------------+----------------------------+ | 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.63:2379 | | 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 | | e6f45ed7d9402b75 | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 | +------------------+---------+--------------------------+----------------------------+----------------------------+
可以看到”新的集群” RAFT INDEX
已经一致, 表示新节点192.168.149.63
已经追上集群数据.
PEER ADDRS
和 CLIENT ADDRS
也均为正确的地址
此时, 节点迁移正确完成
总结: 这种迁移方式基本仅在待迁移的节点还能正常登陆, 还能正常访问数据目录的前提下进行. 如果机器已经挂掉, 无法访问到原有数据, 那么这种方式并不合适. 迁移嘛, 都正常才能迁移, 不正常的迁移叫故障恢复😆