etcd 的节点变更有两种方式变更, 一种是数据迁移, 一种是通过增加新节点, 同步数据完成后, 删除老节点来实现的. 本篇文章介绍前者, 通过数据目录的迁移, 来实现etcd节点的迁移
 
实验环境 当前etcd集群信息
1 2 3 4 >  ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list 1161d5b4260241e3, started, lv-etcd-research-alpha-1, http://192.168.149.60:2380, http://192.168.149.60:2379 4252aec339d438d9, started, lv-etcd-research-alpha-3, http://192.168.149.62:2380, http://192.168.149.62:2379 e6f45ed7d9402b75, started, lv-etcd-research-alpha-2, http://192.168.149.61:2380, http://192.168.149.61:2379 
1 2 3 4 5 6 7 8 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table +----------------------------+------------------+---------+---------+-----------+-----------+------------+ |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | http://192.168.149.60:2379 | 1161d5b4260241e3 |  3.2.28 |   18 MB |     false |         7 |     124802 | | http://192.168.149.62:2379 | 4252aec339d438d9 |  3.2.28 |   18 MB |     false |         7 |     124802 | | http://192.168.149.61:2379 | e6f45ed7d9402b75 |  3.2.28 |   18 MB |      true |         7 |     124802 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ 
本次目标是将 192.168.149.60 节点迁移到 192.168.149.63 节点
总体迁移步骤 
先在192.168.149.60上停止etcd服务, 如果该进程已经挂掉, 也就省去了停止etcd的步骤了😆 前提是你必须要保证, 它挂的很彻底, 不要迁移了一半又自己活过来… 
从老机器上迁移数据到新机器对应目录 
在任意节点执行member update操作, 更新peerURLs信息为新机器的 IP:Port 
从老机器上将配置文件一并拷贝到新机器, 修改成新机器IP地址后, 保证指向的数据目录正确, 启动即可 
 
Step 1: 停服务 在需要迁移的节点上, kill掉etcd的进程, 如果条件允许, 不要-9, 优雅关闭优先
此时查询集群状态:
1 2 3 4 5 6 7 8 > ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table Failed to get the status of endpoint http://192.168.149.60:2379 (context deadline exceeded) +----------------------------+------------------+---------+---------+-----------+-----------+------------+ |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | http://192.168.149.62:2379 | 4252aec339d438d9 |  3.2.28 |   18 MB |     false |         7 |     124940 | | http://192.168.149.61:2379 | e6f45ed7d9402b75 |  3.2.28 |   18 MB |      true |         7 |     124940 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ 
192.168.149.60 节点已经处于失联状态
Step 2: 迁移数据目录 在192.168.149.63上执行(预建数据目录)
1 >  mkdir -p /var/lib/etcd/default.etcd 
在192.168.149.60上执行(打包发送)
1 2 >  tar -cvzf member.tar.gz member >  scp member.tar.gz root@192.168.149.63:/var/lib/etcd/default.etcd/ 
在192.168.149.63上执行(解压)
1 2 >  cd  /var/lib/etcd/default.etcd/ >  tar -xvzf member.tar.gz 
Step 3: 更新member信息 在任意一个节点执行, 已更新原节点 peerURLs 信息
1 2 >  ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.61:2379 member update 1161d5b4260241e3 --peer-urls="http://192.168.149.63:2380"  Member 1161d5b4260241e3 updated in cluster 2c25150e88501a13 
--endpoints http://192.168.149.61:2379 因为 192.168.149.60 节点已停止服务, 所以这里需要选择一个其他的endpoint节点来对集群进行操作
1161d5b4260241e3 是 192.168.149.60 的节点ID, 如果忘记的话, 可以执行 member list查看
回显显示命令已正确执行, 查询状态如下:
1 2 3 4 5 6 7 8 >  ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.60:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list -w table +------------------+---------+--------------------------+----------------------------+----------------------------+ |        ID        | STATUS  |           NAME           |         PEER ADDRS         |        CLIENT ADDRS        | +------------------+---------+--------------------------+----------------------------+----------------------------+ | 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.60:2379 | | 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 | | e6f45ed7d9402b75 | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 | +------------------+---------+--------------------------+----------------------------+----------------------------+ 
可以看到第一行, PEER ADDRS 已经正确更新成为 http://192.168.149.63:2380, 但是后面的 CLIENT ADDRS 依然是原来的 http://192.168.149.60:2379. 这个不用担心, 等新的节点启动后, 这个值就会变成正确的地址
Step 4: 在新节点启动服务 在新节点启动服务之前, 记得把配置文件, 从老节点拷贝过去. 拷贝完成后, 一定要参数进行修改.
1 2 3 4 5 6 7 8 9 10 #  以下两个参数如果指定了: 0.0.0.0 就无需更改, 如果是精确指定每个IP地址, 则需要将IP60更改为63 ETCD_LISTEN_PEER_URLS ETCD_LISTEN_CLIENT_URLS #  以下两个参数注意修改IP地址到新机器的IP ETCD_INITIAL_ADVERTISE_PEER_URLS ETCD_ADVERTISE_CLIENT_URLS #  以下集群信息中, 记得也将原IP修改为新机器的IP地址 ETCD_INITIAL_CLUSTER 
以上修改的参数中, ETCD_LISTEN_PEER_URLS ETCD_LISTEN_CLIENT_URLS ETCD_ADVERTISE_CLIENT_URLS 是最重要的参数, 一定要和新机器的IP地址匹配
因为etcd是运行时重新配置, 另外两个 INIT 的参数虽然在服务启动的时候不再起什么作用了, 但是为了后期看到配置文件后不知道迷茫, 也最好都统一修改到新机器的IP地址
同理 ETCD_INITIAL_CLUSTER_STATE="new" 参数可以保留, 因为不起作用
启动etcd服务
集群状态:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 >  ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 endpoint status -w table +----------------------------+------------------+---------+---------+-----------+-----------+------------+ |          ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ | http://192.168.149.63:2379 | 1161d5b4260241e3 |  3.2.28 |   18 MB |     false |         7 |     128011 | | http://192.168.149.62:2379 | 4252aec339d438d9 |  3.2.28 |   18 MB |     false |         7 |     128011 | | http://192.168.149.61:2379 | e6f45ed7d9402b75 |  3.2.28 |   18 MB |      true |         7 |     128011 | +----------------------------+------------------+---------+---------+-----------+-----------+------------+ >  ETCDCTL_API=3 etcdctl --endpoints http://192.168.149.63:2379,http://192.168.149.62:2379,http://192.168.149.61:2379 member list -w table +------------------+---------+--------------------------+----------------------------+----------------------------+ |        ID        | STATUS  |           NAME           |         PEER ADDRS         |        CLIENT ADDRS        | +------------------+---------+--------------------------+----------------------------+----------------------------+ | 1161d5b4260241e3 | started | lv-etcd-research-alpha-1 | http://192.168.149.63:2380 | http://192.168.149.63:2379 | | 4252aec339d438d9 | started | lv-etcd-research-alpha-3 | http://192.168.149.62:2380 | http://192.168.149.62:2379 | | e6f45ed7d9402b75 | started | lv-etcd-research-alpha-2 | http://192.168.149.61:2380 | http://192.168.149.61:2379 | +------------------+---------+--------------------------+----------------------------+----------------------------+ 
可以看到”新的集群” RAFT INDEX 已经一致, 表示新节点192.168.149.63已经追上集群数据.
PEER ADDRS 和 CLIENT ADDRS 也均为正确的地址
此时, 节点迁移正确完成
总结: 这种迁移方式基本仅在待迁移的节点还能正常登陆, 还能正常访问数据目录的前提下进行. 如果机器已经挂掉, 无法访问到原有数据, 那么这种方式并不合适. 迁移嘛, 都正常才能迁移, 不正常的迁移叫故障恢复😆