Skip to content

Commit

Permalink
Merge pull request #18 from spidernet-io/pr/welan/op
Browse files Browse the repository at this point in the history
update hoststatus
  • Loading branch information
weizhoublue authored Dec 30, 2024
2 parents 14136f6 + 6e42e1c commit 8a4203b
Show file tree
Hide file tree
Showing 12 changed files with 218 additions and 83 deletions.
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
v0.2.0
v0.3.0
4 changes: 2 additions & 2 deletions chart/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ name: bmc-operator
description: A Helm chart for BMC Operator

# This is the chart version, which will be taken from VERSION file
version: 0.2.0
version: 0.3.0

# This is the version number of the application being deployed, which will be taken from VERSION file
appVersion: "0.2.0"
appVersion: "0.3.0"

type: application

Expand Down
4 changes: 4 additions & 0 deletions chart/crds/bmc.spidernet.io_hoststatuses.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ spec:
properties:
basic:
properties:
activeDhcpClient:
description: ActiveDhcpClient specifies this host is an active
dhcp client when type is dhcp
type: boolean
https:
type: boolean
ipAddr:
Expand Down
5 changes: 3 additions & 2 deletions doc/usage/dhcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ Agent 中的 DHCP server,支持把 DHCP client 的 IP 固定到 DHCP server
2. **固定IP绑定** (`EnableBindDhcpIP = true`):
- 所有已分配的 DHCP IP 会被固化到 DHCP server 的配置中,其中实现 IP 地址和 MAC 地址的绑定
- 当网络中的 dhcp client 进行新 IP 分配时,会创建对应的hoststatus对象
- 当网络中的 dhcp client 进行 IP 释放时,不会自动删除对应的hoststatus对象
- 当需解除 DHCP server 配置中的 IP 绑定,可手动删除对应的 hoststatus 对象, 后端会自动更新 DHCP server 的配置,实现解绑
- 当网络中的 dhcp client 进行 IP 释放时,不会自动删除对应的 hoststatus 对象
- 当需解除 DHCP server 配置中的 IP 和 MAC 地址的绑定,可按照如下流程:
首先,进入 agent pod 中,查看 DHCP server 的实时 IP 分配文件 /var/lib/dhcp/bmc-clusteragent-dhcpd.leases , 确认和删除其中期望解除绑定的 IP 地址;其次,`kubectl get hoststatus -l status.basic.ipAddr=<IP>` 查看 hoststatus 对象,确认其中的 IP 和 MAC 地址符合删除预期,然后手动删除对应的 hoststatus 对象 `kubectl delete hoststatus -l status.basic.ipAddr=192.168.0.101` ;最终,后端会自动更新 DHCP server 的配置,实现 IP 和 MAC 地址的解绑 ( 可进入 agent pod 中,查看 文件 /etc/dhcp/dhcpd.conf 确认)

### 通过 hostendpoint 对象创建的静态 IP 的固定

Expand Down
48 changes: 31 additions & 17 deletions doc/usage/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ clusterAgent:
# 当启用 enableDhcpServer 时,需要配置数据存储方式
storage:
type: "pvc" # 指定 DHCP server 存储客户端分配 IP 数据的方式,支持 pvc(适用于生产环境)和 hostPath(适用于 POC 环境)
EOF

# 安装 BMC 组件
Expand All @@ -59,9 +58,9 @@ helm install bmc bmc/bmc-operator \

# 验证安装结果
kubectl get pod -n bmc
NAME READY STATUS RESTARTS AGE
agent-bmc-clusteragent-6b9695698b-hphkj 1/1 Running 0 39m
bmc-bmc-operator-7b4986f89c-bd9j9 1/1 Running 0 40m
NAME READY STATUS RESTARTS AGE
agent-bmc-clusteragent-6b9695698b-hphkj 1/1 Running 0 39m
bmc-bmc-operator-7b4986f89c-bd9j9 1/1 Running 0 40m
```

### 多集群纳管,以 macvlan 模式部署
Expand Down Expand Up @@ -111,7 +110,6 @@ clusterAgent:
password: "password" # 指定所有纳管主机的默认 BMC 密码
storage:
type: "pvc" # 指定 DHCP server 存储客户端分配 IP 数据的方式
EOF

# 安装 BMC 组件
Expand All @@ -121,9 +119,9 @@ helm install bmc bmc/bmc-operator \

# 验证安装结果
kubectl get pod -n bmc
NAME READY STATUS RESTARTS AGE
agent-bmc-clusteragent-6b9695698b-hphkj 1/1 Running 0 39m
bmc-bmc-operator-7b4986f89c-bd9j9 1/1 Running 0 40m
NAME READY STATUS RESTARTS AGE
agent-bmc-clusteragent-6b9695698b-hphkj 1/1 Running 0 39m
bmc-bmc-operator-7b4986f89c-bd9j9 1/1 Running 0 40m
```

## 接入主机
Expand All @@ -135,14 +133,14 @@ BMC 组件支持运行多个 agent 来纳管多个集群。安装完成后,系
```bash
# 查看 agent 实例状态
~# kubectl get clusteragent
NAME READY
bmc-clusteragent true
NAME READY
bmc-clusteragent true

# 查看 agent 和 operator 的 Pod 状态
~# kubectl get pod -n bmc
NAME READY STATUS RESTARTS AGE
agent-bmc-clusteragent-6b9695698b-hphkj 1/1 Running 0 39m
bmc-bmc-operator-7b4986f89c-bd9j9 1/1 Running 0 40m
NAME READY STATUS RESTARTS AGE
agent-bmc-clusteragent-6b9695698b-hphkj 1/1 Running 0 39m
bmc-bmc-operator-7b4986f89c-bd9j9 1/1 Running 0 40m

# 查看 agent 实例的详细配置
~# kubectl get clusteragent bmc-clusteragent -o yaml
Expand Down Expand Up @@ -181,7 +179,7 @@ status:
```bash
# 查看 hoststatus 实例,每个实例代表一个被纳管的 BMC 主机
# 确认 HEALTHY 状态为 true 表示主机已被成功纳管
~# kubectl get hoststatus
~# kubectl get hoststatus -l bmc.spidernet.io/mode=dhcp
NAME CLUSTERAGENT HEALTHY IPADDR TYPE AGE
bmc-clusteragent-192-168-0-100 bmc-clusteragent true 192.168.0.100 dhcp 48m
bmc-clusteragent-192-168-0-101 bmc-clusteragent true 192.168.0.101 dhcp 48m
Expand Down Expand Up @@ -279,14 +277,30 @@ NAME CLUSTERAGENT HOSTIP
device10 bmc-clusteragent 10.64.64.42

# 查看所有主机的状态,确认新添加的主机状态为 HEALTHY
~# kubectl get hoststatus
~# kubectl get hoststatus -l bmc.spidernet.io/mode=hostEndpoint
NAME CLUSTERAGENT HEALTHY IPADDR TYPE AGE
bmc-clusteragent-192-168-0-100 bmc-clusteragent true 192.168.0.100 dhcp 48m
bmc-clusteragent-192-168-0-101 bmc-clusteragent true 192.168.0.101 dhcp 48m
bmc-clusteragent-10-64-64-42 bmc-clusteragent true 10.64.64.42 hostEndpoint 1m
```

## 主机操作

完成主机接入后,您可以对主机进行电源管理等操作,具体请参考 [主机操作](./action.md) 章节。

## 故障运维

1. 查看 hoststatus 对象的 HEALTHY 健康状态,如果不健康,代表这该主机无法正常访问 BMC,也许是 IP 地址不对,也许是 BMC 用户名密码不对,也许是 BMC 主机不支持 redfish 协议,因此,需要人为进行排查故障

```bash
# kubectl get hoststatus
NAME CLUSTERAGENT HEALTHY IPADDR TYPE AGE
bmc-clusteragent-192-168-0-101 bmc-clusteragent true 192.168.0.101 dhcp 2d14h
device-safe bmc-clusteragent true 10.64.64.42 hostEndpoint 2d14h
gpu bmc-clusteragent true 10.64.64.94 hostEndpoint 2d14h
test-hostendpoint bmc-clusteragent true 192.168.0.50 hostEndpoint 2d14h
```

2. 对于 DHCP 接入的主机,当使用绑定 IP 和 MAC 功能时,当期望解除 IP 和 MAC 的绑定,可按照如下流程:

1. 进入 agent pod 中,查看 DHCP server 的实时 IP 分配文件 `/var/lib/dhcp/bmc-clusteragent-dhcpd.leases`,确认和删除其中期望解除绑定的 IP 地址
2. `kubectl get hoststatus -l status.basic.ipAddr=<IP>` 查看 hoststatus 对象,确认其中的 IP 和 MAC 地址符合删除预期,然后手动删除对应的 hoststatus 对象 `kubectl delete hoststatus -l status.basic.ipAddr=192.168.0.101`
3. 后端会自动更新 DHCP server 的配置,实现 IP 和 MAC 地址的解绑(可进入 agent pod 中,查看文件 `/etc/dhcp/dhcpd.conf` 确认)
4 changes: 4 additions & 0 deletions pkg/agent/hostendpoint/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,10 @@ func (r *HostEndpointReconciler) handleHostEndpoint(ctx context.Context, hostEnd
hostStatus := &bmcv1beta1.HostStatus{
ObjectMeta: metav1.ObjectMeta{
Name: name,
Labels: map[string]string{
bmcv1beta1.LabelIPAddr: hostEndpoint.Spec.IPAddr,
bmcv1beta1.LabelClientMode: bmcv1beta1.HostTypeEndpoint,
},
OwnerReferences: []metav1.OwnerReference{
{
APIVersion: bmcv1beta1.APIVersion,
Expand Down
52 changes: 31 additions & 21 deletions pkg/agent/hoststatus/HostStatusReconcile.go
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ package hoststatus
import (
"context"
"fmt"
"reflect"
"sync"
"time"

Expand All @@ -26,7 +25,7 @@ var hostStatusLock = &sync.Mutex{}
// ------------------------------ update the spec.info of the hoststatus

// this is called by UpdateHostStatusAtInterval and UpdateHostStatusWrapper
func (c *hostStatusController) UpdateHostStatusInfo(name string, d *hoststatusdata.HostConnectCon) error {
func (c *hostStatusController) UpdateHostStatusInfo(name string, d *hoststatusdata.HostConnectCon) (bool, error) {

// local lock for updateing each hostStatus
hostStatusLock.Lock()
Expand All @@ -50,7 +49,7 @@ func (c *hostStatusController) UpdateHostStatusInfo(name string, d *hoststatusda
err := c.client.Get(context.Background(), types.NamespacedName{Name: name}, existing)
if err != nil {
log.Logger.Errorf("Failed to get HostStatus %s: %v", name, err)
return err
return false, err
}
updated := existing.DeepCopy()

Expand All @@ -75,29 +74,29 @@ func (c *hostStatusController) UpdateHostStatusInfo(name string, d *hoststatusda
}

// 更新 HostStatus
if !reflect.DeepEqual(updated.Status, existing.Status) {
if !compareHostStatus(updated.Status, existing.Status, log.Logger) {
log.Logger.Debugf("status changed, existing: %v, updated: %v", existing.Status, updated.Status)
updated.Status.LastUpdateTime = time.Now().UTC().Format(time.RFC3339)
if err := c.client.Status().Update(context.Background(), updated); err != nil {
log.Logger.Errorf("Failed to update status of HostStatus %s: %v", name, err)
return err
return true, err
}
log.Logger.Infof("Successfully updated HostStatus %s status", name)
} else {
log.Logger.Debugf("no need to updated HostStatus %s status", name)
return true, nil
}

return nil
return false, nil
}

// this is called by UpdateHostStatusAtInterval and
func (c *hostStatusController) UpdateHostStatusWrapper(name string) error {
func (c *hostStatusController) UpdateHostStatusInfoWrapper(name string) error {
syncData := make(map[string]hoststatusdata.HostConnectCon)

modeinfo := ""
if len(name) == 0 {
syncData = hoststatusdata.HostCacheDatabase.GetAll()
if len(syncData) == 0 {
return nil
}
modeinfo = " during periodic update"
} else {
d := hoststatusdata.HostCacheDatabase.Get(name)
if d != nil {
Expand All @@ -107,12 +106,18 @@ func (c *hostStatusController) UpdateHostStatusWrapper(name string) error {
log.Logger.Errorf("no cache data found for hostStatus %s ", name)
return fmt.Errorf("no cache data found for hostStatus %s ", name)
}
modeinfo = " during hoststatus reconcile"
}

for item, t := range syncData {
log.Logger.Debugf("update status of the hostStatus %s ", item)
if err := c.UpdateHostStatusInfo(item, &t); err != nil {
log.Logger.Errorf("failed to update HostStatus %s: %v", item, err)
if updated, err := c.UpdateHostStatusInfo(item, &t); err != nil {
log.Logger.Errorf("failed to update HostStatus %s %s: %v", item, modeinfo, err)
} else {
if updated {
log.Logger.Debugf("update status of the hostStatus %s %s", item, modeinfo)
} else {
log.Logger.Debugf("no need to update status of the hostStatus %s %s", item, modeinfo)
}
}
}

Expand All @@ -133,7 +138,7 @@ func (c *hostStatusController) UpdateHostStatusAtInterval() {
return
case <-ticker.C:
log.Logger.Debugf("update all hostStatus at interval ")
if err := c.UpdateHostStatusWrapper(""); err != nil {
if err := c.UpdateHostStatusInfoWrapper(""); err != nil {
log.Logger.Errorf("Failed to update host status: %v", err)
}
}
Expand Down Expand Up @@ -171,23 +176,27 @@ func (c *hostStatusController) processHostStatus(hostStatus *bmcv1beta1.HostStat
DhcpHost: hostStatus.Status.Basic.Type == bmcv1beta1.HostTypeDHCP,
})

// update the status.info of the hostStatus
if err := c.UpdateHostStatusWrapper(hostStatus.Name); err != nil {
logger.Errorf("failed to update HostStatus %s: %v", hostStatus.Name, err)
return err
if len(hostStatus.Status.Info) == 0 {
if err := c.UpdateHostStatusInfoWrapper(hostStatus.Name); err != nil {
logger.Errorf("failed to update HostStatus %s: %v", hostStatus.Name, err)
return err
}
} else {
logger.Debugf("HostStatus %s has already been processed, skipping the first time update", hostStatus.Name)
}

logger.Debugf("Successfully processed HostStatus %s", hostStatus.Name)
return nil
}

// Reconcile 实现 reconcile.Reconciler 接口
// 负责在 hoststatus 创建后 Info 信息的第一次更新(后续的更新由 UpdateHostStatusAtInterval 完成)
func (c *hostStatusController) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.Logger.With(
zap.String("hoststatus", req.Name),
)

logger.Info("Reconciling HostStatus")
logger.Debugf("Reconciling HostStatus %s", req.Name)

// 获取 HostStatus
hostStatus := &bmcv1beta1.HostStatus{}
Expand Down Expand Up @@ -219,11 +228,12 @@ func (c *hostStatusController) Reconcile(ctx context.Context, req ctrl.Request)

// 处理 HostStatus
if err := c.processHostStatus(hostStatus, logger); err != nil {
logger.Error(err, "Failed to process HostStatus")
logger.Error(err, "Failed to process HostStatus, will retry")
return ctrl.Result{
RequeueAfter: time.Second * 2,
}, err
}

logger.Debugf("Successfully processed HostStatus %s", hostStatus.Name)
return ctrl.Result{}, nil
}
21 changes: 12 additions & 9 deletions pkg/agent/hoststatus/dhcp.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ import (

const (
// retryDelay is the delay before retrying a failed operation
retryDelay = time.Second
dhcpBoundLabel = "bmc.spidernet.io/dhcp-ip-active"
retryDelay = time.Second
)

func shouldRetry(err error) bool {
Expand Down Expand Up @@ -109,7 +108,9 @@ func (c *hostStatusController) handleDHCPAdd(client dhcptypes.ClientInfo) error
ObjectMeta: metav1.ObjectMeta{
Name: name,
Labels: map[string]string{
dhcpBoundLabel: "true",
bmcv1beta1.LabelIPAddr: client.IP,
bmcv1beta1.LabelClientMode: bmcv1beta1.HostTypeDHCP,
bmcv1beta1.LabelClientActive: "true",
},
},
}
Expand Down Expand Up @@ -139,11 +140,12 @@ func (c *hostStatusController) handleDHCPAdd(client dhcptypes.ClientInfo) error
ClusterAgent: c.config.ClusterAgentName,
LastUpdateTime: time.Now().UTC().Format(time.RFC3339),
Basic: bmcv1beta1.BasicInfo{
Type: bmcv1beta1.HostTypeDHCP,
IpAddr: client.IP,
Mac: client.MAC,
Port: c.config.AgentObjSpec.Endpoint.Port,
Https: c.config.AgentObjSpec.Endpoint.HTTPS,
Type: bmcv1beta1.HostTypeDHCP,
IpAddr: client.IP,
Mac: client.MAC,
Port: c.config.AgentObjSpec.Endpoint.Port,
Https: c.config.AgentObjSpec.Endpoint.HTTPS,
ActiveDhcpClient: true,
},
Info: map[string]string{},
}
Expand Down Expand Up @@ -191,7 +193,8 @@ func (c *hostStatusController) handleDHCPDelete(client dhcptypes.ClientInfo) err
updated.Labels = make(map[string]string)
}
// 添加或更新标签
updated.Labels[dhcpBoundLabel] = "false"
updated.Labels[bmcv1beta1.LabelClientActive] = "false"
updated.Status.Basic.ActiveDhcpClient = false
// 更新对象
if err := c.client.Update(context.Background(), updated); err != nil {
log.Logger.Errorf("Failed to update labels of HostStatus %s: %v", name, err)
Expand Down
Loading

0 comments on commit 8a4203b

Please sign in to comment.