最近看到有几个节点的Yarn的界面显示 Unhealthy Nodes, 点击进去看到大致如下的信息:
1/1 local-dirs are bad: xxxx(path)
1/1 log-dirs are bad: xxx(path)
看到apache的官方文档对于Disk_checker的解释
NodeManager进程会检测磁盘的健康状态,包括磁盘空间、读写权限等以及其他自定义case,如果检查失败的话,会标记为unhealthy状态,并将状态汇报给ResourceManager, 之后新的container不会在该节点启动,所以该节点不会有计算任务.
同时也支持自定义脚本检查节点信息(disk之外的),返回非0态即认为unhealthy.
以下是磁盘检查的参数配置,yarn-site.xml.
解决方法:
通过检查发现配置yarn.nodemanager.local-dirs 和 yarn.nodemanager.log-dirs的磁盘使用空间超过了90%,所以临时先调整
出现问题节点的yarn.nodemanager.disk-health-checker.min-healthy-disks ,后续再增加storage disk,balance后再回到default值90.
Configuration Name | Allowed Values | Description |
---|---|---|
yarn.nodemanager.disk-health-checker.enable | true, false | Enable or disable the disk health checker service |
yarn.nodemanager.disk-health-checker.interval-ms | Positive integer | The interval, in milliseconds, at which the disk checker should run; the default value is 2 minutes |
yarn.nodemanager.disk-health-checker.min-healthy-disks | Float between 0-1 | The minimum fraction of disks that must pass the check for the NodeManager to mark the node as healthy; the default is 0.25 |
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage | Float between 0-100 | The maximum percentage of disk space that may be utilized before a disk is marked as unhealthy by the disk checker service. This check is run for every disk used by the NodeManager. The default value is 90 i.e. 90% of the disk can be used. |
yarn.nodemanager.disk-health-checker.min-free-space-per-disk-mb | Integer | The minimum amount of free space that must be available on the disk for the disk checker service to mark the disk as healthy. This check is run for every disk used by the NodeManager. The default value is 0 i.e. the entire disk can be used. |