This is the documentation for Cloudera Manager 4.8.3.
Documentation for other versions is available at Cloudera Documentation.

ZooKeeper Health Checks

ZooKeeper Canary

This is a ZooKeeper service-level health check that checks that basic client operations are working and are completing in a reasonable amount of time. This check reports the results of a periodic "canary" test that performs the following sequence of operations. First, it connects to and establishes a session (the root session) with the ZooKeeper service and creates a permanent znode to serve as the root of all canary operations. The canary test then connects to and establishes sessions (the child sessions) with each ZooKeeper server of the service. Each child session is used to create an ephemeral child znode under the canary root. After the child znodes have been created, watches that await znode deletion events are registered with each of the child znodes for each of the child sessions. The canary test then deletes each of the child znodes and then verifies that each child session has received deletion notifications for each of the child znodes. Finally the canary test closes all the child sessions, deletes the root znode and closes the root session. The check returns "Bad" health if the establishment of the root session to the ZooKeeper service fails, the creation of znodes (permanent or ephemeral) fails, the deletion of znodes fails or the retrieval of child znodes of the root znode fails. The check returns "Concerning" health when the canary test succeeds but has one or more servers that could not participate in the canary test operations or if the canary test runs too slowly. A failure of this health check may indicate that ZooKeeper is failing to satisfy client requests correctly or in a timely fashion. Check the status of the ZooKeeper servers, and look in the ZooKeeper server logs for more details. This test can be enabled or disabled using the ZooKeeper Canary Health Check ZooKeeper service monitoring setting. The ZooKeeper Canary Root Znode Path, ZooKeeper Canary Connection Timeout, ZooKeeper Canary Session Timeout, ZooKeeper Canary Operation Timeout settings control the operation of the canary.

Short Name: ZooKeeper Canary

Property Name Description Template Name Default Value Unit
ZooKeeper Canary Connection Timeout Configures the timeout used by the canary for connection establishment with ZooKeeper servers zookeeper_canary_connection_timeout 10000 MILLISECONDS
ZooKeeper Canary Health Check Enables the health check that a client can connect to ZooKeeper and perform basic operations zookeeper_canary_health_enabled true no unit
ZooKeeper Canary Operation Timeout Configures the timeout used by the canary for ZooKeeper operations zookeeper_canary_operation_timeout 30000 MILLISECONDS
ZooKeeper Canary Root Znode Path Configures the path of the root znode under which all canary updates are performed zookeeper_canary_root_path /cloudera_manager_zookeeper_canary no unit
ZooKeeper Canary Session Timeout Configures the timeout used by the canary sessions with ZooKeeper servers zookeeper_canary_session_timeout 30000 MILLISECONDS

ZooKeeper Servers Health

This is a ZooKeeper service-level health check that checks that enough of the ZooKeeper servers in the cluster are healthy. The check returns "Concerning" health if the number of healthy ZooKeeper servers falls below a warning threshold, expressed as a percentage of the total number of ZooKeeper servers. The check returns "Bad" health if the number of healthy and "Concerning" ZooKeeper servers falls below a critical threshold, expressed as a percentage of the total number of ZooKeeper servers. For example, if this check is configured with a warning threshold of 80% and a critical threshold of 60% for a cluster of 5 ZooKeeper servers, this check would return "Good" health if 4 or more ZooKeeper servers have good health. This check would return "Concerning" health if at least 3 ZooKeeper servers have either "Good" or "Concerning" health. If more than 2 ZooKeeper servers have bad health, this check would return "Bad" health. A failure of this health check indicates unhealthy ZooKeeper servers. Check the status of the individual ZooKeeper servers for more information. This test can be configured using the Healthy ZooKeeper Server Monitoring Thresholds ZooKeeper service-wide monitoring setting.

Short Name: ZooKeeper Servers Health

Property Name Description Template Name Default Value Unit
Healthy ZooKeeper Server Monitoring Thresholds The health check thresholds of the overall ZooKeeper service health. The check returns "Concerning" health if the percentage of "Healthy" ZooKeeper servers falls below the warning threshold. The check is unhealthy if the total percentage of "Healthy" and "Concerning" ZooKeeper servers falls below the critical threshold. zookeeper_servers_healthy_thresholds critical:51.000000, warning:99.000000 PERCENT

ZooKeeper ZXID Rollover

This ZooKeeper service-level health check monitors the current zxid to ensure that its xid component does not rollover. The zxid is a 64-bit number maintained by ZooKeeper and is made up of two parts. The higher order 32-bit part is the epoch and the lower order 32-bit part is the xid. This check concerns itself with the xid portion that has a maximum possible value of 0xffffffff. If the xid reaches this value a rollover can occur. The check returns "Concerning" or "Bad" health if the current xid is above a warning threshold or critical threshold respectively. The threshold is expressed as a percentage of the maximum possible xid. For example, if this check is configured with a warning percentage threshold of 80% and a critical percentage threshold of 95% for a ZooKeeper service, this check would return "Good" health if the current xid is less than 0xcccccccc. This check would return "Concerning" health if the current xid is between 0xcccccccc and 0xf3333333. If the current xid is above 0xf3333333, this check would return "Bad" health. A failure of this health check indicates that an overflow of xid may occur in the near future if the corrective action of forcing a leader election is not taken. This test is disabled by default since rollover of the xid is a concern only in releases prior to CDH3u4. For those releases, the test needs to be enabled explicitly. This test can be configured using the ZooKeeper Current Zxid Monitoring Percentage Thresholds ZooKeeper service-wide monitoring setting.

Short Name: ZXID Rollover.

Property Name Description Template Name Default Value Unit
ZooKeeper Current Zxid Monitoring Percentage Thresholds The health check thresholds for monitoring of the xid portion of the current zxid of the service. Specified as a percentage of the maximum possible xid setting of 0xffffffff. zookeeper_current_zxid_percentage_thresholds critical:never, warning:never PERCENT