Configuring Monitoring Settings

There are several types of monitoring settings you can configure in Cloudera Manager:

  • For a service or role for which monitoring is provided, you can enable and disable selected health checks and events, configure how those health checks factor into the overall health of the service, and modify thresholds for the status of certain health checks. Cloudera Manager supports this type of monitoring configuration for HDFS, MapReduce, HBase, ZooKeeper, Flume, and Impala.
  • For individual hosts you can also disable or enable selected health checks, modify thresholds, and enable or disable health alerts.
  • Each of the Cloudera Management Services has its own parameters that can be modified in order to modify how much data is retained by that service. For some monitoring functions, the amount of retained data can grow very large, so it may become necessary to adjust the limits.
  • For the Cloudera Management Services you can configure monitoring settings for the monitoring roles themselves — enable and disable health checks on the monitoring processes as well as configuring some general settings related to events and alerts (specifically with the Event Server and Alert Publisher).

In addition, you can configure the basic functions of Cloudera Manager's Management Services through the standard configuration settings for the various management roles. For example, the mail server and related properties for the Alerts Publisher are set under the Default set of Alert Publisher configuration properties.

This section covers the following topics:

For general information about modifying configuration settings, see Changing Service Configurations.

Configuring Health Check Settings

The initial monitoring configuration is handled during the installation and configuration of your cluster, and most monitoring parameters have default settings. However, you can set or modify these at any time.

  Note:

If alerting is enabled for events, you will be able to search for and view alerts in the Events tab, even if you do not have email notification configured.

To configure a Service monitoring setting:

  1. Click the Services tab, and select the service instance you want to modify. (This can be any of the services for which monitoring is provided, or the Cloudera Management Service.)
  2. Click the Configuration tab.
  3. Click the Monitoring category at the bottom of the left-hand Category panel.
  4. Under the Monitoring category, select the category of properties you want to change (these are organized as Service-Wide or by role).

To configure a Host monitoring setting:

  1. Click the Hosts tab.
  2. To modify the settings for an individual host, select the host.
  3. Click the Configuration tab.
  4. Click the Monitoring category in the left-hand Category panel. Note that if you perform this from the Hosts page, rather than for an individual host, the settings will apply to all hosts.

Depending on the service or role you select, and the configuration category, you can enable or disable health checks, determine when health checks cause alerts, or determine whether specific health checks are used in computing the overall health of a role or service. In most cases you can disable these "roll-up" health checks separately from the individual health checks.

As a rule, a Health Check whose result is considered "Concerning" or "Bad" will be forwarded as an event to the Event Server. That includes Health Checks whose results are based on configured Warning or Critical thresholds, as well pass/fail type health checks. An event will also be published when the Health Check result returns to normal.

You can control when an individual Health Check will be forwarded as an Event or an Alert by modifying the threshold values for the relevant Health test.

Configuring Directory Monitoring

Cloudera Manager can perform threshold-based monitoring of free space in the various directories on the hosts its monitors — such as log directories or checkpoint directories (for the Secondary NameNode).

These thresholds can be set in one of two ways — as absolute thresholds (in terms of MiBx/GiBs etc.) or as percentages of space. As with other threshold properties, you can set values that will trigger events at both the Warning and Critical levels.

If you set both thresholds, the Absolute Threshold setting will be used.

These thresholds are set under the Monitoring section of the Configuration page for each service.

Configuring Activity Monitor Events

The Activity Monitor monitors the MapReduce jobs running on your cluster. This also includes the higher-level activities, such as Pig, Hive, and Oozie workflows that eventually are run as MapReduce tasks. Currently the Activity Monitor does not support MapReduce v2 (YARN).

You can monitor for slow-running jobs or jobs that fail, and alert on these events. To detect jobs that are running too slowly, you must configure a set of you must configure Activity Duration Rules that specify what jobs to monitor, and what the limits on duration are for those jobs.

Activity Monitor-related events and alerts for MapReduce are configured via the Monitoring category under the Configuration tab of the MapReduce services page.

To configure Activity Monitor settings for MapReduce:

  1. Click the Services tab.
  2. Select the MapReduce service instance.
  3. Click the Configuration tab.
  4. Click the Monitoring category at the bottom of the left-hand Category panel.

A "slow activity" alert occurs when a job exceeds the duration limit configured for it in an Activity Duration Rule. Activity Duration Rules are not defined by default; you must configure these rules if you want to see alerts for jobs that exceed the duration defined by these rules.

An Activity Duration Rule is a regular expression (used to match an activity name (Job ID)) combined with a run time limit which the job should not exceed. You can add as many rules as you like, one per line, in the Activity Duration Rules property.

The format of each rule is '<regex>=<number>' where the <regex> is a regular expression to match against the activity name, and <number> is the job duration limit, in minutes. When a new activity starts, each <regex> expression is tested against the name of the activity for a match.

The list of rules is tested in order, and the first match found is used. For example, if the rule set is:

foo=10
bar=20

Any activity named "foo" would be marked slow if it ran for more than 10 minutes. Any activity named "bar" would be marked slow if it ran for more than 20 minutes.

Since full Java regular expressions can be used, if the rule set is:

foo.*=10
bar=20

In this case, any activity with a name that starts with foo (e.g. fool, food, foot) will match the first rule (see http://download.oracle.com/javase/tutorial/essential/regex/).

If there is not a match for an activity, then that activity will not be monitored for job duration. However, you can add a "catch-all" as the last rule which will always match any name:

foo.*=10
bar=20
baz=30
.*=60

In this case, any job that runs longer than 60 minutes will be marked slow and will generate an alert.

Configuring Log Events

You can enable or disable the forwarding of selected log events to the Event Server. This is enabled by default, and is a service-wide setting (Enable Log Event Capture) for each service for which monitoring is provided. Alerts for log events is disabled by default for all alerts.

To enable or disable log event capture:

  1. Click the Services tab, and select the service instance you want to modify. You can enable disable event capture for CDH services or for the Cloudera management services.
  2. Pull down the Configuration tab and select Edit.
  3. Click the Monitoring category at the bottom of the left-hand Category panel.
  4. Under Service Wide > Events and Alerts, modify the Enable Log Event Capture setting.

You can also modify the rules that determine how log messages are turned into events. Editing these rules is not recommended.

For each role, there are rules that govern how its log messages are turned into events by the custom log4j appender for the role. These are defined in the Rules to Extract Events from Log Files property for each HDFS, MapReduce and HBase role, and for ZooKeeper, Flume agent, and monitoring roles as well.

To configure which log messages become events:

  1. Click the Services tab, and select the service instance you want to modify.
  2. Pull down the Configuration tab and select Edit.
  3. Click the Monitoring category at the bottom of the left-hand Category panel.
  4. Select the role group for the Role for which you want to configure log events, or search for "Rules to Extract Events from Log Files". Note that for some roles there may be more than one role group, and you may need to modify all of them. The easiest way to ensure that you have found all occurrences of the property to need to modify is to search for the property by name; Cloudera Manager will show all copies of the property that match the search filter.
  5. Edit these rules as needed.

A number of useful rules are defined by default, based on Cloudera's experience supporting Hadoop clusters. For example:

  • The line {"rate": 10, "threshold":"FATAL"}, means log entries with severity FATAL should be forwarded as alerts, up to 10 a minute.
  • The line {"rate": 0, "exceptiontype": "java.io.EOFException"}, means log entries with the exception java.io.EOFException should always be forwarded as an alert.

The syntax for these rules is defined in the Description field for this property: basically, the syntax lets you create rules that identify log messages based on log4j severity, message content matching, and/or the exception type. These rules must result in valid JSON. You can also specify that the event should generate an alert (by setting "alert":true in the rule). Note that if you specify a content match, the entire content must match — if you want to match on a partial string, you must provide wildcards as appropriate to allow matching the entire string.

Editing these rules is not recommended. Cloudera Manager provides a default set of rules that should be sufficient for most users.

Configuring Alerts

You can configure alerts to be delivered by email, or sent as SNMP traps. These configurations are set under the Alert Publisher role of the Cloudera Manager management service. See Configuring Alert Delivery.

Note that if you just want to add to or modify the list of alert recipient email addresses, you can do this starting at the Alerts tab under the Administration page, accessed with the gear icon images/image6.jpeg .

You can also send a test alert e-mail from the Alerts tab under the Administration page.

Enabling Health Checks for Cloudera Management Services

The Cloudera Manager management service provides health checks for its own roles.

You can enable or disable these health checks for each management service. (Role-based health checks are enabled by default). You can also set a variety of thresholds for specific roles such as thresholds for log directory free space.

Configuring Cloudera Management Services Database Limits

Each Cloudera Management Service maintains a database for retaining the data it monitors. These databases (as well as the log files maintained by these services) can grow quite large. For example, the Activity Monitor maintains data at the service level, the activity level (MapReduce jobs and aggregate activities), and at the task attempt level. Limits on these data sets are configured when you install your management services, but you can modify these parameters through the Configuration settings in the Cloudera Manager Admin console, for each management service.

For example, the Event Server lets you set a total number of events you want to store. Host Monitor and Service Monitor let you set data expiration thresholds (in hours), and Activity Monitor gives you "purge" settings (also in hours) for the data it stores. There are also settings for the logs that these various services create. You can throttle how big the logs are allowed to get and how many previous logs to retain.

To change any of the data retention or log size settings:

  1. From the Services tab, select the Cloudera Management Services service instance.
  2. Pull down the Configuration tab and click Edit.
  3. In the left-hand column, select the role group for the role whose configurations you want to modify. (Note that the management services are singleton roles so there will be only a Base role group for the role.)
  4. For some services, such as the Activity Monitor, Service Monitor, or Host Monitor, the purge or expiration period properties are found in the top-level settings for the role. Typically, Log file size settings will be under the Logs category under the role group.