Welcome to new things

[Technical] [Electronic work] [Gadget] [Game] memo writing

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

This is a way to notify external parties such as Slack or email when an error occurs in a program in GCP's Kubernetes (GKE).

You do not have to implement your own error detection mechanism, only GCP settings can do this.

How it works

When an error occurs in a program, the program should output a message to stderr.

By default, GKE logs standard container output and standard errors to Cloud Logging.

Cloud Monitoring monitors Cloud Logging logs and notifies externally when there is an output.

How GCP works

GKEとCloud Logging

When a cluster is created in GKE, various logs are output to Cloud Logging by default.

The logs also include standard output and standard errors from the container, and as long as you output to standard output and standard errors without being aware of it from the program, the same will be output to Cloud Logging and remain as a log.

Cloud Monitoring

Cloud Monitoring is a tool that accumulates various indicator data and creates graphs from the data.

For example, memory usage data for instances can be periodically sent to Cloud Monitoring for graphing and monitoring.

One of the features of Cloud Monitoring is the ability to set conditions on indicators and send external notifications when conditions are met.

In this case, we will use that functionality to send an external notification when there is output from the container to stderr.

Cloud Monitoring gives the impression of being similar to log monitoring, but the indicators are more like database records that are registered in chronological order rather than logs, and Cloud Monitoring is a service that is a combination of an indicator database and a visualization and monitoring tool for that database Cloud Monitoring is a service that combines an indicator database with a visualization and monitoring tool for that database.

Combination of Cloud Monitoring and Cloud Logging

Since Cloud Logging is just a log, to monitor the log with Cloud Monitoring, send data from the log to Cloud Monitoring as an indicator.

Specifically, write a query to display the logs you want to monitor in Cloud Logging, and register that query as an indicator for Cloud Monitoring.

Cloud Logging will then periodically execute the query and send the number of logs resulting from the execution as an indicator to Cloud Monitoring.

After that, you can set Cloud Monitoring to send a notification externally when a value is entered for that indicator, so that a notification will be sent externally when there is an output in the log.

procedure

container

From the container, you can only output to stderr.

Example: Standard error output application

package main

import (
    "fmt"
    "os"
    "time"
)

func main() {
    for {
        m := time.Now().Minute()
        for i := 0; i <= m; i++ {
            fmt.Fprintf(os.Stderr, "stderr message : %+v/%+v\n", i, m)
        }
        fmt.Println("stdout message")
        time.Sleep(2 * time.Minute)
    }
}

Cloud Logging

Making Indicators

Kubernetes outputs various logs to Cloud Logging, and Cloud Logging's "Log Viewer" allows you to narrow down a large number of logs to a specific log by simply clicking on it with the mouse in the "Log Field Explorer". The "Log Viewer" of Cloud Logging

For example, it displays the error log for the aforementioned container.

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

The refinement method is created as a query.

resource.type="k8s_container"
resource.labels.cluster_name="cluster-test-dev"
resource.labels.container_name="container-app-go"
severity=ERROR

The number of logs for this query is registered as an indicator.

  • Click on "Create Indicator

The "Indicator Editor" will then open. Set the "Name" and "Type" to "Counter" and click "Create Indicator" to create an original indicator that gives the number of logs that satisfy this query.

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

Queries can use regular expressions, etc., so they can be specified flexibly, for example, to narrow down the error log by pod.

Create an alert

Then, in Cloud Monitoring, create alerts based on the number of container error log indicators you have just created.

flow

Since this is a bit complicated, we will first describe the data flow from log to alert detection and alert termination.

  • Log-based indicators execute a query every minute and send the number of execution results to Cloud Monitoring. Even if there are zero query results, it will be registered as an indicator with a value of 0
  • The number of logs recorded can be totaled (calculated) in any way desired.

    • For example, "Find the total number of cases every 5 minutes" etc.
  • Check the aggregate results and generate an alert if the condition is met for a specified period of time.

    • For example, "An alert is generated if the aggregate result is always above 0 for one hour.
  • After the alert occurs, the alert ends when the condition is no longer met.

    • As long as the alert meets the criteria, no new notifications will be sent.

making

Aggregation is performed from the indicators.

  • [Cloud Logging]-[Log-based Metrics]-[User-defined Metrics]

to the right of the log-based indicator created in "..." to "Create notifications based on indicators".

You will then be redirected to the Cloud Monitoring notification policy creation.

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

Log-based metrics are registered as metrics under the name "logging/user/\~", so the name is set in the "Metric" of the notification policy.

aggregate specification

Cloud Monitoring will record log-based metrics. As it is, it is just log count data, so we use it to determine what to aggregate (calculate).

For example, count the number of cases every 5 minutes.

  • Aggregatornone
  • Period "を 5 minute
  • [Advnced Aggregation]-[Aligner]sum

to the

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

Conditions specified

Under "Configuration," set what happens to the values aggregated above to trigger an alert.

For example, if there is always at least one error during a minute, make it a notification.

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

  • [Configuration]-[Threshold]0

This will send a notification if there is always at least one error during a one-minute period.

As long as the condition is met, the alert is considered to continue and no new notification is made.

When the condition is no longer met, the alert is terminated and a new alert is generated and notified the next time the condition is met.

Notification is designated first

After that, you can specify the recipients of the alerts, and they will be notified when an alert occurs.

Example: Email notification

How to notify Slack or email when an error occurs in a container with Google Kubernetes Engine (GKE)

Cloud Monitoring 補足説明

We have gotten into some difficulties with Cloud Monitoring, so here is a brief summary of how to use it.

About Aggregator

When creating a graph from indicators, a log-based indicator has only one graph because there is only one log-based indicator, but an instance memory usage indicator, for example, will display as many graphs as there are instances. However, for instance memory usage indicators, the number of graphs will be as many as the number of instances. If there is only one graph (indicator), as in this case, Aggregator should be replaced by none.

About Aligner

Cloud Monitoring is designed to be flexible and does not display the values of the received indicators as they are, but rather displays them in arbitrary aggregate form. The aggregation method is specified in [Advnced Aggregation]-[Aligner].

To obtain the number of logs (index), as in this case, use sum, and to obtain the maximum value during the period, use max.

There is an count, which does not refer to the number of logs, but to the number of times a log-based indicator was sent. So even if the number of logs is zero, the number of times an indicator is registered is counted.

And since log-based indicators are sent every minute, for example, Period becomes 10 minute and count becomes 10.

About Configuration

Specify how long the aggregate results of the indicators specified above should remain in an abnormal state.

For example, if you want to be alerted when an error log is detected, you should be alerted if there is at least one error log in a one-minute period.

The period specified here is not the duration of the check frequency, but rather the period during which the condition continues to be met.

For example, if you want to set the check frequency to hourly, instead of setting the frequency here to hourly, set the aggregate frequency to hourly.

If the aggregation frequency is set to 1 minute and the duration here is set to 1 hour, the condition is that an error occurs if an error continues to occur for 60 minutes. In this case, if there is no error at least once during the 60 minutes, the condition is not satisfied.

Log-based indicators send data every minute, but some indicators send data irregularly. In such cases, the value used to establish the condition is treated as if the last indicator acquired continues to be used.

For example, if the aggregation frequency is set to 1 hour and the frequency here is set to 10 minutes, then after an error occurs and the number of errors is above 0 in the aggregation, it will remain in that state for 1 hour. And it will stay that way for 10 minutes after the error is counted, resulting in an alert after 10 minutes. The error condition will then continue for 50 minutes, during which time the alert will not be removed. Then, when the next tally shows zero errors, the alert state is canceled.

Impressions, etc.

Cloud Logging and Cloud Monitoring are used in combination, so the documentation and UI are scattered and a bit confusing.

Some of the points of interest are

  • Cloud Logging itself is just a log, it does not have an alerting function, but creates indicators from the logs and generates alerts from Cloud Monitoring via those indicators.
  • Log-based indicators are not registered when logs are recorded, but always execute queries at regular intervals and register the results of their execution.
  • Even if there are zero query execution results, they are registered as indicators.
  • Cloud Monitoring does not use the indicators as they are, but uses aggregate results at regular intervals.
  • An alert is generated if the condition continues to be met for a specified period of time
  • If no new indicators are registered during the specified time period specified in the alert, the last indicator will continue to be used.

Cloud Monitoring is a bit complicated to set up.

Since this is a vendor-dependent technique, I would limit its use to what you can understand by doing a little research, and not go too deep into it.

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com

www.ekwbtblog.com