Marathon

Marathon之健康检查

2019-01-06  本文已影响58人  汤尼房
引言

健康检查是依赖于应用程序的,你必须在应用程序中实现它们,并将其公开,因为只有你知道应用程序中什么构成健康状态(定义应用处于什么状态是属于健康状态)。在应用程序中定义健康检查,满足以下条件则健康检查被视为通过:

Marathon级别的健康检查

健康检查(HTTP、HTTPS、TCP)由Marathon执行,通过测试marathon能否到达任务来检测任务的健康状况。其主要有以下几个限制:

Mesos级别的健康检查

Mesos级别的健康检查(MESOS_HTTP、MESOS_HTTPS、MESOS_TCP、COMMAND)由Mesos在运行相应任务的节点(agent)上本地执行,从而测试Mesos执行器的可达性。较于Marathon,Mesos级别的健康检查具有以下优点:

Mesos与Marathon关于tcp、http健康检查源码实现

HTTP健康检查验证在某个特定的路由发送GET请求是否会得到成功的HTTTP状态码。成功状态码可以是真的成功,或者重定向。因此,响应码必须在200-399的范围内,包含200和399。TCP健康检查验证是否能够成功开启TCP链接到任务上。它们并不发送或者接受任何数据,而只是简单尝试开启socket。

  1. mesos tcp
case HealthCheck::TCP: {
      check.set_type(CheckInfo::TCP);

      check.mutable_tcp()->set_port(healthCheck.tcp().port());

      break;
    }
  1. mesos http
 case HealthCheck::HTTP: {
      check.set_type(CheckInfo::HTTP);

      CheckInfo::Http* http = check.mutable_http();
      http->set_port(healthCheck.http().port());
      http->set_path(healthCheck.http().path());

      break;
    }
  1. marathon tcp
def tcp(
    instance: Instance,
    check: MarathonTcpHealthCheck,
    host: String,
    port: Int): Future[Option[HealthResult]] = {
    val address = s"$host:$port"
    val timeoutMillis = check.timeout.toMillis.toInt
    logger.debug(s"Checking the health of [$address] for instance=${instance.instanceId} via TCP")

    Future {
      val address = new InetSocketAddress(host, port)
      val socket = new Socket
      scala.concurrent.blocking {
        socket.connect(address, timeoutMillis)
        socket.close()
      }
      Some(Healthy(instance.instanceId, instance.runSpecVersion, Timestamp.now()))
    }(ThreadPoolContext.ioContext)
  }
  1. marathon http
def http(
    instance: Instance,
    check: MarathonHttpHealthCheck,
    host: String,
    port: Int): Future[Option[HealthResult]] = {
    val rawPath = check.path.getOrElse("")
    val absolutePath = if (rawPath.startsWith("/")) rawPath else s"/$rawPath"
    val url = s"http://$host:$port$absolutePath"
    logger.debug(s"Checking the health of [$url] for instance=${instance.instanceId} via HTTP")

    singleRequest(
      RequestBuilding.Get(url),
      check.timeout
    ).map { response =>
        response.discardEntityBytes() //forget about the body
        if (acceptableResponses.contains(response.status.intValue())) {
          Some(Healthy(instance.instanceId, instance.runSpecVersion))
        } else if (check.ignoreHttp1xx && (toIgnoreResponses.contains(response.status.intValue))) {
          logger.debug(s"Ignoring health check HTTP response ${response.status.intValue} for instance=${instance.instanceId}")
          None
        } else {
          logger.debug(s"Health check for instance=${instance.instanceId} responded with ${response.status}")
          Some(Unhealthy(instance.instanceId, instance.runSpecVersion, response.status.toString()))
        }
      }.recover {
        case NonFatal(e) =>
          logger.debug(s"Health check for instance=${instance.instanceId} did not respond due to ${e.getMessage}.")
          Some(Unhealthy(instance.instanceId, instance.runSpecVersion, e.getMessage))
      }
  }
实践

marathon级别的健康检查

创建应用实例的同时进行marathon级别的健康检查 流程:
  1. Adding health check for app
  2. Starting health check actor for app
    Received health result for app(Unhealthy)
    Received health result for app(Unhealthy)
    Received health result for app(Unhealthy)
    Received health result for app(Unhealthy)
    Received health result for app(Unhealthy)
    (允许失败的5次健康检查)
    Received health result for app(Unhealthy)
    Received health result for app(Unhealthy)
    Received health result for app(Unhealthy)
    (连续3次失败的健康检查)
  3. Detected unhealthy instance (视为Unhealthy的应用)
  4. Send kill request for instance (kill并重新创建
mesos级别的健康检查 执行调度应用的时刻与图中 Executor registered...时刻基本一致,Executor registered与执行健康检查的大约相差16s。(Executor registered不包含拉取镜像所需要的时间,即Executor registered属于slave上镜像拉取完成后准备创建应用的时刻) 参数解读:
intervalSeconds: 60s,检测的时间间隔为60s,即每60s会执行一次健康检查
gracePeriodSeconds: 300s,允许健康检查失败的时间为300s;因此这里第0s、60s、120s、180s、240s发起的健康检查均处于这300s的范畴,所以会有允许健康检查失败的5次机会

maxConsecutiveFailures:3,连续3次健康检查失败后,该应用会被视为Unhealthy状态;因此这里会有3次失败的健康检查
timeoutSeconds:20s,即每次健康检查的响应时间为20s,如果响应时间超过20s,即使该响应正常,也会被视为失败
delaySeconds: 15s,创建app后经过15s再进行健康检查,此参数只有mesos级别的健康检查有

调整参数,查看结果
引用
上一篇下一篇

猜你喜欢

热点阅读