Diagnostic API

The Diagnostic API is built on Corus' core REST support, and on Corus' port management functionality. It allows for robust automation by providing a single endpoint to validate deployments at scale, and diagnose/monitor applications later on in the course of their execution.

Here are some highlights:

  • Early failure detection through probing locally running processes after startup;
  • scanning of Corus' internal task manager log, for detecting deployment errors and minimizing the time required to reach a diagnostic;
  • probing of internal components to make sure concurrent activity that may impact diagnostic is taken into account (deployments are given time to complete, processes are given time to start, etc.);
  • clients need connecting with only a single Corus node in order to obtain cluster-wide diagnostic data;
  • diagnostic outcome for a whole cluster reduced to a single HTTP status code (without hiding underlying details), allowing for simple client-side logic and quick integration into existing CI/CD infrastructure.
The diagram below illustrates the diagnostic acquisition flow (implied are the processes controlled by each Corus node):

Even if diagnostic support is primarily meant to be used through the API, the Corus command-line interface offers a diags command also giving access to this feature.

Some Background

Before delving into the API spec itself, some useful background is provided. You do not need this background to start integrating the functionality and may jump straight to the API spec if you feel impatient. Having the background though will help understand the internals, and might come handy when the time comes to make sense of the diagnostic output provided by Corus (for investigation/remediation purposes).

How it Works

The API's main goal is to ease the job of automating deployment validation, by using Corus to perform the following:

  • Test for application responsiveness by having each Corus node in a cluster check for the status of the processes it controls;
  • aggregate such node-specific statuses into a single global status that is easy to interpret and results in one of three possible outcomes:
    • Success: all Corus nodes (and all processes running on these nodes) have been diagnosed as healthy.
    • Failure: one or more Corus nodes (and/or processes running on these nodes) have been diagnosed as down/unresponsive/unstable.
    • Incomplete: the status that could be obtained is only temporary and due to one or more Corus instances unable to perform diagnostic acquisition at the moment (due to concurrent activity that could yield erroneous results). Upon such an outcome, applications should simply retry a few seconds later - say, 10 seconds.
How the three statuses above map to HTTP status codes is explained further below.

The functionality relies largely on Corus' port management feature: for each port configured in the corus.xml descriptor, it is possible to add a diagnostic configuration that is used at runtime by a Corus node to check the status of corresponding processes.

In addition to checking applications state, Corus scans its own state to establish the proper diagnostic. This has been done in order to detect rapidly conditions resulting in a failed deployment, and signal such a failed deployment early on, in an automatic manner (this is described further below).

To top this off, when tags are configured in the the corus.xml descriptor, Corus is able to determine if a process should be running or not on a give node. It will therefore not flag a process as being down (or even attempt checking it) if such a process is not expected to run.

The diagnostic functionality is also used internally by Corus in the context of process health-check, and the auto-restart of processes that are deemed unhealthy.

Application Diagnostic

The first thing to do in order for applications to be diagnosed by Corus is to configure them with a port (using a <port> element in the corus.xml descriptor) that will be checked at runtime. Currently, only HTTP/HTTPS-based checks are supported.

For more detailed information regarding port management in Corus, please refer to the Port Management page.

Configuration

Corus performs process diagnostic by performing a simple HTTP GET call to a predefined endpoint provided by the application. The sample below illustrates how the corus.xml file has to be modified to take advantage of this functionality.

<distribution name="demo" version="1.0" xmlns="http://www.sapia-oss.org/xsd/corus/distribution-5.0.xsd">
  <process name="httpServer" maxKillRetry="3" shutdownTimeout="30000" invoke="true" tags="http-server">
    <port name="http.server">
      <http-diagnostic portPrefix="8" />
    </port>
    ...
  </process>
  ...
</distribution>

The <port> element supports both a <http-diagnostic> and a <https-diagnostic> element for connecting to processes over either HTTP or HTTPS, respectively. Only one such diagnostic element is supported under the <port> element. Using the provided configuration, Corus is able to recreate URLs that will be used to connect to the corresponding application processes and perform the HTTP/HTTPS GET call required to determine their status. Both <http-diagnostic> and a <https-diagnostic> support the same attributes, documented below:

  • gracePeriod (optional, defaults to 45 seconds): the number of seconds given to the process to boot up and respond to health checks, after being started.
  • path (optional, defaults to /): indicates the context path (after the host:port) of the endpoint provided by the application for health check purposes.
  • portPrefix (optional): allows configuring one or more digits which, together with the port value passed to the process by Corus, will constitute the full port number to use for pinging the process.
  • statusCode (optional, defaults to 200): indicates to HTTP status code to expect for the check to be considered a success.
  • readTimeout (optional, defaults to 3000 milliseconds): indicates the maximum amount of time of to wait (in millis) for establishing a HTTP connection.
  • connectionTimeout (optional, defaults to 3000 milliseconds): indicates the maximum amount of time of to wait (in millis) for incoming response data.

Using the information available, Corus is able to connect to the application process and determine it's liveness: it will check only processes on the current host, using the port allocated to each, and connect to the enpoint indicated by the the provided path (i.e.: the path attribute of the <port> element).

For this behavior to yield the expected results, application must provide an endpoint for health check purposes, over HTTP or HTTPS. Such an endpoint is will be invoked using the GET method, and success will be inferred from the endpoint returning a HTTP 200 - or a status corresponding to what has been configured as part of either the <http-diagnostic> or <https-diagnostic> element (as documented above).

An important point to remember is that Corus knows which processes are supposed to run or not on given hosts, based on their tags. Therefore, it will not attempt to erroneously perform health checks on processes that are not supposed to be running, which would create false positives.

More on the Port Prefix

To better clarify the example above, let's expand it a bit and examine how the configuration will be interpreted at runtime to build a URL. For starters, here's an upgraded <port> configuration:

<distribution name="demo" version="1.0" xmlns="http://www.sapia-oss.org/xsd/corus/distribution-5.0.xsd">
  <process name="httpServer" maxKillRetry="3" shutdownTimeout="30000" invoke="true" tags="http-server">
    <port name="http.server">
      <http-diagnostic portPrefix="8" path="/ping" successCode="200"  readTimeout="1000" connectionTimeout="2000" gracePeriod="30" />
    </port>
    ...
  </process>
  ...
</distribution>

The portPrefix attribute deserve a bit more explaining, and a refresher about port management in Corus. First, remember that port management is supported by Corus mainly to allow running multiple processes on the same box. In such a context, suppose it is our preference to use different digits for different protocols. For example, say we like to have the JMX server in the JVM listen on port 6xxx, and our HTTP service to listen on port 8xxx, and so on, but we like to have the same port suffix digits for a given process (on any given host): process #1 would have 6100 and 8100, process #2 6101 and 8101, etc. This is entirely possible by configuring in Corus a port range whose allocated values would be used as port suffixes. The corresponding port add command in the CLI would look like this (given our above configuration):

conf add -n http.server -min 100 -max 104

The above configures a range that allows for five ports. The values leased to processes by Corus will include the integers from 100 to 104, inclusively. In our corus.xml descriptor, port values passed to processes by Corus can be indirectly reused in config, with corresponding variables. The variables names consist of the port range name, prefixed by the corus.process.port string:

<java mainClass="org.sapia.corus.examples.HttpServer" profile="test" vmType="server">
  <property name="server.port" value="8${corus.process.port.http.server}" />
  <xoption  name="ms" value="16M" />
</java>

Note the 8 digit preceding the variable: it corresponds to the portPrefix attribute, and we must provide it in order for Corus to correctly generate the URL that will be used for performing the health check.

Internal Diagnostic

As was mentioned earlier on, Corus also checks its internal state in order to determine the proper diagnostic:

  • It checks its own task log, and checks for those entries that have an ERROR level. If such entries are found since the last diagnostic check, they will result in a "failure" status. For example, a deployment might have been attempted, but failed due to an invalid Corus distribution archive. Or, in the context of a deployment, processes might have started and crashed immediately - and therefore might have been terminated automatically by Corus...
  • It checks its internal components to ensure that they're not performing activity that might contradict the diagnostic Corus might come to. For example, if deployments are currently occurring, or if processes are pending execution, Corus detects it and allocates time for the completion of such activities. When Corus is not ready to perform a diagnostic, it will yield an "incomplete" status.

API Doc

The Diagnostic API is built on the foundation of the core REST API. That is, the details regarding security, response format, etc, are exactly the same. The Diagnostic API in fact only adds two REST resources to the core API.

Implementation Notes

The Diagnostic API, from a client perspective, only requires interpreting the following HTTP statuses to start with (delving into the details of responses can be useful, but is not required to get going):

  • 200 (HTTP OK): Means that the diagnostic was positive.
  • 500 (HTTP General Server Error): Means that the diagnostic was negative - if in the context of a deployment, remediation should be considered (rollback, etc).
  • 503 (HTTP Service Unavailable): Means that a diagnostic could not be determined, due to ongoing concurrent activity, or to activity that has not yet completed. In such a case, clients should retry a few seconds later (10 seconds, as a rule of thumb).

Note that the internals have been designed to avoid infinite loop conditions on the client-side, when performing retries. A "Service Unavailable" status will not last forever, and eventually Corus will yield a "final" status (either 200 or 500, depending).

Just as in the case of the core REST API, a response corresponding to either a single Corus node, or to a whole cluster, can be obtained. One only needs connecting to a Corus instance in the cluster and obtain, using that instance, a diagnostic for that node only, or for all the nodes in the cluster. In the latter case, the node to which a client connects will play the role of aggregator, consolidating the responses from siblings into a single response.

For a "success" diagnostic to be determined, all nodes in the cluster must return a diagnostic which amounts to a success.

Specification

As was briefly mentioned above, only two REST resources have been added to support acquiring diagnostic data from Corus. These resources allow aquire such data for a single host, and for the whole cluster, respectively. They are documented below.

GET
- Permission.....: ADMIN
- Behavior.......: SYNC
- Request headers:
  - Accept......: application/json
- Response Status:
  - 200: Indicates that all Corus process activity on all 
         Corus nodes is normal.
  - 500: Indicates that diagnostic acquisition failed at 
         one node at least.
  - 503: Indicates that diagnostic acquisition could not 
         be obtained for now. The client should retry in
         a few seconds, until a 200 or 500 is obtained.
         
- Resources:
  /clusters/{clusterName}/diagnostic
  /clusters/{clusterName}/hosts/{host}/diagnostic
  /clusters/{clusterName}/partitionsets/{partitionSetId}/partitions/{partitionIndex}/diagnostic

- Path variables:
  - clusterName...: The name of the cluster to access.
  - host..........: A host literal (taking the form: ip_address:port).
  - partitionSetId: The ID of the partition set to target.
  - partitionIndex: The index of the partition in the targeted partition set.
  
- Parameters:
  - contentLevel (optional, defaults to minimal - can be either 
                  minimal, summary or detail): allows specifying the level 
                  of detail (making the response more or less 
                  verbose, depending on what clients require).

Sample requests

http://saturn:33000/rest/clusters/app-01/diagnostic
http://saturn:33000/rest/clusters/app-01/hosts/192.168.0.104:33000/diagnostic
http://saturn:33000/rest/clusters/app-01/hosts/192.168.0.104:33000/diagnostic?contentLevel=detail
http://saturn:33000/rest/clusters/app-01/partitionsets/8d450417-cbee-4e58-9533-ab7a0966042a/partitionIndex/0/diagnostic

Sample Response

The diagnostic response below corresponds to the summary content level:

[
{
  "cluster": "default",
  "host": "192.168.1.12:33000",
  "dataType": "diagnostic",
  "data": {
    "classVersion": 1,
    "status": "SUCCESS",
    "processDiagnostics": [
    {
      "classVersion": 1,
      "status": "NO_DIAGNOSTIC_AVAILABLE",
      "suggestedAction": "REMEDIATE",
      "name": "httpServer",
      "distribution": {
        "name": "demo",
        "version": "1.0"
      },
      "results": [
      {
        "classVersion": 1,
        "status": "NO_DIAGNOSTIC_CONFIG",
        "message": "Process [distribution=demo,version=1.0,profile=test,process=httpServer,pid=150822379402,status=ACTIVE] ([distribution=demo,version=1.0,process=httpServer]) has no diagnostic config defined for port http.server",
        "diagnosticPort": {
          "name": "http.server",
          "value": 100
        },
        "process": {
          "classVersion": 2,
          "id": "150822379402",
          "name": "httpServer",
          "pid": "4937",
          "distribution": "demo",
          "version": "1.0",
          "profile": "test",
          "activePorts": [
          {
            "name": "http.server",
            "port": 100
          }
          ]
        }
      }
      ]
    }
    ],
    "progressDiagnostics": {
      "classVersion": 1,
      "errors": []
    }
  }
}
]

The response above gives an idea of how diagnostic data is provided. For each Corus host, the data section holds a JSON object corresponding to a global diagnostic result. For that global result, a status is provided, which is SUCCESS in the present case. The possible values for this status are the following:

  • SUCCESS: diagnostic was positive for the given Corus node (all processes running as expected, no error detected).
  • FAILURE: diagnostic check failed for one or more processes on the given node.
  • INCOMPLETE: diagnostic could not be obtained at this time on the given node. Client should retry.

The global result is subdivided into two sections, each corresponding to different types of diagnostics. These two sections are in fact represented as two fields: processDiagnostics and progressDiagnostic.

Process Diagnostics Overview

The processDiagnostics field consists of an array holding diagnostic results for processes, grouped by process configuration. That is, each element of the array corresponds to a <process> element in the corus.xml descriptor. It holds the distribution name, version, and the name of the processes for which diagnostic data was obtained. Within such a process configuration result, another array, under the results field: each element in that array corresponds to the diagnostic data for a specific process instance "under" that process configuration.

For example, in the response output above, the overall diagnostic for the "httpServer" processes corresponding to the "demo" distribution (version 1.0) is NO_DIAGNOSTIC_AVAILABLE. Note that a "suggested action" is indicated: REMEDIATE. The possible values are:

  • NOOP: nothing wrong has been detected.
  • RETRY: diagnostic data could not be obtained at this time and the client should retry in a few seconds.
  • REMEDIATE: diagnostic resulted in one or more failures being detected, require remediation.

It should be noted that status determined at the process level determines the status at "upper" levels: a status corresponding to a failure at the process will result in a failure at the global level, and in a HTTP 500 sent to the client.

In the sample reponse above, you might wonder why suggested action REMEDIATE does not result in a status corresponding to an error: in the present case, diagnostic could not be obtained because the corus.xml descriptor is missing the configuration needed for Corus to perform automatic diagnostic check against corresponding processes. It was chosen to consider this case not as a failure per say, but as still justifying remediation in the form of fixing the corus.xml.

Progress Diagnostic Overview

The progressDiagnostic field consists of an array of error messages (if any) coming from Corus' task manager component. Indeed, Corus does most of its work (when there is such work) through so-called tasks. For example, deployment/undeployment is done by separate tasks; process execution and termination also; etc. Such tasks generate internal logs, according to different levels (debug, warning, error, etc.).

As part of determining a diagnostic, Corus checks for messages corresponding to the "error" level that tasks might have logged. If any such message is present, it is returned as part of the diagnostic response. It also means that the global diagnostic status will then be FAILURE.

Corus does not rescan the task log messages it has previoulsy scanned. At the next diagnostic check, it will only scan the messages that have been generated since the last check. This logic has been put in place to avoid considering messages that might not reflect the current situation, given a remediation that might have occured in the meantime.

Status Doc

The following sub-sections describe the status at the different levels (global, process configuration, process instance). Not that clients do not have to interpret all these statuses themselves. Such an interpretation is done by Corus, which returns, through an HTTP code, the result of that intepretation. Yet, having such details in hand can prove useful for investigation/remediation purposes.

Process-Level Diagnostic

The table below documents the statuses at the process level. These statuses are ultimately interpreted by Corus, in order to determine the status at the process configuration level.

StatusDescriptionOutcome/Suggested Action
CHECK_SUCCESSFUL Process checked successfully. No futher action required.
CHECK_FAILED Process check failed. Remediation required.
RESTARTING Process is currently restarting. A retry should be attempted in a few seconds.
SHUTTING_DOWN Process is currently shutting down. A retry should be attempted in a few seconds.
STALE Process is stale - this happens when process auto-restart is disabled: unresponsive processes are terminated by Corus, but not restarted. In such a case they're assigned the "stale" status. Remediation required.
SUSPECT Process check failed, but might not have finished booting up yet. A retry should be attempted in a few seconds.
NO_DIAGNOSTIC_CONFIG Process has no diagnostic configuration defined in the corus.xml descriptor, defeating the purpose of diagnostics. This status does not result in a failure, but the corresponding process configuration should be modified in order to allow for automatic health check by Corus.
PROCESS_LOCKED Process currently locked by a concurrent operation. This is does not correspond to a failure. Retry should be attempted in a few seconds.
NO_ACTIVE_PORT The process does have a port active (at least, not one configured through Corus). This is does not correspond to a failure and requires no retry.

Process Configuration-Level Diagnostic

The following statuses "fold up" into the global statuses - see next section.

StatusDescriptionOutcome/Suggested Action
SUCCESS All processes responding. No futher action required.
FAILURE One or more processes not responding. Remediation required.
BUSY The system is busy with concurrent activities that may defeat diagnostic acquisition, or lead to the wrong diagnostic. A retry should be attempted in a few seconds.
NO_PROCESSES_EXPECTED No process expected to be running at this time. No further action required.
NO_DIAGNOSTIC_AVAILABLE No process diagnostic could be acquired due to missing diagnostic configuration in the corus.xml descriptor. This status does not result in a failure, but the corresponding process configuration should be modified in order to allow for automatic health check by Corus.
PENDING_EXECUTION One or more processes pending execution. A retry should be attempted in a few seconds.
MISSING_PROCESS_INSTANCES One or more processes are likely down. This status corresponds to a failure and requires remediation.

Global-Level Diagnostic

There is one global diagnostic per Corus node. From all such global diagnostics, a final HTTP status code is determined and returned to the client. The following logic is applied:

  • If all nodes have a SUCCESS status, the HTTP code returned by Corus will be 200;
  • if at least one node has a FAILURE status, HTTP 500 will be returned.
  • if at least one node has status corresponding diagnostic acquisition not being possible currently, or implying a retry, a HTTP 503 will be returned.

StatusDescriptionOutcome/Suggested Action
SUCCESS All processes responding. No futher action required.
FAILURE One or more processes not responding. Remediation required.
INCOMPLETE Diagnostic data could not be obtained at this time. A retry should be attempted in a few seconds.