Log and Monitor GoodData.CN

Proper operation and maintenance of GoodData.CN are essential for ensuring its performance, reliability, and scalability. This guide provides best practices for log collection, monitoring, tracing, and capacity planning.

Log Collection

Setting up a centralized logging system is recommended to automatically collect logs from all Kubernetes workloads and components. GoodData.CN log messages are structured and JSON-formatted, making them easy to ingest and analyze. Logs contain tracing information (traceId and spanId), allowing you to link relevant logs from multiple sources for broader context.

Recommended Log Collection Systems

Consider using one of the following commercial or open-source systems for log collection:

  • Splunk
  • Sumo Logic
  • Graylog
  • Grafana Loki

Monitoring and tracing

Most GoodData.CN components expose Prometheus metrics. These metrics can be collected by a monitoring system, visualized, and used to trigger alerts when issues arise. If you use Prometheus Operator, the Helm chart can create PodMonitor objects for you.

Setting Up Distributed Tracing

GoodData.CN services can produce distributed tracing events in a Zipkin-compatible format. Setting up infrastructure to collect these events is recommended for analysis and troubleshooting. Systems capable of receiving these events include:

  • Jaeger
  • Zipkin
  • Grafana (with Tempo storage)

Capacity planning

Each GoodData.CN microservice has specified resource allocations. In Kubernetes, resource consumption is controlled by parameters such as replica count, CPU, and memory requests and limits. Below is an example configuration:

someApplication:
  replicaCount: 2
  resources:
    requests:
      cpu: 200m
      memory: 300Mi
    limits:
      cpu: 500m
      memory: 600Mi

For applications running in the JVM, additional JVM options are needed:

someJavaApplication:
  jvmOptions: "-XX:ReservedCodeCacheSize=100M -Xms320m -Xmx320m -XX:MaxMetaspaceSize=170M"

Adjusting Resource Allocations

The default values specified in the GoodData.CN Helm chart are minimal. If you expect many concurrent users, these values may need to be increased to handle the load. Consider scaling up CPU and/or memory limits. For JVM-based applications, remember to update the heap size parameter (-XmxNNNm) to utilize the added memory.

Monitoring Resource Usage

Use your monitoring system to observe actual CPU and memory metrics, including RSS memory and CPU throttling. For JVM-based applications, monitor metrics related to various JVM memory regions and garbage collection. If needed, increase the replica count to distribute the workload across multiple pods.

Refer to the default Helm chart values for detailed per-application resource settings.