Metrics & Monitoring

Core metrics

  • Query targeting

    • Ideal ratio is 1 where a document is returned for every one read

    • very high ratio negatively impacts performance

  • Storage

    • writes are refused at capacity and can cause crashing

    • key metrics include disk space percent free, disk latency, disk IOPs, disk queue depth

  • CPU

    • May need to optimize with indexes or upgrade hardware

  • Memory

    • System should be sized to hold all indexes

    • Swap usage, and memory usage

  • Replication lag

    • delay between primary and secondary in seconds

Additional Metrics:

  • opcounters

    • number of operations per seconds run on mongodb process since startup

      • It tracks: command, query, insert, delete, update, getMore

  • network traffic

    • Average rate of physical bytes

      • bytes in / bytes out (physical)

    • Number of requests sent to DB

      • numRequests

  • connections

    • Organized by application, shell client, as well as internal processes

    • Can affect system performance

    • Large connection count may be suboptimal connection strategy

  • tickets available

    • when available tickets drops to zero, other operations must wait

    • indicates undersized cluster or poorly performing queries

Atlas CLI

  • atlas metrics processes <host_name>:<port>

    • You can also add params like period, granularity, output, type, etc

  • Atlas also has real-time monitoring charts

  • You can kill long running operations in this dashboard

atlas processes list

atlas metrics processes atlas-jj12z4-shard-00-00.p31f3ej.mongodb.net:27017 --period P1D --granularity PT1M

  • P1D stands for 1 day

  • PT1M is 1 minute intervals

or..

atlas metrics processes atlas-jj12z4-shard-00-00.p31f3ej.mongodb.net:27017 --period P1D --granularity PT1M --output json --type CONNECTIONS | jq '.measurements[0].dataPoints |= .[-10:]'

Configure Alerts

  • You can configure alert settings at organization and project levels

  • Must have "project owner" role

  • Shared tiered clusters will only triggers alerts for:

    • connections, logical size, opcounters, network

  • All Atlas projects come with defaults, but they can be edited

    • Atlas alerts are a little bell icon in the top right

CLI

Responding to Alerts

Alerts are shown here:

  • Notifications will continue until an alert is acknowledged

    • No further notification are sent until the acknowledgement period ends, you resolve the condition, or you unacknowledge the alert

CLI

Integrations

Good for hybrid situations or for when you are migrating on-prem to cloud

Examples:

  • Prometheus, pagerduty, datadog, sumo, splunk, custom web hooks, etc

    • Prometheus and DD are only on M10+ clusters

In database dashboard, click elipses and go to Integrations

  • Typically you fill in your credentials here for your connection

Self-Managed Monitoring

Cloud Manager or a hybrid solution listed above can be used

  • note: prometheus for example can not directly collect metrics from an onprem solution, but their are open source connectors such as Percona

  • The account collecting data will need clusterMonitor role

Command Line Metrics

serverStatus

  • diagnostic command that returns a document showing current instance state

  • This command is used by monitoring platforms to collect valuable metrics

  • Ex:

  • Ex helper command: db.serverStatus()

currentOp

  • admin command that returns document about active operations

  • monitoring apps use this command to find slow operations

  • Ex.

killOp

  • terminates operations using opId

  • Ex.

  • Helper function: db.killOp()

Last updated