Metrics & Monitoring

Core metrics

  • Query targeting

    • Ideal ratio is 1 where a document is returned for every one read

    • very high ratio negatively impacts performance

  • Storage

    • writes are refused at capacity and can cause crashing

    • key metrics include disk space percent free, disk latency, disk IOPs, disk queue depth

  • CPU

    • May need to optimize with indexes or upgrade hardware

  • Memory

    • System should be sized to hold all indexes

    • Swap usage, and memory usage

  • Replication lag

    • delay between primary and secondary in seconds

Additional Metrics:

  • opcounters

    • number of operations per seconds run on mongodb process since startup

      • It tracks: command, query, insert, delete, update, getMore

  • network traffic

    • Average rate of physical bytes

      • bytes in / bytes out (physical)

    • Number of requests sent to DB

      • numRequests

  • connections

    • Organized by application, shell client, as well as internal processes

    • Can affect system performance

    • Large connection count may be suboptimal connection strategy

  • tickets available

    • when available tickets drops to zero, other operations must wait

    • indicates undersized cluster or poorly performing queries

Atlas CLI

  • atlas metrics processes <host_name>:<port>

    • You can also add params like period, granularity, output, type, etc

  • Atlas also has real-time monitoring charts

  • You can kill long running operations in this dashboard

atlas processes list

atlas metrics processes atlas-jj12z4-shard-00-00.p31f3ej.mongodb.net:27017 --period P1D --granularity PT1M

  • P1D stands for 1 day

  • PT1M is 1 minute intervals

or..

atlas metrics processes atlas-jj12z4-shard-00-00.p31f3ej.mongodb.net:27017 --period P1D --granularity PT1M --output json --type CONNECTIONS | jq '.measurements[0].dataPoints |= .[-10:]'

Configure Alerts

  • You can configure alert settings at organization and project levels

  • Must have "project owner" role

  • Shared tiered clusters will only triggers alerts for:

    • connections, logical size, opcounters, network

  • All Atlas projects come with defaults, but they can be edited

    • Atlas alerts are a little bell icon in the top right

CLI

atlas alerts settings list --output json

Responding to Alerts

Alerts are shown here:

  • Notifications will continue until an alert is acknowledged

    • No further notification are sent until the acknowledgement period ends, you resolve the condition, or you unacknowledge the alert

CLI

atlas alerts list --output json
atlas alerts acknowledge <alertId> --comment <comment>
atlas alerts unacknowledge <alertId>

Integrations

Good for hybrid situations or for when you are migrating on-prem to cloud

Examples:

  • Prometheus, pagerduty, datadog, sumo, splunk, custom web hooks, etc

    • Prometheus and DD are only on M10+ clusters

In database dashboard, click elipses and go to Integrations

  • Typically you fill in your credentials here for your connection

Self-Managed Monitoring

Cloud Manager or a hybrid solution listed above can be used

  • note: prometheus for example can not directly collect metrics from an onprem solution, but their are open source connectors such as Percona

  • The account collecting data will need clusterMonitor role

Command Line Metrics

serverStatus

  • diagnostic command that returns a document showing current instance state

  • This command is used by monitoring platforms to collect valuable metrics

  • Ex:

    db.runCommand(
       {
         serverStatus: 1
       }
    )
  • Ex helper command: db.serverStatus()

currentOp

  • admin command that returns document about active operations

  • monitoring apps use this command to find slow operations

  • Ex.

    db.adminCommand(
       {
         currentOp: true,
         "$all": true
       }
    )

killOp

  • terminates operations using opId

  • Ex.

    db.adminCommand(
       {
         killOp: 1,
         op: <opid>,
         comment: <any>
       }
    )
  • Helper function: db.killOp()

Last updated