Metrics & Monitoring
Core metrics
Query targeting
Ideal ratio is 1 where a document is returned for every one read
very high ratio negatively impacts performance
Storage
writes are refused at capacity and can cause crashing
key metrics include disk space percent free, disk latency, disk IOPs, disk queue depth
CPU
May need to optimize with indexes or upgrade hardware
Memory
System should be sized to hold all indexes
Swap usage, and memory usage
Replication lag
delay between primary and secondary in seconds
Additional Metrics:
opcounters
number of operations per seconds run on mongodb process since startup
It tracks: command, query, insert, delete, update, getMore
network traffic
Average rate of physical bytes
bytes in / bytes out (physical)
Number of requests sent to DB
numRequests
connections
Organized by application, shell client, as well as internal processes
Can affect system performance
Large connection count may be suboptimal connection strategy
tickets available
when available tickets drops to zero, other operations must wait
indicates undersized cluster or poorly performing queries
Atlas CLI
atlas metrics processes <host_name>:<port>
You can also add params like period, granularity, output, type, etc
Atlas also has real-time monitoring charts
You can kill long running operations in this dashboard
atlas processes list
atlas metrics processes atlas-jj12z4-shard-00-00.p31f3ej.mongodb.net:27017 --period P1D --granularity PT1M
P1D stands for 1 day
PT1M is 1 minute intervals
or..
atlas metrics processes atlas-jj12z4-shard-00-00.p31f3ej.mongodb.net:27017 --period P1D --granularity PT1M --output json --type CONNECTIONS | jq '.measurements[0].dataPoints |= .[-10:]'
Configure Alerts
You can configure alert settings at organization and project levels
Must have "project owner" role
Shared tiered clusters will only triggers alerts for:
connections, logical size, opcounters, network
All Atlas projects come with defaults, but they can be edited
Atlas alerts are a little bell icon in the top right
CLI
Responding to Alerts
Alerts are shown here:
Notifications will continue until an alert is acknowledged
No further notification are sent until the acknowledgement period ends, you resolve the condition, or you unacknowledge the alert
CLI
Integrations
Good for hybrid situations or for when you are migrating on-prem to cloud
Examples:
Prometheus, pagerduty, datadog, sumo, splunk, custom web hooks, etc
Prometheus and DD are only on M10+ clusters
In database dashboard, click elipses and go to Integrations
Typically you fill in your credentials here for your connection
Self-Managed Monitoring
Cloud Manager or a hybrid solution listed above can be used
note: prometheus for example can not directly collect metrics from an onprem solution, but their are open source connectors such as Percona
The account collecting data will need clusterMonitor role
Command Line Metrics
serverStatus
diagnostic command that returns a document showing current instance state
This command is used by monitoring platforms to collect valuable metrics
Ex:
Ex helper command: db.serverStatus()
currentOp
admin command that returns document about active operations
monitoring apps use this command to find slow operations
Ex.
killOp
terminates operations using opId
Ex.
Helper function: db.killOp()
Last updated