DF300

Expression Stages

$project

Good for creating brand new fields or renaming fields
Only sends new fields you specify from one stage to the next

$set

$set is very similar to project, but the major difference is that $set will forward ALL of the fields of the document as well as the ones you $set
From a performance perspective, the instructor said that $project is not necessarily faster than $set even though you are likely returning less fields in $project

$group

Roughly equivalent to GROUP BY in SQL
Groups documents by a "group key"
Output is one document per group
Additional fields can contain results of accumulator expressions

Common $group accumulators
- $addToSet, $avg, $first, $last, $max, $min, $mergeObjects, $push, $stdDevSamp, $sum
- Also.. $bucket, $bucketAuto, $facet
  - $bucket is good to segregate data
    ex. Group data by month
- $sortByCount
  - Really common for DBAs and developers
  - Returns a frequency count of a field across all of your documents
- $unwind
  - Opposite of $group, it will unwind an array into multiple documents with the same _id
    Because this is a stage during the pipeline in memory, this does not cause any issues

$lookup

Like a left outer join, but much more closely aligns to a nested select in terms of performance
- MongoDB is NOT a relational database!
Good for analytic purposes mostly
- Their is one edge case that is acceptable in OLTP applications..
  - ex. When one collection is very static and another is extremely active (updates every few milliseconds)
No referential integrity is enforced with this operation
If their is more than one match on the right side of the join, it will put the results into an array
For each document in the left side of the join, you will have to iteratively scan the right side collection
- You MUST index the foreign field (right side)
- It is also recommended to index the local field (left side)
If their is no match, their is nothing to attach the join to and thus that field will not be attached
Be careful of cross joining NULL values as NULL is considered a value in Mongo

Output Stages

$out

Write results to a new collection

$merge

Update an existing collection
- Add new documents or edit existing
Updates existing fields or appends new ones

$set

Add extra fields without $projecting all of them

$replaceRoot

Create a whole new shape of top-level document
Very similar to $merge, but $replaceRoot will just fully replace the document
_id remains the same

$sample

Choose a random set of docs from the input

Database Internal Stats Stages

Requires elevated permissions to access diagnostic information

$collStats
- Describe collection statistics
$currentOp
- List ongoing database operations
$indexStats
- Are people actually using indexes since boot
$listSessions
- Show ho is connected and what are they doing
$planCacheStats
- Show what query shapes are cached and the query plans

Aggregation Optimization

Common mistakes:

Often developers will think that $project reduces the fields outputted into the next stage, but Mongo will actually use the entire document throughout the pipeline
- Generally speaking, $project should be towards the end
  - This doesn't actually have any performance impact, but it makes it clear that their is no advantage to putting it at the beginning
Stages can be streaming or blocking
- A blocking stage will block subsequent streaming stages

Aggregation and Memory Usage

Documents inside the pipeline can be up to 64mb in size
Final stage must be 16mb or less as they need to be BSON
- This can be broken during the pipeline/computation phase, but at the end of the process you have to be under 16mb
Blocking stages can use up to 100mb of heap RAM
- If you need more, you can write to disk
  - $sort, $group, $bucket, and $bucketAuto can use disk with allowDiskUse:true
    WARNING: THIS WILL INCREASE IOPs FOR THE ENTIRE DATBASE AND IT SHOULD BE AVOIDED AT ALL COSTS IN PRODUCTION
    You can see this in db.currentOp and in the explain plan look for that specific type of operation

Schema Validation

Moderate validation allows documents out of validation to stay out of validation
Strict validation does not retroactively reapply to everything on disk
The rules are applied on new updates

Change Streams

Reads from the op log
Listen for writes to a collection, database, or instance
- Filter notifications we want to receive
- On notification, receive the delta or whole document (before or after change)
- Can stop listening and resume where we left off
Drivers handle change stream differently
- Some block until an event with or without a timeout
  - For C#, the default is technically blocking, but if you have an async methodology you can change that
  - Atlas triggers uses this same technology
- Some allow to ask if there are any changes or not by setting timeout zero
- Provide a function to be called if there are changes

Sessions

Support retryable writes and causal consistency
- Diff than eventual consistency.. in mongo things have to occur in the correct order
Logical server sessions that can be synchronized between clients
- Sessions are kept track of on the database side
Allows them to have the same causal view of the world
- Allows read your own writes
  - findOneAndUpdate()
- One sessions knows another's writes have completed
- Allows retryable writes - ability to safely auto-retry a failover
- Underpin transaction handling
Ultimately sessions revolve around a defined timestamp in the internal operation log

Retryable Read and Writes

Retryable writes automatically retry a write in the event of HA failover
- Default for 4.2+ drivers and Atlas connection strings
- Operations can be made idempotent but a simple $inc:{a:1} isn't
- Retryable writes, with implicit sessions are able to ensure things happen once
- Waits until the cluster is available then repeats the single write
- Checks that the newly elected Primary has not already seen the operation
Retryable reads redo a read sent to a server if that if that server fails before responding
- Does not apply to getmore operations
If a secondary goes down and comes back up, it will automatically recover itself

Multi-Document Transactions

Most locking occurs at the document level
Enable full ACID properties in MonogDB
- Snapshot level isolation
- Require at least a replica set
- Work across sharded clusters from MongoDB 4.2 (distributed data)
Be aware transactions should not be the default approach
- Can introduce contention
- Might be used for <1% of update operations or less
Correct schema design often provides a simpler, more efficient solution
The majority of MongoDB developers do not use Transactions
Deadlocks do occur in Mongo
- The second transaction is always the one that is deadlocked
Writes block writes, but not other reads

Bulk Write Operations

Group together multiple write operations in a single network call
- insertMany is a bulk operation
- updateMany is a SINGLE operation
- bulkWrite API sends many changes in a single command
Avoids latency cost of network round trips
Can be ordered or unordered:
- Unordered is faster on a sharded cluster as parallel
- Ordered must stop on error
Can be used inside a transaction

Views

Views are virtual collections
- Same concept as an RDBMS
- Strictly read-only
- Defined by an aggregation pipeline
- Hard to index
  - You can not index the view itself
  - Views are not materialized
- Allow us to encapsulate aggregation pipelines on a server
- Have own security and can be used to redact fields
- To materialize a view, you basically just need to create a new collection and use MERGE to maintain it
  - Of course.. you are adding more data to disk

BSON Data Types

Using the correct data type is important
- Avoid data loss/rounding, e.g. double versus decimal
- Enable range queries
- Minimize data sizes
Default number in mongosh is int32 or double (if initialized with a decimal point)
- ObjectId() is a compact data type - never store it as a string
- Always store Dates as dates, not strings
- Binary() is a useful data type
- Booleans should not be strings
- Special types: MinKey, MaxKey, Timestamp, Regex

NULL Handling

Collation and Sort Ordering

All text in MongoDB is unicode stored as UTF8
- 1-9, A-Z, a-z, etc etc
Default sort order in MongoDB is the unicode code point order
The collation changes the sort order to be correct for a language
- Can define a collation for a collection or an index, or a query
- Can say whether to have case sensitivity in sort and query
- Can say whether to match diacritics or not in sort/query
MongoDB also explicitly defines how different data types order:
- NULL < Numbers < Strings < Objects
  - NULLs come first ^
If indexing text fields like a name, you need to make sure your queries match your index collations!

Sorting by Objects

Objects can be compared to each other
When objects are compared, the field order matters
- Some programming languages don't preserve the field order by default
- Objects are compared field by field
- If field names differ, then the order is by first changed field name
- If values differ, then the order is by value
  - {a : 1} < { a:1, b:2} < {b:1, a:1}
When ordering...
- 1) Fields first (not values)
- 2) If fields match, THEN you look at values
The order of the fields in each document MATTERS!

Sorting by Arrays

When querying an array, the semantics refers to the whole array or any member
So {$gt: 5} is true where any array member is greater than 5
When sorting, the highest or the lowest value in the array is used as the sort value
This also applies to strings

Ex.

Type Bracketing

MongoDB auto converts numeric data types when comparing
- Searching for 5 will find Integer, Long, double, or decimal versions
- It will NOT find "5" as a string
- Indexes are created and stored in a special format to facilitate this - not BSON
- Also works for range queries and sorting
Type bracketing prevents comparison of different type brackets in find *and $match)
- ALLOWS comparison in aggregation and $expr

Locking

Longer Locking Semantics

Database locks are very short lived
Your application may need to reserve a document whilst manually editing it
One thing you could do is update a field and set a "status" field to something like "being_edited"
- The problem with this is that other apps might not respect this behavior as it is not enforcable
- Another issue is that an app could abandon the document during a crash
One way to improve the above process is to add owners and timestamps
- Be careful of a race condition where new are prioritized over abandoned
- If using this without a specific _id, what function do you need and why?

Ex.

Developer Best Practices

Idempotent Operations

For scale, operations need to be independent of each other
An operation changing many things should not be in a transaction
Use roll-forward, not roll-back
These write operation are idempotent:
- Inserts with a supplied _id (2nd attempt fails)
- Deletes cannot delete a thing twice
- Updates is using explicitly set values
Set {paid:false} in 100,000 documents with updateMany() if it fails part way through, we can retry the whole operation
- For efficiency, we might only set it where it's not set already
non-updating updates do NOT lock the document

Object Document Mappers (ODM)

PreviousDF200 NextDF400

Last updated 1 year ago