DF300
Last updated
Last updated
Good for creating brand new fields or renaming fields
Only sends new fields you specify from one stage to the next
$set is very similar to project, but the major difference is that $set will forward ALL of the fields of the document as well as the ones you $set
From a performance perspective, the instructor said that $project is not necessarily faster than $set even though you are likely returning less fields in $project
Roughly equivalent to GROUP BY in SQL
Groups documents by a "group key"
Output is one document per group
Additional fields can contain results of accumulator expressions
Common $group accumulators
$addToSet, $avg, $first, $last, $max, $min, $mergeObjects, $push, $stdDevSamp, $sum
Also.. $bucket, $bucketAuto, $facet
$bucket is good to segregate data
ex. Group data by month
$sortByCount
Really common for DBAs and developers
Returns a frequency count of a field across all of your documents
$unwind
Opposite of $group, it will unwind an array into multiple documents with the same _id
Because this is a stage during the pipeline in memory, this does not cause any issues
Like a left outer join, but much more closely aligns to a nested select in terms of performance
MongoDB is NOT a relational database!
Good for analytic purposes mostly
Their is one edge case that is acceptable in OLTP applications..
ex. When one collection is very static and another is extremely active (updates every few milliseconds)
No referential integrity is enforced with this operation
If their is more than one match on the right side of the join, it will put the results into an array
For each document in the left side of the join, you will have to iteratively scan the right side collection
You MUST index the foreign field (right side)
It is also recommended to index the local field (left side)
If their is no match, their is nothing to attach the join to and thus that field will not be attached
Be careful of cross joining NULL values as NULL is considered a value in Mongo
Write results to a new collection
Update an existing collection
Add new documents or edit existing
Updates existing fields or appends new ones
Add extra fields without $projecting all of them
Create a whole new shape of top-level document
Very similar to $merge, but $replaceRoot will just fully replace the document
_id remains the same
Choose a random set of docs from the input
Requires elevated permissions to access diagnostic information
$collStats
Describe collection statistics
$currentOp
List ongoing database operations
$indexStats
Are people actually using indexes since boot
$listSessions
Show ho is connected and what are they doing
$planCacheStats
Show what query shapes are cached and the query plans
Common mistakes:
Often developers will think that $project reduces the fields outputted into the next stage, but Mongo will actually use the entire document throughout the pipeline
Generally speaking, $project should be towards the end
This doesn't actually have any performance impact, but it makes it clear that their is no advantage to putting it at the beginning
Stages can be streaming or blocking
A blocking stage will block subsequent streaming stages
Documents inside the pipeline can be up to 64mb in size
Final stage must be 16mb or less as they need to be BSON
This can be broken during the pipeline/computation phase, but at the end of the process you have to be under 16mb
Blocking stages can use up to 100mb of heap RAM
If you need more, you can write to disk
$sort, $group, $bucket, and $bucketAuto can use disk with allowDiskUse:true
WARNING: THIS WILL INCREASE IOPs FOR THE ENTIRE DATBASE AND IT SHOULD BE AVOIDED AT ALL COSTS IN PRODUCTION
You can see this in db.currentOp and in the explain plan look for that specific type of operation
Moderate validation allows documents out of validation to stay out of validation
Strict validation does not retroactively reapply to everything on disk
The rules are applied on new updates
Reads from the op log
Listen for writes to a collection, database, or instance
Filter notifications we want to receive
On notification, receive the delta or whole document (before or after change)
Can stop listening and resume where we left off
Drivers handle change stream differently
Some block until an event with or without a timeout
For C#, the default is technically blocking, but if you have an async methodology you can change that
Atlas triggers uses this same technology
Some allow to ask if there are any changes or not by setting timeout zero
Provide a function to be called if there are changes
Support retryable writes and causal consistency
Diff than eventual consistency.. in mongo things have to occur in the correct order
Logical server sessions that can be synchronized between clients
Sessions are kept track of on the database side
Allows them to have the same causal view of the world
Allows read your own writes
findOneAndUpdate()
One sessions knows another's writes have completed
Allows retryable writes - ability to safely auto-retry a failover
Underpin transaction handling
Ultimately sessions revolve around a defined timestamp in the internal operation log
Retryable writes automatically retry a write in the event of HA failover
Default for 4.2+ drivers and Atlas connection strings
Operations can be made idempotent but a simple $inc:{a:1} isn't
Retryable writes, with implicit sessions are able to ensure things happen once
Waits until the cluster is available then repeats the single write
Checks that the newly elected Primary has not already seen the operation
Retryable reads redo a read sent to a server if that if that server fails before responding
Does not apply to getmore operations
If a secondary goes down and comes back up, it will automatically recover itself
Most locking occurs at the document level
Enable full ACID properties in MonogDB
Snapshot level isolation
Require at least a replica set
Work across sharded clusters from MongoDB 4.2 (distributed data)
Be aware transactions should not be the default approach
Can introduce contention
Might be used for <1% of update operations or less
Correct schema design often provides a simpler, more efficient solution
The majority of MongoDB developers do not use Transactions
Deadlocks do occur in Mongo
The second transaction is always the one that is deadlocked
Writes block writes, but not other reads
Group together multiple write operations in a single network call
insertMany is a bulk operation
updateMany is a SINGLE operation
bulkWrite API sends many changes in a single command
Avoids latency cost of network round trips
Can be ordered or unordered:
Unordered is faster on a sharded cluster as parallel
Ordered must stop on error
Can be used inside a transaction
Views are virtual collections
Same concept as an RDBMS
Strictly read-only
Defined by an aggregation pipeline
Hard to index
You can not index the view itself
Views are not materialized
Allow us to encapsulate aggregation pipelines on a server
Have own security and can be used to redact fields
To materialize a view, you basically just need to create a new collection and use MERGE to maintain it
Of course.. you are adding more data to disk
Using the correct data type is important
Avoid data loss/rounding, e.g. double versus decimal
Enable range queries
Minimize data sizes
Default number in mongosh is int32 or double (if initialized with a decimal point)
ObjectId() is a compact data type - never store it as a string
Always store Dates as dates, not strings
Binary() is a useful data type
Booleans should not be strings
Special types: MinKey, MaxKey, Timestamp, Regex
All text in MongoDB is unicode stored as UTF8
1-9, A-Z, a-z, etc etc
Default sort order in MongoDB is the unicode code point order
The collation changes the sort order to be correct for a language
Can define a collation for a collection or an index, or a query
Can say whether to have case sensitivity in sort and query
Can say whether to match diacritics or not in sort/query
MongoDB also explicitly defines how different data types order:
NULL < Numbers < Strings < Objects
NULLs come first ^
If indexing text fields like a name, you need to make sure your queries match your index collations!
Objects can be compared to each other
When objects are compared, the field order matters
Some programming languages don't preserve the field order by default
Objects are compared field by field
If field names differ, then the order is by first changed field name
If values differ, then the order is by value
{a : 1} < { a:1, b:2} < {b:1, a:1}
When ordering...
1) Fields first (not values)
2) If fields match, THEN you look at values
The order of the fields in each document MATTERS!
When querying an array, the semantics refers to the whole array or any member
So {$gt: 5} is true where any array member is greater than 5
When sorting, the highest or the lowest value in the array is used as the sort value
This also applies to strings
Ex.
MongoDB auto converts numeric data types when comparing
Searching for 5 will find Integer, Long, double, or decimal versions
It will NOT find "5" as a string
Indexes are created and stored in a special format to facilitate this - not BSON
Also works for range queries and sorting
Type bracketing prevents comparison of different type brackets in find *and $match)
ALLOWS comparison in aggregation and $expr
Database locks are very short lived
Your application may need to reserve a document whilst manually editing it
One thing you could do is update a field and set a "status" field to something like "being_edited"
The problem with this is that other apps might not respect this behavior as it is not enforcable
Another issue is that an app could abandon the document during a crash
One way to improve the above process is to add owners and timestamps
Be careful of a race condition where new are prioritized over abandoned
If using this without a specific _id, what function do you need and why?
Ex.
For scale, operations need to be independent of each other
An operation changing many things should not be in a transaction
Use roll-forward, not roll-back
These write operation are idempotent:
Inserts with a supplied _id (2nd attempt fails)
Deletes cannot delete a thing twice
Updates is using explicitly set values
Set {paid:false} in 100,000 documents with updateMany() if it fails part way through, we can retry the whole operation
For efficiency, we might only set it where it's not set already
non-updating updates do NOT lock the document