DF100

Storage

MongoDB stores data as BSON:

  • In the above example, you can enforce uniqueness of "Dental" value for the "type" key by creating an index

  • Additionally, you can put validation on the data itself, but their may be downsides to that

Terminology

  • namespace is nothing more than database name + collection name that helps you distinguish between two similarly named collections

Benefits of MongoDB

  • Agility

    • The documents within a given collection are not required to have the same schema

      • ex. Data types do not have to match across documents

    • MongoDB's take is that the application should be doing the validation, not the database (although it is possible to do so on the DB side)

  • Usability

    • Option to use a wide array of tools and languages to query MongoDB

  • Utility

    • Complex indexed queries, smart edits, aggregation frameworks

  • Availability and scalability

    • HA via replica sets, multiple copies of data, different hosts/locations, continuous replication

    • Scale via sharding

      • Partition data over multiple replica sets

      • Provides unlimited hardware scaling

      • Sharding can be complex so it is important to consider this as your data grows

        • Do not plan to add sharding on day 1, but around 2 TB is when you want to start having that conversation

    • Compression data

  • Enterprise tooling

    • Atlas, Ops manager, cloud manager, K8s, terraform, etc

    • Ops manager - Very good to use once you start spinning up multiple replica sets and your environment gets more complicated. It is very difficult to do a PITR correctly on a sharded DB without using this tool

When should MongoDB be used?

  • High speed access to complex objects

    • atomic partial updates

    • fast retrieval

    • secondary indexes

    • aggregation capabilities

  • When you want to store larger data structures together

    • Large arrays

      • Exception: keep arrays under 200 elements for performance

    • text fields

    • binary data

  • Rapid development

  • When you need to store structures of varying shapes

  • Large data volumes

  • Distributed data

Things to be aware of

  • Easy to get things wrong and performance can suffer

  • DBAs need to be trained and certified

    • Devs perform traditional DBA tasks, but DBAs have very important tasks as well

CRUD Operations

  • If multiple documents satisfy the query for a single document command, it will return the first 1 on disk if it exists

    • Searching the index takes precedence

    • The same document may not always be returned unless you specify additional criteria

// db.customers.insertOne({
    _id: "[email protected]",
    name: "Bob Smith",
    spend: 0,
    orders: [],
    lastpurchase: null
})
  • In the above command, mongo will create the customers collection if it does not already exist

  • at least one ID field must exists and one will be generated for you if you do not specify it

    • It is ALWAYS called _id

    • This ID must be unique

    • You can not change the ID of a document for any reason

    • Mongo can and will generate a unique value for you

  • If you are querying the ID field repeatedly and you know it is going to unique, then it makes sense to make that the ID

    • For example, an email value.. although if the value changes you must delete the document

  • If you let mongoDB generate the ID, it will be unique across the entire database, but it only MUST be unique across the collection

    • i.e. you could easily have the same ID across multiple collections if you set it to something like "[email protected]"

let friends = [ 
    {_id: "joe" }, 
    {_id: "bob" }, 
    {_id: "joe" }, 
    {_id: "jen" } 
]

db.collection1.insertMany(friends)
  • In the above example.. Joe & Bob will all be inserted into the database

    • Both Joe (the duplicate record) and Jen are NOT inserted

      • This is the default insertMany behavior

        • Atomicity is at the document level

        • You can do multi-document transactions, but there are tradeoffs!

        • It is also possible to modify the default behavior

          • db.collection2.insertMany(friends, {ordered: false} )

            • In the above snippet, 3 records would be inserted, which includes "Jen"

  • Tip: mongo is indexed at 0

If you have to Insert 200 documents, is it better to use insertMany or insertOne in a FOR loop?

  • insertMany is better because the network roundtrips are lessened

  • You can insert up to 100,000 documents or 48mb in a single trip

Reading Documents

  • db.customers.findOne({})

    • Returns the first record

  • db.customers.findOne({name: "Andy Smith"})

    • Strings are case-sensitive

  • db.customers.findOne({name: "Andy Smith", spend: 0})

    • The above is an AND operation

  • db.customers.findOne({name: /smi/i, spend: 0})

    • REGEX expressions work, but you should be using a text index or Atlas Search if you use Atlas assuming you do this operation frequently

  • db.customers.findOne({name: /smi/i, spend: 0.0000})

    • For numerical values, regardless of precision, Mongodb will still return the spend for customers with a spend of 0

    • In application code, you should use the proper data types as it will be mapped within Mongo

Looking for multiple documents?

  • db.customers.find({})

    • Equivalent of SELECT *

  • db.customers.find({},{name: 1, spend: 1})

    • 1 specifies that the field will be returned

    • _id is always returned

    • conversely, if you have 0, it will exclude just those fields

    • you can NOT mix and match inclusion/exclusion with ONE exception

      • You can explicitly exclude the _id field while also including others

  • db.customers.find({lastpurchase: null})

    • Will return documents even where lastpurchase field does not exist as it implicitly defines the value is null

  • db.customers.find({gibberish:null})

    • Returns every single document

    • find vs findOne

      • find will return a cursor that has a maximum of 100 documents or in some drivers, 16mb (C# is 48mb)

        • A lot of drivers will obfuscate this behavior and iterate through the cursor for you

      • If no documents are found, it will still return the object, but it will be empty

  • db.customers.find({}).sort({age: -1}).skip(30).limit(10)

    • Commands can be chained

    • -1 is descending

    • order of chained operations does not matter, but in order to not confuse people, always write by sort, skip, and then limit for best practices

      • You can change this behavior with an "aggregate query", but we have not yet learned what that is

  • db.customers.find({}).sort({age: -1, plate: -1}).skip(30).limit(10)

  • .count() vs .countDocuments() vs .countDocuments({})

    • they should almost always be the same, but their is a VERY extreme edge case where you are dealing with ultra precise application and high frequency apps

      • .countDocuments() is faster

      • .count()

      • .countDocuments({}) with an empty document cheats and uses the metadata stats, but it could be off by +/- 1

  • db.people.find({address: {city: "Houston"}})

    • This will work ONLY if the field names, field order, and values match identically because mongodb creates a blob of the document

    • The reason for this is because in Mongodb allows for 100 levels of nesting so it is more performant to just hash the document

    • In the real world, you would typically not use the above syntax, but rather would use the below syntax

    • VERY IMPORTANT: querying on an embedded document like above will rehash every single document on the fly, it does not save this hash on an INSERT

    • Embedded documents include arrays and nested documents

  • db.people.find({"address.city": "Houston"})

    • when referencing a child field (nested document), you MUST have double quotes around the key in the JSON document in your FIND operation

  • db.fun.find({hobbies: "rockets"})

    • MongoDB will walk an array and return a document if rockets exists in the array

  • db.fun.find({hobbies: ["rockets, "cars"]})

    • Since this is an embedded document, it must be an exact match

  • db.taxis.find({age: {$gt: 37} } )

    • Operator documents are always prefixed with a $

  • db.taxis.find({age: {$gt: 37}, plate: {$lt: 20, $lte: 50, $ne: 38, $in: [40,44,45] } } )

    • This operates in AND behavior

    • operators can be on the same field as well

  • db.pets.find({$or: [{species: "cat", color: "black"},{species: "dog", color: "brown"}] })

    • nesting boolean logic is valid

    • ORDER within the query document does not matter unless you are referencing embedded documents

    • ORDER within the OR array does matter in the sense that the query will stop executing when the first condition is satisfied

      • In terms of performance, it may be faster for your query to start with the condition that is more frequent

  • db.customers.find({lastpurchase: {$exists: false})

    • Returns any document where lastpurchase does not exist

  • db.customers.find({lastpurchase: {$exists: true, $eq: null})

    • Returns any document where last purchase field exists and the value is set to NULL

  • db.fun.find({hobbies: {$all: ["rockets", "cars"]}})

    • returns any documents where the array contains all of those values, but the array itself could contain more values

    • It is an AND operator

  • db.fun.find({hobbies: {$in: ["rockets", "cars"]}})

    • OR variant

  • db.ages.find({age: {$lt: 39, $gt: 21}})

    • when testing an array,. the moment you implement an operator document, you are now testing whether the SET of values in the array are going to meet the criteria

    • age: [40,20,8] would evaluate as true for the above logic because a value of less than 39 exists and a value of greater than 21 exists

  • db.ages.find({age: {$elemMatch: {$lt: 39, $gt: 21}}})

    • elemMatch is an array operator that will walk the value of the array until it finds one element that must match ALL of the criteria

    • The caveat to this is that $elemMatch only executes against an array and will not return documents that have ints

Updating Documents

  • updateOne vs updateMany

    • updateOne(query, change) - changes only the first matching document

    • updateMany(query, change) - changes all matching documents

Operators

  • Ex. {$inc: {score: 50, numGames: 1}, $push: {gameId: 22, winLoss: "win"}}

    • Because all of this is being done in the same "mutation" document, you are guaranteed an atomic operation

  • $set - assign or replace a value on an existing document

    • use dot notation to set a field in an embedded document

      • be extra cautious when setting a NEW value inside of an embedded document as you are going to erase the entire existing field and replace it with your new value unless you use the dot notation

      • {$set : { staff: {principal: "jones"} } vs {$set : { "staff.principal": "jones"} }

  • $unset - remove a field from a document

    • the value you set can be a "" blank string, it does not matter

    • it does NOT set the value to NULL, it physically removes the data

    • Ex. $unset: {"Singer":""}}) is the same as $unset: {"Singer":"myrandomvalue"}})

      • This is because it needs to be the standard across the mongodb environment

  • $inc / $mul - self explanatory

    • Their are no decrement or divide operators

      • Instead you will use a fractual value between 1 and 0 or a negative value

  • $max / $min - can modify a field depending on its current value

    • you could do a read + update, but their are problems with that..

      • problem #1, the value you are trying to update could change between your read and update

      • problem #2, you incur additional load on the database because it is two operations

    • $max / $min is essentially like adding a conditional to prevent an update

    • $max makes it so that you can avoid an index on a field and increases performance

      • This is more efficient than using $gt

      • You find all objects that satisfy your conditional and on that document evaluate whether the value is less than, if so, update the value, otherwise do nothing

    • Most common use cases are dates and numbers

      • e.g. you want to $max date when you want to update a date changed field

Deleting Documents

  • Recommended to find or findOne what you want to delete before deleting to verify the command is valid

  • deleteOne() and deleteMany()

    • $unset of the UPDATE command only removes a field whereas delete removes a document

  • replaceOne() is typically not used

    • It erases everything on the document except the _id field and replaces it with the content you are trying to set

    • Typically you would just use $set

Updating, Locking, and Concurrency

  • If two processes attempt to update the same document at the same time they are serialised

  • The conditions in the query must always match for the update to take place

  • In the example, if the two updates take place in parallel - the result is the same

  • Locks are at the document level

Ex. In below example, transaction B does nothing

  • This is where stuff starts to back up and you run out of CPU

  • Furthermore, every single time the lead blocker runs it's operation, all of the queries in queue must re-evaluate

Advanced Arrays

$push - append and element to the end of an array

  • Can be used in updateOne and updateMany

  • Fails if the field is not an array

  • Creates an array field if it does not already exist

  • Can be used with multiple modifiers

  • Ex. db.playlists.updateOne({name:"funky"}, {$push: {name: { artist: "AC/DC, track: "Thunderstruck}}

    • name must not be a string, it has to be an array

$pop - removes last or first element from an array

  • Can be used in updateOne or updateMany commands

  • Fails if the field is not an array

  • Removing the first element renumbers all array elements

  • Ex. db.playlists.updateOne({name:"Funky",{$pop: {tracks: 1}})

    • You can also use -1 to delete the first element in the array

$pull - remove specified elements from an array

  • Elements can be specified by value or condition

  • Will throw an error if not an array

$addToSet - appends an element to an array if it does not already exist

  • Does not affect existing duplicates in the array

  • Elements in the modified array can have any order

  • Fails if the field is not an array

$each - if you use $push to add an array to an existing array, it will nest the array so you need to use $each to add multiple values

  • Faster performance than using a FOR loop in code

  • You can also combine $each and $sort to insert new data in an ordered fashion

    • To be clear, this reorganizes the entire array as the push occurs

    • Note: the benefits of this command don't make much sense if you can't trust that all applications are modifying documents with $sort as the benefit only comes when you read a document and do not have to specify $sort

$sort and $slice - sort and keep the top (or bottom) N elements

  • This is an example of a design pattern

  • Used for high/low lists - high scores, top 10 temperatures, etc

  • Order of operations applied (left to right) matters, which is contradictory to an earlier discussion

Modifying a specific element in an array

  • Use the index of the array to target and modify the value

  • You can also use "hrs.$" syntax to change the first index that matches the condition:

Modifying all matching elements

  • Query to find documents is not used to decide what elements to change

  • separate arrayFilters(s) apply update to matching array elements

  • this example adds 2 to everything less than 1 hr

    • nohrs is like a variable in the below example

  • You must use arrayFilters when updating multiple items within an array

Expressive Updates

  • Mongo, unlike SQL Server, will actually persist the value of area so that subsequent reads do not have to recalculate the value

  • Note: if somebody modifies $w or $h field, I am not sure what happens as the instructor did not cover it

Upsert

  • Most mongodb operations taht update also allow the flag "upsert: true"

  • Upsert inserts a new document if none are found to update

  • Values in both the query and update are used to create a new record

fineOneAndUpdate()

  • To understand this command, you must first understand updateOne()

    • updateOne() finds and changes document atomicly and doesn't return the updated document unless you do a fineOne() afterwards (two separate transactions)

  • Imagine getting the next one-up number from a sequence

  • fineOneAndUpdate() prevents a potential race condition

Last updated