🥒
Dill's Knowledge Base
  • Hello World
  • 💻SQL
    • ❌Error Handling
    • 🧀Parameter Sniffing
      • Indexes
      • Query Hints
      • RECOMPILE
      • Branching
      • Memory Grants
      • Summary
      • Bonus
    • SQL Server Buffer Pool
  • 🖱️MongoDB
    • Instructor Led Training
      • DF100
      • DF200
      • DF300
      • DF400
    • MongoDB DBA University
      • DBA Admin Tools
      • DBA Basics
      • Metrics & Monitoring
  • 💻Web Design
    • Oxygen Tips
    • Bricks Builder
      • Tips
      • Discovery Call
      • Utility vs Custom Classes
      • Math Functions
      • Static vs Relative Units
  • Azure
    • AZ-900
      • Benefit of Cloud Computing
      • CapEx, OpEx and Consumption-based
      • Differences Between Cloud Service Categories
      • Identify The Right Service Type
      • Differences Between Types of Cloud Computing
      • Reliability and Predictability
      • Regions and Region Pairs
      • Availability Zones
      • Resource Groups
      • Subscriptions
      • Management Groups
      • Azure Resource Manager
      • Azure ARC
      • Resources Required for VM
      • Benefits and Usage of Core Compute Resources
      • Benefits and Usage of Core Network Resources
      • Public/Private Endpoints
      • Benefits and Usage of Storage Accounts
      • Benefits and Usage of Database Resources
      • Data Movement and Migration Options
      • Benefits and Usage of IoT Services
      • Benefits and Usage of Big Data and Analytics Services
      • Benefits and Usage of AI Services
      • Benefits and Usage of Serverless Technologies
      • Benefits and Usage of DevOps Technologies
      • Functionality of Azure Management Solutions
      • Functionality and Usage of Azure Advisor
      • Functionality and Usage of ARM Templates
      • Functionality and Usage of Azure Monitor
      • Functionality and Usage of Azure Service Health
      • Functionality of Microsoft Defender for Cloud
      • Functionality and Usage of Key Vault
      • Functionality and Usage of Microsoft Sentinel
      • Azure Dedicated Host
      • Defense in Depth
      • Describe the Concept of Zero Trust
      • Functionality and Usage of NSGs
      • Functionality and Usage of Azure Firewall
      • Functionality and Usage of Azure DDoS Protection
      • Explain Authentication and Authorization
      • Functionality and Usage of Azure AD
      • Microsoft Entra Overview
      • Functionality of Conditional Access, MFA and SSO
      • Functionality and Usage of RBAC
      • Functionality and Usage of Resource Locks
      • Functionality and Usage of Tags
      • Functionality and Usage of Azure Policy
      • Governance Hierarchy Constructs
      • Azure Blueprints
      • Describe Microsoft Privacy Statement, OST and DPA
      • Purpose of Trust Center and Azure Compliance Documentation
      • Purpose of Azure Sovereign Regions
      • Factors That Affect Costs
      • Factors to Reduce Cost
      • Functionality and Usage of Azure Cost Management
      • Purpose of Service Level Agreements
    • DP-900
      • Study Cram
    • DP-300
      • Deploy IaaS Soluton with Azure SQL
  • 📦Kubernetes
    • Udemy: Kubernetes for Beginners
Powered by GitBook
On this page
  • BSON
  • Arrays vs Documents
  • When to Use Arrays
  • Document model vs Relational
  • Histogram Exercise
  • Schema Design in MongoDB
  • Document Design Basics
  • Linking
  • One-to-One Linked
  • One-to-One Embedded
  • Hybrid
  • One-to-Many
  • Many-to-Many
  • Payload vs Process Fields
  • Two Types of Fields
  • Evolutionary Dynamic Schema
  • Data Driven Dynamic Schema
  • Design Patterns
  • Attribute Pattern
  • Bucket Pattern
  • Computed Pattern
  • Subset Pattern
  • Outlier or Overspill Pattern
  • Anti-Patterns
  • Replication
  • Different Access Patterns
  • Replica Set component
  • Drivers and Replica Sets
  • Oplog
  • Oplog Window
  • Initial Sync
  • Majority / Elections
  • Rolling Upgrade for Maintenance
  • Write Concerns
  • Majority Commit Point
  • Read Concerns
  • Read Preferences
  • Best Practice
  • Arbiter
  • When to Shard?
  • Architecture
  1. MongoDB
  2. Instructor Led Training

DF400

BSON

  • Binary Serialized Object Notation

    • Cross language object -> binary -> object

    • Used internally and for network traffic

    • Converted to hash/dict objects in most drivers

    • Basically, a list of named, typed values:

      • Type, FieldName, <length>, Data

      • Type, FieldName, <length>, Data

        • <length> will tell a program how many bytes to skip to get to the next field, which improves performance

Arrays vs Documents

  • The diffrence is in how they map to a programming lanugage and what constraints

    • Can only $push to an array, not to a document

    • Arrays can contain duplicates

    • Arrays cannot be sparse, must store nulls for empty values

  • Large arrays or large flat documents are equally bad

    • Max 200 elements strongly recommended in either

      • Recall previous discussions about 200 element arrays

    • Large or unbounded arrays can have a significant speed impact

  • Arrays are often used where document would be a better choice!

Ex.

  • In the above, we redesigned the schema to be 30% smaller

  • However, the way you query the 2nd result is significantly different

    • Furthermore, indexing the 2nd example is very difficult

      • In the first example, you could index Results.Player

When to Use Arrays

  • Use arrays when:

    • Just need to $push to the end

    • Don't have unique identifiers

    • Having sparse data is not an issue

    • Need to index the values to search them

  • Otherwise, consider objects

    • Projection is faster than $filter

    • Querying a single member is faster

  • One big advantage of arrays over joined tables i not needing an indexed (foreign) key for each element

Document model vs Relational

Histogram Exercise

  • 24 hrs x 60 minutes gets you 1440

  • The question is, do you have a single array with 1440 elements or do a 2 dimensional array:

  • In regards to updating, one of the above will be significantly faster...

    • In the 1440 element array, on average, you will have to scan 720 places before finding the value you need

    • By switching to a 2 dimensional array, that number is 360

Schema Design in MongoDB

3 Considerations:

  • The data the application needs

  • Application's read usage of the data

  • Application's write usage of the data

Document Design Basics

  • To link or embed?

    • Do I want the embedded info a lot of the time?

      • Put it in the same collection!

    • Do I need to search with the embedded info?

    • Does the embedded info change often?

    • Do I need the latest version or the same versions?

    • Is the embedded info shared - really?

Linking

One-to-One Linked

  • It is very important to index the proper fields to make these lookups faster

One-to-One Embedded

  • The problem with embedding this info, is that the author could have changed their name

    • If you embed, it becomes difficult to keep up with updating the data

Hybrid

  • On a webpage, you can render all of the books with first name, last name.. if the user clicks on the auth, you can then use the _id to query the authors collection

  • Yes, you are duplicating data, but that is why you only include very basic information

One-to-Many

  • Scalar in child

Array in Parent

  • Could be vulnerable in a scale up situation

Many-to-Many

  • Arrays in either side..

Payload vs Process Fields

Two Types of Fields

  • Payload - Data we just store and retrieve

  • Processing - metadata we examine inside MongoDB

    • Querying, aggregating, filtering and projecting, using for worklow, using to implement patterns

  • Payload might be predefined, sometimes better to ignore that and add processing fields

  • For example, existing corporate data/object models, might be used only for storage

  • JSON files you want to store is Payload, fields you want to filter by are processing fields

  • Generally speaking, payload fields

Ex.

Evolutionary Dynamic Schema

  • Schema changes as application changes - big benefit top using MongoDB

  • Slow changes to the application, but no big conversion needed

    • Include a schema version number with the data as a field

    • Retain the ability to read previous schemas in application

    • Either convert in the background or on modification

    • Lossely coupled database objects and code objects

Data Driven Dynamic Schema

  • Field names are the data values

    • Uncomfortable new concept for many designers

    • Requires a truly dynamic coding approach

    • Clean and performant for many things

    • Can be used to do some amazing optimization of retrieval using Tries

Design Patterns

Attribute Pattern

  • Used where documents have a number of searchable named attributes

    • Not all values may be present in a given record

    • Any new record might have a new gield name not seen before

  • Ex. ecommerce, CRM, content management

  • The reason you use k/v, you can more easily index those fields for how they will be searched

  • Also, it would cover NEW attributes being added

  • If you are looking for a specific sku, then the attributes is a payload, but otherwise it could be used for processing so keep that in mind

Bucket Pattern

  • Most common design pattern - stringly matches document model

  • Used to store many small, related data items

    • Bank transactions - related by account and date

    • IoT Readings - related by sensor and date

  • Reduces index sizes up to 200x

  • Speeds up retrieval of related data

  • Enables computed pattern

  • MOngoDB 5.0 timeseries collections abstract this concept

Ex.

  • In the above, you fill up the ddocument with 200 readings, you then fail to find a document because in your query you have less than 200 defined, because upsert it set to true, a new document will be created

    • If you then turn around and query on sensor id, you can easily seek to those documents

Computed Pattern

  • In the above, you have a top 20 constantly updated list

  • You are precomputing vaslues to avoid computation during a subsequent reads

Subset Pattern

Outlier or Overspill Pattern

  • Creates a separate collection if too many array items

    • In this pattern, this is not the typical case

    • There is a main record and a flag to say if there is an overspill situation

    • Additional collections only store the extra array items

  • For example, in a movie database:

    • Main cast and crew in the parent document

    • Full cast and crew in a second additional document (can be the same collection)

Anti-Patterns

Replication

  • Multiple copies of each collection

  • On different physical servers

  • As far away from each other as possible

  • Nature is the best example of replication for resilience

Different Access Patterns

  • If 90% of the users look at the latest 1% of data - small enough to remain in cache =fast

  • If the rest (10%) look at all data (analysts)

    • Don't need such a fast response

    • Shouldn't be bad neighbors

    • Protect the cache and resources from these heavy users

Replica Set component

  • 1 primary at most

    • Responsible for accepting all writes

    • If primary goes down, a voting stage occurs

  • Up to 50 secondary members

    • Up to 7 can be voting members

    • Apps can read from a secondary

    • The largest mongo has seen is 9 in a replica set and that was a massive bank

  • 3 members is the most common replica set

    • 5 is for more critical processes

Secondary Server - more info

  • Priority 0 replica set member

  • Hidden replica set member

  • Delayed replica set member

    • Generally speaking, you should completely avoid hidden and delayed replica set members!

Drivers and Replica Sets

  • When coding with MongoDB, Drivers keep track of the topology

    • Drivers know where and how to route requests

    • You don't need to know what is a primary or secondary at any point

    • You can specify where you want a read to go logically

    • When upgrading the server, it is important to check driver compatibility

Oplog

  • each write is a new entry in the oplog:

  • Oplog is always idempotent

Oplog Window

  • Oplogs are capped collections (fixed size in bytes)

  • 990mb to 1 petabyte configurable

    • The bigger you make it, the less space you have for your data

  • Guarantee the preservation of insertion order

  • Support high-throughput operation

  • Once a collection fills its allocated space:

    • Makes room for new documents by deleting the oldest documents in the collection

    • Like circular buffers

  • If a secondary goes down for longer than the Oplog Window, the secondary will be out of sync once it comes back up

    • The DBA will need to manually trigger initial sync

Initial Sync

  • New/replacement replica set members need a full copy of the data

  • As do any that have been down too long where the oplog has rolled over

  • By default, initial sync pulls data from the primary, which is bad!

    • Their is a way to sync from the secondary

  • Can take a long time and be relatively fragile on large systems

  • Recent changes (4.2/4.4):

    • Can auto restart on non-transient error

    • Resumable on transient error

    • Copy Oplog at the same time

    • Overall more resilient

    • Can use a secondary as source

Majority / Elections

  • In distributed systems, the idea of a majority or quorum is important:

    • Group that together have MORE than half of the total members

  • How to choose a primary (leader)

    • Agreeing on a primary is surprisingly hard

    • There are many academi papers on how to do this

    • MongoDB uses the RAFT protocol

    • Only one primary is chosen at a time (the chosen one)

      • Must be able to communicate with a majority of nodes

      • Should be the most up to date with the latest information

      • May prefer some over the others due to geography

Elections Explained

  • A secondary determines:

    • Has not heard from the primary recently - primary might be up, but not getting through

    • Can contact a majority of secondaries, including itself

    • Can vote and is not specifically ineligible

  • Secondary then proposes an election - with itself as Primary

    • States its latest transaction time

    • States the election term - that goes up by one every election

    • Votes for itself by default

  • Any server that can vote, only votes for the candidate in an election if:

    • The candidate is at the same or a higher transaction time

    • Has not already voted in that election term

  • If it's currently the primary, stands down

  • Once a candidate received a majority of votes:

    • Converts itself from a secondary to a primary

    • Goes through an onboarding

    • Starts to accept writes

  • If it keeps tying, you basically just keep electing (flipping quarters) until a primary is chosen

Ex.

  • if a connection blip occurs with 4 nodes, the primary (top left) will turn itself from a primary to a secondary

    • Apps will still read, but fail to write

Ex 2:

  • If the entire data center 1 goes down, you can still read from data center 2, but you can't write

Ex 3:

  • This protects you against hurricane/tornado/etc, but it causes the app to fail to a new data center

  • Depending on where the application sits, reads could actually become faster

Ex 4:

  • Ideal setup assuming cost is not a concern

Stepping Down

Rolling Upgrade for Maintenance

  • Go into a secondary host, upgrade the version, restart the secondary

    • Apps will still write to the primary and the app will be fine

  • You will then go to your other secondary and repeat the process

  • Go to primary and do an rs.stepdown

  • Ops and cloud manager automate this process

    • This is a MUST if you have 3+ replicas with sharding all in the correct sequence

  • Versions are backwards/forwards compatible +- 1 version

    • You can always rolling upgrade up multiple versions, but rolling back down may be more difficult

Write Concerns

  • Devs are responsible for this!

  • On the DB side, you can only set the default

  • In the below example, if their is a failure in-between steps 3 and 4.. you may lose data!

  • When the old primary comes back up, it has writes that don't exist on the new primary

    • It will silently roll back the original committed record

Majority Commit Point

  • Primary knows what timestamp each secondary is asking for, there it knows:

    • They have everything before that durable

    • Up to what timestamp exists on a majority of nodes

    • Up to what point data is 100% safe

  • Until a change is 100% safe, the primary will keep it in memory

  • In a system with automated failover, "Safe" means "if it fails over, this won't be silently lost"

Read Concerns

  • ONLY devs can control read concerns

  • In Read Majority, that point always lags behind "real-time"

    • Read Linearizable will add latency to your query until the Majority points catches up to "real-time" at the point where the query began

Read Preferences

When do you think you would use each of the following?

  • Read from primary only (default)

  • Read from primary unless no primary exists (primary preferred)

  • Read from any secondary only

    • Otherwise read from primary if no secondary exists

  • Read from secondary unless no secondary exists

  • Read from nearest geographically

  • Read from a specific set of servers

    • Data analysts and BI users should be using

Best Practice

  • Generally speaking, always use write concern majority

    • Read preference, use secondary preferred

      • Read concern, it depends

Arbiter

  • Acts as a tie-breaker in elections

  • Stores no data

  • Cannot become a primary

  • Is strongly advised against in production systems

  • A system with arbiters can be highly available OR guarantee durability - but not both

When to Shard?

  • Sharding is a solution to Big Data problems

  • Sharding is needed when resource limitations require horizontal scale

  • How to tell if you need to Shard

    • Resources are maxed out but schema and code are already optimized

    • Need to increase overall throughpout or increase in-memory cache/storage with affordable costs

    • Need to meet a backup restore time (RTO) target and need parallelism to do so

Architecture

PreviousDF300NextMongoDB DBA University

Last updated 1 year ago

🖱️