Benefits and Usage of Big Data and Analytics Services

https://www.youtube.com/watch?v=LSVewE4mKfE&list=PLlVtbbG169nED0_vMEniWBQjSoxTsBYS3&index=24

Back in the day, ETL was the most common approach due to limited disk space

  • In the modern world, we have ADLS Gen 2 that sits on top of blob so it makes sense to load the raw data into ADLS Gen 2

    • If in the future you ever want to come back and add more data to the ETL process, you can easily access everything stored in ADLS Gen 2 in its' raw format

  • In the final transform phase of ELT, you clean and wrangle the data before loading into the destination where it can be analyzed

    • Azure Data Factory is the orchestrator that facilitates this data movement

HDInsight

  • HDInsight is all about open source frameworks that Microsoft has created managed solutions for

    • Hadoop is about dividing tasks into smaller parts

      • Disk based

        • Map reduce breaks things down into key value pairs of data that can be shuffled around

    • Storm is real-time processing for machine learning

    • Spark is mostly batch jobs and data transformation

      • Memory based

    • Kafka is all about big data streaming

    • Hive LLAP is interactive querying like from a data lake

    • HBase is NoSQL storage

Databricks

  • Built off apache spark

  • Microsoft Databricks is a managed solution built in Azure

  • Has a delta lake that sits on top of a data lake

Azure Synapse Analytics

  • Brings everything above under a single umbrella/workspace

Last updated