Azure Data Fundamentals (DP-900)
Whether you’re just starting your Azure journey or preparing for the DP-900 certification, this guide will provide you with a solid foundation on key data concepts and Azure services.
Understanding Data Formats
Data comes in various formats, and understanding the differences between these formats is essential for the certification. Here’s a breakdown:
-
Structured Data: This type of data fits neatly into tables, making it easy to query and analyze. This is what you find in most relational databases.
- Examples:
- Excel Files: Spreadsheets organize data into a table-like structure, which can be used for reporting and analysis.
- SQL Databases: Relational databases like Azure SQL Database store structured data in tables with rows and columns, making it easy to perform queries using SQL.
- Examples:
-
Semi-Structured Data: While not as rigid as structured data, semi-structured data includes tags or markers to create some organization.
- Examples:
- Avro: A row-based storage format that is compact and efficient, often used in data serialization and supported by Azure Data Lake Storage and Azure HDInsight.
- CSV: Comma-separated values, often used for exporting and importing tabular data.
- JSON: Commonly used in web applications for transferring data (Azure services like Cosmos DB and Azure Blob Storage support JSON).
- ORC: Optimized Row Columnar format, designed for high-performance data processing.
- Parquet: Optimized for storing large datasets, often used in Azure Data Lake Storage and Azure Synapse Analytics for analytics.
- Examples:
-
Unstructured Data: This data lacks any predefined structure, making it flexible but harder to manage.
- Examples:
- Video/Audio Files: Stored in Azure Blob Storage, which scales to accommodate large media files like video and audio.
- Examples:
Azure Services to Know:
- Azure Blob Storage: Ideal for storing unstructured data like text, binary data, images, and videos. It is highly scalable and cost-effective.
- Azure Data Lake Storage: Designed for handling large amounts of structured, semi-structured, and unstructured data, especially useful for big data analytics and compatible with HDFS.
OLAP vs. OLTP: Key Data Processing Models
-
OLAP (Online Analytical Processing): This model is optimized for running complex queries and analyzing large datasets, typically for reporting and decision-making. OLAP systems focus on read-heavy workloads and multidimensional analysis.
- Examples:
- Azure Synapse Analytics is a perfect example of an OLAP solution that integrates data warehousing and big data analytics to perform large-scale queries.
- Examples:
-
OLTP (Online Transaction Processing): OLTP systems are designed for real-time transaction management and are optimized for large numbers of short, write-heavy transactions, such as customer orders.
- Examples:
- Azure SQL Database is an OLTP-optimized service that handles transactional workloads efficiently, such as managing purchase records for an online store.
- Examples:
SQL Basics for Beginners
SQL (Structured Query Language) is the standard language for managing relational databases. Azure provides fully managed relational databases like Azure SQL Database. Here are the basic SQL command categories:
-
DML (Data Manipulation Language): Used to manipulate data within tables.
- Examples:
- SELECT: Retrieves data from one or more tables.
- INSERT: Adds new rows of data.
- UPDATE: Modifies existing records.
- DELETE: Removes records.
- Examples:
-
DDL (Data Definition Language): Used to define or modify the structure of a database.
- Examples:
- CREATE: Creates new database objects like tables.
- ALTER: Modifies the structure of an existing object (e.g., adding columns to a table).
- DROP: Deletes objects from the database.
- Examples:
-
DCL (Data Control Language): Used to control access to the database.
- Examples:
- GRANT: Provides permissions to users.
- REVOKE: Removes granted permissions.
- Examples:
Basic knowledge of SQL is necessary for the certification; readers are encouraged to practice.
Azure Services to Know:
- Azure SQL Database: A fully managed relational database service that supports SQL queries and offers features like high availability and automatic scaling.
- Azure Cosmos DB: A globally distributed NoSQL database that supports multiple data models, such as SQL, MongoDB, and Cassandra.
Introduction to Power BI
Power BI is a business intelligence tool that allows users to connect to various data sources, transform the data, and create reports and dashboards. For the DP-900 exam, you need to understand how Power BI Desktop, Power BI Mobile, and Power BI Service work together:
- Power BI Desktop: The main tool for creating reports and dashboards.
- Power BI Service: A cloud-based service for sharing and collaborating on reports.
- Power BI Mobile: A mobile app to view reports on the go.
Example:
Imagine you have customer data stored in Azure SQL Database. You can use Power BI Desktop to create visualizations that display sales trends, and then share those reports using Power BI Service for your team to collaborate on in real time.
Organizing Data in Azure: Key Concepts
Data Warehouse
- Data Warehouses: These are central data repositories optimized for querying and reporting. In Azure, Azure Synapse Analytics serves as a data warehouse solution that can store large amounts of structured data for analysis.
Data Lakes
- Data Lakes: These are storage systems that can store raw data in its native format (structured, semi-structured, or unstructured). Azure Data Lake Storage is optimized for storing and analyzing big data.
Data Lakehouses
- Data Lakehouses: A hybrid solution that combines the scalability of data lakes with the structure of data warehouses. With Azure Synapse Analytics, you can store both structured and unstructured data and query it with SQL for quick insights.
Database Normalization
Normalization is the process of organizing data in a database to minimize redundancy and ensure data integrity. This process involves dividing large tables into smaller, related tables and defining relationships between them.
Example:
Instead of storing customer data repeatedly in every order, you can normalize your database by separating it into a Customers table and an Orders table, linking them by a CustomerID. This reduces duplication and improves data accuracy.
Benefits of normalization include:
- Reduced Data Redundancy: Ensures that data is stored only once, reducing storage costs.
- Improved Data Integrity: Organizing data logically improves accuracy and consistency.
ETL vs. ELT: Understanding Data Pipelines
-
ETL (Extract, Transform, Load): This process involves extracting data from different sources, transforming it into a useful format, and loading it into a data warehouse for analysis.
- Example: Using Azure Data Factory to extract data from an on-premises SQL Server, clean and transform it, and load it into Azure Synapse Analytics for reporting.
-
ELT (Extract, Load, Transform): In this process, data is loaded into the target system first (such as a data lake) and transformed afterward. ELT is popular in cloud-based systems because the target (like Azure Synapse Analytics) can handle large-scale data transformations.
- Example: Ingesting raw data into Azure Data Lake Storage and then using Azure Synapse to transform it for analysis.
Azure Services to Know:
- Azure Data Factory: A fully managed service for orchestrating ETL/ELT processes in the cloud.
- Azure Synapse Analytics: Combines big data and data warehousing, enabling you to run complex queries on large datasets.
Azure Storage Services to Know
Azure offers a range of storage services to meet different needs:
- Azure Blob Storage: Scalable storage for unstructured data, such as text and binary data.
- Azure Files: Provides fully managed file shares that can be accessed via standard protocols like SMB.
- Azure Queue Storage: A service that stores large numbers of messages for asynchronous processing.
- Azure Table Storage: A NoSQL key-value store for semi-structured data.
Example:
If you are building a cloud app that requires storing large files and exchanging messages between components, Azure Blob Storage is perfect for the files, while Azure Queue Storage handles messaging.
Cosmos DB: Globally Distributed NoSQL
Azure Cosmos DB is a globally distributed, multi-model database service that supports APIs for SQL, MongoDB, Cassandra, and others. It ensures low-latency data access, high availability, and seamless data replication across regions.
Example:
An e-commerce platform could use Cosmos DB to store customer profiles and order histories, ensuring fast access to data for users anywhere in the world.
Batch vs. Streaming Data Processing
In Azure, you can process data in two main ways:
-
Batch Processing: Data is collected and processed at scheduled intervals. Azure Data Factory can be used for orchestrating large-scale data movement and transformation jobs.
- Example: End-of-day financial reports.
-
Streaming Processing: Data is processed in real-time as it arrives. Services like Azure Stream Analytics and Azure Event Hubs allow you to analyze live data streams for instant insights.
- Example: Real-time fraud detection for credit card transactions.
Azure Services to Know:
- Azure Stream Analytics: Processes real-time data streams for immediate analysis.
- Azure Event Hubs: Ingests large streams of data in real time.
- Azure Event Grid: Handles event-driven architectures, making it easier to build apps that react to real-time events.
Conclusion
The DP-900 certification introduces a broad range of data concepts and Azure services, from structured data to real-time streaming.