In this course, you will learn how to translate various concepts in Amazon Athena to the analogous concepts in BigQuery and Dataproc. You will learn how the high-level storage and compute architectures of Amazon Athena compare to BigQuery and Dataproc, understand how to configure datasets and tables in BigQuery, understand schema mappings from Amazon Athena to BigQuery and schema optimization in BigQuery. You will also learn how to create ephemeral Dataproc clusters for Spark data processing jobs and best practices around resource management for these jobs.
Course Objectives
- Comparing Amazon Athena architecture and provisioning of resources to how resources are provisioned in BigQuery and Dataproc
- Configuring datasets and tables in BigQuery
- Mapping and optimizing schemas from Amazon Athena to BigQuery
- Configuring ephemeral Dataproc clusters for executing Spark data processing jobs
- Best practices for using Dataproc and Spark with data stored in BigQuery
Audience
Current users of Amazon Athena (Data Engineers, Data Analysts, Data Scientists, Application Developers) migrating to BigQuery.
Prerequisites
Working knowledge of Amazon Athena as a data consumer and completion of a course covering an introduction to BigQuery (e.g., From Data to Insights with Google Cloud) or equivalent experience using BigQuery.
Course Outline
Module 1: Understanding BigQuery Architecture
- Quick reminder of Amazon Athena compute architecture
- Introduction to BigQuery
- Overview of BigQuery compute architecture
- Separation of compute and storage in BigQuery
- Slots and workload management in BigQuery
Module 2: Creating Datasets and Tables in BigQuery
- Resource hierarchy in Amazon Athena and Amazon S3
- Resource hierarchy in BigQuery
- Creating resources in BigQuery
- Sharing resources in BigQuery
- Data discovery using Data Catalog
- Lab: Provisioning and Managing Resources in BigQuery
Module 3: Schema Mapping and Optimization in BigQuery
- How data types map from Amazon Athena to BigQuery
- Understand data types unique to BigQuery
- Schema definitions in BigQuery
- Partitioning and Clustering in BigQuery
- Quick comparison of Athena SQL and BigQuery SQL
- Lab: Schema Migration from Amazon Athena to BigQuery
Module 4: Introduction to Dataproc for Spark Workloads
- Introduction to Dataproc
- Comparison of EMR and Dataproc for Spark jobs
- Running Spark jobs on Dataproc
- Separation of Compute and Storage using Dataproc
- Autoscaling clusters in Dataproc
Module 5: Optimizing Spark Workloads on Dataproc
- The Spark BigQuery connector
- Spark SQL vs. BigQuery SQL
- Setting up roles and permissions to connect BigQuery to Spark
- Automation options for running Spark jobs using ephemeral clusters
- Best practices for optimizing Spark workloads using BigQuery
- Lab: Executing a PySpark ETL Pipeline Using Ephemeral Dataproc Clusters