Skip to product information
The Complete Guide to Google BigQuery
The Complete Guide to Google BigQuery
Description
Book Introduction
Everything you need to know about big data, data engineering, and machine learning for large-scale data analysis and processing.

Build a collaborative and agile workspace while processing petabyte-scale datasets.
This book is a definitive guide to Google BigQuery, a query engine that integrates data extracted from across the enterprise and enables interactive data analysis and machine learning on large datasets.
BigQuery enables businesses to efficiently store, query, collect, and learn from data in a single, convenient framework.


In this book, authors Baliappa Lakshmanan and Jordan Tigani present best practices for modern data warehousing based on a serverless architecture that scales automatically on the public cloud.
This guide will serve as a perfect fit for readers who are just starting out with BigQuery and want to get a general overview of its features, as well as for those who want to use BigQuery to solve specific tasks.
  • You can preview some of the book's contents.
    Preview

index
[Chapter 1] Google BigQuery
Data Processing Architecture
__Relational Database Management System
__MapReduce framework
BigQuery: A Serverless, Distributed SQL Engine
Working with BigQuery
__Derive insights from multiple datasets
__ETL, EL, ELT
__Powerful analysis
__Simplicity of management
How BigQuery Was Created
How BigQuery Could Be Implemented
__Computing and storage separation
__Storage and networking infrastructure
__Managed Storage
__Integration with Google Cloud Platform
__Security and Compliance
organize

[Chapter 2] Query Essentials
Simple query
Finding rows with __SELECT
Assigning an alias to a column name with __AS
Filtering with __WHERE
__SELECT *, EXCEPT, REPLACE
Subqueries using __WITH
Sorting with __ORDER BY
Aggregation
Aggregate with __GROUP BY
Counting records with __COUNT
Filter items grouped by __HAVING
Finding unique values ​​with __DISTINCT
Array and Structure Basics
Creating an array with __ARRAY_AGG
Array of __structures
__tuple
__Using arrays
__Unpacking the array
Table join
__How Join Works
__inner join
__Cross Join
__Outer join
Save and Share
__Query logging and caching
__Saved Queries
Comparison of Views and Shared Queries
organize

[Chapter 3] Data Types, Functions, and Operators
Numeric types and functions
__mathematical function
__Standard Specification Floating Point Division
__SAFE function
__comparison
Precise decimal calculations using __NUMERIC
Handling Bools
__Logical operations
__conditional expression
Handling NULL values ​​cleanly with __COALESCE
__Type conversion and type coercion
Using COUNTIF to avoid Boolean conversion
String functions
__Internationalization
__Output and Parsing
__String manipulation functions
__Conversion function
__Regular expression
__Summary of string functions
Handling timestamps
Parsing and formatting timestamp values
__Extract calendar information
__Calculating timestamps
__Date, Time, and DateTime
Using GIS functions
organize

[Chapter 4] Loading Data with BigQuery
The most basic method
__Loading data from local
__Specify schema
__Copy to new table
__Data Management (DDL and DML)
__Loading data efficiently
Integrated Queries and External Data Sources
__Using a unified query
__Use Cases for Unified Queries and External Data Sources
__Interactive exploration and querying of Google Sheets data
__SQL queries on data in Cloud Bigtable
Transmission and Export
__Data Transfer Service
__Export Stackdriver logs
__Reading and writing BigQuery data with Cloud Dataflow
Migration of on-premises data
__How to migrate data
organize

[Chapter 5] Development Using BigQuery
Development using programming methods
__Using REST API
__Google Cloud Client Library
Using BigQuery in Data Science Tools
__Notebook on Google Cloud Platform
__A combination of BigQuery, Pandas, and Jupyter
__Handling BigQuery in R
__Cloud Dataflow
__JDBC/ODBC driver
__Incorporating BigQuery data into Google Slides in G Suite
BigQuery and Bash Scripting
__Creating datasets and tables
__Execution of query
__BigQuery object
organize

[Chapter 6] BigQuery Architecture
Take a look at the architecture
__Lifetime of a query request
__BigQuery Upgrade
Query Engine (Dremel)
__Dremel Architecture
__Execute query
Storage
__Storage Data
__metadata
organize

[Chapter 7] Performance and Cost Optimization
Basic Principles of Performance Optimization
__Key elements of performance
__Controlling costs
Measurement and Troubleshooting
Measuring query speed with __REST API
Measuring query speed with BigQuery Workload Tester
__Troubleshooting Workload Problems with Stackdriver
__Read query execution plan information
__Get query plan information from job details
__Query plan information visualization
Speed ​​up your queries
__I/O minimization
__Cache previous query results
__Join efficiently
__Avoiding excessive work on the worker
__Using approximate aggregate functions
Optimizing data storage and access methods
__Minimizing network overhead
__Choosing an efficient storage format
Partitioning tables to reduce scan size
__Clustering tables based on high-cardinality keys
__Time-independent use cases
organize
__Checklist

[Chapter 8] Advanced Queries
Reusable queries
__Parameterized queries
__SQL user-defined functions
__Reusing parts of a query
Advanced SQL
__Handling arrays
__window function
__Table metadata
__Data Definition Language and Data Manipulation Language
Beyond SQL
__JavaScript user-defined functions
__Scripting
Advanced functions
BigQuery Geographic Information System
__Useful statistical functions
__hash algorithm
organize

[Chapter 9] BigQuery Machine Learning
What is machine learning?
__Formulating the Machine Learning Problem
__Types of machine learning problems
Creating a regression model
__Select a label
__Exploring the dataset to find features
__Create a learning dataset
__Model training and evaluation
Predicting with a __model
__Checking model weights
__More complex regression models
Creating a classification model
__learning
__evaluation
__prediction
__Choose a threshold value
Customizing BigQuery ML
__Controlling data partitioning
__Class Balancing
__Normalization
k-means clustering
__What to cluster
Clustering bike rental stations
__Perform clustering
__Understanding Clusters
__Data-driven decision making
Recommendation system
__MovieLens Dataset
__Matrix decomposition
__Create a recommendation
__Integrating user and movie information
Custom machine learning models on GCP
__Hyperparameter tuning
__AutoML
__TensorFlow support
organize

[Chapter 10] BigQuery Management and Security
Infrastructure Security
Account and Access Management
__account
__role
__Resource
BigQuery Management
__Task Management
__Grant permission to user
__Recover deleted records and tables
__Continuous Integration/Continuous Delivery
__Dashboards, monitoring, and audit logging
Availability, disaster recovery, and encryption
__Zone, Region, and Multi-Region
__BigQuery and Fault Handling
__Durability, Backup, and Disaster Recovery
__Privacy and Encryption
Compliance with regulations
__Data locality
__Restrict access to data services
__Remove all transactions related to an individual
__Data loss prevention
__CMEK
__Data Leakage Protection
organize

[Korean Edition Special Appendix] Building an ELT Pipeline with Cloud Composer and BigQuery
The Big Picture of the ELT Pipeline
What is Cloud Composer?
Creating and Setting Up Cloud Composer
Cloud Composer Web Server UI
Creating a DAG
Creating an ELT Pipeline

Detailed image
Detailed Image 1

Publisher's Review
What this book covers

ㆍ A detailed guide to BigQuery's high-level architecture and internal operations.
ㆍ Description of data types, functions, and operators supported by BigQuery
ㆍ Secrets to improving performance or reducing costs through query and schema optimization
ㆍLearn advanced technologies such as GIS, time travel, DDL/DML, user-defined functions, and scripting within standard SQL.
How to solve various machine learning problems with BigQuery ML
How to protect data, monitor activity, and authenticate users
ㆍUpdates with the latest technologies, including scripting, scheduling, materialized views, column-level security, dynamic SQL, machine learning, table-level access control, and unified queries.

Download sample code
https://github.com/onlybooks/bigquery

Contents and target audience of this book

As businesses become increasingly data-driven, data warehouses, centralized storage for all of a company's data, are emerging as a key part of their data strategy.
Traditionally, data warehouses have been used by data analysts to create analytical reports.
However, it is now widely used to create real-time dashboards, write ad-hoc queries, and provide decision-making guidance through predictive analytics.
The primary reasons many organizations are migrating to cloud-based data warehouses like Google BigQuery today are business demands for advanced analytics, along with cost management, agility, and access to data within their own services.

This book delves into Google Cloud's BigQuery, a serverless, highly scalable, low-cost enterprise data warehouse.
Without the need to manage infrastructure, businesses can focus on analyzing data to uncover meaningful insights using familiar SQL.

Our goal with BigQuery was to build a data platform that offered cutting-edge capabilities, leveraged the best technologies available in the cloud, and supported proven data technologies that can be trusted and used today.
For example, in terms of cutting-edge technology, Google BigQuery is a serverless computing architecture that separates computation and storage.
This allows different layers of the architecture to run and scale independently, giving data scientists greater flexibility in design and deployment.
Although a bit unique, BigQuery also natively supports machine learning and geospatial analysis.
BigQuery also integrates with a variety of third-party tools, including Cloud Pub/Sub, Cloud Dataflow, Cloud Bigtable, and Cloud AI platforms.
This enables interoperability with both legacy and modern systems while meeting a wide range of throughput and latency requirements.
In terms of proven data technologies, BigQuery supports ANSI standard SQL, column-based optimization, and integrated queries, which are key features for ad-hoc data exploration that many users have been requesting.

This book is for data analysts, data engineers, and data scientists who use BigQuery to derive insights from large datasets.


◆ Data Analyst
You can use BigQuery with dashboard tools like Looker, Data Studio, and Tableau, or with SQL.


◆ Data Engineer
You can integrate data pipelines written in Python or Java with BigQuery using frameworks such as Apache Spark and Apache Beam.


Data Scientist
Data scientists can build machine learning models in BigQuery, run TensorFlow models on data stored in BigQuery, and run large-scale distributed jobs in BigQuery using Jupyter Notebooks.
Special Preface to the Korean Edition

Korea has been using BigQuery since its early days.
In 2015, a developer from Lezhin Entertainment in Korea even held a seminar on using BigQuery (https://www.slideshare.net/modestjude/big-query-43974844).
As cloud computing continues to advance, companies across various fields in Korea are increasingly leveraging BigQuery for data warehouses, data analysis, and machine learning.
However, it wasn't easy for Korean developers to get started with BigQuery and find best practices.
You may have had to comb through reference documents and various blog posts to find what you were looking for.
That's why I can't help but congratulate the publication of the Korean version of "Google BigQuery: The Definitive Guide"!

I am truly proud that a Korean edition of a book published by a publisher that has consistently produced high-quality books is now being published.
I heard that Jang Hyun-hee, who has extensive translation experience, and Byun Seong-yoon, who has extensive experience in machine learning and data science, worked together to translate the book into an easy-to-read book that is faithful to the original text.
I don't know Korean, but I asked a trustworthy colleague at Google Korea and he said there's no need to worry about the translation quality.

Like other cloud technologies, BigQuery continues to evolve rapidly.
Just last year, we added scripting, persistent user-defined functions, new machine learning models, and more.
However, BigQuery has been around for over a decade, and its core architecture is very stable, so the recommended practices presented in this book remain valid.
In particular, I heard that this Korean edition also contains the contents of the original book, which I last updated in June 2020.
We will continue to update the original book on our GitHub website (https://github.com/GoogleCloudPlatform/bigquery-oreilly-book).
Welcome to the BigQuery user community!

- November 2020, Bellevue, Washington, USA / Baliappa Lakshmanan

Translator's Note

Data-related industries, such as data science and AI, are growing rapidly.
In particular, the field of machine learning/deep learning has made incredible progress in recent years.
In addition, the field of data processing is also developing rapidly.
Representative examples include Apache Hadoop and Apache Spark, and we are considering ways to quickly process large amounts of data.
Amidst these movements, Google is revolutionizing data processing by launching BigQuery using the Dremel engine in 2011.


BigQuery has many advantages.
Some of the most representative advantages are as follows:

ㆍ Since Google manages the infrastructure, users do not need to manage the infrastructure.
ㆍ Data can be quickly extracted and processed using internal distributed processing.

ㆍYou can use GIS functions for geographic data analysis, BigQuery ML for machine learning, etc.
ㆍ If you use Firebase, you can easily obtain app log data.

This book, "Google BigQuery: The Complete Guide," covers everything about BigQuery, the data warehouse of Google Cloud Platform.
BigQuery is designed to process large amounts of data quickly, and because it allows data extraction using SQL, it is easier to learn than tools like Apache Spark, which use specific programming languages ​​(e.g., Scala, Python, etc.).
Additionally, since there is no need to manage separate data infrastructure, users can focus on extracting data from BigQuery without worrying about infrastructure management.
Although there are already many data processing tools available, BigQuery has recently gained the most attention for these reasons.

Today, even those not necessarily data analysts study SQL in a variety of roles, including planners and marketers. SQL is a language used to extract data, and familiarity with it allows you to quickly and easily extract data stored within your company.
Regardless of your career field, being able to handle SQL can be a huge advantage, so it's a good idea to study it little by little and consistently.


I've also been using BigQuery for over 4 years.
Over the past four years of using BigQuery, we've created a number of resources to help people use it more easily.
I can't forget how I felt when I first encountered the original version of this book, "Google BigQuery: The Definitive Guide," while creating various materials.
This book is the most detailed of all books on BigQuery, and is a must-read for both data analysts and data engineers.
For our domestic readers, we have added some more content to the original text, so you can see the following in the Korean version:


1.
Special appendix to the Korean edition

After translating the entire book, I thought it would be helpful to include information on how to actually build pipelines, so I contributed an article titled "Building ELT Pipelines with Cloud Composer and BigQuery" as a special appendix to the Korean edition.
This appendix guides you through using Cloud Composer, a managed service for Apache Airflow, to create an ELT pipeline using BigQuery.

2.
Source code for this book

After checking the original book's GitHub, I found that all queries for each chapter were saved in a single file.
I thought this part might make it less convenient for those studying the book, so I organized and saved all the queries by example.
The source code for each example in the Korean version can be viewed and downloaded from the publisher's GitHub (https://github.com/onlybooks/bigquery).
If you have any questions or concerns while purchasing and studying the book, please post them directly on the book's official GitHub issue tracker (https://github.com/onlybooks/bigquery/issues) or tag my GitHub account (zzsza) and ask a question. I will respond as quickly as possible.
- Byun Seong-yoon

Google BigQuery is a cloud-based service that excels at storing and analyzing large amounts of data.
Terms like large-scale data and big data have been around for a long time, but implementing the infrastructure and applications to properly store and utilize them has been a challenge.
Processing large amounts of data on your own requires significant expertise, money, experienced developers to properly build it, and the infrastructure to support it. This may have been even more difficult because not all companies could afford these conditions.

However, as cloud technology and services have advanced, technologies that anyone can use easily and quickly at a very low cost without worrying about the underlying infrastructure, scalability, or performance have begun to emerge, and Google BigQuery is one of the services that provides this value.
In particular, it can be said that it is a great advantage that it supports a query language that is compatible with the existing ANSI SQL.
Google BigQuery's advantages over other big data platforms can be summarized in two main points.

1.
Ease of use

Unlike existing big data platforms such as Apache Spark or Hadoop, BigQuery supports a query language compatible with ANSI SQL, making it easy for developers who are new to large-scale data processing to quickly utilize it if they have experience using RDBMS.
Of course, you need to gain experience writing distributed queries to utilize distributed data efficiently, but it's a huge improvement over having to learn a completely unfamiliar language or implement logic.

2.
Cloud Infrastructure

Like other cloud services, BigQuery is cloud-based, so users don't need to manage their own infrastructure.
Therefore, the burden of expensive operating costs is also reduced.
BigQuery's pricing policy is quite efficient, so if users fully understand and pay attention to BigQuery's features, they can achieve the desired results much faster and at a much lower cost.


This book explains the knowledge developers need, from the birth of BigQuery to its architecture and utilization, with fun and practical examples.
The authors' extensive experience and expertise in BigQuery, developed for cloud data analysis and machine learning products at Google, will be of great help to readers of this book.
As this book covers cloud services, where new features are rapidly being added to the service, it was updated just seven months after its publication. Fortunately, with the author's cooperation, we were able to publish it with all of the content reflected.
This translation includes the latest updates and additional appendices written by co-translator Seongyoon Byun, ensuring you'll enjoy learning about the latest version of BigQuery.

We would like to express our gratitude to the authors, Baliappa Lakshmanan and Jordan Tigani, for sending us a wonderful book and a special preface for the Korean edition.
Finally, I encourage all of our readers who, even during difficult times, continue to strive to strengthen their individual capabilities and contribute to the advancement of the Korean software industry.
Thank you.
- Special Preface to the Korean Edition by Jang Hyun-hee

Korea has been using BigQuery since its early days.
In 2015, a developer from Lezhin Entertainment in Korea even held a seminar on using BigQuery (https://www.slideshare.net/modestjude/big-query-43974844).
As cloud computing continues to advance, companies across various fields in Korea are increasingly leveraging BigQuery for data warehouses, data analysis, and machine learning.
However, it wasn't easy for Korean developers to get started with BigQuery and find best practices.
You may have had to comb through reference documents and various blog posts to find what you were looking for.
That's why I can't help but congratulate the publication of the Korean version of "Google BigQuery: The Definitive Guide"!

I am truly proud that a Korean edition of a book published by a publisher that has consistently produced high-quality books is now being published.
I heard that Jang Hyun-hee, who has extensive translation experience, and Byun Seong-yoon, who has extensive experience in machine learning and data science, worked together to translate the book into an easy-to-read book that is faithful to the original text.
I don't know Korean, but I asked a trustworthy colleague at Google Korea and he said there's no need to worry about the translation quality.

Like other cloud technologies, BigQuery continues to evolve rapidly.
Just last year, we added scripting, persistent user-defined functions, new machine learning models, and more.
However, BigQuery has been around for over a decade, and its core architecture is very stable, so the recommended practices presented in this book remain valid.
In particular, I heard that this Korean edition also contains the contents of the original book, which I last updated in June 2020.
We will continue to update the original book on our GitHub website (https://github.com/GoogleCloudPlatform/bigquery-oreilly-book).
Welcome to the BigQuery user community!

- November 2020, Bellevue, Washington, USA / Baliappa Lakshmanan

Translator's Note

Data-related industries, such as data science and AI, are growing rapidly.
In particular, the field of machine learning/deep learning has made incredible progress in recent years.
In addition, the field of data processing is also developing rapidly.
Representative examples include Apache Hadoop and Apache Spark, and we are considering ways to quickly process large amounts of data.
Amidst these movements, Google is revolutionizing data processing by launching BigQuery using the Dremel engine in 2011.


BigQuery has many advantages.
Some of the most representative advantages are as follows:

ㆍ Since Google manages the infrastructure, users do not need to manage the infrastructure.
ㆍ Data can be quickly extracted and processed using internal distributed processing.

ㆍYou can use GIS functions for geographic data analysis, BigQuery ML for machine learning, etc.
ㆍ If you use Firebase, you can easily obtain app log data.

This book, "Google BigQuery: The Complete Guide," covers everything about BigQuery, the data warehouse of Google Cloud Platform.
BigQuery is designed to process large amounts of data quickly, and because it allows data extraction using SQL, it is easier to learn than tools like Apache Spark, which use specific programming languages ​​(e.g., Scala, Python, etc.).
Additionally, since there is no need to manage separate data infrastructure, users can focus on extracting data from BigQuery without worrying about infrastructure management.
Although there are already many data processing tools available, BigQuery has recently gained the most attention for these reasons.

Today, even those not necessarily data analysts study SQL in a variety of roles, including planners and marketers. SQL is a language used to extract data, and familiarity with it allows you to quickly and easily extract data stored within your company.
Regardless of your career field, being able to handle SQL can be a huge advantage, so it's a good idea to study it little by little and consistently.


I've also been using BigQuery for over 4 years.
Over the past four years of using BigQuery, we've created a number of resources to help people use it more easily.
I can't forget how I felt when I first encountered the original version of this book, "Google BigQuery: The Definitive Guide," while creating various materials.
This book is the most detailed of all books on BigQuery, and is a must-read for both data analysts and data engineers.
For our domestic readers, we have added some more content to the original text, so you can see the following in the Korean version:


1.
Special appendix to the Korean edition

After translating the entire book, I thought it would be helpful to include information on how to actually build pipelines, so I contributed an article titled "Building ELT Pipelines with Cloud Composer and BigQuery" as a special appendix to the Korean edition.
This appendix guides you through using Cloud Composer, a managed service for Apache Airflow, to create an ELT pipeline using BigQuery.

2.
Source code for this book

After checking the original book's GitHub, I found that all queries for each chapter were saved in a single file.
I thought this part might make it less convenient for those studying the book, so I organized and saved all the queries by example.
The source code for each example in the Korean version can be viewed and downloaded from the publisher's GitHub (https://github.com/onlybooks/bigquery).
If you have any questions or concerns while purchasing and studying the book, please post them directly on the book's official GitHub issue tracker (https://github.com/onlybooks/bigquery/issues) or tag my GitHub account (zzsza) and ask a question. I will respond as quickly as possible.
- Byun Seong-yoon

Google BigQuery is a cloud-based service that excels at storing and analyzing large amounts of data.
Terms like large-scale data and big data have been around for a long time, but implementing the infrastructure and applications to properly store and utilize them has been a challenge.
Processing large amounts of data on your own requires significant expertise, money, experienced developers to properly build it, and the infrastructure to support it. This may have been even more difficult because not all companies could afford these conditions.

However, as cloud technology and services have advanced, technologies that anyone can use easily and quickly at a very low cost without worrying about the underlying infrastructure, scalability, or performance have begun to emerge, and Google BigQuery is one of the services that provides this value.
In particular, it can be said that it is a great advantage that it supports a query language that is compatible with the existing ANSI SQL.
Google BigQuery's advantages over other big data platforms can be summarized in two main points.

1.
Ease of use

Unlike existing big data platforms such as Apache Spark or Hadoop, BigQuery supports a query language compatible with ANSI SQL, making it easy for developers who are new to large-scale data processing to quickly utilize it if they have experience using RDBMS.
Of course, you need to gain experience writing distributed queries to utilize distributed data efficiently, but it's a huge improvement over having to learn a completely unfamiliar language or implement logic.

2.
Cloud Infrastructure

Like other cloud services, BigQuery is cloud-based, so users don't need to manage their own infrastructure.
Therefore, the burden of expensive operating costs is also reduced.
BigQuery's pricing policy is quite efficient, so if users fully understand and pay attention to BigQuery's features, they can achieve the desired results much faster and at a much lower cost.


This book explains the knowledge developers need, from the birth of BigQuery to its architecture and utilization, with fun and practical examples.
The authors' extensive experience and expertise in BigQuery, developed for cloud data analysis and machine learning products at Google, will be of great help to readers of this book.
As this book covers cloud services, where new features are rapidly being added to the service, it was updated just seven months after its publication. Fortunately, with the author's cooperation, we were able to publish it with all of the content reflected.
This translation includes the latest updates and additional appendices written by co-translator Seongyoon Byun, ensuring you'll enjoy learning about the latest version of BigQuery.

We would like to express our gratitude to the authors, Baliappa Lakshmanan and Jordan Tigani, for sending us a wonderful book and a special preface for the Korean edition.
Finally, I encourage all of our readers who, even during difficult times, continue to strive to strengthen their individual capabilities and contribute to the advancement of the Korean software industry.
Thank you.
- Jang Hyun-hee
GOODS SPECIFICS
- Publication date: November 26, 2020
- Page count, weight, size: 616 pages | 185*240*30mm
- ISBN13: 9791189909239
- ISBN10: 1189909235

You may also like

카테고리