Content
- Understanding Partitioned Tables in BigQuery
- When to Use Partitioning
- Types of Fields for Partitioning BigQuery Tables
- How to Create Partitioned Tables
- Ingesting Data into Partitioned Tables
- Retrieving Data from Partitioned Tables in BigQuery
- Managing Partitioned Tables in BigQuery
- Advanced Techniques to Use When Partitioning Tables in BigQuery
- Converting Date-Sharded Tables to Ingestion-Time Partitioned Tables
- Limitations of BigQuery Partitioned Tables
- Best Practices to Follow When Partitioning Tables
- Explore the Power of Useful BigQuery Functions
- Enhance Your Reporting with the OWOX BI BigQuery Reports Extension
Partitioned Tables in BigQuery: Everything You Need to Know
Anna Panchenko, Senior Digital Analyst @ OWOX
With the amount of corporate data growing every single day - efficiently managing, and querying large datasets can often seem like navigating through uncharted waters. It might look complicated and expensive.
BigQuery, a Google Cloud's data warehouse, provides a great solution with its Partitioned Tables feature.
Partitioned tables in BigQuery allow users to divide a table into segments, each holding a subset of the data based on specific criteria like dates or ranges. This strategic division enables precise and swift queries over massive datasets, turning a potential data deluge into a streamlined flow of insights.
Whether you're looking to optimize your analytics processes, spend less costs, or both, understanding partitioned tables is crucial in utilizing BigQuery as a data storage to its full potential.
Let’s dive deep into the world of BigQuery partitioned tables, exploring how they work, why they are indispensable, and how you can use them to make your data queries both faster and more cost-effective.
Understanding Partitioned Tables in BigQuery
Partitioned tables in Google BigQuery help organize tabular data into segments, called partitions, based on specific column values, for example, years, days or months.
This arrangement enables more efficient data analysis and management, especially for large volumes of data, by allowing you to query smaller, relevant subsets of data, thereby reducing costs and increasing query speed.
When to Use Partitioning
Partitioning is essential when dealing with large or frequently updated datasets. It is useful in scenarios where data needs to be accessed by date or specific identifiers, as it can significantly speed up these accesses.
Moreover, partitioning is a strategic approach to managing historical data over time, allowing for easier data structure management through expiration settings that automatically handle old data.
Types of Fields for Partitioning BigQuery Tables
Choosing the right field to partition depends greatly on the nature of your data and the typical queries your system will handle.
Common fields for partitioning include:
- Date fields: Ideal for logs, transactions, and historical data where access patterns are predominantly time-based.
- Numeric fields: Used for range-based queries where data is often accessed via a numeric identifier, such as user IDs or geographic codes.
Time-unit Column
Partitioning by time units like date or timestamp allows you to break down your data into manageable, sequential chunks, making it easier to perform time-based analyses. This type of partitioning involves organizing tables based on a TIMESTAMP, DATE, or DATETIME column. Using a time-unit column for partitioning allows the system to automatically identify and direct data into the appropriate partition when writing new records.
For instance, if a new record is added, BigQuery determines the relevant partition by examining the date or timestamp provided in the time-unit column. BigQuery also offers various levels of granularity for partitions, adding flexibility to how data is segmented. For partitions based on the DATE column, you can segment the data into daily, monthly, or yearly.
Here's an example of how yearly granularity might work:
Value | Partition (yearly) |
DATE(2024, 12, 25) | 2024 |
DATE(2024, 11, 21) | 2024 |
DATE(2023, 4, 8) | 2023 |
This approach allows for efficient data management and query performance by grouping data into logically defined periods.
Ingestion Time
Ingestion-time partitioning utilizes the exact moment when data is ingested into BigQuery to organize the table into multiple segments. This method shares similarities with time-unit column partitioning but offers additional granularity options such as hourly, daily, monthly, or yearly.
BigQuery facilitates this by creating two pseudo columns, _PARTITIONTIME and _PARTITIONDATE, which store the truncated ingestion time according to the selected granularity.
For instance, using monthly granularity, the ingestion times would be segmented as follows:
Ingestion time | _PARTITIONTIME | Partition (monthly) |
2016-12-25 18:08:04 | 2016-12-25 18:08:04 | 201612 |
2018-11-21 09:24:54 | 2018-11-21 09:24:54 | 201811 |
2018-04-08 23:19:11 | 2018-04-08 23:19:11 | 201804 |
Integer Range
Integer-range partitioning involves dividing a table based on an integer column according to specified ranges.
To establish an integer-range partition, you must provide four key pieces of information:
- The name of the integer-type column to partition by.
- The starting value for the range (inclusive).
- The ending value for the range (exclusive).
- The interval between the starting and ending values defines the range of each partition.
Based on these parameters, BigQuery organizes the data into defined ranges. Any values falling below the starting value or equal to or exceeding the ending value are placed into a special partition called __UNPARTITIONED__.
Additionally, any records where the integer column is NULL are assigned to the __NULL__ partition. This method ensures that data is systematically categorized, enhancing the efficiency of range-specific queries.
How to Create Partitioned Tables
To create a partitioned table, you can specify the partitioning configuration during table creation through SQL or the BigQuery UI. The choice of partitioning method (e.g., by date, by range) will influence how the table is partitioned and how data is distributed across partitions.
Additionally, understanding the implications of each partitioning method is crucial for optimizing performance and cost.
Using BigQuery Console to Create New Partitioned Tables
To create a new partitioned table in BigQuery, the process is an extension of the standard table creation steps, with additional settings for partitioning:
1. Access the BigQuery Console: Navigate to the BigQuery interface through your Google Cloud Console.
2. Choose the Dataset: Select the dataset within which you want to create the new table.
3. Initiate Table Creation: Click on the "Create Table" button to start the setup process.
4. Add Table Details: Enter your table name and define the schema by specifying the columns. For instance, for a "customers" table, you might include:
- customer_id (INTEGER type)
- first_name (STRING type)
- last_name (STRING type)
- date_registered (DATE type)
5. Configure Partitioning Settings: Below the schema definition, you’ll find options for setting up partitioning.
- Partition by Ingestion Time: If selecting this option, specify the granularity of the partitioning (hourly, daily, monthly, or yearly).
- Partition by Integer Range: If using a column like customer_id for partitioning, define the starting value, ending value, and the interval between these values.
- Partition by Date: For a date field such as date_registered, choose the granularity of the partitions (daily, monthly, or yearly).
- Additional Clustering Options: If desired, you can also set up clustering alongside partitioning to further organize data within each partition by one or more secondary columns.
6. Create the Table: After configuring all necessary options, click on “Create table” at the bottom of the dialog to finalize the creation of your partitioned table.
Example from description above
Example:
Suppose you are managing a customer database for a retail company and need to create a new table in BigQuery to store customer registration data efficiently. The table will be frequently queried based on the registration date, so you decide to partition it by this field.
CREATE TABLE myDataset.customers_data
(
customer_id INT64,
first_name STRING,
last_name STRING,
date_registered DATE
)
PARTITION BY date_registered
OPTIONS(
description="A table to store customer registration data partitioned by registration date"
)
Ingesting Data into Partitioned Tables
When loading data into a partitioned table in BigQuery, it's crucial to ensure that each record includes the value for the partitioning column. This value dictates which partition the record will be stored in, optimizing data organization and query performance.
For data loading from external sources such as files, BigQuery provides tools like the bq load command. This command can include specific flags like _time_partitioning_field to direct the data into the correct partitions based on the values in the specified column.
Before diving into the example scenario, we have a few steps. We need to prepare the data for processing first and then use a specific tool in BigQuery known as bq command in-line tool.
Steps to Load Data into the Partitioned Table:
- Prepare the Data File: Ensure that the CSV file, daily_sales_report.csv, includes the sale_date column with dates formatted correctly.
- Use the bq load Command: To load the data into the daily_sales table and ensure it partitions by sale_date, use the following command in your Google Cloud Shell or local terminal with Google Cloud SDK installed.
Steps to use the bq command in-line tool:
- Set Up Your Google Cloud Project
- Access the Google Cloud Console
- On the project selector page, either select an existing project or create a new one. Creating a new project is recommended if you do not plan to retain the resources you will use.
- If you create a new project, you can delete it later to remove all associated resources, ensuring no ongoing charges.
- Enable BigQuery API
- BigQuery service is automatically enabled for new projects. For existing projects, you may need to manually enable it.
- Configure Billing (Optional)
- From the Google Cloud console, activate Cloud Shell by clicking on the Cloud Shell icon in the top right corner. Cloud Shell provides a command-line interface to your cloud resources.
- Once activated, you can start using bq commands directly in the Cloud Shell.
Example:
Imagine you are managing sales data for an online retailer and have a partitioned table in your retail_operations dataset called daily_sales. This table is partitioned by the sale_date column with daily granularity. You receive daily sales reports in CSV format stored in Google Cloud Storage and need to load this data into the appropriate partitions in BigQuery.
bq load --source_format=CSV --time_partitioning_field=sale_date retail_operations.daily_sales gs://your-bucket-name/daily_sales_report.csv
Here:
- --source_format=CSV: Indicates the format of the source data.
- --time_partitioning_field=sale_date: Directs BigQuery to partition the data based on the values in the sale_date column.
- retail_operations.daily_sales: The dataset and table where the data will be loaded.
- gs://your-bucket-name/daily_sales_report.csv: The path to the CSV file in Google Cloud Storage.
Make Your Corporate BigQuery Data Smarter in Sheets
Transform Google Sheets into a dynamic data powerhouse for BigQuery. Visualize your data for wise, efficient, and automated reporting
Retrieving Data from Partitioned Tables in BigQuery
Effective retrieval from partitioned tables means writing queries that specifically target the partitions relevant to the query. For example, if a table is partitioned by date, and you only need data from a particular month, your query should specify this range to avoid scanning unnecessary partitions, thus saving time and reducing cost.
Suppose you oversee the sales_data table within the retail_analysis dataset on BigQuery, partitioned by transaction_date, holding sales transactions over several years. Your task is to retrieve records from September 1, 2023, onward. Additionally, you manage the event_logs table, an ingestion-time partitioned table that records daily user activities. We will be using this scenario for the Query examples in this section.
Querying a Time-unit Partitioned Table
This involves writing SQL queries that target specific partitions based on the time unit, which reduces the amount of data scanned and speeds up the query.
We will retrieve sales records from September 1, 2023, onward to analyze recent sales trends.
SELECT * FROM `owox-analytics.myDataset.daily_sales`
WHERE sales_date >= '2023-09-01';
Here:
- SELECT *: This part of the query selects all columns from the specified table. The asterisk is a wildcard character that represents all columns in the table.
- FROM: Specifies the source of the data for the query.
- owox-analytics.myDataset.daily_sales: Indicates the fully qualified name of the table from which the data is being retrieved. Here, myDataset is the dataset and daily_sales is the table name within that dataset.
- WHERE transaction_date >= '2023-09-01': This condition filters records to include only those where the transaction_date is on or after September 1, 2023.
This query efficiently prunes the partitions by directly targeting data from the specified start date, minimizing the data scanned and speeding up the query process.
Querying an Ingestion-time partitioned table
Leveraging an ingestion-time partitioned table is highly effective when you utilize a pseudocolumn like _PARTITIONDATE or _PARTITIONTIME.
Based on the pre-set scenario, we will analyze user activities on August 16, 2024.
SELECT
_PARTITIONTIME AS pt,
*
FROM
`owox-analytics.myDataset.test_table`
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP("2024-08-01 00:00:00") AND TIMESTAMP("2024-08-16 00:00:00");
Here:
- SELECT _PARTITIONTIME AS pt, *: Retrieves all columns (*) from the specified table. The _PARTITIONTIME column is also selected and aliased as pt for easier reference in the results.
- FROM owox-analytics.myDataset.test_table: Specifies the table test_table from which to retrieve the data, located within the myDataset dataset.
- WHERE _PARTITIONTIME BETWEEN TIMESTAMP("2024-08-01 00:00:00") AND TIMESTAMP("2024-08-16 00:00:00"): Filters the rows to include only those where _PARTITIONTIME falls within the specified date and time range. This is a precise way to query data from Agust 1, 2024, starting at midnight, up to but not including Agust 16, 2024, at midnight.
Querying an Integer-range partitioned table
Like time units, querying integer-range partitions requires SQL queries that specify the integer range, allowing faster and more efficient data retrieval.
For this scenario, we will retrieve all sales records for products with IDs between 100000 and 200000 to analyze sales performance for a specific range of products.
SELECT *
FROM `owox-analytics.myDataset.sales_report_partitioned`
WHERE product_id BETWEEN 100000 AND 200000;
Here:
- SELECT *: This command selects all columns from the specified table, providing a full view of the data records.
- FROM owox-analytics.myDataset.sales_report_partitioned: Indicates that the data is being retrieved from the sales_data table within the retail_analysis dataset.
- WHERE product_id BETWEEN 100000 AND 200000: Limits the rows returned by the query to those where the product_id falls within the range of 100000 to 200000, inclusive of both endpoints. This type of filter is useful for targeting a specific segment of records for detailed analysis or reporting.
Explore BigQuery Data in Google Sheets
Bridge the gap between corporate BigQuery data and business decisions. Simplify reporting in Google Sheets without manual data blending and relying on digital analyst resources availability
Managing Partitioned Tables in BigQuery
Effective management of partitioned tables involves regularly monitoring the performance and storage of each partition. Tools such as BigQuery's INFORMATION_SCHEMA can provide insights into partition usage and help optimize partitioning data. Additionally, setting expiration policies for partitions can automatically manage storage costs by deleting old data that is no longer needed.
Retrieving Partition Metadata with INFORMATION_SCHEMA
When managing and analyzing partitioned tables in BigQuery, querying the INFORMATION_SCHEMA.PARTITIONS view is crucial. This view provides metadata about each partition within a table, such as the partition ID, the number of rows, and other pertinent details.
Each row in the result represents a distinct partition, offering insights into the structure and utilization of your partitioned tables.
Example:
Imagine you are managing a dataset called myDataset in BigQuery, which contains several partitioned tables. To effectively monitor and optimize these tables, you need to understand the distribution and size of the partitions, especially focusing on active partitions that are frequently queried.
We will retrieve a list of all active table partitions in the sales_data dataset to analyze partition usage and data distribution.
SELECT table_name, partition_id, total_rows
FROM `myDataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL;
Here:
- table_name: This column shows the name of the table to which the partition belongs.
- partition_id: The ID of the partition. Filtering out NULL values ensures that the query only returns partitions that exist and are assigned a specific ID.
- total_rows: Indicates the number of rows in each partition, providing insight into the data volume of each segment
Updating Partition Expiration with SQL
When you partition a table by time unit or ingestion time with BigQuery SQL, you can set an expiration time for each partition. This setting determines how long BigQuery retains the data in each partition before automatically deleting it.
The expiration setting is particularly useful for managing data lifecycle and storage costs.
Example:
Assume you are overseeing a dataset named financial_records in BigQuery, which includes a table daily_sales_partition partitioned by the transaction date. Given the regulatory requirements to retain transaction data for only 365 days, you need to set an expiration date for each partition to ensure compliance and manage storage efficiently.
ALTER TABLE `owox-analytics.myDataset.daily_sales_partition`
SET OPTIONS (
partition_expiration_days=365
);
Here:
- ALTER TABLE Command: This SQL command modifies the settings of an existing table.
- SET OPTIONS: This clause is used to specify or change table options.
- partition_expiration_days=365: This option sets the expiration time for each partition. Each partition’s data will automatically be deleted after 365 days from its creation or the specified date of the data in that partition.
Setting Partition Filter Requirements with SQL
In BigQuery, enabling the "Require partition filter" option for partitioned tables is crucial for enhancing query performance and reducing unnecessary full-table scans. This option forces queries against the table to include a filter on the partitioning column, ensuring that only relevant partitions are accessed during query execution.
If this option is not set during table creation, it can be added later to existing tables using the ALTER TABLE SET OPTIONS command.
Example:
You are managing a large dataset called myDataset in BigQuery, which includes a table daily_sales_partition partitioned by sale_date. Initially, the table was created without enabling the "Require partition filter" option. To improve query performance and control data access costs, you decide to update the table settings to require a partition filter for all queries.
You can use the following query to enable the "Require partition filter":
ALTER TABLE `owox-analytics.myDataset.daily_sales_partition`
SET OPTIONS (
require_partition_filter = true
);
Here:
- ALTER TABLE Command: Modifies the settings of the specified table.
- SET OPTIONS: Used to update table options.
- require_partition_filter = true: By setting this option to true, you enforce the inclusion of a filter on the partitioning column (sale_date) for all queries, preventing full-table scans and optimizing query performance.
Copy a Partitioned Table
The process of copying tables in BigQuery, whether they are partitioned or not, is generally the same. However, there are specific considerations to keep in mind when dealing with partitioned tables.
- Copying a Partitioned Table to a New Destination Table: When you copy a partitioned table to a new destination, all partitioning information, including the schema and partitions, is preserved in the new table.
- Copying a Non-Partitioned Table into an Existing Partitioned Table: This operation is feasible only for tables partitioned by ingestion time. BigQuery will assign the data to the partition corresponding to the current date. This does not apply to tables partitioned by a time-unit column or integer range.
- Copying a Partitioned Table into Another Partitioned Table: The partition specifications of both the source and destination tables must match. If they do not, the operation will not be successful.
- Copying a Partitioned Table into a Non-Partitioned Table: When copying data from a partitioned to a non-partitioned table, the resulting table will not retain any partitioning.
- Copying Multiple Partitioned Tables: If copying multiple source tables into a partitioned destination table in the same job, ensure all source tables have the same partitioning specifications. Mixing partitioned and non-partitioned tables is not supported.
Example:
You are managing a dataset named myDataset that includes a table named daily_sales_partition, which is partitioned by the sale_month column. To ensure more efficient querying, you decide to update the table to require a partition filter for all queries.
ALTER TABLE `owox-analytics.myDataset.daily_sales_partition`
SET OPTIONS (
require_partition_filter = true
);
Here:
- ALTER TABLE: This is the command used to modify an existing table's structure or settings.
- myDataset: This is the name of the dataset where the table resides.
- daily_sales_partition: This is the name of the table that is being altered.
- SET OPTIONS: This clause is used to change or set specific options for a table.
- require_partition_filter: This is the specific table option being set. It controls whether queries against the table must include a filter that restricts the data returned by using the partitioning column.
- true: This value is set to enable the option, meaning that every query executed against the monthly_sales table must include a condition that uses the partition column to filter the data.
Copy an Individual Partition
In BigQuery, there are scenarios where you might need to copy only a specific partition from one table to another. This is particularly useful for archiving, testing, or isolating specific data. However, it's important to note that this action cannot be performed directly through the BigQuery UI (Cloud Console) and requires the use of the bq command-line tool with partition decorators.
Example:
You manage a customer table in the sales_data dataset of your BigQuery project, which is partitioned by the date_registered field. You need to archive the customer data specifically for January 30, 2021, to a new table for a detailed audit and compliance review.
The following query can be run in bq command in-line tool; the setup steps have been mentioned before.
bq cp -a 'project-id.sales_data.customers$20210130' project-id.sales_data.archived_customers_20210130
Here:
- bq cp: This is the BigQuery command-line tool's command for copying datasets, tables, or partitions.
- -a: This flag stands for "append". It is used here to denote that the data should be appended to the destination. If the destination table does not exist, BigQuery will create it.
- project-id: Specifies the Google Cloud project ID.
- sales_data: The dataset within the project where the source table is located.
- customers: The name of the source table.
- $20210130: The partition decorator indicating the specific partition date (date_registered = January 30, 2021) to be copied.
- Project-id.sales_data.archived_customers_20210130: This specifies the destination path where the partition data will be copied. It indicates that the data should be archived into a new table named archived_customers_20210130 within the same dataset.
Delete a partition with SQL
This query helps remove specific partitions when they are no longer needed, maintaining your storage's efficiency and cost-effectiveness.
Example:
Suppose you manage a daily_sales table in the myDataset dataset of your BigQuery project, which is partitioned by sale_date. For data compliance and storage optimization, you need to delete sales records specifically for the dates September 1, 2023, and September 2, 2023, which are no longer relevant to your analysis.
DELETE FROM `owox-analytics.myDataset.daily_sales_partition`
WHERE sales_date IN ('2023-09-01', '2023-09-02');
Here:
- DELETE FROM: This statement is used to remove data from a specified table.
- owox-analytics.myDataset.daily_sales_partition: Specifies the fully qualified name of the table from which partitions will be deleted.
- WHERE sales_date IN ('2023-09-01', '2023-09-02'):
- sales_date : A special column in BigQuery that represents the partition's date.
- IN ('2023-09-01', '2023-09-02'): Specifies the dates of the partitions to be deleted. This filter targets all the partitions corresponding to these dates, enabling their complete removal without scanning other data.
Uncover in-depth insights
Modern Data Management Guide
Download nowBonus for readers
Advanced Techniques to Use When Partitioning Tables in BigQuery
Advanced partitioning techniques include using scripts to dynamically adjust partitions based on data growth or query performance, or implementing multi-level partitioning strategies that use both time and a secondary key. Another advanced technique is partitioning with clustering, which further organizes data within each partition by additional fields, enhancing query performance.
Partition of an Existing BigQuery table
Partitioning an existing table in BigQuery requires creating a new table with the desired partitioning setup. You cannot directly convert a non-partitioned table to a partitioned table without creating a new version of the table.
The process involves using an SQL query to copy the existing data into a new table that is defined with partitioning on a specific column. This method is commonly used for columns of integer-range or time-unit types.
Example:
You manage a dataset customer_records in BigQuery, containing a non-partitioned table customers_data that holds records of customer interactions. The table includes fields such as first_name, last_name, and date_registered. To enhance query performance and manage historical data more efficiently, you decide to create a new partitioned table based on the date_registered column.
CREATE TABLE `owox-analytics.myDataset.customers_data_partitioned`
(first_name STRING,
last_name STRING,
date_registered DATE)
PARTITION BY date_registered
AS
SELECT first_name, last_name, date_registered
FROM `owox-analytics.myDataset.customers_data`;
Here:
- CREATE TABLE: This statement is used to create a new table in the specified dataset.
- owox-analytics.myDataset.customers_data_partitioned: The fully qualified name of the new table to be created.
- Column Definitions: Lists the columns and their types that will be included in the new table.
- PARTITION BY date_registered: Specifies that the new table should be partitioned based on the date_registered column using the default day granularity. This means each partition will correspond to one day.
- AS SELECT ...: This part of the query copies the existing data from the customers_data table into the new customers_data_partitioned table.
Seamless BigQuery Integration in Sheets
Get real-time, automated, and deeply insightful reporting at your fingertips. Connect BigQuery, run reports, and prepare dashboards in your favorite Google Sheets
Converting Date-Sharded Tables to Ingestion-Time Partitioned Tables
Date-sharded tables refer to a common data management practice in which tables are split or sharded based on date segments. This typically results in multiple tables named or suffixed according to specific periods like days, months, or years.
Converting date-sharded tables into a single ingestion-time partitioned table can significantly simplify your data management and querying processes. This conversion involves combining multiple sharded tables, each corresponding to specific dates, into a single table with partitions organized by time. The bq command-line tool facilitates this process through specific partitioning commands.
Example:
You manage a dataset called sales_data in BigQuery. It contains multiple date-sharded tables for daily sales records, such as daily_sales_20200101, daily_sales_20200102, etc. To streamline queries and improve performance, you decide to consolidate these sharded tables into a single table partitioned by day.
You can open your command line interface and use the following command to create a Partitioned Table.
bq --location=us partition \
--time_partitioning_type=DAY \
--time_partitioning_expiration=259200 \
myproject:sales_data.daily_sales_ \
myproject:sales_data.consolidated_sales
Here:
- --location=us: Specifies the location of your BigQuery resources. This flag is optional but important for aligning with the location of your datasets.
- --time_partitioning_type=DAY: Sets the partition type to daily, meaning each partition corresponds to one day.
- --time_partitioning_expiration=259200: Sets the expiration for each partition to 259,200 seconds (or 3 days), after which the data in each partition will automatically be deleted.
- myproject:sales_data.daily_sales_: Indicates the prefix of the source sharded tables in the sales_data dataset of your project.
- myproject:sales_data.consolidated_sales: Specifies the destination dataset and the new partitioned table name.
Querying Data in Write-Optimized Storage
In BigQuery, the __UNPARTITIONED__ partition uniquely manages data streamed to partitioned tables. When data is directly streamed to a specific partition of a table, it bypasses the __UNPARTITIONED__ partition.
However, if data does not immediately fit into a designated partition because it's in write-optimized storage or lacks partition identifiers, it temporarily resides in the __UNPARTITIONED__ partition. The data in this temporary storage will have NULL values for _PARTITIONTIME and _PARTITIONDATE, distinguishing it from data that has been properly partitioned.
Example:
You manage a partitioned table test_table in the myDataset dataset on BigQuery, usually partitioned by activity_date. Occasionally, data without an sales_date ends up in the __UNPARTITIONED__ partition. Your objective is to identify and analyze these records to ensure they are correctly processed or to resolve any issues with missing date information.
SELECT *
FROM `owox-analytics.myDataset.test_table`
WHERE _PARTITIONTIME IS NULL;
Here:
- SELECT *: This part of the query selects all columns from the specified table. You might modify this to select specific columns based on your analysis needs.
- owox-analytics.myDataset.test_table: Specifies the fully qualified name of the table containing the data.
- WHERE _PARTITIONTIME IS NULL: This condition filters the data to include only those records that are in the __UNPARTITIONED__ partition, identified by _PARTITIONTIME being NULL.
Query an Externally Partitioned Table in BigQuery
BigQuery allows you to create partitioned tables directly from data stored in formats like Avro, Parquet, ORC, JSON, and CSV on Google Cloud Storage, using a Hive-compatible partitioning layout. This functionality is useful for efficiently managing and querying large datasets that are logically divided into distinct segments.
Example:
You oversee the event_logs dataset in BigQuery and need to create a new table called daily_events. This table will import JSON files from Google Cloud Storage, stored in a Hive-partitioned directory like gs://my_bucket/events/year=2021/month=09/day=15/*.json, and utilize their existing date-based partition structure.
Steps to Create a Partitioned Table from External Files Using BigQuery Console
1. Access the BigQuery Console and select the event_logs dataset where the new table will be created.
2. Click on "Create a table" then select "Cloud Storage" as the source.
3. Enter the Cloud Storage path using a wildcard to encompass all files within the partition structure, e.g., gs://my_bucket/events/*.
4. Check the box to enable partitioning based on the directory structure.
5. Enter gs://my_bucket/events to define the common prefix of the file paths.
6. Leave the "Partition inference mode" option set to "Automatically infer types" to let BigQuery determine the schema and partitioning from the files.
7. Then use the following code.
CREATE EXTERNAL TABLE event_logs.daily_events
OPTIONS (
format = 'JSON',
uris = ['gs://my_bucket/events/*'],
hive_partition_uri_prefix = 'gs://my_bucket/events',
enable_hive_partitioning = TRUE
);
Here:
- CREATE EXTERNAL TABLE: This statement is used to create a new external table in BigQuery. External tables allow you to query data stored in Google Cloud Storage without moving the data into BigQuery itself.
- event_logs.daily_events: Specifies the name of the new external table to be created, where event_logs is the dataset, and daily_events is the table name.
- OPTIONS: This clause specifies various settings and configurations for the external table.
- format: Defines the format of the source data files.
- 'JSON': Indicates that the data files are in JSON format. This setting helps BigQuery understand how to parse the data files.
- uris: A list of URIs (Uniform Resource Identifiers) that point to the data files or folders in Google Cloud Storage.
- ['gs://my_bucket/events/*']: Specifies the path to the data files. The wildcard * is used to include all files in the events folder, allowing BigQuery to access all JSON files under this path.
- hive_partition_uri_prefix: Sets the base URI prefix for hive partitioning.
- 'gs://my_bucket/events': The path prefix that precedes the actual hive-partitioned directories. This setting tells BigQuery the root directory from which the partitioned data structure begins.
- enable_hive_partitioning: A flag to enable or disable hive-style partitioning.
- TRUE: Setting this to TRUE enables BigQuery to recognize and manage data based on the hive-partitioned directory structure specified in the uris. This is crucial for correctly interpreting the partitioning scheme (year=YYYY/month=MM/day=DD) in the file paths.
Steps to Create a Partitioned Table from External Files Using SQL
1. Prepare Your External Data
Ensure your data is stored in Google Cloud Storage. The data does not need to be organized in a Hive partitioning layout for this method, as partitioning will be defined in the SQL query.
2. Open BigQuery in the Cloud Console
Go to the BigQuery section in Google Cloud Console, but instead of using the UI, navigate to the Query Editor to write SQL queries.
3. Write the SQL Query to Create a Partitioned Table
CREATE OR REPLACE EXTERNAL TABLE `owox-analytics.myDataset.test_table_partitioned`
WITH PARTITION
(
sale_date DATE,
order_id STRING,
order_amount INT64
)
OPTIONS (
format = 'CSV', -- Specify file format
uris = ['gs://owox-analytics/daily_sales.csv'],
skip_leading_rows = 1 -- Optional: Skip header rows
)
hive_partition_uri_prefix = 'gs://owox-analytics/daily_sales.csv'
;
Here, URIs are the URI that points directly to the external files in Google Cloud Storage, without considering any directory structure for partitions.
Combining Partitioning with Clustering
Combining both partitioning and clustering in a single BigQuery table is an effective way to optimize query performance and reduce costs. Partitioning organizes data into separate segments based on a specified column, such as a date or an integer range. Clustering further organizes data within those partitions based on one or more columns, helping to minimize the data scanned during queries.
Example:
You manage a dataset myDataset in BigQuery, containing comprehensive transaction records. To enhance query efficiency, you plan to create new tables combining partitioning by transaction year and clustering by regiond.
CREATE TABLE `owox-analytics.myDataset.partition_cluster_region`
PARTITION BY RANGE_BUCKET(year, GENERATE_ARRAY(2015, 2022, 1))
CLUSTER BY region_id
AS
SELECT * FROM `owox-analytics.myDataset.transaction_data`;
Here:
- CREATE TABLE: This command is used to create a new table within a dataset in BigQuery.
- owox-analytics.myDataset.partition_cluster_region: Specifies the full path of the new table:
- partition_cluster_region: The name given to the new table.
- PARTITION BY RANGE_BUCKET(year, GENERATE_ARRAY(2015, 2022, 1)): Specifies how the table is to be partitioned.
- CLUSTER BY region_id: Specifies that the table should be clustered by the region column.
- AS SELECT * FROM owox-analytics.myDataset.transaction_data: Defines the data source for populating the new table.
Access BigQuery Data at Your Fingertips
Make BigQuery corporate data accessible for business users. Easily query data, run reports, create pivots & charts, and enjoy automatic updates
Limitations of BigQuery Partitioned Tables
The limitations of partitioned tables include constraints on the number of partitions per table and the impact on performance when these limits are approached or exceeded. Understanding these limitations is crucial when designing systems that scale, as excessive partitioning can lead to increased management overhead and reduced performance.
Limits on Maximum Partitions per Table
Each partitioned table in BigQuery is limited to a maximum of 10,000 partitions. Should you reach this limit, it's advisable to integrate clustering alongside, or as an alternative to, partitioning to manage and query your data efficiently.
For example, a table spanning 27 years with daily partitions would exceed this limit; thus, switching to monthly partitions could be a strategic solution. This adjustment not only adheres to the partition limit but also optimizes query performance by reducing the granularity of the data.
Limits on Partitions Modified per Job
Each query or load operation in BigQuery can impact a maximum of 4,000 partitions. If an operation attempts to modify more than this limit, BigQuery will reject the job.
For example, when importing historical data, dividing the load into smaller batches that each impact less than 4,000 partitions can help avoid job failures and promote more effective data management.
Limit on Daily Partition Modifications for Ingestion
Your project is allowed up to 5,000 daily partition modifications, which may involve appending, updating, or truncating data in an ingestion-time partitioned table. It's important to note that DML (Data Manipulation Language) statements are not included in this daily modification count.
For instance, if you're managing a dataset that logs user activities, you could perform operations such as adding new user data, updating existing records, or clearing old data from specific partitions throughout the day. If these operations on ingestion-time partitions do not exceed 5,000 changes in a single day, they will comply with BigQuery's limits.
This cap helps manage the system's load effectively without counting any modifications made via DML statements like INSERT, UPDATE, or DELETE.
Limit on Daily Partition Modifications for Column
Your project is permitted to make up to 30,000 modifications each day to a column-partitioned table. It's worth noting that neither DML statements (like INSERT, UPDATE, or DELETE) nor streaming data entries count towards this daily limit of partition modifications.
For example, if your project involves updating sales data across various regions stored in a column-partitioned table, you can execute multiple batch updates throughout the day. As long as the total number of these modifications doesn't surpass 30,000, you'll remain within BigQuery's operational guidelines.
Range Partitioning Limits
A table partitioned by range can support up to 10,000 distinct ranges. This limit is relevant when defining the partitioning scheme during table creation and affects the number of partitions that can exist once the table is established.
For instance, if you're setting up a table to track monthly sales over several decades, you might consider using range partitioning by month. However, since the total number of months spanning multiple decades could potentially exceed 10,000, you'd need to ensure that the number of distinct monthly partitions does not surpass the 10,000 range limit set by BigQuery.
Maximum Rate of Metadata Update Operations for Partitioned Tables
Your project is allowed up to 50 metadata modifications per partitioned table every 10 seconds. This limit covers all types of metadata updates, whether performed through the Google Cloud Console, the bq command-line tool, BigQuery client libraries, or API methods like tables.insert, tables.patch, and tables.update.
It also encompasses operations from DDL statements on tables and the total of all load, copy, and query jobs that modify data in a table, including DML operations such as DELETE, INSERT, MERGE, TRUNCATE TABLE, or UPDATE.
If this limit is exceeded, BigQuery will return an error message indicating that the rate limits for partitioned table update operations have been surpassed. This type of error can be resolved by retrying the operation with exponential backoff.
Exponential back off is a computing strategy used to manage retry attempts for failed operations, particularly in network applications. It involves progressively increasing the delays between retries up to a maximum delay, which is usually capped.
This method helps to reduce the load on the system and the likelihood of continued failures that can occur when many retries are made in a short period.
For example, if you frequently update table schemas or descriptions through automated scripts or rapidly execute multiple data-loading tasks to the same partitioned table, you might hit this rate limit. Monitoring your logs will help you identify which operations contribute to this limit, allowing you to adjust your processes accordingly.
Unlock BigQuery Insights in Google Sheets
Report on what matters to you. Integrate corporate BigQuery data into a familiar spreadsheet interface. Get insightful, up-to-date reports with just a few clicks
Best Practices to Follow When Partitioning Tables
When implementing partitioning in BigQuery, it's essential to adopt best practices that optimize both performance and cost efficiency. Key strategies include selecting partition keys that align well with your most common query conditions, as this reduces the amount of data scanned during each query, enhancing query performance and reducing costs. Let's look into them in detail.
Use Constant Filter Expressions
To effectively limit which partitions are scanned during a query in BigQuery, it's essential to use constant expressions in your filters. Utilizing dynamic expressions leads to scanning all partitions, which can significantly impact performance and cost.
For instance, consider this query that efficiently prunes partitions due to its use of a constant expression in the filter:
SELECT
t1.name,
t2.category
FROM
table1 AS t1
INNER JOIN
table2 AS t2
ON t1.id_field = t2.field2
WHERE
t1.ts = CURRENT_TIMESTAMP();
In contrast, the query below fails to prune partitions effectively because it relies on a dynamic expression for its filter. The partition scan extends over all partitions as the filter's value changes based on the output of a subquery:
SELECT
t1.name,
t2.category
FROM
table1 AS t1
INNER JOIN
table2 AS t2
ON
t1.id_field = t2.field2
WHERE
t1.ts = (SELECT timestamp FROM table3 WHERE key = 2);
This example demonstrates the importance of carefully designing query filters to enhance query efficiency by reducing unnecessary data scans.
Isolate the Partition Column in Filters
To optimize partition pruning in BigQuery, it's essential to keep filters on the partition column isolated from computations involving other fields. Filters that require data processing across multiple fields, such as those using date comparisons with additional field manipulations or concatenations, do not effectively prune partitions.
For instance, consider the following query condition, which fails to prune partitions effectively because it involves a computation that combines the partitioning column ts with another field ts2:
WHERE TIMESTAMP_ADD(ts, INTERVAL 6 HOUR) > ts2
To optimize partition pruning in BigQuery, directly compare the partitioning column to a constant or a simple variable. This method effectively narrows down the data scan to relevant partitions.
For example, to efficiently compare timestamps, you can use a similar command.
WHERE ts > TIMESTAMP "2021-01-01 06:00:00"
Require Partition Filters in Queries
When setting up a partitioned table in BigQuery, you can mandate the inclusion of predicate filters by activating the "Require partition filter" option. This ensures that every query against the table includes a WHERE clause that facilitates partition elimination, enhancing query efficiency. Queries without a suitable partition-filtering clause will trigger an error, indicating the need for a filter targeting the partition column.
For a table partitioned by partition_id, valid queries might look like the following examples:
Example 1:
WHERE partition_id = "20221231"
Example 2:
WHERE partition_id = "20221231" AND f = "20221130"
However, a query using OR, such as the following, does not qualify for partition elimination.
WHERE (partition_id = "20221231" OR f = "20221130")
For tables partitioned by ingestion time, the pseudocolumns _PARTITIONTIME or _PARTITIONDATE should be used.
Choosing the Right Partitioning Column
Select a partitioning column that best matches the characteristics of your data and the typical queries you run.
For instance, if you frequently query transaction data by date, partitioning the table by the transaction_date column would optimize query performance by reducing the amount of data scanned.
Avoiding Over-Partitioning
Minimize excessive partitioning, as it can unnecessarily increase storage costs and impair query efficiency.
For example, partitioning a table by the hour for data that spans several years could create an excessive number of partitions, leading to higher storage expenses and slower query speeds due to the overhead of managing many small partitions.
Clustering Data Within Partitions
Enhance query performance by implementing clustering within partitions. Use the CLUSTER BY clause to organize data by specific columns inside each partition.
For instance, if you have a table partitioned by date, adding clustering on the customer_id column can optimize queries that filter both by date and customer. BigQuery can efficiently locate and retrieve data related to specific customers within each daily partition.
Explore the Power of Useful BigQuery Functions
Dive into the powerful features of BigQuery for complex data analysis and efficient handling of extensive datasets with these essential functions:
- Conditional Expressions: Utilize logic-based functions to perform actions based on specific conditions. These expressions are crucial for decision-making within queries.
- String Functions: Handle text data effectively with functions that allow searching, substring extraction, replacement, and manipulation of string values.
- Conversion Functions: Facilitate the transformation of data types, such as converting strings to integers or dates to strings, with CAST and CONVERT to ensure compatibility and correctness across different data formats.
- Array Functions: Manage and manipulate arrays through functions that enable array creation, access specific elements, flatten nested arrays, and verify the presence of elements within arrays.
- Numbering Functions : Assign unique identifiers or ranking numbers to rows within your query results using functions like ROW_NUMBER, RANK, and DENSE_RANK to organize and identify data more distinctly.
- Navigation Functions: Navigate through rows in a dataset effectively, typically used in conjunction with window functions to retrieve previous or subsequent rows like LEAD, LAG, FIRST_VALUE, and LAST_VALUE.
These functions enhance BigQuery’s analytical capabilities, making it easier to manage, transform, and analyze large volumes of data, ensuring that you can derive meaningful insights from your datasets efficiently.
Enhance Your Reporting with the OWOX BI BigQuery Reports Extension
Understanding partitioned tables in BigQuery is essential for optimizing your data architecture for better performance and cost efficiency. These tables allow you to organize data into meaningful partitions based on specific column values, such as dates or numeric ranges, facilitating faster queries and reducing costs by enabling more focused data scans.
For those looking to deepen their mastery of partitioned tables and enhance their BigQuery implementations, consider exploring advanced tools like the OWOX Reports Extension for Google Sheets.
This extension seamlessly integrates with BigQuery, offering sophisticated reporting and analytics capabilities. Utilizing OWOX BI can simplify complex data workflows, improve the accuracy of your insights, and empower you to make more informed decisions swiftly.
Simplify BigQuery Reporting in Sheets
Easily analyze corporate data directly into Google Sheets. Query, run, and automatically update reports aligned with your business needs
FAQ
-
What are partitioned tables in BigQuery?
Partitioned tables in BigQuery help manage large datasets by dividing them into smaller, manageable segments based on specific column values like dates or integers. This division enhances query performance and reduces costs by scanning only relevant partitions.
-
What types of fields can I partition BigQuery tables by?
You can partition BigQuery tables by date (`DATE`, `DATETIME`, `TIMESTAMP`), integer range, or you can use ingestion time for automatic partitioning based on the data's load time.
-
How can I manage partitioned tables in BigQuery?
Manage partitioned tables by setting partition expiration, using the BigQuery console or SQL commands to modify schema, update settings, and monitor partition usage and performance through the INFORMATION_SCHEMA views.
-
How do I delete a partition with SQL?
Delete a partition in SQL by using a `DELETE` statement where the partition filter matches the specific partitions you want to remove, ensuring only the targeted data within the partition is deleted.
-
How do I query a time-unit partitioned table?
Query a time-unit partitioned table by specifying a condition on the partitioning column in the `WHERE` clause, which allows BigQuery to scan only the relevant partitions, optimizing query performance.
-
How do I copy a partitioned table or an individual partition?
Copy a partitioned table or an individual partition using the `bq cp` command in the bq tool, specifying the source and destination tables, and using the partition decorator for individual partitions if necessary.
-
Can I query an externally partitioned table in BigQuery?
Yes, you can query externally partitioned tables stored in formats like Parquet or ORC in Google Cloud Storage by setting up external tables in BigQuery that reference the partitioned data.
-
Is there any additional pricing for partitioning tables?
Creating partitioned tables in BigQuery does not incur an additional cost; however, storage and query operations incur costs. Reducing the amount of data scanned during queries can often lead to cost savings.