All resources

A Guide to BigQuery ML (With Examples)

Wondering how to turn your data into decisions quickly and easily? BigQuery ML lets analysts build, train, and deploy machine learning models directly in their data environment, without complex setups. It offers interactive tutorials and enables SQL users to build models using familiar tools, making machine learning more accessible.

i-radius

By integrating with Vertex AI, BigQuery ML improves predictive accuracy, simplifies data workflows, and helps organizations efficiently manage machine learning projects at scale.

Note: This blog post was originally published in February 2025 and has been updated in May 2025 to reflect the latest BigQuery ML features, use cases, and best practices for streamlined machine learning.

What is BigQuery ML?

BigQuery ML integrates machine learning directly into Google BigQuery, allowing users to create and deploy models without needing to move data to external services. This setup eliminates the need for data movement, speeding up development by letting data analysts and machine learning practitioners work within the familiar BigQuery environment.

By eliminating data transfers to other Google Cloud AI services, BigQuery ML simplifies workflows, boosts productivity, and streamlines the process of developing and deploying machine learning models.

Key Advantages of BigQuery ML

BigQuery ML simplifies machine learning by enabling data analysts to build models within Google BigQuery using SQL, directly handling large data structures

  1. BigQuery ML lets data analysts build and run models using SQL. For advanced use cases, tools like Python or Vertex AI may still be needed for added functionality.
  2. Empowers data analysts to leverage AI and ML directly within their familiar business intelligence tools and spreadsheets.
  3. Aims to democratise machine learning by enabling predictive analytics to guide decision-making across the organization.
  4. Increases development speed by eliminating the need for data movement, speeding up model development and reducing complexity.
  5. Accelerates deployment by bringing machine learning to the data, eliminating the need for external tools and data transfers.

Understanding BigQuery ML Models and Their Types

BigQuery ML supports various model types for different use cases, including:

  1. Internally Trained Models: BigQuery ML lets businesses build models like Linear Regression, Binary Logistic Regression, K-Means Clustering, Matrix Factorization, and Time Series forecasting for faster insights. However, advanced algorithms like PCA or Contribution Analysis still require external tools or custom code.
  2. Externally Trained Models (via Vertex AI): BigQuery ML integrates with Vertex AI, offering flexibility for advanced models such as DNN, Wide & Deep, Autoencoder, Boosted Tree, Random Forest, and AutoML. These models, however, are exclusive to Vertex AI and cannot be directly utilized within BigQuery ML. They must be trained and deployed on Vertex AI, with predictions imported back into BigQuery for analysis.
  3. Remote Models: Created in Vertex AI, these models are referenced in BigQuery without incurring processing charges.
  4. Imported Models: Pre-trained models like ONNX, TensorFlow, and XGBoost can be imported into BigQuery for seamless predictions.

How to Create ML Models in BigQuery

A model in BigQuery ML represents learned patterns from training data to identify trends and make predictions. To create a model, BigQuery ML uses the CREATE MODEL command, which initiates a Query Job that runs asynchronously within the BigQuery environment.

Prerequisites for BigQuery ML:

#1: GCP Account: Sign up and create a project.

#2: BigQuery API: Enable the BigQuery API in your project via the API Library. 

Additionally, ensure that billing is active for your project and that you have the necessary roles – such as BigQuery Data Editor and BigQuery Job User, to create and manage models effectively. Once a model is created, BigQuery runs it as a Query Job. These jobs execute asynchronously, allowing users to continue working in the interface while the job runs in the background

These roles provide the permissions required to handle data and execute machine learning jobs within BigQuery. 

Step 1: Select a BigQuery Dataset

In the BigQuery console, start by selecting or creating a dataset to hold your data and machine learning models.

To add a new dataset, click 'Create dataset' and enter the required information.

BigQuery console displaying a list of datasets with the 'Create dataset' option highlighted for adding a new dataset. i-shadow

For this article, we'll use the pre-cleaned and processed 'Google Analytics Sample' dataset from the BigQuery Public Datasets, adding it to our project for convenience.

Google Analytics Sample dataset selected from BigQuery Public Datasets for machine learning model training. i-shadow

Step 2: Build Your First Machine Learning Model

To create a logistic regression model in BigQuery, use the CREATE MODEL statement. 

Example:

This example demonstrates how to define and train a model to predict whether a transaction occurred based on user session data from the Google Analytics sample dataset.

1CREATE OR REPLACE MODEL `bqml.new_sample_model` OPTIONS(model_type='logistic_reg') 
2AS SELECT IF(totals.transactions IS NULL, 0, 1) 
3AS purchase_made, -- Target variable (binary classification) IFNULL(device.browser, "") 
4AS browser, -- Input feature: Browser type device.deviceCategory 
5AS device_category, -- Input feature: Device category (mobile, desktop, tablet) IFNULL(geoNetwork.region, "") 
6AS region, -- Input feature: Region IFNULL(totals.hits, 0) 
7AS total_hits -- Input feature: Total hits count 
8FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*` 
9WHERE _TABLE_SUFFIX BETWEEN '20170101' AND '20171231' LIMIT 150000;

Here:

  • CREATE OR REPLACE MODEL bqml.new_sample_model: Creates a logistic regression model in BigQuery.
  • purchase_made: Target variable, assigns 0 for no purchase, 1 for purchase.
  • browser, device_category, region, total_hits, operating_system: Input features for the model.
  • Dataset: Google Analytics public dataset.
  • Date Filter: Filters data between January 1, 2017, and December 31, 2017.
  • LIMIT 150000: Restricts data to 150,000 rows for model training.

Steps on How to Train and Deploy a BigQuery ML Model

To train and deploy a BigQuery ML model, first prepare and load your dataset. Then, use SQL to create and train the model. Reviewing training statistics is essential to understand how the model was trained and how well it performed.

The training data is split into training and evaluation data to prevent overfitting.

Step 1: Prepare Your Training Dataset

Before training, clean and organize your dataset. Create a BigQuery dataset, load relevant data (e.g., GA4), and set the correct data location (US/EU) for compliance. Use exploratory analysis or Google Data Studio to better understand the data before training.

Step 2: Feature Engineering using the TRANSFORM Statement

In BigQuery ML, you can enhance model training using the TRANSFORM clause for feature engineering and data preprocessing. Instead of relying solely on raw data, machine learning models achieve better accuracy with proper feature engineering and preprocessing. 

In BigQuery ML, the TRANSFORM clause enables these steps directly within SQL to enhance model training.

This enables both feature selection and engineering, providing the model with more relevant and refined features. The TRANSFORM clause helps data analysts and BI professionals ensure that their models use the most meaningful data, leading to improved predictions and analysis accuracy.

BigQuery ML CREATE MODEL statement using the TRANSFORM clause to perform feature engineering and data preprocessing, refining input features for improved model accuracy. i-shadow

Step 3: Train Your Model

To train a model in BigQuery ML, start by defining it with a standard SQL query. First, specify the model’s name, then define the features and any necessary data transformations using the TRANSFORM clause. Users can access detailed model information, including training statistics and options, to understand the model's capabilities.

BigQuery ML SQL query defining a machine learning model with a specified name, selected features, and data transformations using the TRANSFORM clause. i-shadow

‍Next, select the model type and set parameters, including the target column. Finally, execute an SQL query to select the input data from a table. The WHERE clause limits the number of rows scanned by the query, optimizing performance and reducing costs.

Model training time varies depending on complexity, from minutes for simple models to hours for more advanced models.

BigQuery ML training progress displayed in the TRAINING tab, showing estimated duration based on model complexity i-shadow

Track progress in the BigQuery UI's TRAINING tab.

BigQuery ML training process running, with progress tracked in the TRAINING tab. i-shadow

Step 4: Evaluate the Model Performance

BigQuery automatically calculates key performance metrics based on the model type, such as precision, recall, accuracy, F1 score, log loss, and ROC AUC. You can view these in the EVALUATION tab or via ML.EVALUATE queries.

  • Precision shows how often positive predictions are correct.
  • F1 score balances precision and recall.
  • Log loss measures prediction errors.
  • ROC AUC indicates the model's ability to distinguish between positive and negative classes.

Users can use the ML.EVALUATE function to compare the model's predicted values against actual data. Results from ML.EVALUATE can be further structured and analyzed using standard SQL queries.

BigQuery ML EVALUATION tab showing performance metrics such as accuracy, precision, recall, and F1 score. i-shadow

These metrics help assess model performance. If the results are unsatisfactory, you can return to the model training step to fine-tune and improve performance.

BigQuery ML model evaluation metrics displaying three performance graphs: (1) Precision-recall by threshold curve, (2) Precision-recall curve, and (3) ROC curve. i-shadow

Step 5: Model Deployment to Predict Outcomes

Once you’re satisfied with your model’s performance, it’s time to use it for predictions on new, unseen data. In BigQuery, prepare a dataset with new data that wasn’t used during training, such as using a date range outside the training period. 

Users can apply the trained model to predict metrics like purchases per user or transaction counts for new visitors using unseen data.

For example, predict the probability of the addedToCart event using data from December 31, 2020. Alternatively, you can export the model for deployment in production systems to generate online predictions.

BigQuery ML SQL query using ML.PREDICT to estimate the probability of an event during a session, based on a logistic regression model  i-shadow

Overview of BigQuery ML Pricing Models

BigQuery ML supports built-in models (e.g., linear regression, K-means) trained within BigQuery, and external models like deep neural networks via Vertex AI. By eliminating data movement, BigQuery ML speeds up development and reduces costs.

Pricing depends on the model type and usage, either through editions or on-demand. Prediction and evaluation functions are executed within BigQuery, with costs based on the model type and resources used.

BigQuery ML Editions Pricing

BigQuery ML offers Enterprise and Enterprise Plus Editions with slot-based pricing for predictable costs on dedicated resources. Pricing and resource management vary between built-in and external models. By using familiar SQL tools, teams can manage resources efficiently and reduce costs.

Built-in Models: Built-in models like linear regression and K-means use shared BigQuery slots, enabling efficient resource use across ML and analytical tasks.

External Models: External models require ML_EXTERNAL reservations to prevent conflicts and ensure cost parity with BigQuery and other Google Cloud services.

BigQuery ML On-Demand Pricing

BigQuery ML on-demand pricing is determined by the type of operation, including model creation, evaluation, inspection, and prediction. For iterative models, the CREATE MODEL statement is limited to 50 iterations, applying to both on-demand and edition pricing.

For time series models with auto-arima enabled, multiple candidate models are evaluated during training. The number of models can be controlled by the AUTO_ARIMA_MAX_ORDER setting. This impacts model creation costs, but for evaluation, inspection, and prediction, only the selected model is used, following standard query pricing.

BigQuery ML Remote Models

BigQuery ML enables customers to create remote models using Vertex AI, Cloud AI APIs (Vision, NLP, Translate), or custom prediction endpoints. It integrates with Google Cloud resources for remote training and predictions. Pricing includes BigQuery processing costs and additional charges for services like text generation, translation, or image annotation.

For supervised tuning with Vertex AI LLMs, costs include BigQuery data processing, but GPU/TPU usage is billed separately. Use billing labels like billing_service = ‘bigquery_ml’ to track these charges across services.

BigQuery ML Dry Run

For certain BigQuery ML model types, due to the complexity of their underlying algorithms, the bytes processed for billing may not be calculated until after the training process is complete.

After training, users can review resource usage and billing details through model statistics and metadata to understand any adjustments made.

This is because estimating the initial resource usage is more complex for these models, resulting in billing adjustments being made once training has finished.

Cost Optimization Tips for BigQuery ML

To manage costs in BigQuery ML, choose the right pricing model, use dry runs for cost estimates, and manage training efficiently. Editions can help maintain predictable pricing. Speeding up development by simplifying queries or limiting training iterations also reduces costs without affecting performance.

Use BigQuery ML Editions for Predictable Pricing

Choose BigQuery ML Enterprise or Enterprise Plus Editions for predictable, slot-based pricing over on-demand. These editions help avoid unexpected costs and speed up development by ensuring consistent resource availability. Built-in models share BigQuery slots, while external models use ML_EXTERNAL reservations for efficient resource management and cost parity with other Google Cloud services.

Take Advantage of Dry Runs

Use dry runs to estimate data processed and potential costs before running queries or training models. This helps catch costly mistakes early, manage resources efficiently, and avoid unexpected expenses. Make sure your project ID is set correctly to track costs and resource usage during dry runs.

Register your models in the Vertex AI Model Registry

Register models in the Vertex AI Model Registry to track, compare, and manage versions efficiently. This avoids unnecessary retraining, reduces costs, and improves model lifecycle control in BigQuery ML. Make sure your Google Cloud account has the right permissions and billing setup to manage registrations effectively.

What is Vertex AI for BigQuery?

Vertex AI is Google’s platform for developing, deploying, and managing AI/ML models. It integrates with BigQuery to prepare training data, manage features in the Vertex AI Feature Store, and support various training methods like AutoML and custom training. Models trained in BigQuery ML can be registered in Vertex AI for easier management and deployment.

Benefits of Managing BigQuery ML Models in Vertex AI

  1. Online Model Serving: Register models in Vertex AI to enable real-time predictions through endpoints.
  2. MLOps Capabilities: Automate model monitoring, retraining, and integration of BigQuery ML jobs for continuous accuracy.

Steps for Integrating BigQuery ML with Vertex AI

Register BigQuery ML models in the Vertex AI Model Registry to manage, version, and deploy them without exporting. This enables online predictions through a unified interface, with no separate serving container needed. The query below shows how to register a BigQuery ML model in Vertex AI.

Enable the Vertex AI API

Before registering BigQuery ML models in Vertex AI, enable the Vertex AI API in your Google Cloud project. You can do this via the APIs & Services page, the BigQuery UI, or by running the gcloud CLI command below.

1gcloud --project PROJECT_ID services enable aiplatform.googleapis.com

Grant IAM Permissions for Vertex AI

To register BigQuery ML models in the Vertex AI Model Registry, the service or user account must have the Vertex AI Administrator role (roles/aiplatform.admin). You can grant this role using the Google Cloud CLI.

For a service account, use the following command:

1gcloud projects add-iam-policy-binding PROJECT_ID 
2--member=serviceAccount:SERVICE_ACCOUNT_EMAIL 
3--role=roles/aiplatform.admin --condition=None

For a user account, use the following command:

1gcloud projects add-iam-policy-binding PROJECT_ID 
2--member=user:USER_ACCOUNT --role=roles/aiplatform.admin 
3--condition=None

Register BigQuery ML Model to Vertex AI Model Registry During Creation

Use the MODEL_REGISTRY option in CREATE MODEL to register a BigQuery ML model in Vertex AI. Add VERTEX_AI_MODEL_ID and VERTEX_AI_MODEL_VERSION_ALIASES for version control. Once registered, models can be deployed and monitored with support from Vertex Explainable AI.

The following example shows how to create a time series forecasting model and register it during creation. Here’s a step-by-step guide:

Step 1: Import Time Series Data

First, upload the dataset to BigQuery and create a table. You can also use an existing dataset.

  1. Navigate to BigQuery in the console.
  2. Create a dataset or use an existing one.
  3. Upload the CSV file and name the table as per the requirement. Here, we are naming it transportation_revenue_insights. Let the schema be auto-detected.

Step 2: Create a Time Series Model

Using BigQuery ML syntax, we will create an ARIMA time series model:

1CREATE OR REPLACE MODEL
2  `demo.transportation_revenue_insights` 
3OPTIONS(MODEL_TYPE='ARIMA',
4    TIME_SERIES_TIMESTAMP_COL='service_date',
5    TIME_SERIES_DATA_COL='total_rides',
6    HOLIDAY_REGION='us',
7    MODEL_REGISTRY='vertex_ai',
8    VERTEX_AI_MODEL_ID='cta_ridership_forecast') AS
9SELECT
10  service_date, total_rides
11FROM
12  `demo.transportation_revenue_insights`;

Here:

  • MODEL_REGISTRY='vertex_ai': Registers the model in the Vertex AI Model Registry.
  • VERTEX_AI_MODEL_ID: Specifies a custom Vertex AI Model ID for easy management.

Step 3: Evaluate the Model

Once the model is trained, evaluate it with this query:

1SELECT
2  *
3FROM
4  ML.EVALUATE(MODEL `transportation_revenue_insights`);

This returns model parameters, evaluation metrics, and the best ARIMA model based on AIC.

Step 4: Forecast with the Model

Use the ML.FORECAST function to forecast ridership for the next 7 days:

1SELECT
2  *
3FROM
4  ML.FORECAST(MODEL `transportation_revenue_insights`,
5    STRUCT(7 AS horizon));

This query generates a 7-day forecast along with confidence intervals.

By following these steps, you not only create and evaluate a time series model but also register it in Vertex AI for easy management and deployment.

Register an Existing BigQuery ML Model to Vertex AI

To register an existing BigQuery ML model in the Vertex AI Model Registry, use the ALTER MODEL statement.

1ALTER MODEL [IF EXISTS] <model_name> SET OPTIONS 
2(vertex_ai_model_id="VERTEX_AI_MODEL_ID");

The query shows the model options after it's registered.

  • If you use a new model ID, it registers as version 1. To add it as a new version of an existing model, specify the existing model's ID.
  • This allows you to manage and compare versions in the Model Registry.

Manage Time Series Models in Vertex AI Model Registry

Once your time series forecasting model is registered in the Vertex AI Model Registry, you can deploy it to predict outcomes like purchases, monitor its performance, and use Vertex Explainable AI to understand the key factors driving forecasts. This helps improve control and transparency over your model’s predictions.

Specify a Vertex AI Model ID and Aliases

With BigQuery ML models registered in Vertex AI, you can assign a model ID and version aliases for easier management and versioning.

Specify a Vertex AI Model ID

Use the VERTEX_AI_MODEL_ID option when creating a model. It must be lowercase. If not set, the BigQuery ML model ID is used by default. To change it later, you’ll need to delete and recreate the model.

Specify a Vertex AI Model Alias

Use VERTEX_AI_MODEL_VERSION_ALIASES to assign aliases to model versions. This lets you refer to versions easily without using exact IDs.

BigQuery ML vs. Traditional Machine Learning Approaches

BigQuery ML streamlines model building by letting users execute machine learning models directly within BigQuery using SQL. In contrast, traditional ML often involves complex workflows, data movement, and external tools.

Here is a detailed comparison of the two:

Feature/Aspect BigQuery ML Traditional Machine Learning
Data Processing Works directly within BigQuery; no need to export data to external environments Requires data export to external environments for processing
Model Development SQL-based model creation, accessible to data analysts Requires programming knowledge in languages like Python, R
Complexity Simplified with SQL queries and built-in functions Involves specialized libraries (e.g., TensorFlow, scikit-learn)
Environment Integrated within Google Cloud and BigQuery; models trained directly within BigQuery Requires external ML platforms for model training
Deployment Built-in deployment capabilities with Vertex AI integration Involves creating custom deployment pipelines
Speed to Production Faster, with no need to move data and minimal setup Slower due to multiple steps in the process
Prediction Type Batch and online predictions via Vertex AI Typically requires custom APIs for deployment and real-time predictions
User Skill Level Accessible for SQL users; no need for advanced ML skills Requires knowledge of ML frameworks and programming

By simplifying model creation and deployment within a familiar environment, BigQuery ML offers a more efficient, accessible approach to machine learning compared to traditional workflows.

Unlock In-Depth Insights with the OWOX Reports Extension for Google Sheets

The OWOX Reports Extension for Google Sheets empowers users to transform raw data into actionable insights with ease. By connecting Google Sheets directly to your data sources, such as Google BigQuery and Google Analytics, it enables users to automate report generation, streamline data analysis, and eliminate the need for manual data imports. 

With intuitive dashboards, charts, and reports, business users and data analysts can track key metrics, uncover trends, and make data-driven decisions without leaving Google Sheets. The extension simplifies data management, making it accessible to non-technical users while providing in-depth insights.

FAQ

What is BigQuery ML?
What types of machine learning models can you build with BigQuery ML?
Can I integrate BigQuery ML with Vertex AI for advanced use cases?
What is the difference between built-in and external models in BigQuery ML?
How is BigQuery ML priced?
How can I reduce costs when running machine learning models in BigQuery?

You might also like

2,000 companies rely on us

Oops! Something went wrong while submitting the form...