BigQuery Google Features — Detailed Review

3
113
Downloads

The more information a business accumulates, the more acute the question of where to store it. If you don’t have the ability or desire to maintain your own servers, Google BigQuery will help out. BigQuery is a fast, economical, and scalable repository for working with big data. It lets you write queries using SQL-like syntax, with standard and user-defined functions.

In this article, we’ll look at the main functions of BigQuery and show:

You’ll learn how to write basic queries and can try them on demo data.

What is SQL and what are its dialects?

SQL (Structured Query Language) is a structured query language for working with databases. With SQL, you can receive data, add data to the database, and modify large data sets. Google BigQuery supports two dialects: Standard SQL and Legacy SQL.

Which dialect to choose depends on your preferences, but Google recommends using Standard SQL, as it has several advantages:

  • Flexibility and functionality when working with nested and repeating fields.
  • Support for DML and DDL languages, which allow you to change data in tables as well as manage tables and views in GBQ.
  • Its processing speed for large amounts of data is faster than that of Legacy SQL.
  • Support for all current and future updates to BigQuery.

You can find more information on the difference between these dialects in the Google BigQuery documentation.

By default, queries in Google BigQuery run on Legacy SQL. There are several ways to switch to Standard SQL:

1. In the BigQuery interface, in the query editing window, select Show Options and uncheck the box next to Use Legacy SQL:

2. Before the query, add the line #standardSQL, then start the query on a new line:

Where to begin

So that you can train as you read this article and run queries with us, we’ve prepared a table with demo data. Fill out the form below and we’ll send the table to you by email.

Get demo data

Download the demo data and upload it to your Google BigQuery project. The easiest way to do this is using the OWOX BI BigQuery Reports add-on.

1. Install the OWOX BI BigQuery Reports add-on.

2. Open the table that you saved to Google Sheets, and in the Add-ons section, select OWOX BI BigQuery Reports —> Upload data to BigQuery:

3. In the window that appears, select your Google BigQuery project and data set, then think up a name for the table in which the downloaded data will be stored.

4. Specify the format of the downloaded data, as shown in the screenshot:

If you don’t have a project in GBQ, create one. To do this, you need an active billing account in the Google Cloud Platform. Don’t be afraid that you need to provide a credit card — nothing will be charged to it without your knowledge. In addition, during registration you’ll receive $300 for 12 months, which you can spend on storing and processing data.

Before moving on to Google BigQuery features, let’s recall what basic queries look like depending on the dialect:

Operation Legacy SQL Standard SQL
Select fields from the table SELECT field​1, field2 SELECT field​1, field2
Choose a table from which to select fields FROM [projectID:dataSet.tableName] FROM `projectID.dataSet.tableName`
Filter values by parameter WHERE field​1=value WHERE field​1=value
Group results by fields GROUP BY field​1, field2 GROUP BY field​1, field2
Order results by ORDER BY field1 ASC (ascending) or DESC (descending) ORDER BY field1 ASC (ascending) or DESC (descending)

Google BigQuery features

When building queries, the following groups of functions are most commonly used: aggregate functions, date functions, string functions, and window functions. We’ll consider each group in detail.

Aggregate functions

Aggregation functions provide aggregate values for the entire table. For example, they can be used to calculate the average check or total income for the month, or to select a segment of users who have made the most purchases.

These are the most popular aggregate functions:

Legacy SQL Standard SQL What the function does
AVG(field) AVG([DISTINCT] (field)) Returns the average value of the field column.

In Standard SQL, when adding the DISTINCT condition, the average is considered only for rows with unique (not duplicate) values from the field column.
MAX(field) MAX(field) Returns the maximum value from the field column.
MIN(field) MIN(field) Returns the minimum value from the field column.
SUM(field) SUM(field) Returns the sum of the values from the field column.
COUNT(field) COUNT(field) Returns the number of rows in a field column.
EXACT_COUNT_DISTINCT(field) COUNT([DISTINCT] (field)) Returns the number of unique rows in a field column.

You can find a list of all functions in the help documentation for Legacy SQL and Standard SQL.

Let’s look at the demo data to see how these functions work. We’ll calculate the average revenue per transaction, the purchases for the largest and smallest amounts, total revenue, and total number of transactions. To check whether purchases are duplicated, we’ll also calculate the number of unique transactions. To do this, we’ll write a query in which we indicate the name of our Google BigQuery project, the data set, and the table.

#legacySQL

	  
SELECT
  AVG(revenue) as average_revenue,
  MAX(revenue) as max_revenue,
  MIN(revenue) as min_revenue,
  SUM(revenue) as whole_revenue,
  COUNT(transactionId) as transactions,
  EXACT_COUNT_DISTINCT(transactionId) as unique_transactions
FROM
  [owox-analytics:t_kravchenko.Demo_data]
		
	

#standardSQL

		
SELECT
  AVG(revenue) as average_revenue,
  MAX(revenue) as max_revenue,
  MIN(revenue) as min_revenue,
  SUM(revenue) as whole_revenue,
  COUNT(transactionId) as transactions,
  COUNT(DISTINCT(transactionId)) as unique_transactions
FROM
  `owox-analytics.t_kravchenko.Demo_data`
		
	

As a result, we obtain the following results:

You can check the calculation results in the source table with demo data using standard Google Sheets functions (SUM, AVG, and others) or pivot tables.

As you can see from the screenshot above, the number of transactions and the number of unique transactions are different. This suggests that in our table there are two transactions for which the transactionId is duplicated:

If you’re interested in unique transactions, use a function that counts unique strings. Or you can group data with the GROUP BY clause to get rid of duplicates before using the aggregation function.

Date functions

These functions allow you to process dates: change their format, select the necessary part (day, month, or year), or shift the date by a certain interval. They may be useful in the following cases:

  • When configuring end-to-end analytics, to bring dates and times from different sources into a single format.
  • When creating automatically updated reports or trigger mailings; for example, when you need data for the last two hours, one week, or one month.
  • When creating cohort reports in which it’s necessary to obtain data in the context of days, weeks, or months.

These are the most commonly used functions for working with dates:

Legacy SQL Standard SQL Function description
CURRENT_DATE() CURRENT_DATE() Returns the current date in the format %YYYY-%MM-%DD
DATE(timestamp) DATE(timestamp) Converts dates from the format %YYYY-%MM-%DD %H:%M:%S to the format %YYYY-%MM-%DD
DATE_ADD(timestamp, interval, interval_units) DATE_ADD(timestamp, INTERVAL interval interval_units) Returns the timestamp date, increasing it by the specified interval.

interval_units in Legacy SQL can be YEAR, MONTH, DAY, HOUR, MINUTE, and SECOND, and in Standard SQL can be YEAR, QUARTER, MONTH, WEEK, and DAY.
DATE_ADD(timestamp, - interval, interval_units) DATE_SUB(timestamp, INTERVAL interval interval_units) Returns the timestamp date, decreasing it by the specified interval.
DATEDIFF(timestamp1, timestamp2) DATE_DIFF(timestamp1, timestamp2, date_part) Returns the difference between two dates, timestamp1 and timestamp2.

In Legacy SQL, returns the difference in days, and in Standard SQL, returns the difference depending on the specified date_part value (day, week, month, quarter, year).
DAY(timestamp) EXTRACT(DAY FROM timestamp) Returns the day from the timestamp date. Accepts values from 1 to 31 inclusive.
MONTH(timestamp) EXTRACT(MONTH FROM timestamp) Returns the month from the timestamp date as an ordinal number. Accepts values from 1 to 12 inclusive.
YEAR(timestamp) EXTRACT(YEAR FROM timestamp) Returns the year from the timestamp date.

A list of all functions can be found in the documentation for Legacy SQL and Standard SQL.

Let’s consider how each of the above functions works on the demo data. For example, we’ll get the current date, transfer the date from the source table to the %YYYY-%MM-%DD format, subtract X, and add one day to it. Then we’ll calculate the difference between the current date and the date from the source table and divide the current date into its components: year, month, and day. To do this, you can copy the sample queries below and replace the project name, data set, and data table with your own.

#legacySQL

		
  SELECT
    CURRENT_DATE() AS today,
    DATE( date_UTC ) AS date_UTC_in_YYYYMMDD,
    DATE_ADD( date_UTC,1, 'DAY') AS date_UTC_plus_one_day,
    DATE_ADD( date_UTC,-1, 'DAY') AS date_UTC_minus_one_day,
    DATEDIFF(CURRENT_DATE(), date_UTC ) AS difference_between_date,
    DAY( CURRENT_DATE() ) AS the_day,
    MONTH( CURRENT_DATE()) AS the_month,
    YEAR( CURRENT_DATE()) AS the_year
  FROM
    [owox-analytics:t_kravchenko.Demo_data]
		
	

#standardSQL

		
SELECT
  today,
  date_UTC_in_YYYYMMDD,
  DATE_ADD( date_UTC_in_YYYYMMDD, INTERVAL 1 DAY) AS date_UTC_plus_one_day,
  DATE_SUB( date_UTC_in_YYYYMMDD,INTERVAL 1 DAY) AS date_UTC_minus_one_day,
  DATE_DIFF(today, date_UTC_in_YYYYMMDD, DAY) AS difference_between_date,
  EXTRACT(DAY FROM today ) AS the_day,
  EXTRACT(MONTH FROM today ) AS the_month,
  EXTRACT(YEAR FROM today ) AS the_year
FROM (
  SELECT
    CURRENT_DATE() AS today,
    DATE( date_UTC ) AS date_UTC_in_YYYYMMDD
  FROM
    `owox-analytics.t_kravchenko.Demo_data`)
		
	

After running the query, you’ll receive this report:

String functions

String functions allow you to form a string, select and replace substrings, calculate the length of a string, and find the index number of a substring in the source string. For example, with the help of string functions you can:

  • Make report filters for UTM tags that are transmitted to the URL of the page.
  • Bring data into a single format if the names of the sources and the campaign are written in different registers.
  • Replace incorrect data in the report, for example if the campaign name was passed with a typo.

These are the most popular functions for working with strings:

Legacy SQL Standard SQL Function description
CONCAT(’str1′, ’str2′) or ’str1′ + ’str2′ CONCAT('str1', 'str2') Combine the str1 and str2 strings into one string.
'str1' CONTAINS 'str2' REGEXP_CONTAINS(’str1′, ’str2′) or ’str1′ LIKE ‘%str2%’ Returns true if the string ’str1′ contains the string ‘str2’.

In Standard SQL, the string ‘str2’ can be written as a regular expression using the re2 library.
LENGTH('str' ) CHAR_LENGTH(’str’ ) or CHARACTER_LENGTH(’str’ ) Returns the length of the string ’str’ (the number of characters in the string).
SUBSTR('str', index [, max_len]) SUBSTR('str', index [, max_len]) Returns a substring of length max_len, starting with the character with the index from the string ’str’.
LOWER('str' ) LOWER('str' ) Converts all characters in the string ’str’ to lowercase.
UPPER(str) UPPER(str) Converts all characters in the string ’str’ to uppercase.
INSTR('str1', 'str2') STRPOS('str1', 'str2') Returns the index of the first occurrence of the string ’str2′ in string ’str1′; otherwise, returns 0.
REPLACE('str1', 'str2', 'str3') REPLACE('str1', 'str2', 'str3') Replaces the substring ’str2′ in the string ’str1′ with the substring ’str3′.

You can find more details about all of the string functions in the documentation for Legacy SQL and Standard SQL.

Let’s look at how to use the described functions on the demo data. Suppose we have three separate columns that contain day, month, and year values:

Working with a date in such a format isn’t very convenient, so we’ll merge it into one column. To do this, use the SQL queries below, and don’t forget to substitute the name of your project, data set, and table in Google BigQuery.

#legacySQL

		
SELECT
  CONCAT(the_day,'-',the_month,'-',the_year) AS mix_string1,
  the_day+'-'+the_month+'-'+the_year AS mix_string2
FROM (
  SELECT
    '31' AS the_day,
    '12' AS the_month,
    '2018' AS the_year
  FROM
    [owox-analytics:t_kravchenko.Demo_data])
GROUP BY
  mix_string1,
  mix_string2
		
	

#standardSQL

		
SELECT
  CONCAT(the_day,'-',the_month,'-',the_year) AS mix_string1
FROM (
  SELECT
    '31' AS the_day,
    '12' AS the_month,
    '2018' AS the_year
  FROM
    `owox-analytics.t_kravchenko.Demo_data`)
GROUP BY
  mix_string1
		
	

After executing the query, we’ll get the date in one column:

Often, when loading a particular page on a site, the values ​​of variables that the user has selected are recorded in the URL. This may be a method of payment or delivery, a transaction number, an index of the physical store in which the buyer wants to pick up the goods, etc. Using an SQL query, you can select these parameters from the page address. Consider two examples of how and why to do this.

Example 1. Suppose we want to know the number of purchases for which users pick up goods from physical stores. To do this, we need to count the number of transactions sent from pages whose URLs contain the substring shop_id (a physical store index). We can do this with the following queries:

#legacySQL

		
SELECT
  COUNT(transactionId) AS transactions,
  check
FROM (
  SELECT
    transactionId,
    page CONTAINS 'shop_id' AS check
  FROM
    [owox-analytics:t_kravchenko.Demo_data])
GROUP BY
  check
		
	

#standardSQL

		
SELECT
  COUNT(transactionId) AS transactions,
  check1,
  check2
FROM (
  SELECT
    transactionId,
    REGEXP_CONTAINS( page, 'shop_id') AS check1,
    page LIKE '%shop_id%' AS check2
  FROM
    `owox-analytics.t_kravchenko.Demo_data`)
GROUP BY
  check1,
  check2
		
	

From the resulting table, we see that 5,502 transactions were sent from pages containing shop_id (check = true):

Example 2. Suppose we have assigned to each delivery method a delivery_id and have entered the value of this parameter in the URL of the page. To find out which delivery method the user has chosen, we need to highlight delivery_id in a separate column. We can use the following queries for this:

#legacySQL

		
SELECT
  page_lower_case,
  page_length,
  index_of_delivery_id,
  selected_delivery_id,
  REPLACE(selected_delivery_id, 'selected_delivery_id=', '') as delivery_id
FROM (
  SELECT
    page_lower_case,
    page_length,
    index_of_delivery_id,
    SUBSTR(page_lower_case, index_of_delivery_id) AS selected_delivery_id
  FROM (
    SELECT
      page_lower_case,
      LENGTH(page_lower_case) AS page_length,
      INSTR(page_lower_case, 'selected_delivery_id') AS index_of_delivery_id
    FROM (
      SELECT
        LOWER( page) AS page_lower_case,
        UPPER( page) AS page_upper_case
      FROM
        [owox-analytics:t_kravchenko.Demo_data])))
ORDER BY
  page_lower_case ASC
		
	

#standardSQL

		
SELECT
  page_lower_case,
  page_length,
  index_of_delivery_id,
  selected_delivery_id,
  REPLACE(selected_delivery_id, 'selected_delivery_id=', '') AS delivery_id
FROM (
  SELECT
    page_lower_case,
    page_length,
    index_of_delivery_id,
    SUBSTR(page_lower_case, index_of_delivery_id) AS selected_delivery_id
  FROM (
    SELECT
      page_lower_case,
      CHAR_LENGTH(page_lower_case) AS page_length,
      STRPOS(page_lower_case, 'selected_delivery_id') AS index_of_delivery_id
    FROM (
      SELECT
        LOWER( page) AS page_lower_case,
        UPPER( page) AS page_upper_case
      FROM
        `owox-analytics.t_kravchenko.Demo_data`)))
ORDER BY
  page_lower_case ASC
		
	

As a result, we get the following table in Google BigQuery:

Window functions

These functions are similar to the aggregation functions discussed above. The main difference lies in the fact that calculations for window functions are performed not on the entire set of data selected using the query but on part — a subset or window.

Using window functions, you can aggregate data across groups without using the JOIN operator to combine multiple queries. For example, let’s calculate the average income per advertising campaign and the number of transactions per devices. By adding another field to the report, you can easily find out, for example, the share of revenue from an advertising campaign on Black Friday or the share of transactions made from a mobile application.

Together with each function in the request, you must write the expression OVER, which defines the boundaries of the window. OVER contains three components that you can work with:

  • PARTITION BY — Defines the attribute by which you’ll divide the source data into subsets, for example PARTITION BY clientId, DayTime.
  • ORDER BY — Determines the order of rows in a subset, for example ORDER BY hour DESC.
  • WINDOW FRAME allows you to process strings within a subset for a specific attribute. For example, you can count the sum of not all lines in the window, but only the first five before the current line.

In this table, we’ve compiled the most commonly used window functions:

Legacy SQL Standard SQL Function description
AVG(field)
COUNT(field)
COUNT(DISTINCT field)
MAX()
MIN()
SUM()
AVG([DISTINCT] (field))
COUNT(field)
COUNT([DISTINCT] (field))
MAX(field)
MIN(field)
SUM(field)
Returns the mean, quantity, maximum, minimum, and total value from the field column within the selected subset.

DISTINCT is used if you want to count only unique (distinct) values.
DENSE_RANK() DENSE_RANK() Returns the line number within a subset.
FIRST_VALUE(field) FIRST_VALUE (field[{RESPECT | IGNORE} NULLS]) Returns the value of the first row in a field column within a subset.

By default, rows with empty values from the field column are included in the calculation. RESPECT or IGNORE NULLS determines whether to include or ignore strings with a NULL value.
LAST_VALUE(field) LAST_VALUE (field [{RESPECT | IGNORE} NULLS]) Returns the value of the last row from the field column within a subset.

By default, rows with empty values from the field column are included in the calculation. RESPECT or IGNORE NULLS determines whether to include or ignore strings with a NULL value.
LAG(field) LAG (field[, offset [, default_expression]]) Returns the value of the previous line in relation to the current from the field column within the subset.

Offset determines the number of lines by which you want to shift downward relative to the current line. It is an integer.

Default_expression is the value that the function will return if there is no necessary string within the subset.
LEAD(field) LEAD (field[, offset [, default_expression]]) Returns the value of the next line in relation to the current from the field column within the subset.

Offset determines the number of lines by which you want to move upwards relative to the current line. It is an integer.

Default_expression is the value that the function will return if the required string does not exist within the current subset.

You can see a list of all functions in the documentation for Legacy SQL and Standard SQL: aggregate analytic functions, navigation functions.

Example 1. Suppose we want to analyze the activity of customers during business and non-business hours. To do this, we need to divide transactions into two groups and calculate the metrics of interest to us:

  • Group 1 — Purchases during business hours from 9:00 to 18:00
  • Group 2 — Purchases after hours from 00:00 to 9:00 and from 18:00 to 00:00

In addition to business and non-business times, another sign for the formation of the window will be the clientId. That is, for each user we will have two windows:

Subset (window) clientId DayTime
1 window clientId 1 Business hours
2 window clientId 1 Non-business hours
3 window clientId 2 Business hours
4 window clientId 2 Non-business hours
N window clientId N Business hours
N+1 window clientId N+1 Non-business hours

Let’s calculate the average, maximum, minimum, and total income, the number of transactions, and the number of unique transactions for each user during business and non-business hours in the demo data. The queries below will help us do this.

#legasy SQL

		
SELECT
  date,
  clientId,
  DayTime,
  avg_revenue,
  max_revenue,
  min_revenue,
  sum_revenue,
  transactions,
  unique_transactions
FROM (
  SELECT
    date,
    clientId,
    DayTime,
    AVG(revenue) OVER (PARTITION BY date, clientId, DayTime) AS avg_revenue,
    MAX(revenue) OVER (PARTITION BY date, clientId, DayTime) AS max_revenue,
    MIN(revenue) OVER (PARTITION BY date, clientId, DayTime) AS min_revenue,
    SUM(revenue) OVER (PARTITION BY date, clientId, DayTime) AS sum_revenue,
    COUNT(transactionId) OVER (PARTITION BY date, clientId, DayTime) AS transactions,
    COUNT(DISTINCT(transactionId)) OVER (PARTITION BY date, clientId, DayTime) AS unique_transactions
  FROM (
    SELECT
      date,
      date_UTC,
      clientId,
      transactionId,
      revenue,
      page,
      hour,
      CASE
        WHEN hour>=9 AND hour<=18 THEN 'рабочее время'
        ELSE 'нерабочее время'
      END AS DayTime
    FROM
      [owox-analytics:t_kravchenko.Demo_data]))
GROUP BY
  date,
  clientId,
  DayTime,
  avg_revenue,
  max_revenue,
  min_revenue,
  sum_revenue,
  transactions,
  unique_transactions
ORDER BY
  transactions DESC
		
	

#standard SQL

		
#standardSQL
SELECT
  date,
  clientId,
  DayTime,
  avg_revenue,
  max_revenue,
  min_revenue,
  sum_revenue,
  transactions,
  unique_transactions
FROM (
  SELECT
    date,
    clientId,
    DayTime,
    AVG(revenue) OVER (PARTITION BY date, clientId, DayTime) AS avg_revenue,
    MAX(revenue) OVER (PARTITION BY date, clientId, DayTime) AS max_revenue,
    MIN(revenue) OVER (PARTITION BY date, clientId, DayTime) AS min_revenue,
    SUM(revenue) OVER (PARTITION BY date, clientId, DayTime) AS sum_revenue,
    COUNT(transactionId) OVER (PARTITION BY date, clientId, DayTime) AS transactions,
    COUNT(DISTINCT(transactionId)) OVER (PARTITION BY date, clientId, DayTime) AS unique_transactions
  FROM (
    SELECT
      date,
      date_UTC,
      clientId,
      transactionId,
      revenue,
      page,
      hour,
      CASE
        WHEN hour>=9 AND hour<=18 THEN 'рабочее время'
        ELSE 'нерабочее время'
      END AS DayTime
    FROM
      `owox-analytics.t_kravchenko.Demo_data`))
GROUP BY
  date,
  clientId,
  DayTime,
  avg_revenue,
  max_revenue,
  min_revenue,
  sum_revenue,
  transactions,
  unique_transactions
ORDER BY
  transactions DESC
		
	

Let’s see the result, using the example of the user with clientId = 102041117.1428132012. In the source table for this user, we have the following data:

After running the query, we receive a report that contains the average, minimum, maximum, and total income for this user, as well as the number of transactions. As can be seen in the screenshot below, the user made two transactions, both during business hours:

Example 2. Now let’s slightly complicate the task:

  • We’ll assign the sequence numbers for all transactions in the window, depending on the time of their execution. Recall that we define a window by user and business/non-business hours.
  • We’ll display in the report the income of the next and previous transactions (relative to the current) within the window.
  • We’ll derive the income of the first and last transactions in the window.

To do this, use the following queries:

#legasy SQL

		
SELECT
  date,
  clientId,
  DayTime,
  hour,
  rank,
  revenue,
  lead_revenue,
  lag_revenue,
  first_revenue_by_hour,
  last_revenue_by_hour
FROM (
  SELECT
    date,
    clientId,
    DayTime,
    hour,
    DENSE_RANK() OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS rank,
    revenue,
    LEAD( revenue, 1) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS lead_revenue,
    LAG( revenue, 1) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS lag_revenue,
    FIRST_VALUE(revenue) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS first_revenue_by_hour,
    LAST_VALUE(revenue) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS last_revenue_by_hour
  FROM (
    SELECT
      date,
      date_UTC,
      clientId,
      transactionId,
      revenue,
      page,
      hour,
      CASE
        WHEN hour>=9 AND hour<=18 THEN 'рабочее время'
        ELSE 'нерабочее время'
      END AS DayTime
    FROM
      [owox-analytics:t_kravchenko.Demo_data]))
GROUP BY
  date,
  clientId,
  DayTime,
  hour,
  rank,
  revenue,
  lead_revenue,
  lag_revenue,
  first_revenue_by_hour,
  last_revenue_by_hour
ORDER BY
  date,
  clientId,
  DayTime,
  hour,
  rank,
  revenue,
  lead_revenue,
  lag_revenue,
  first_revenue_by_hour,
  last_revenue_by_hour
		
	

#standard SQL

		
SELECT
  date,
  clientId,
  DayTime,
  hour,
  rank,
  revenue,
  lead_revenue,
  lag_revenue,
  first_revenue_by_hour,
  last_revenue_by_hour
FROM (
  SELECT
    date,
    clientId,
    DayTime,
    hour,
    DENSE_RANK() OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS rank,
    revenue,
    LEAD( revenue, 1) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS lead_revenue,
    LAG( revenue, 1) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS lag_revenue,
    FIRST_VALUE(revenue) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS first_revenue_by_hour,
    LAST_VALUE(revenue) OVER (PARTITION BY date, clientId, DayTime ORDER BY hour) AS last_revenue_by_hour
  FROM (
    SELECT
      date,
      date_UTC,
      clientId,
      transactionId,
      revenue,
      page,
      hour,
      CASE
        WHEN hour>=9 AND hour<=18 THEN 'рабочее время'
        ELSE 'нерабочее время'
      END AS DayTime
    FROM
      `owox-analytics.t_kravchenko.Demo_data`))
GROUP BY
  date,
  clientId,
  DayTime,
  hour,
  rank,
  revenue,
  lead_revenue,
  lag_revenue,
  first_revenue_by_hour,
  last_revenue_by_hour
ORDER BY
  date,
  clientId,
  DayTime,
  hour,
  rank,
  revenue,
  lead_revenue,
  lag_revenue,
  first_revenue_by_hour,
  last_revenue_by_hour
		
	

We’ll check the results using the example of a user already familiar to us: the user with clientId = 102041117.1428132012:

From the screenshot above, we see that:

  • The first transaction was at 15:00 and the second was at 16:00.
  • After the current transaction at 15:00, there was a transaction at 16:00 in the amount of 25066 (column lead_revenue).
  • Before the current transaction at 16:00, there was a transaction at 15:00 for the amount of 3699 (column lag_revenue).
  • The first in the window was a transaction at 15:00 for 3699 (column first_revenue_by_hour).
  • The request processes the data line by line, so for the transaction in question, it will be the last one in the window and the values ​​in the last_revenue_by_hour and revenue columns will be the same.

Wrapping up

In this article, we looked at the most popular aggregate, date, string, and window functions. However, Google BigQuery has many more useful functions:

  • Casting functions allow you to cast data to a specific format.
  • Table wildcard functions allow you to access several tables from a data set.
  • Regular expression functions allow you to describe the search query model and not its exact value.

We’ll definitely write more about these functions on our blog. For now, you can try all the functions described in this article on the demo data.

Get demo data

Used tools

You might also like