The Three Most Useful Window Functions in BigQuery

Running Total. Moving Average. Percent of Total.

As an analyst, these are metrics that you commonly find yourself calculating – and yet they can be some of the most cumbersome to calculate. You typically have to export raw data to CSV and then manipulate the data in a spreadsheet to calculate them.

However, if you’re using Google BigQuery you can do it all inside the query composer.

This is especially helpful if you’re connecting your queries directly to a visualization tool like Google Data Studio or Tableau and want to pass the tool one of these calculated metrics directly.

Normal BigQuery functions will perform an action over an entire column or dataset, but BigQuery comes with a class of window functions that allow you to perform calculations over a specific subset of your results. The power here lies in the fact that you can do math on data in a table, but do it only on certain rows, based on the instructions you provide via your window function.

Here are three of the most useful BigQuery window functions that you can use as an analyst:

  • Use SUM() OVER() to calculate a running total of a numeric column as time progresses, and see what the total is at each point in time
  • Use AVG() OVER() to calculate a moving average of a numeric column as time progresses, and see what the average is at each point in time
  • Use RATIO_TO_REPORT() OVER() to calculate, for each number in a column, what percent of the total column that number represents

We’ll walk through the first two examples in more depth using the public Wikipedia sample data inside BigQuery. This dataset contains a revision history of all Wikipedia articles (up through April 2010). It’s a fairly large table – just over 313M rows.

For the third example, we’ll use the public NYC yellow taxi trip data from 2016 inside BigQuery. This dataset contains a row for each ride that someone took in a yellow cab in New York City in 2016.

Example 1: Running Total

Each row inside the Wikipedia table represents a single revision to a single article.

If we wanted to see how many revisions had taken place per day (all the way from Wikipedia’s launch through April 2010) we could use this query:

SELECT
  DATE(SEC_TO_TIMESTAMP(timestamp)) AS date_of_revision,
  COUNT(*) AS num_revisions_on_day
FROM
  [bigquery-public-data:samples.wikipedia]
GROUP BY
  date_of_revision
ORDER BY
  date_of_revision

(Click here to open this query inside BigQuery and see the results.)

We could then dump the results into a Google sheet and calculate our running total with the formulas shown in column D:

To accomplish all of this spreadsheet work with a single query, we can instead write…

SELECT
  date_of_revision,
  SUM(num_revisions_on_day) OVER(ORDER BY date_of_revision) AS running_total_revisions
FROM (
  SELECT
    DATE(SEC_TO_TIMESTAMP(timestamp)) AS date_of_revision,
    COUNT(*) AS num_revisions_on_day
  FROM
    [bigquery-public-data:samples.wikipedia]
  GROUP BY
    date_of_revision
  ORDER BY
    date_of_revision )

(Click here to open this query inside BigQuery and see the results.)

…and BigQuery will return this table of results:

Nice! An exact match to our calculation from the spreadsheet.

Here is how that query works:

  • We start by taking our results table and placing that inside a subquery. I.e.: SELECT ... FROM (first query)
  • We’re ultimately trying to create a new results table that has one line per day, so we need to SELECT date_of_revision
  • We’re getting a running total of the number of revisions per day, so we’ll write SUM(num_revisions_per_day)
  • We only want the total to be calculated over one additional day at a time. Also, since running totals are purposefully calculated in a specific order, we need to specify the order in which we’d like to calculate our running total. The solution is to use the OVER function, specifying that we will ORDER BY date_of_revision.

Using the SUM function here in combination with the OVER function is what separates this from a simple addition of all numbers in the column. We use the OVER function to tell BigQuery that we want to do a SUM, but only over a particular window.

Example 2: Moving Average

If we wanted to get a simple look at how revision volume has changed over time, a great starting point would be to look at a moving (a.k.a. “rolling”) average.

We could plot just the raw volume that has taken place each day, but day-by-day variability can be quite high.

Plotting that data for Wikipedia revisions shows a noisy series that potentially hides valuable insights:

However, if we plot a 7-day moving average, much of the noise generated by “outlier” days will disappear and we’ll be left with a smoother curve that could potentially reveal additional trends and insights:

Here is the query that calculates the moving average:

SELECT
  date_of_revision,
  AVG(num_revisions_on_day) OVER(ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_7_day_average
FROM (
  SELECT
    DATE(SEC_TO_TIMESTAMP(timestamp)) AS date_of_revision,
    COUNT(*) AS num_revisions_on_day
  FROM
    [bigquery-public-data:samples.wikipedia]
  GROUP BY
    date_of_revision
  ORDER BY
    date_of_revision )

(Click here to open this query inside BigQuery and see the results.)

This query is very similar to the query we used to calculate a running total, but there are a few key differences:

  1. Since we are calculating an average here, we use the AVG function, instead of the SUM function.
  2. We need to tell the AVG function which rows it will be averaging. So, we tell it to look at the num_revisions_on_day field in the inner results table (in the subquery), and for each row inside that table, take the average of that CURRENT ROW and the 6 PRECEDING rows. The final syntax looks like this: ROWS BETWEEN 6 PRECEDING AND CURRENT ROW.

Note: the inner subquery already includes an ORDER BY clause, so the results of that query are already ordered in ascending order. That’s perfect for our outer query that looks at those inner results and uses the AVG + OVER  window function to do math on them.

If that inner results table was not already nicely ordered, we would need to adjust the OVER function to read OVER(ORDER BY date_of_revision ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).

Example 3: Percent of Total

As analysts we’re often asked to present data in tabular format, with an additional column for the “percent of total” to help put raw volume into perspective.

Let’s jump over to the 2016 NYC yellow taxi dataset to see how we might implement this in BigQuery.

Suppose we wanted to see the spread of trip duration over all trips in the dataset to see what the most common trip durations were. Let’s also slice this data by the payment type (credit card or cash) to see if the payment type used is affected by the duration of a ride.

Here is a query that we can use:

SELECT
  trip_minutes,
  payment_type,
  num_trips,
  RATIO_TO_REPORT(num_trips) OVER(PARTITION BY trip_minutes) AS percent_of_total
FROM (
  SELECT
    ROUND((dropoff_datetime - pickup_datetime)/1000000/60) AS trip_minutes,
    payment_type,
    COUNT(*) AS num_trips
  FROM
    [bigquery-public-data:new_york.tlc_yellow_trips_2016]
  WHERE
    dropoff_datetime > pickup_datetime
    AND (payment_type = 1 OR payment_type = 2)
  GROUP BY
    trip_minutes,
    payment_type )
ORDER BY
  trip_minutes,
  payment_type

(Click here to open this query inside BigQuery and see the results.)

Once again, we’re using the OVER function in combination with a window function – this time, it’s RATIO_TO_REPORT. But there’s a lot going on here, so let’s explore each piece of this query.

First, we have an inner query that totals up the number of trips in 2016 per payment type, per trip duration (by rounded minutes):

SELECT
  ROUND((dropoff_datetime - pickup_datetime)/1000000/60) AS trip_minutes,
  payment_type,
  COUNT(*) AS num_trips
FROM
  [bigquery-public-data:new_york.tlc_yellow_trips_2016]
WHERE
  dropoff_datetime > pickup_datetime
  AND (payment_type = 1 OR payment_type = 2)
GROUP BY
  trip_minutes,
  payment_type

(Click here to open this query inside BigQuery and see the results.)

That query produces a results table like this:

A few notes about different parts of this query:

  • To get trip_minutes, we calculate the difference in pickup and dropoff datetimes. That returns a difference in microseconds, so we divide by 1,000,000 and again by 60 to get the number of minutes for the trip. We finish with ROUND to round to the nearest integer – this creates fewer rows of results and makes aggregate analysis easier.
  • We need to include WHERE dropoff_datetime > pickup_datetime just to clean the data up a bit. (There is a small amount of data in the table that seems nonsensical; for ex: trips where the dropoff takes place before the pickup, or where the total_amount of the trip costs a negative number of dollars.)
  • We only ask for payment_types of 1 and 2 because those represent credit card and cash, respectively. 99.5% of NYC yellow taxi trips are paid for via these methods, so we’ll filter out data from the other 0.5% of payment methods. This will allow for easier analysis when we look at percent-of-total results by payment type.

The goal of this example is to add one more column to the right of that inner results table – a percent of total column. But how do we do that?

With the RATIO_TO_REPORT and OVER functions!

When combined together, we’re able to calculate the ratio of each row to the rest of the report! In other words: the percent that each row consists of the total.

However, since we’re interested in whether the trip duration affects the payment type used, we’ll want to do our “percent of total” math within each trip_minutes value. In other words, we want to answer this question:

  • For each trip_minutes value, what percent of trips were paid for with a credit card, and what percent were paid for with cash?

This is where the PARTITION function comes in. As a function in BigQuery, “partition” is a verb that essentially means “divide into parts” or “do an action, but do it separately over different groups”.

Since we’re interested in the breakdown of payment_type for each trip_minutes value, we’re aiming to do the RATIO_TO_REPORT action over separate trip_minutes groups.

In BigQuery, the syntax is RATIO_TO_REPORT(num_trips) OVER(PARTITION BY trip_minutes)

The result is a table like this:

As we can see, there are two rows for each trip_minutes value, and for each pair of successive rows, the percent_of_total value adds up to 100%.

And now that we have our percent of total data, we can export to a spreadsheet and quickly see that different trip_minutes values tend towards different payment method ratios:

Leave a Reply

Your email address will not be published. Required fields are marked *