Running Total. Moving Average. Percent of Total.
As an analyst, these are metrics that you commonly find yourself calculating – and yet they can be some of the most cumbersome to calculate. You typically have to export raw data to CSV and then manipulate the data in a spreadsheet to calculate them.
However, if you’re using Google BigQuery you can do it all inside the query composer.
Normal BigQuery functions will perform an action over an entire column or dataset, but BigQuery comes with a class of window functions that allow you to perform calculations over a specific subset of your results. The power here lies in the fact that you can do math on data in a table, but do it only on certain rows, based on the instructions you provide via your window function.
Here are three of the most useful BigQuery window functions that you can use as an analyst:
SUM() OVER()to calculate a running total of a numeric column as time progresses, and see what the total is at each point in time
AVG() OVER()to calculate a moving average of a numeric column as time progresses, and see what the average is at each point in time
RATIO_TO_REPORT() OVER()to calculate, for each number in a column, what percent of the total column that number represents
We’ll walk through the first two examples in more depth using the public Wikipedia sample data inside BigQuery. This dataset contains a revision history of all Wikipedia articles (up through April 2010). It’s a fairly large table – just over 313M rows.
For the third example, we’ll use the public NYC yellow taxi trip data from 2016 inside BigQuery. This dataset contains a row for each ride that someone took in a yellow cab in New York City in 2016.
Example 1: Running Total
Each row inside the Wikipedia table represents a single revision to a single article.
If we wanted to see how many revisions had taken place per day (all the way from Wikipedia’s launch through April 2010) we could use this query:
SELECT DATE(SEC_TO_TIMESTAMP(timestamp)) AS date_of_revision, COUNT(*) AS num_revisions_on_day FROM [bigquery-public-data:samples.wikipedia] GROUP BY date_of_revision ORDER BY date_of_revision
We could then dump the results into a Google sheet and calculate our running total with the formulas shown in column D:
To accomplish all of this spreadsheet work with a single query, we can instead write…
SELECT date_of_revision, SUM(num_revisions_on_day) OVER(ORDER BY date_of_revision) AS running_total_revisions FROM ( SELECT DATE(SEC_TO_TIMESTAMP(timestamp)) AS date_of_revision, COUNT(*) AS num_revisions_on_day FROM [bigquery-public-data:samples.wikipedia] GROUP BY date_of_revision ORDER BY date_of_revision )
…and BigQuery will return this table of results:
Nice! An exact match to our calculation from the spreadsheet.
Here is how that query works:
- We start by taking our results table and placing that inside a subquery. I.e.:
SELECT ... FROM (first query)
- We’re ultimately trying to create a new results table that has one line per day, so we need to
- We’re getting a running total of the number of revisions per day, so we’ll write
- We only want the total to be calculated over one additional day at a time. Also, since running totals are purposefully calculated in a specific order, we need to specify the order in which we’d like to calculate our running total. The solution is to use the
OVERfunction, specifying that we will
ORDER BY date_of_revision.
SUM function here in combination with the
OVER function is what separates this from a simple addition of all numbers in the column. We use the
OVER function to tell BigQuery that we want to do a
SUM, but only over a particular window.
Example 2: Moving Average
If we wanted to get a simple look at how revision volume has changed over time, a great starting point would be to look at a moving (a.k.a. “rolling”) average.
We could plot just the raw volume that has taken place each day, but day-by-day variability can be quite high.
Plotting that data for Wikipedia revisions shows a noisy series that potentially hides valuable insights:
However, if we plot a 7-day moving average, much of the noise generated by “outlier” days will disappear and we’ll be left with a smoother curve that could potentially reveal additional trends and insights:
Here is the query that calculates the moving average:
SELECT date_of_revision, AVG(num_revisions_on_day) OVER(ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) as moving_7_day_average FROM ( SELECT DATE(SEC_TO_TIMESTAMP(timestamp)) AS date_of_revision, COUNT(*) AS num_revisions_on_day FROM [bigquery-public-data:samples.wikipedia] GROUP BY date_of_revision ORDER BY date_of_revision )
This query is very similar to the query we used to calculate a running total, but there are a few key differences:
- Since we are calculating an average here, we use the
AVGfunction, instead of the
- We need to tell the AVG function which rows it will be averaging. So, we tell it to look at the
num_revisions_on_dayfield in the inner results table (in the subquery), and for each row inside that table, take the average of that
CURRENT ROWand the
6 PRECEDINGrows. The final syntax looks like this:
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW.
Note: the inner subquery already includes an
ORDER BY clause, so the results of that query are already ordered in ascending order. That’s perfect for our outer query that looks at those inner results and uses the
OVER window function to do math on them.
If that inner results table was not already nicely ordered, we would need to adjust the
OVER function to read
OVER(ORDER BY date_of_revision ROWS BETWEEN 6 PRECEDING AND CURRENT ROW).
Example 3: Percent of Total
As analysts we’re often asked to present data in tabular format, with an additional column for the “percent of total” to help put raw volume into perspective.
Let’s jump over to the 2016 NYC yellow taxi dataset to see how we might implement this in BigQuery.
Suppose we wanted to see the spread of trip duration over all trips in the dataset to see what the most common trip durations were. Let’s also slice this data by the payment type (credit card or cash) to see if the payment type used is affected by the duration of a ride.
Here is a query that we can use:
SELECT trip_minutes, payment_type, num_trips, RATIO_TO_REPORT(num_trips) OVER(PARTITION BY trip_minutes) AS percent_of_total FROM ( SELECT ROUND((dropoff_datetime - pickup_datetime)/1000000/60) AS trip_minutes, payment_type, COUNT(*) AS num_trips FROM [bigquery-public-data:new_york.tlc_yellow_trips_2016] WHERE dropoff_datetime > pickup_datetime AND (payment_type = 1 OR payment_type = 2) GROUP BY trip_minutes, payment_type ) ORDER BY trip_minutes, payment_type
Once again, we’re using the
OVER function in combination with a window function – this time, it’s
RATIO_TO_REPORT. But there’s a lot going on here, so let’s explore each piece of this query.
First, we have an inner query that totals up the number of trips in 2016 per payment type, per trip duration (by rounded minutes):
SELECT ROUND((dropoff_datetime - pickup_datetime)/1000000/60) AS trip_minutes, payment_type, COUNT(*) AS num_trips FROM [bigquery-public-data:new_york.tlc_yellow_trips_2016] WHERE dropoff_datetime > pickup_datetime AND (payment_type = 1 OR payment_type = 2) GROUP BY trip_minutes, payment_type
That query produces a results table like this:
A few notes about different parts of this query:
- To get
trip_minutes, we calculate the difference in pickup and dropoff datetimes. That returns a difference in microseconds, so we divide by 1,000,000 and again by 60 to get the number of minutes for the trip. We finish with
ROUNDto round to the nearest integer – this creates fewer rows of results and makes aggregate analysis easier.
- We need to include
WHERE dropoff_datetime > pickup_datetimejust to clean the data up a bit. (There is a small amount of data in the table that seems nonsensical; for ex: trips where the dropoff takes place before the pickup, or where the
total_amountof the trip costs a negative number of dollars.)
- We only ask for
2because those represent credit card and cash, respectively. 99.5% of NYC yellow taxi trips are paid for via these methods, so we’ll filter out data from the other 0.5% of payment methods. This will allow for easier analysis when we look at percent-of-total results by payment type.
The goal of this example is to add one more column to the right of that inner results table – a percent of total column. But how do we do that?
When combined together, we’re able to calculate the ratio of each row to the rest of the report! In other words: the percent that each row consists of the total.
However, since we’re interested in whether the trip duration affects the payment type used, we’ll want to do our “percent of total” math within each
trip_minutes value. In other words, we want to answer this question:
- For each
trip_minutesvalue, what percent of trips were paid for with a credit card, and what percent were paid for with cash?
This is where the
PARTITION function comes in. As a function in BigQuery, “partition” is a verb that essentially means “divide into parts” or “do an action, but do it separately over different groups”.
Since we’re interested in the breakdown of
payment_type for each
trip_minutes value, we’re aiming to do the
RATIO_TO_REPORT action over separate
In BigQuery, the syntax is
RATIO_TO_REPORT(num_trips) OVER(PARTITION BY trip_minutes)
The result is a table like this:
As we can see, there are two rows for each
trip_minutes value, and for each pair of successive rows, the
percent_of_total value adds up to 100%.
And now that we have our percent of total data, we can export to a spreadsheet and quickly see that different
trip_minutes values tend towards different payment method ratios: