Analysts' Guide

Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.

Commitment summary

Content planning: about 4 hours working with the content team to identify metrics from the HTTP Archive that could be used to support the chapter. This time is also spent preparing queries and augmenting the testing infrastructure as needed (custom metrics). This phase typically occurs in June-July.
Data validation: about 6 hours finalizing queries, running them against the dataset, and building a spreadsheet of results for the rest of the content team to review. This phase typically occurs in July-September.
Content writing: about 2 hours reviewing the drafts to ensure that the results are being interpreted correctly. This phase typically occurs in September-October.

The total commitment is approximately 30 hours of work per chapter, spread over 6 months.

How to join

The 2021 project is underway and we're actively looking for analysts! Browse the list of open chapters and comment in any chapter-specific issues that interest you.

Expensing queries

The Analysts team will be given BigQuery coupon codes to offset the cost of analysis. To get your code, join the #web-almanac-analysts channel on the HTTP Archive Slack and ask @Paul Calvano.

Analysis workflow

There could be dozens of queries per chapter, so to help stay organized here is a recommended workflow for analysts to get the queries written and the results shared.

When the chapter is outlined, work with the content team to identify the quantifiable metrics that would be needed.
Add each metric to the Metrics section using data-oriented descriptions of what it's measuring, for example "% of mobile pages that have images larger than 10 MB".
Triage the feasibility of each metric based on the capabilities of the HTTP Archive dataset. Feel free to ask around in #web-almanac-analysts if you're not sure whether a metric is feasible.
Create a git branch off main named <chapter>-sql-<yyyy> (or similar) and create a subdirectory for your chapter under the sql directory named <yyyy>/<chapter-number>_<chapter-name>. For example, 2020/01_CSS (use leading 0s for the chapter number to ensure that they appear in sorted order).
Open up a pull request to merge with the main branch, but leave it in draft mode for now. Edit the PR description to contain a checklist of all of the metrics you triaged as feasible in the doc. See #1087, for example.
As you write the queries for each metric, add a comment to the top of the query using the description from step 2, and give the file a short and descriptive name, like pct_large_images.sql. (Note that in 2019 we used a system of metric IDs like 01_22.sql, which made it hard to understand what a query did by its name.)
When all of the queries have been written, request a review from the @HTTPArchive/analysts team.
When the 2020_08_01 HTTP Archive crawl is complete at the end of August, run each query and save its results to your designated spreadsheet linked from the chapter issue. Each metric's results should be in its own tab, which is named using the file name (eg pct large images).
To help others interpret the results, leave a comment on the top row of the results with the correct interpretation. For example: "Read as: 4.28% of mobile pages contain images larger than 10 MB".
Authors and reviewers should check the results to make sure they're within expected limits, otherwise the query may need to be rewritten.
When the authors have drafted the chapter, they should leave a placeholder for the data visualization. Generate a chart using the templates provided, publish it to get its embeddable HTML, and replace the placeholder.

Query conventions

#standardSQL
# The number of unique websites in the July 2019 crawl.
SELECT
  _TABLE_SUFFIX AS client,
  COUNT(DISTINCT url) AS num_pages
FROM
  `httparchive.summary_pages.2019_07_01_*`
GROUP BY
  client

The first line should be #standardSQL so the query is interpreted as Standard SQL syntax
The next line(s) should be a comment describing what the query is analyzing and the metric ID(s) if relevant
Use uppercase for all SQL keywords (SELECT, COUNT, AS, FROM, etc)
Fields are descriptively named in snake_case (lowercase and underscore-delimited)
Table names should always specify the project (httparchive) and include both desktop and mobile when possible by using a wildcard (*)
Separate the desktop and mobile results by grouping by client. When using a wildcard, this will be the _TABLE_SUFFIX special value.
Filenames should have a .sql extension for syntax highlighting in GitHub and other editors, and also allow them to be recognised by the linter (see below).
Filenames should not have spaces in them to make them easier to reference from the web.

SQL Linting

Most of the above standards will be enforced by the linter (SQLFluff) when you open a Pull Request and the linting will fail if you deviate from these standards.

You can run the linter locally if you have a python environment set up, which can be set up with the following commands if using Virtual Env:

cd src
virtualenv --python python3.8 .venv
source .venv/bin/activate
pip install -r requirements.txt

Or for those on Windows:

cd src
virtualenv --python python3 .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt

The environment will need to be activated each time but the virtual env should only need set up once and the requirements only need to be installed again if you want new versions.

To lint the 2020 resource-hints SQL files, for example, install the python environment as per above, and then issue the following command:

sqlfluff lint ../sql/2020/resource-hints

This will return errors like the following:

% sqlfluff lint ../sql/2020/resource-hints
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L:  25 | P:  26 | L010 | Inconsistent capitalisation of keywords.
L:  26 | P:  37 | L010 | Inconsistent capitalisation of keywords.
L:  34 | P:  63 | L038 | Trailing comma in select statement forbidden
All Finished 📜 🎉!

This states that:

On line 25, in position 26 you are using lowercase for keywords (e.g. as instead of AS) so failed rule L010.
Similarly on line 26, position 37.
And finally on line 34, position 63 you have an unnecessary comma (e.g. SELECT a,b, FROM table) and so failed rule L038. Remove the extra comma.

The list of rules can be found in the SQLFLuff documentation though we have turned a few of them off and configured others for our style (see the our .sqlfluff file if curious).

If you see any "unparseable" or PRS errors, then this is either an error in your code, or perhaps you've discovered a bug. Reach out to Barry (@tunetheweb) for help if stuck.

To attempt to autofix the errors you can use the fix command, instead of lint:

sqlfluff fix ../sql/2020/resource-hints

Which will produce similar output but with an offer to fix the issues it thinks it can fix:

% sqlfluff fix ../sql/2020/resource-hints
==== finding fixable violations ====
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L:  25 | P:  26 | L010 | Inconsistent capitalisation of keywords.
L:  26 | P:  37 | L010 | Inconsistent capitalisation of keywords.
L:  34 | P:  63 | L038 | Trailing comma in select statement forbidden
==== fixing violations ====
3 fixable linting violations found
Are you sure you wish to attempt to fix these? [Y/n]

If you lint again you should see most of the errors are fixed. Note that not all errors can be autofixed and some will require manual intervention but autofixing is useful for the simple errors. So while it's generally OK to run the fix command, do run the lint command when all clean to make sure the fix command didn't miss any issues.

Formatting results

When returning a percent value, ensure that the number is between 0 and 1, not 0 and 100 (we used 0 to 100 in 2019 but switched in 2020 to allow easier formatting in Sheets). We also stopped rounding in BigQuery from 2020 onwards and present the full decimal value and round in Sheets.

#standardSQL
# Take the average percent of external stylesheets over all requests.
SELECT
  _TABLE_SUFFIX AS client,
 COUNTIF(type = 'css') / COUNT(0) AS pct_stylesheets
FROM
  `httparchive.summary_requests.2020_08_01_*`
GROUP BY
  client

Usually you would sort by pct in DESCing order. Try to reuse the column names rather than reimplement the percentage formula in the ORDER BY clause too.

 SELECT
  _TABLE_SUFFIX AS client,
  element,
  COUNT(DISTINCT url) AS pages,
  total,
 COUNT(DISTINCT url) / total, 2 AS pct
FROM
  `some_dataset`
GROUP BY
  client,
  total,
  element
ORDER BY
  pct DESC,
  client

A warning on totals

We generally want to present stats as a percentage of pages. The HTTP Archive data set changes month by month (based on the latest available CrUX data) and has been growing over time. Therefore presenting absolute values can make it difficult to understand the scale, and to do year on year comparison. So as much as possible we want data presented as percentages.

To do this we need to take the absolute frequency of stats and divide by the total number. This means you need a total number. There are a number of ways of getting this.

You can use CTE (Common Table Expression):

WITH totals AS (
  SELECT
    _TABLE_SUFFIX AS _TABLE_SUFFIX,
    COUNT(0) AS total
  FROM
    `httparchive.pages.2021_07_01_*`
  GROUP BY
    _TABLE_SUFFIX
)

SELECT
  _TABLE_SUFFIX AS client
  COUNT(0) AS freq,
  total,
  COUNT(0) / total AS pct
FROM
  `httparchive.pages.2021_07_01_*`
JOIN
  totals
USING (_TABLE_SUFFIX)
WHERE
  ...
GROUP BY
  client
ORDER BY
  pct DESC

You can use a subquery:

SELECT
  _TABLE_SUFFIX AS client,
  COUNT(0) AS freq,
  total,
  COUNT(0) / total AS pct
FROM
  `httparchive.pages.2021_07_01_*`
JOIN
  (
    SELECT
      _TABLE_SUFFIX AS _TABLE_SUFFIX,
      COUNT(0) AS total
    FROM
      `httparchive.pages.2021_07_01_*`
    GROUP BY
      _TABLE_SUFFIX
  )
USING (__TABLE_SUFFIX)
WHERE
  ...
GROUP BY
  client
ORDER BY
  pct DESC

Or you can use a windowing clause using OVER:

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(...) AS freq,
  SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS total
  COUNT(0) / SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS pct
FROM
  `httparchive.pages.2021_07_01_*`
GROUP BY
  client
ORDER BY
  pct DESC

The last looks the simplest and is a very handy way of not having to do another join. However care must be taken with this, as it's easy to get the wrong total with this. In particular:

UNNEST queries change the totals and even using DISTINCT url or similar will not remove this as you are doing a SUM of those DISTINCT url's rather than a DISTINCT of the SUM. This often leads to a total much LARGER than expected.
WHERE clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected.
HAVING clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. You can use the QUALIFY option instead which is basically the same as HAVING but is applied after.

So for simple functions with no joins, and no filters (using WHERE or HAVING) by all means use the windowing option, but be aware of its limitations.

It is always best to include the total in your exports so you can validate the numbers - why is this out of 500 million URLs? I thought our crawl was 7 million URLs?

Almanac custom metrics and UDFs

In the pages dataset, there is a payload property of _almanac containing an object with several custom metrics. See https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/almanac.js for the definitions of each metric.

#standardSQL
# 01_15: Percent of pages that include link[rel=modulepreload]
CREATE TEMP FUNCTION hasModulePreload(payload STRING)
RETURNS BOOLEAN LANGUAGE js AS '''
  try {
    var $ = JSON.parse(payload);
    var almanac = JSON.parse($._almanac);
    return !!almanac['link-nodes'].find(e => e.rel.toLowerCase() == 'modulepreload');
  } catch (e) {
    return false;
  }

''';

SELECT
  _TABLE_SUFFIX AS client,
  COUNTIF(hasModulePreload(payload)) AS num_pages,
  COUNTIF(hasModulePreload(payload)) / COUNT(0) AS pct_modulepreload
FROM
  `httparchive.pages.2019_07_01_*`
GROUP BY
  client

Sometimes the values in a custom metric are too complex to process using the JSON_EXTRACT functions in BigQuery, so you can do more complex analysis with a user-defined function (UDF). In the example above, the UDF parses both the payload JSON and the JSON-encoded _almanac custom metric.

Almanac dataset

httparchive.almanac is a dataset containing 1k subset tables for all standard datasets (lighthouse, pages, requests, response_bodies, summary_pages, summary_requests, technologies)

There are also several preprocessed tables to make querying more convenient and cost-effective:

response_bodies: combination of desktop/mobile response bodies clustered by client, page, and url fields
summary_requests: combination of desktop/mobile summary requests clustered by client, page, and url fields. Note that page is the URL of the corresponding pageid in the summary_pages table.
summary_response_bodies: combination of desktop/mobile response bodies and summary requests data joined by page and url fields, clustered by client, firstHtml, and type fields. This is the preferred table for querying response bodies of a certain type, for example:

#standardSQL
# Calculates the percent of pages that reference the word "almanac" (case-insensitive).
SELECT
  client,
  COUNTIF(REGEXP_CONTAINS(body, '(?i)almanac')) / COUNT(0) AS pct_almanac
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  firstHtml
GROUP BY
  client

Note that summary_response_bodies is an 18 TB table, but thanks to the clustering, this query is limited to only the response bodies that are marked as firstHtml. So the query actually completes in ~13 seconds and consumes only 850 GB.

Rank

Since the June 2021 crawl, rank data is taken from the CrUX report allowing queries on the top 1k origins, top 10k, top 100k, top 1M...etc. The rank field is available in the httparchive.summary_pages.* tables and will contain the values 1000, 10000, 100000, 1000000 and 10000000 representing the rank for each url crawled.

Note that the ranks are exclusive, meaning that a URL may only have one rank. For example, the 100000 rank contains 99,000 URLs because the top 1,000 is its own exclusive segment. To include all URLs you may combine the different ranks as follows:

SELECT 
  client,
  rank,
  COUNT(0) AS total
FROM (
  SELECT 
    _TABLE_SUFFIX AS client,
    rank AS _rank,
  FROM
    `httparchive.summary_pages.2021_07_01_*`
  ),
UNNEST([1000, 10000, 100000, 1000000, 10000000]) AS rank
WHERE
  _rank <= rank
GROUP BY 
  client,
  rank
ORDER BY
  client,
  rank

When running the above query, you may notice that the total does not perfectly match the number of expected URLs, i.e. less than 1000 URLs for the top 1000 rank. Since the CrUX rank for each origin is shared between desktop & mobile, it may be possible that an origin which features on CrUX for on one device type, does not feature for the other. As a result, for the 1000 rank, the list of URLs to crawl may contain a maximum of 1,000 URLs - if all URLs feature on both desktop and mobile - but likely to contain a subset.

Uh oh!

Analysts' Guide

Analysts' Guide

Commitment summary

How to join

Expensing queries

Analysis workflow

Query conventions

SQL Linting

Formatting results

A warning on totals

Almanac custom metrics and UDFs

Almanac dataset

Rank

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contributor Guides

Content Team for each Chapter

Clone this wiki locally