Analysts' Guide

Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.

Commitment summary

Content planning: about 4 hours working with the content team to identify metrics from the HTTP Archive that could be used to support the chapter. This time is also spent preparing queries and augmenting the testing infrastructure as needed (custom metrics). This phase typically occurs in June-July.
Data validation: about 6 hours finalizing queries, running them against the dataset, and building a spreadsheet of results for the rest of the content team to review. This phase typically occurs in July-September.
Content writing: about 2 hours reviewing the drafts to ensure that the results are being interpreted correctly. This phase typically occurs in September-October.

The total commitment is approximately 30 hours of work per chapter, spread over 6 months.

How to join

The 2025 project is underway and we're actively looking for analysts! Browse the list of open chapters and comment in any chapter-specific issues that interest you.

Expensing queries

The Analysts team will be given BigQuery access on behalf of HTTP Archive GCP project.

To get your access, request to join the httparchive-webalmanac-analysts Google Group (please mention your Github username).

Once the group managers approve your membership you will get a confirmation email and will be able to use 'httparchive' project in the GCP console for BigQuery analysis.

Make sure you run your queries within this project to avoid any personal costs. And check our query cost analysis best practices.

Analysis workflow

There could be dozens of queries per chapter, so to help stay organized here is a recommended workflow for analysts to get the queries written and the results shared.

When the chapter is outlined, work with the content team to identify the quantifiable metrics that would be needed.
Add each metric to the Metrics section using data-oriented descriptions of what it's measuring, for example "% of mobile pages that have images larger than 10 MB".
Triage the feasibility of each metric based on the capabilities of the HTTP Archive dataset. Feel free to ask around in #web-almanac-analysts if you're not sure whether a metric is feasible.
Create a git branch off main named <chapter>-sql-<yyyy> (or similar) and create a subdirectory for your chapter under the sql directory named <yyyy>/<chapter-name>. For example, 2022/css.
Open up a pull request to merge with the main branch, but leave it in draft mode for now. Edit the PR description to contain a checklist of all of the metrics you triaged as feasible in the doc. See #1087, for example.
As you write the queries for each metric, add a comment to the top of the query using the description from step 2, and give the file a short and descriptive name, like pct_large_images.sql. (Note that in 2019 we used a system of metric IDs like 01_22.sql, which made it hard to understand what a query did by its name.)
When all of the queries have been written, request a review from the @HTTPArchive/analysts team.
When the HTTP Archive crawl is complete at the end of July, run each query and save its results to your designated spreadsheet linked from the chapter issue. Each metric's results should be in its own tab, which is named using the file name (eg pct large images). See visualizations examples.
To help others interpret the results, leave a comment on the top row of the results with the correct interpretation. For example: "Read as: 4.28% of mobile pages contain images larger than 10 MB".
Authors and reviewers should check the results to make sure they're within expected limits, otherwise the query may need to be rewritten.
When the authors have drafted the chapter, they should leave a placeholder for the data visualization. Generate a chart using the templates provided, publish it to get its embeddable HTML, and replace the placeholder.

Dates to use

The Web Almanac is an annual publication and we try to enforce a common HTTP Archive run to query for consistency across the Almanac. There has been some suggestions and discussion about making it less annual, or publishing different chapters at different time of years but, for now, it's annual primarily based around the one month's dataset.

We try to use the July dataset for each year, but there have been exceptions, either due to issues with that particular dataset making it less reliable (2020), or when we decided to publish earlier than usual (2022). Even within years, there are some exceptions. Below are the dates for each "Almanac Year" and the exceptions.

Almanac Edition	Dataset to use	Exceptions
2019	`2019-07-01`
2020	`2020-08-01`	HTTP Archive performance stats were incorrect for Aug and 2020-09-01 should be used for this (CrUX data was fine, and should really be used in preference to lab-based results anyway).
2021	`2021-07-01`	WASM used `2021-09-01`
2022	`2022-06-01`	Parsed CSS data was not available for June and July data should be used. Also `2022-06-09` contains home AND secondary pages for those chapters wanting to use that (the home page match the data fro `2022-06-01` as the official run kicked off later in the month and we just marked them as `2022-06-01`).
2024	`2024-06-01`
2025	`2025-07-01`

We have a linter to attempt to block incorrect usage when copying queries from year to year. You can prevent the linter from flagging in these edge cases by adding --noqa: L063 to the end of the SQL line with the "incorrect" date.

Query conventions

#standardSQL
# The number of unique websites in the July 2019 crawl.
SELECT
  client,
  COUNT(DISTINCT page) AS num_pages
FROM
  `httparchive.crawl.pages`
WHERE
  date = '2024-07-01` AND
  is_root_page
GROUP BY
  client

The first line should be #standardSQL so the query is interpreted as Standard SQL syntax
The next line(s) should be a comment describing what the query is analyzing and the metric ID(s) if relevant
Use uppercase for all SQL keywords (SELECT, COUNT, AS, FROM, etc)
Fields, and tables, are descriptively named in snake_case (lowercase and underscore-delimited)
Function names are descriptively named in camelCase (lowercase in general with uppercase letter indicating new word) to differentiate them from fields and tables. Do not reuse the exact same function name for different purposes in different queries. See below for more information on user defined functions.
Table names should always specify the project (httparchive) and include both desktop and mobile when possible by using a wildcard (*)
Separate the desktop and mobile results by grouping by client.
Filenames should have a .sql extension for syntax highlighting in GitHub and other editors, and also allow them to be recognised by the linter (see below).
Filenames should not have spaces in them to make them easier to reference from the web.

Mobile or Desktop or Both?

In general we would like both data sets queried. It is tempting for some stats to only run one, whether there should not be any difference between the two (e.g. HTTP version supported, or looking at robots.txt), but that makes the assumption that we tests the same sites on both! There is a huge overlap alright, but not exactly the same because we base our crawl on CrUX's list which is based on usage. So https://m.facebook.com might appear on mobile but not on desktop. And maybe facebook forgot to add a robots.txt (or added a really weird one!) to their mobile site?

All stats are available to both. Historically Lighthouse was only available in mobile, but Desktop was added in May 2022

We've also been surprised before when we wouldn't expect there to be much difference, so unless there's a very good reason, we recommend to include both - you might be surprised in the differences which makes interesting information to report on! That's not to say that the author needs to include both stats if they are repetitive, but let's at least run the data so they can make that call.

If we really do only need to run (or report on) only one, we generally prefer mobile over desktop because 1) we've a bigger dataset for mobile and 2) we live in a mobile first world!

Home pages or all pages?

Each sites is crawled twice: once for the home page and with one secondary page as sometimes the home page is not representative of the site.

Consider whether to include:

Just home page results (WHERE is_root_page = true, or shortened to just WHERE is_root_page)
Both separated, and grouped by is_root_page so they can be reported separated to understand the differences
Or together as site-level stats (in which case use COUNT(root_page) rather than COUNT(page) to only count sites).

This can depend on chapter. For example for HTTP chapter it's unlikely to be different between pages so reporting only one is probably the right thing to do. For Page Weight, there can be large differences as the home page is often a "brochure page" with more media, so reporting both separately is likely more interesting. Finally, for the CMS chapter (or the JavaScript chapter) you probably interested whether a technology is used on any part of the site so want combined stats. Do be aware again that you can be surprised with results, so be careful with assumptions like those listed here.

Regardless of what you use, be sure to work appropriately ("home pages use" or "sites use").

SQL Linting

Most of the above standards will be enforced by the linter (SQLFluff) when you open a Pull Request and the linting will fail if you deviate from these standards.

You can run the linter locally if you have a python environment set up, which can be set up with the following commands if using Virtual Env:

cd src
virtualenv --python python3.8 .venv
source .venv/bin/activate
pip install -r requirements.txt

Or for those on Windows:

cd src
virtualenv --python python3 .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt

The environment will need to be activated each time but the virtual env should only need set up once and the requirements only need to be installed again if you want new versions.

To lint the 2020 resource-hints SQL files, for example, install the python environment as per above, and then issue the following command:

sqlfluff lint ../sql/2020/resource-hints

This will return errors like the following:

% sqlfluff lint ../sql/2020/resource-hints
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L:  25 | P:  26 | L010 | Inconsistent capitalisation of keywords.
L:  26 | P:  37 | L010 | Inconsistent capitalisation of keywords.
L:  34 | P:  63 | L038 | Trailing comma in select statement forbidden
All Finished 📜 🎉!

This states that:

On line 25, in position 26 you are using lowercase for keywords (e.g. as instead of AS) so failed rule L010.
Similarly on line 26, position 37.
And finally on line 34, position 63 you have an unnecessary comma (e.g. SELECT a,b, FROM table) and so failed rule L038. Remove the extra comma.

The list of rules can be found in the SQLFLuff documentation though we have turned a few of them off and configured others for our style (see the our .sqlfluff file if curious).

If you see any "unparseable" or PRS errors, then this is either an error in your code, or perhaps you've discovered a bug. Reach out to Barry (@tunetheweb) for help if stuck.

To attempt to autofix the errors you can use the fix command, instead of lint:

sqlfluff fix ../sql/2020/resource-hints

Which will produce similar output but with an offer to fix the issues it thinks it can fix:

% sqlfluff fix ../sql/2020/resource-hints
==== finding fixable violations ====
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L:  25 | P:  26 | L010 | Inconsistent capitalisation of keywords.
L:  26 | P:  37 | L010 | Inconsistent capitalisation of keywords.
L:  34 | P:  63 | L038 | Trailing comma in select statement forbidden
==== fixing violations ====
3 fixable linting violations found
Are you sure you wish to attempt to fix these? [Y/n]

If you lint again you should see most of the errors are fixed. Note that not all errors can be autofixed and some will require manual intervention but autofixing is useful for the simple errors. So while it's generally OK to run the fix command, do run the lint command when all clean to make sure the fix command didn't miss any issues.

Formatting results

When returning a percent value, ensure that the number is between 0 and 1, not 0 and 100 (we used 0 to 100 in 2019 but switched in 2020 to allow easier formatting in Sheets). We also stopped rounding in BigQuery from 2020 onwards and present the full decimal value and round in Sheets.

#standardSQL
# Take the average percent of external stylesheets over all requests.
SELECT
  client,
  is_root_page,
  COUNTIF(type = 'css') / COUNT(0) AS pct_stylesheets
FROM
  `httparchive.crawl.requests`
WHERE
  date = '2024-07-01'
GROUP BY
  client,
  is_root_page
ORDER BY
  client,
  is_root_page;

Usually you would sort by pct in DESCing order. Try to reuse the column names rather than reimplement the percentage formula in the ORDER BY clause too.

 SELECT
  client,
  element,
  COUNT(DISTINCT page) AS pages,
  total,
 COUNT(DISTINCT page) / total, 2 AS pct
FROM
  `some_dataset`
GROUP BY
  client,
  total,
  element
ORDER BY
  pct DESC,
  client

A warning on totals

We generally want to present stats as a percentage of pages. The HTTP Archive data set changes month by month (based on the latest available CrUX data) and has been growing over time. Therefore presenting absolute values can make it difficult to understand the scale, and to do year on year comparison. So as much as possible we want data presented as percentages.

To do this we need to take the absolute frequency of stats and divide by the total number. This means you need a total number. There are a number of ways of getting this.

You can use CTE (Common Table Expression):

WITH totals AS (
  SELECT
    client,
    is_root_page,
    COUNT(0) AS total
  FROM
    `httparchive.crawl.pages`
  WHERE
    date = '2024-07-01'
  GROUP BY
    client,
    is_root_page
)

SELECT
  client,
  is_root_page,
  COUNT(0) AS freq,
  total,
  COUNT(0) / total AS pct
FROM
  `httparchive.crawl.pages`
JOIN
  totals
USING (client, is_root_page)
WHERE
  date = '2024-07-01'
  ...
GROUP BY
  client
ORDER BY
  pct DESC

You can use a subquery:

SELECT
  client,
  is_root_page,
  COUNT(0) AS freq,
  total,
  COUNT(0) / total AS pct
FROM
  `httparchive.crawl.pages`
JOIN
  (
    SELECT
      client,
      is_root_page,
      COUNT(0) AS total
    FROM
      `httparchive.crawl.pages`
    WHERE
      date = '2024-07-01'
    GROUP BY
      client,
      is_root_page,
  )
USING (client, is_root_page)
WHERE
  date - '2024-07-01'
  ...
GROUP BY
  client
ORDER BY
  pct DESC

Or you can use a windowing clause using OVER:

SELECT
  client,
  COUNTIF(...) AS freq,
  SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client) AS total
  COUNT(0) / SUM(COUNT(DISTINCT page)) OVER (PARTITION BY client) AS pct
FROM
  `httparchive.crawl.pages`
WHERE
  date = '2024-07-01' AND
  is_roor_page
GROUP BY
  client
ORDER BY
  pct DESC

The last looks the simplest and is a very handy way of not having to do another join. However care must be taken with this, as it's easy to get the wrong total with this. In particular:

UNNEST queries change the totals and even using DISTINCT url or similar will not remove this as you are doing a SUM of those DISTINCT url's rather than a DISTINCT of the SUM. This often leads to a total much LARGER than expected.
WHERE clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected.
HAVING clauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. You can use the QUALIFY option instead which is basically the same as HAVING but is applied after.

So for simple functions with no joins, and no filters (using WHERE or HAVING) by all means use the windowing option, but be aware of its limitations.

It is always best to include the total in your exports so you can validate the numbers - why is this out of 500 million URLs? I thought our crawl was 7 million URLs?

Saving results to the sheets

Once all the queries are ready you can automate this step using this Colab Notebook.

To get all your queries results uploaded to the Google Sheets follow the steps:

Run 'Download repo' cell

This will create a local copy of the Almanac repository. It uses the original repo https://github.com/HTTPArchive/almanac.httparchive.org by default.
Adjust parameters and run 'Configure the chapter to process' cell
- GCP_PROJECT - the Google Cloud project ID that is used for billing of BigQuery queries (querying permissions required).
- year - the year of the Almanac edition.
- chapter - the name of the chapter. It's used to find branch and sql files. See naming conventions in Analysis Workflow.
- spreadsheet_url - the spreadsheet designated for your chapter (edit permission required). You can find it in the chapter issue.
Run 'Update local branch' cell

It switches to the designated chapter branch and fetches any new commits.
Run 'Authenticate' cell

It authenticates your Google account in Google Colab to gain access to BigQuery and Google Sheets.

configuration steps

Adjust parameters and run 'Upload query results' cell
- filename_match - regular expression to match the queries to run. Any query that doesn't match will not be run. Wrap the regex in quotes.
- filename_match_exclude - regular expression to skip the queries by the filename. No queries will be skipped if this left empty. Wrap the regex in quotes.
- dry_run - check this to run queries as dry run. It will give an overview of the queries to be processed and the estimated data processing volumes.
- overwrite_sheets - check this if the query results must overwrite the sheets that already exist.
- maximum_tb_billed - use the maximum bytes billed setting to limit query costs.

query results uploads

Almanac custom metrics and UDFs

In the pages dataset, there is a payload property of _almanac containing an object with several custom metrics. See https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/almanac.js for the definitions of each metric.

Sometimes the values in a custom metric are too complex to process using the JSON_EXTRACT functions in BigQuery, so you can do more complex analysis with a user-defined function (UDF). In the example below, the UDF parses both the payload JSON and the JSON-encoded _almanac custom metric:

#standardSQL
# 01_15: Percent of pages that include link[rel=modulepreload]
CREATE TEMP FUNCTION hasModulePreload(link_nodes JSON)
RETURNS BOOLEAN LANGUAGE js AS '''
  try {
    return !!link_nodes.find(e => e.rel.toLowerCase() == 'modulepreload');
  } catch (e) {
    return false;
  }

''';

SELECT
  client,
  COUNTIF(hasModulePreload(custom_metrics.other.almanac['link-nodes'])) AS num_pages,
  COUNTIF(hasModulePreload(custom_metrics.other.almanac['link-nodes'])) / COUNT(0) AS pct_modulepreload
FROM
  `httparchive.crawl.pages`
WHERE
  date = '2024-07-01' AND
  is_root_page
GROUP BY
  client

As per the example function names should be written in camelCase (hasModulePreload) rather than snake_case (has_module_preload) to help differentiate them from table names.

Additionally we have had reports that reusing the same function name in different queries being run at the same time (e.g. two tabs in your browser) in the same session, can mix up the results! Sounds like a BigQuery bug if this is the case, but either way it's easily avoided but not reusing the same function name for different functions across queries within the same set of chapter queries.

Rank

Since the June 2021 crawl, rank data is taken from the CrUX report allowing queries on the top 1k origins, top 10k, top 100k, top 1M...etc. The rank field is available in the httparchive.summary_pages.* tables and will contain the values 1000, 10000, 100000, 1000000 and 10000000 representing the rank for each url crawled.

Note that the ranks are exclusive, meaning that a URL may only have one rank. For example, the 100000 rank contains 99,000 URLs because the top 1,000 is its own exclusive segment. To include all URLs you may combine the different ranks as follows:

SELECT
  client,
  rank_grouping,
  COUNT(0) AS total
FROM (
  SELECT
    client,
    rank,
  FROM
    `httparchive.crawl.pages`
  WHERE
    date = '2024-07-01' AND
    is_root_page
  ),
UNNEST([1000, 10000, 100000, 1000000, 10000000]) AS rank_grouping
WHERE
  rank <= rank_grouping
GROUP BY
  client,
  rank_grouping
ORDER BY
  client,
  rank_grouping

When running the above query, you may notice that the total does not perfectly match the number of expected URLs, i.e. less than 1,000 URLs for the top 1000 rank. Since the CrUX rank for each origin is shared between desktop & mobile, it may be possible that an origin which features on CrUX for one device type, does not feature for the other (e.g. a mobile version of a site like https://m.example.com may not appear in desktop CrUX data if it doesn't get a lot of visits from desktop devices). As a result, for the 1000 rank, the list of URLs to crawl may contain a maximum of 1,000 URLs - if all URLs feature on both desktop and mobile - but likely to contain a subset.

Appendix

Scripts to analyze number of queries and lines of SQL for a given year.

Number of queries:

ls sql/2021/**/*.sql | wc -l

Total lines of SQL:

wc -l sql/2021/**/*.sql | tail -n 1

Uh oh!

Analysts' Guide

Table of Contents

Commitment summary

How to join

Expensing queries

Analysis workflow

Dates to use

Query conventions

Mobile or Desktop or Both?

Home pages or all pages?

SQL Linting

Formatting results

A warning on totals

Saving results to the sheets

Almanac custom metrics and UDFs

Rank

Appendix

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contributor Guides

Content Team for each Chapter

Clone this wiki locally