-
-
Notifications
You must be signed in to change notification settings - Fork 200
Analysts' Guide
Analysts are responsible for researching the stats and trends used throughout the Almanac. Analysts work closely with authors and reviewers during the planning phase to give direction on the types of stats that are possible from the dataset, and during the analyzing/writing phases to ensure that the stats are used correctly.
- Content planning: about 4 hours working with the content team to identify metrics from the HTTP Archive that could be used to support the chapter. This time is also spent preparing queries and augmenting the testing infrastructure as needed (custom metrics). This phase typically occurs in June-July.
- Data validation: about 6 hours finalizing queries, running them against the dataset, and building a spreadsheet of results for the rest of the content team to review. This phase typically occurs in July-September.
- Content writing: about 2 hours reviewing the drafts to ensure that the results are being interpreted correctly. This phase typically occurs in September-October.
The total commitment is approximately 30 hours of work per chapter, spread over 6 months.
The 2021 project is underway and we're actively looking for analysts! Browse the list of open chapters and comment in any chapter-specific issues that interest you.
The Analysts team will be given BigQuery coupon codes to offset the cost of analysis. To get your code, join the #web-almanac-analysts channel on the HTTP Archive Slack and ask @Paul Calvano.
There could be dozens of queries per chapter, so to help stay organized here is a recommended workflow for analysts to get the queries written and the results shared.
- When the chapter is outlined, work with the content team to identify the quantifiable metrics that would be needed.
- Add each metric to the Metrics section using data-oriented descriptions of what it's measuring, for example "% of mobile pages that have images larger than 10 MB".
- Triage the feasibility of each metric based on the capabilities of the HTTP Archive dataset. Feel free to ask around in
#web-almanac-analystsif you're not sure whether a metric is feasible. - Create a git branch off
mainnamed<chapter>-sql-<yyyy>(or similar) and create a subdirectory for your chapter under thesqldirectory named<yyyy>/<chapter-number>_<chapter-name>. For example,2020/01_CSS(use leading 0s for the chapter number to ensure that they appear in sorted order). - Open up a pull request to merge with the
mainbranch, but leave it in draft mode for now. Edit the PR description to contain a checklist of all of the metrics you triaged as feasible in the doc. See #1087, for example. - As you write the queries for each metric, add a comment to the top of the query using the description from step 2, and give the file a short and descriptive name, like
pct_large_images.sql. (Note that in 2019 we used a system of metric IDs like01_22.sql, which made it hard to understand what a query did by its name.) - When all of the queries have been written, request a review from the @HTTPArchive/analysts team.
- When the 2020_08_01 HTTP Archive crawl is complete at the end of August, run each query and save its results to your designated spreadsheet linked from the chapter issue. Each metric's results should be in its own tab, which is named using the file name (eg
pct large images). - To help others interpret the results, leave a comment on the top row of the results with the correct interpretation. For example: "Read as: 4.28% of mobile pages contain images larger than 10 MB".
- Authors and reviewers should check the results to make sure they're within expected limits, otherwise the query may need to be rewritten.
- When the authors have drafted the chapter, they should leave a placeholder for the data visualization. Generate a chart using the templates provided, publish it to get its embeddable HTML, and replace the placeholder.
#standardSQL
# The number of unique websites in the July 2019 crawl.
SELECT
_TABLE_SUFFIX AS client,
COUNT(DISTINCT url) AS num_pages
FROM
`httparchive.summary_pages.2019_07_01_*`
GROUP BY
client- The first line should be
#standardSQLso the query is interpreted as Standard SQL syntax - The next line(s) should be a comment describing what the query is analyzing and the metric ID(s) if relevant
- Use uppercase for all SQL keywords (SELECT, COUNT, AS, FROM, etc)
- Fields are descriptively named in
snake_case(lowercase and underscore-delimited) - Table names should always specify the project (
httparchive) and include both desktop and mobile when possible by using a wildcard (*) - Separate the desktop and mobile results by grouping by client. When using a wildcard, this will be the
_TABLE_SUFFIXspecial value. - Filenames should have a
.sqlextension for syntax highlighting in GitHub and other editors, and also allow them to be recognised by the linter (see below). - Filenames should not have spaces in them to make them easier to reference from the web.
Most of the above standards will be enforced by the linter (SQLFluff) when you open a Pull Request and the linting will fail if you deviate from these standards.
You can run the linter locally if you have a python environment set up, which can be set up with the following commands if using Virtual Env:
cd src
virtualenv --python python3.8 .venv
source .venv/bin/activate
pip install -r requirements.txt
Or for those on Windows:
cd src
virtualenv --python python3 .venv
.venv\Scripts\activate.bat
pip install -r requirements.txt
The environment will need to be activated each time but the virtual env should only need set up once and the requirements only need to be installed again if you want new versions.
To lint the 2020 resource-hints SQL files, for example, install the python environment as per above, and then issue the following command:
sqlfluff lint ../sql/2020/resource-hints
This will return errors like the following:
% sqlfluff lint ../sql/2020/resource-hints
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L: 25 | P: 26 | L010 | Inconsistent capitalisation of keywords.
L: 26 | P: 37 | L010 | Inconsistent capitalisation of keywords.
L: 34 | P: 63 | L038 | Trailing comma in select statement forbidden
All Finished 📜 🎉!
This states that:
- On line 25, in position 26 you are using lowercase for keywords (e.g.
asinstead ofAS) so failed rule L010. - Similarly on line 26, position 37.
- And finally on line 34, position 63 you have an unnecessary comma (e.g.
SELECT a,b, FROM table) and so failed rule L038. Remove the extra comma.
The list of rules can be found in the SQLFLuff documentation though we have turned a few of them off and configured others for our style (see the our .sqlfluff file if curious).
If you see any "unparseable" or PRS errors, then this is either an error in your code, or perhaps you've discovered a bug. Reach out to Barry (@tunetheweb) for help if stuck.
To attempt to autofix the errors you can use the fix command, instead of lint:
sqlfluff fix ../sql/2020/resource-hints
Which will produce similar output but with an offer to fix the issues it thinks it can fix:
% sqlfluff fix ../sql/2020/resource-hints
==== finding fixable violations ====
== [../sql/2020/resource-hints/adoption_service_workers.sql] FAIL
L: 25 | P: 26 | L010 | Inconsistent capitalisation of keywords.
L: 26 | P: 37 | L010 | Inconsistent capitalisation of keywords.
L: 34 | P: 63 | L038 | Trailing comma in select statement forbidden
==== fixing violations ====
3 fixable linting violations found
Are you sure you wish to attempt to fix these? [Y/n]
If you lint again you should see most of the errors are fixed. Note that not all errors can be autofixed and some will require manual intervention but autofixing is useful for the simple errors. So while it's generally OK to run the fix command, do run the lint command when all clean to make sure the fix command didn't miss any issues.
When returning a percent value, ensure that the number is between 0 and 1, not 0 and 100 (we used 0 to 100 in 2019 but switched in 2020 to allow easier formatting in Sheets). We also stopped rounding in BigQuery from 2020 onwards and present the full decimal value and round in Sheets.
#standardSQL
# Take the average percent of external stylesheets over all requests.
SELECT
_TABLE_SUFFIX AS client,
COUNTIF(type = 'css') / COUNT(0) AS pct_stylesheets
FROM
`httparchive.summary_requests.2020_08_01_*`
GROUP BY
clientUsually you would sort by pct in DESCing order. Try to reuse the column names rather than reimplement the percentage formula in the ORDER BY clause too.
SELECT
_TABLE_SUFFIX AS client,
element,
COUNT(DISTINCT url) AS pages,
total,
COUNT(DISTINCT url) / total, 2 AS pct
FROM
`some_dataset`
GROUP BY
client,
total,
element
ORDER BY
pct DESC,
clientWe generally want to present stats as a percentage of pages. The HTTP Archive data set changes month by month (based on the latest available CrUX data) and has been growing over time. Therefore presenting absolute values can make it difficult to understand the scale, and to do year on year comparison. So as much as possible we want data presented as percentages.
To do this we need to take the absolute frequency of stats and divide by the total number. This means you need a total number. There are a number of ways of getting this.
You can use CTE (Common Table Expression):
WITH totals AS (
SELECT
_TABLE_SUFFIX AS _TABLE_SUFFIX,
COUNT(0) AS total
FROM
`httparchive.pages.2021_07_01_*`
GROUP BY
_TABLE_SUFFIX
)
SELECT
_TABLE_SUFFIX AS client
COUNT(0) AS freq,
total,
COUNT(0) / total AS pct
FROM
`httparchive.pages.2021_07_01_*`
JOIN
totals
USING (_TABLE_SUFFIX)
WHERE
...
GROUP BY
client
ORDER BY
pct DESCYou can use a subquery:
SELECT
_TABLE_SUFFIX AS client,
COUNT(0) AS freq,
total,
COUNT(0) / total AS pct
FROM
`httparchive.pages.2021_07_01_*`
JOIN
(
SELECT
_TABLE_SUFFIX AS _TABLE_SUFFIX,
COUNT(0) AS total
FROM
`httparchive.pages.2021_07_01_*`
GROUP BY
_TABLE_SUFFIX
)
USING (__TABLE_SUFFIX)
WHERE
...
GROUP BY
client
ORDER BY
pct DESCOr you can use a windowing clause using OVER:
SELECT
_TABLE_SUFFIX AS client,
COUNTIF(...) AS freq,
SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS total
COUNT(0) / SUM(COUNT(DISTINCT url)) OVER (PARTITION BY client) AS pct
FROM
`httparchive.pages.2021_07_01_*`
GROUP BY
client
ORDER BY
pct DESCThe last looks the simplest and is a very handy way of not having to do another join. However care must be taken with this, as it's easy to get the wrong total with this. In particular:
-
UNNESTqueries change the totals and even usingDISTINCT urlor similar will not remove this as you are doing aSUMof thoseDISTINCT url's rather than aDISTINCTof theSUM. This often leads to a total much LARGER than expected. -
WHEREclauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. -
HAVINGclauses are applied before the windowing function, so your total may already be being filtered. This often leads to a total SMALLER than expected. You can use theQUALIFYoption instead which is basically the same asHAVINGbut is applied after.
So for simple functions with no joins, and no filters (using WHERE or HAVING) by all means use the windowing option, but be aware of its limitations.
It is always best to include the total in your exports so you can validate the numbers - why is this out of 500 million URLs? I thought our crawl was 7 million URLs?
In the pages dataset, there is a payload property of _almanac containing an object with several custom metrics. See https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/custom_metrics/almanac.js for the definitions of each metric.
#standardSQL
# 01_15: Percent of pages that include link[rel=modulepreload]
CREATE TEMP FUNCTION hasModulePreload(payload STRING)
RETURNS BOOLEAN LANGUAGE js AS '''
try {
var $ = JSON.parse(payload);
var almanac = JSON.parse($._almanac);
return !!almanac['link-nodes'].find(e => e.rel.toLowerCase() == 'modulepreload');
} catch (e) {
return false;
}
''';
SELECT
_TABLE_SUFFIX AS client,
COUNTIF(hasModulePreload(payload)) AS num_pages,
COUNTIF(hasModulePreload(payload)) / COUNT(0) AS pct_modulepreload
FROM
`httparchive.pages.2019_07_01_*`
GROUP BY
clientSometimes the values in a custom metric are too complex to process using the JSON_EXTRACT functions in BigQuery, so you can do more complex analysis with a user-defined function (UDF). In the example above, the UDF parses both the payload JSON and the JSON-encoded _almanac custom metric.
httparchive.almanac is a dataset containing 1k subset tables for all standard datasets (lighthouse, pages, requests, response_bodies, summary_pages, summary_requests, technologies)
There are also several preprocessed tables to make querying more convenient and cost-effective:
-
response_bodies: combination of desktop/mobile response bodies clustered by
client,page, andurlfields -
summary_requests: combination of desktop/mobile summary requests clustered by
client,page, andurlfields. Note thatpageis the URL of the correspondingpageidin the summary_pages table. -
summary_response_bodies: combination of desktop/mobile response bodies and summary requests data joined by
pageandurlfields, clustered byclient,firstHtml, andtypefields. This is the preferred table for querying response bodies of a certain type, for example:
#standardSQL
# Calculates the percent of pages that reference the word "almanac" (case-insensitive).
SELECT
client,
COUNTIF(REGEXP_CONTAINS(body, '(?i)almanac')) / COUNT(0) AS pct_almanac
FROM
`httparchive.almanac.summary_response_bodies`
WHERE
firstHtml
GROUP BY
clientNote that summary_response_bodies is an 18 TB table, but thanks to the clustering, this query is limited to only the response bodies that are marked as firstHtml. So the query actually completes in ~13 seconds and consumes only 850 GB.

Since the June 2021 crawl, rank data is taken from the CrUX report allowing queries on the top 1k origins, top 10k, top 100k, top 1M...etc. The rank field is available in the httparchive.summary_pages.* tables and will contain the values 1000, 10000, 100000, 1000000 and 10000000 representing the rank for each url crawled.
Note that the ranks are exclusive, meaning that a URL may only have one rank. For example, the 100000 rank contains 99,000 URLs because the top 1,000 is its own exclusive segment. To include all URLs you may combine the different ranks as follows:
SELECT
client,
rank,
COUNT(0) AS total
FROM (
SELECT
_TABLE_SUFFIX AS client,
rank AS _rank,
FROM
`httparchive.summary_pages.2021_07_01_*`
),
UNNEST([1000, 10000, 100000, 1000000, 10000000]) AS rank
WHERE
_rank <= rank
GROUP BY
client,
rank
ORDER BY
client,
rankWhen running the above query, you may notice that the total does not perfectly match the number of expected URLs, i.e. less than 1000 URLs for the top 1000 rank. Since the CrUX rank for each origin is shared between desktop & mobile, it may be possible that an origin which features on CrUX for on one device type, does not feature for the other. As a result, for the 1000 rank, the list of URLs to crawl may contain a maximum of 1,000 URLs - if all URLs feature on both desktop and mobile - but likely to contain a subset.