-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
Description
One common use-case for cheerio is to extract multiple values from a document, and store them in an object. Doing so manually currently isn't a great experience. There are several packages built on top of cheerio that improve this: For example https://github.com/matthewmueller/x-ray and https://github.com/IonicaBizau/scrape-it. Commercial scraping providers also allow bulk extractions: https://www.scrapingbee.com/documentation/data-extraction/
We should add an API to make this use-case easier. The API should be along the lines of:
$.extract({
// To get the `textContent` of an element, we can just supply a selector
title: "title",
// If we want to get results for more than a single element, we can use an array
headings: ["h1, h2, h3, h4, h5, h6"],
// If an array has more than one child, all of them will be queried for.
// Complication: This should follow document-order.
headings: ["h1", "h2", "h3", "h4", "h5", "h6"], // (equivalent to the above)
// To get a property other than the `textContent`, we can pass a string to `out`. This will be passed to the `prop` method.
links: [{ selector: "a", out: "href" }],
// We can map over the received elements to customise the behaviour
links: [
{
selector: "a",
out(el, key, obj) {
const $el = $(el);
return { name: $el.text(), href: $el.prop("href") };
},
},
],
// To get nested elements, we can pass a nested extract object. Selectors inside the nested extract object will be relative to the current scope.
posts: [
{
selector: ".post",
out: {
title: ".title",
body: ".body",
link: { selector: "a :has(> .title)", out: "href" },
},
},
],
// We can skip the selector in nested extract objects to reference the current scope.
links: [
{
selector: ".post a",
out: {
href: { out: "href" },
name: { out: "textContent" },
// Equivalent — get the text content of the current scope
name: "*",
},
},
],
});mikestopcontinues, adeacetis, EliasDevis, albinofreitasbearth, jasp402 and 4 more