Skip to content

Add an extract method #2523

@fb55

Description

@fb55

One common use-case for cheerio is to extract multiple values from a document, and store them in an object. Doing so manually currently isn't a great experience. There are several packages built on top of cheerio that improve this: For example https://github.com/matthewmueller/x-ray and https://github.com/IonicaBizau/scrape-it. Commercial scraping providers also allow bulk extractions: https://www.scrapingbee.com/documentation/data-extraction/

We should add an API to make this use-case easier. The API should be along the lines of:

$.extract({
  // To get the `textContent` of an element, we can just supply a selector
  title: "title",

  // If we want to get results for more than a single element, we can use an array
  headings: ["h1, h2, h3, h4, h5, h6"],
  // If an array has more than one child, all of them will be queried for.
  // Complication: This should follow document-order.
  headings: ["h1", "h2", "h3", "h4", "h5", "h6"], // (equivalent to the above)

  // To get a property other than the `textContent`, we can pass a string to `out`. This will be passed to the `prop` method.
  links: [{ selector: "a", out: "href" }],

  // We can map over the received elements to customise the behaviour
  links: [
    {
      selector: "a",
      out(el, key, obj) {
        const $el = $(el);
        return { name: $el.text(), href: $el.prop("href") };
      },
    },
  ],

  // To get nested elements, we can pass a nested extract object. Selectors inside the nested extract object will be relative to the current scope.
  posts: [
    {
      selector: ".post",
      out: {
        title: ".title",
        body: ".body",
        link: { selector: "a :has(> .title)", out: "href" },
      },
    },
  ],

  // We can skip the selector in nested extract objects to reference the current scope.
  links: [
    {
      selector: ".post a",
      out: {
        href: { out: "href" },
        name: { out: "textContent" },
        // Equivalent — get the text content of the current scope
        name: "*",
      },
    },
  ],
});

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions