Skip to content

Improve parsing microdata when itemProps contains multiple space separated properties #26

@iaincollins

Description

@iaincollins

Thanks for creating such a great project!

I ran into a bug parsing microdata content where itemprop contained multiple properties, like in these examples and thought I'd share what I ran into:

<meta data-rh="true" property="article:published" itemprop="datePublished dateCreated" content="2019-07-21T09:00:06.000Z"/>
<span itemProp="publisher copyrightHolder provider sourceOrganization" itemscope="" itemType="http://schema.org/NewsMediaOrganization" itemID="https://www.nytimes.com">
<figure itemprop="associatedMedia image" itemscope itemtype="http://schema.org/ImageObject" data-component="image" class="element element-image img--landscape  fig--narrow-caption fig--has-shares " data-media-id="f82028d62b1edd7417d7d3773c4abf0d4fa86174" id="img-3">
  <meta itemprop="url" content="https://i.guim.co.uk/img/media/f82028d62b1edd7417d7d3773c4abf0d4fa86174/0_272_6435_3861/master/6435.jpg?width=700&amp;quality=85&amp;auto=format&amp;fit=max&amp;s=016df6a3f33eabe3cbca39eb389a60fb">
</figure>

Markup like this is parsed correctly in Google's Structured Data Testing Tool, but web-auto-extractor does not currently split input based on spaces.

I resolved this in a project which uses web-auto-extractor by doing this:

const __transformStructuredData = (structuredData) => {
   let result = structuredData
   Object.keys(result.microdata).forEach(schema => {
     result.microdata[schema].forEach(object => {
       Object.keys(object).forEach(key => {
         if (key.includes(' ')) {
           key.split(' ').forEach(newKey => {
             object[newKey] = object[key]
           })
           delete object[key]
         }
       })
     })
   })
   return result
 }

I'm aware there are some other PRs related to handling whitespace trimming open.

If an enhancement like this appeals I'd be happy to raise a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions