Skip to content

[fandom/wikimedia] KeyError: 'metadata' caused by non-existent files #8388

@ClosedPort22

Description

@ClosedPort22

I have occasionally run into this error when scraping several Fandom wikis lately:

[fandom][error] An unexpected error occurred: KeyError - 'metadata'. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues .
[fandom][debug] 
Traceback (most recent call last):
  File "/home/[redacted]/programming/gallery-dl/gallery_dl/job.py", line 153, in run
    for msg in extractor:
               ^^^^^^^^^
  File "/home/[redacted]/programming/gallery-dl/gallery_dl/extractor/wikimedia.py", line 107, in items
    self.prepare_image(image)
    ~~~~~~~~~~~~~~~~~~^^^^^^^
  File "/home/[redacted]/programming/gallery-dl/gallery_dl/extractor/wikimedia.py", line 85, in prepare_image
    for m in image["metadata"] or ()}
             ~~~~~^^^^^^^^^^^^
KeyError: 'metadata'

Upon closer inspection, it appears that this is an expected behavior, though it is only documented in the changelog of MediaWiki version 1.34:

In the response to queries that use prop=imageinfo, entries for non-existing files (indicated by the filemissing field) now omit the following fields, since they are meaningless in this context: timestamp, userhidden, user, userid, anon, size, width, height, pagecount, duration, commenthidden, parsedcomment, comment, thumburl, thumbwidth, thumbheight, thumbmime, thumberror, url, sha1, metadata, extmetadata, commonmetadata, mime, mediadtype, bitdepth. Clients that process these fields should first check if filemissing is set. Fields that are supported even if the file is missing include: canonicaltitle, archivename (deleted files only), descriptionurl, descriptionshorturl.

So far I have only seen this happen with image-revisions greater than 1. Here's a quick fix that I've been using personally, since I noticed that entries with filemissing set seem to always be returned as the last element of imageinfo:

diff --git a/gallery_dl/extractor/wikimedia.py b/gallery_dl/extractor/wikimedia.py
index 2e8136f1..a103a06b 100644
--- a/gallery_dl/extractor/wikimedia.py
+++ b/gallery_dl/extractor/wikimedia.py
@@ -104,6 +104,12 @@ class WikimediaExtractor(BaseExtractor):
             yield Message.Directory, info
 
             for info["num"], image in enumerate(images, 1):
+                # https://www.mediawiki.org/wiki/Release_notes/1.34
+                if "filemissing" in image:
+                    self.log.warning(
+                        "File %s (or its revision) is missing",
+                        image["canonicaltitle"].partition(":")[2])
+                    continue
                 self.prepare_image(image)
                 image.update(info)
                 yield Message.Url, image["url"], image

This would break the continuity of the sequence number if invalid entries appear in the middle of imageinfo, and I'm not sure if that would be acceptable.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions