Skip to content

[Bug] convert_bytes gives different results when running in Docker vs native #983

@afourney

Description

@afourney

Well, this is an odd bug, but I thought I would report it since it breaks some of our GitHub CI tests.

The following code gives different results when running in a Docker environment vs. natively in Ubuntu:

import magika
from pathlib import Path

test_file = "test_mskanji.csv"

with open(test_file, "rb") as f:
    m = magika.Magika()

    test_bytes = f.read()

    print("Identify path:")
    print(m.identify_path(Path(test_file)))

    print("\n\Identify bytes:")
    print(m.identify_bytes(test_bytes))

Natively I get what you would expect.... path and bytes yielding the same results.

Identify path:
MagikaResult(path=test_mskanji.csv, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), output=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), score=0.9990027546882629, overwrite_reason=<OverwriteReason.NONE: 'none'>))


Identify bytes:
MagikaResult(path=-, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), output=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), score=0.9990027546882629, overwrite_reason=<OverwriteReason.NONE: 'none'>))

In Docker I get a random / unknown result, but only for identify bytes:

Identify path:
MagikaResult(path=/app/packages/test_mskanji.csv, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=randombytes, mime_type='application/octet-stream', group='unknown', description='Random bytes', extensions=[], is_text=False), output=ContentTypeInfo(label=unknown, mime_type='application/octet-stream', group='unknown', description='Unknown binary data', extensions=[], is_text=False), score=0.9977052807807922, overwrite_reason=<OverwriteReason.OVERWRITE_MAP: 'overwrite-map'>))


Identify bytes:
MagikaResult(path=-, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=randombytes, mime_type='application/octet-stream', group='unknown', description='Random bytes', extensions=[], is_text=False), output=ContentTypeInfo(label=unknown, mime_type='application/octet-stream', group='unknown', description='Unknown binary data', extensions=[], is_text=False), score=0.9977052807807922, overwrite_reason=<OverwriteReason.OVERWRITE_MAP: 'overwrite-map'>))

This is also the case when running as an action in the GitHub CI.

The test file I am using is available here:
https://github.com/microsoft/markitdown/raw/refs/heads/main/packages/markitdown/tests/test_files/test_mskanji.csv

Metadata

Metadata

Assignees

Labels

add-to-faqassetsRelated to assets such as models, content types knowledge basebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions