-
Notifications
You must be signed in to change notification settings - Fork 473
Closed
Labels
add-to-faqassetsRelated to assets such as models, content types knowledge baseRelated to assets such as models, content types knowledge basebugSomething isn't workingSomething isn't working
Description
Well, this is an odd bug, but I thought I would report it since it breaks some of our GitHub CI tests.
The following code gives different results when running in a Docker environment vs. natively in Ubuntu:
import magika
from pathlib import Path
test_file = "test_mskanji.csv"
with open(test_file, "rb") as f:
m = magika.Magika()
test_bytes = f.read()
print("Identify path:")
print(m.identify_path(Path(test_file)))
print("\n\Identify bytes:")
print(m.identify_bytes(test_bytes))Natively I get what you would expect.... path and bytes yielding the same results.
Identify path:
MagikaResult(path=test_mskanji.csv, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), output=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), score=0.9990027546882629, overwrite_reason=<OverwriteReason.NONE: 'none'>))
Identify bytes:
MagikaResult(path=-, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), output=ContentTypeInfo(label=csv, mime_type='text/csv', group='code', description='CSV document', extensions=['csv'], is_text=True), score=0.9990027546882629, overwrite_reason=<OverwriteReason.NONE: 'none'>))
In Docker I get a random / unknown result, but only for identify bytes:
Identify path:
MagikaResult(path=/app/packages/test_mskanji.csv, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=randombytes, mime_type='application/octet-stream', group='unknown', description='Random bytes', extensions=[], is_text=False), output=ContentTypeInfo(label=unknown, mime_type='application/octet-stream', group='unknown', description='Unknown binary data', extensions=[], is_text=False), score=0.9977052807807922, overwrite_reason=<OverwriteReason.OVERWRITE_MAP: 'overwrite-map'>))
Identify bytes:
MagikaResult(path=-, status=ok, prediction=MagikaPrediction(dl=ContentTypeInfo(label=randombytes, mime_type='application/octet-stream', group='unknown', description='Random bytes', extensions=[], is_text=False), output=ContentTypeInfo(label=unknown, mime_type='application/octet-stream', group='unknown', description='Unknown binary data', extensions=[], is_text=False), score=0.9977052807807922, overwrite_reason=<OverwriteReason.OVERWRITE_MAP: 'overwrite-map'>))
This is also the case when running as an action in the GitHub CI.
The test file I am using is available here:
https://github.com/microsoft/markitdown/raw/refs/heads/main/packages/markitdown/tests/test_files/test_mskanji.csv
Metadata
Metadata
Assignees
Labels
add-to-faqassetsRelated to assets such as models, content types knowledge baseRelated to assets such as models, content types knowledge basebugSomething isn't workingSomething isn't working