Skip to content

the tokenized result of sentencepiece java lib and python lib are different #999

@thinkzhou

Description

@thinkzhou

Description

I am using your java lib and the origin python lib to load xlm-robert-base model and tokenize sentences, find the result of java and python are different. It looks like the way java lib treat the emoji (eg. 👋) is incorrect, maybe this is a bug?

Expected Behavior

The tokenized result from java lib and python lib be the same

Error Message

No Error Message

How to Reproduce?

Java code:

public static void main(String[] args) {
    Path modelPath = Paths.get("path/to/sentencepiece.model");
    try (SpTokenizer tokenizer = new SpTokenizer(modelPath)) {
     String s = "\uD83D\uDC4B\uD83D\uDC4B";
      List<String> tokens = tokenizer.tokenize(s);
      System.out.println(tokens);
    } catch (IOException exception) {
      exception.printStackTrace();
    }
  }

get result:
[▁, ������������]

Python Code:

import sentencepiece as spm
processor = spm.SentencePieceProcessor(model_file="path/to/sentencepiece.model")
print(processor.tokenize("👋👋",out_type=str))

get result:
['▁', '👋', '👋']

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. run java code and python code
  2. compare the tokenized result

What have you tried to solve it?

  1. https://stackoverflow.com/questions/32205446/getting-true-utf-8-characters-in-java-jni

Environment Info

Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:

----------- System Properties -----------
sun.cpu.isalist:
ftp.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
socksNonProxyHosts: local|*.local|169.254/16|*.169.254/16
sun.io.unicode.encoding: UnicodeBig
sun.cpu.endian: little
java.vendor.url.bug: http://bugreport.sun.com/bugreport/
file.separator: /
java.vendor: Oracle Corporation
sun.boot.class.path: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/classes
java.ext.dirs: /Users/zhouyang/Library/Java/Extensions:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/ext:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java
java.version: 1.8.0_171
java.vm.info: mixed mode
awt.toolkit: sun.lwawt.macosx.LWCToolkit
user.language: zh
java.specification.vendor: Oracle Corporation
sun.java.command: ai.djl.integration.util.DebugEnvironment
java.home: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre
sun.arch.data.model: 64
java.vm.specification.version: 1.8
java.class.path: /Users/zhouyang/work/github/djl/integration/build/classes/java/main:/Users/zhouyang/work/github/djl/integration/build/resources/main:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.4/c51c00206bb913cd8612b24abd9fa98ae89719b1/commons-cli-1.4.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.13.3/7cca27a921a18645139cf651c04b83b1a19cfd76/log4j-slf4j-impl-2.13.3.jar:/Users/zhouyang/work/github/djl/basicdataset/build/libs/basicdataset-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/model-zoo/build/libs/model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/testing/build/libs/testing-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.1.0/b0bcea778fb2899aeb4014c558babea8833d180a/testng-7.1.0.jar:/Users/zhouyang/work/github/djl/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.mxnet/mxnet-native-auto/1.8.0/e32265c03e27e1fb18c9c0904733b00f9acffaee/mxnet-native-auto-1.8.0.jar:/Users/zhouyang/work/github/djl/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.pytorch/pytorch-native-auto/1.8.1/3cbb59c8b21c24cb368d296f6c4c6ef069d4d9b/pytorch-native-auto-1.8.1.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.tensorflow/tensorflow-native-auto/2.4.1/20b8c7a4e6d451e782d15dd30cebd4df0ad86c74/tensorflow-native-auto-2.4.1.jar:/Users/zhouyang/work/github/djl/mxnet/mxnet-engine/build/libs/mxnet-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/pytorch/pytorch-engine/build/libs/pytorch-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/api/build/libs/api-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.30/b5a4b6d16ab13e34a88fae84c35cd5d68cac922c/slf4j-api-1.7.30.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.13.3/4e857439fc4fe974d212adaaaa3b118b8b50e3ec/log4j-core-2.13.3.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.13.3/ec1508160b93d274b1add34419b897bae84c6ca9/log4j-api-2.13.3.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.8/37ca9a9aa2d4be2599e55506a6d3170dd7a3df4/commons-csv-1.8.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.72/6375e521c1e11d6563d4f25a07ce124ccf8cd171/jcommander-1.72.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.inject/guice/4.1.0/faf9ee8ac09eafd1128091426dd367a8c0085d55/guice-4.1.0-no_aop.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.21/18775fdda48574784f40b47bf478ab0593f92e4d/snakeyaml-1.21.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.8.6/9180733b7df8542621dc12e21e87557e8c99b8cb/gson-2.8.6.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.3.0/4654d1da02e4173ba7b64f7166378847db55448a/jna-5.3.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.20/b8df472b31e1f17c232d2ad78ceb1c84e00c641b/commons-compress-1.20.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/javax.inject/javax.inject/1/6975da39a7040257bd51d21a231b76c915872d38/javax.inject-1.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/aopalliance/aopalliance/1.0/235ba8b489512805ac13a8f9ea77a1ca5ebe3e8/aopalliance-1.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/19.0/6ce200f6b23222af3d8abb6b6459e6c44f4bb0e9/guava-19.0.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.5/92e1c31aaed15a3dc12008859a37ced45fa0b730/javacpp-1.5.5.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.3.1/954f292e85f4d2a587ede1b2e1a525e74ef96c97/tensorflow-core-api-0.3.1.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.8.0/b5f93103d113540bb848fe9ce4e6819b1f39ee49/protobuf-java-3.8.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.1/3cdb825411a9de908cc3dac740f18628d6512260/ndarray-0.3.1.jar
user.name: zhouyang
ai.djl.logging.level: debug
file.encoding: UTF-8
java.specification.version: 1.8
java.awt.printerjob: sun.lwawt.macosx.CPrinterJob
user.timezone: Asia/Shanghai
user.home: /Users/zhouyang
library.jansi.path: /Users/zhouyang/.gradle/native/jansi/1.18/osx
http.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
os.version: 10.15.7
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.specification.name: Java Platform API Specification
java.class.version: 52.0
org.gradle.internal.http.connectionTimeout: 60000
java.library.path: /Users/zhouyang/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
org.gradle.internal.publish.checksums.insecure: true
sun.jnu.encoding: UTF-8
os.name: Mac OS X
user.variant:
java.vm.specification.vendor: Oracle Corporation
org.gradle.appname: gradlew
java.io.tmpdir: /var/folders/zv/gqw522z179l_5zv1k2q7tblm0000gn/T/
line.separator:

java.endorsed.dirs: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/endorsed
os.arch: x86_64
java.awt.graphicsenv: sun.awt.CGraphicsEnvironment
java.runtime.version: 1.8.0_171-b11
java.vm.specification.name: Java Virtual Machine Specification
user.dir: /Users/zhouyang/work/github/djl/integration
org.gradle.internal.http.socketTimeout: 120000
user.country: CN
sun.java.launcher: SUN_STANDARD
sun.os.patch.level: unknown
java.vm.name: Java HotSpot(TM) 64-Bit Server VM
file.encoding.pkg: sun.io
path.separator: :
java.vm.vendor: Oracle Corporation
java.vendor.url: http://java.oracle.com/
gopherProxySet: false
sun.boot.library.path: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib
java.vm.version: 25.171-b11
java.runtime.name: Java(TM) SE Runtime Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions