generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 735
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Description
I am using your java lib and the origin python lib to load xlm-robert-base model and tokenize sentences, find the result of java and python are different. It looks like the way java lib treat the emoji (eg. 👋) is incorrect, maybe this is a bug?
Expected Behavior
The tokenized result from java lib and python lib be the same
Error Message
No Error Message
How to Reproduce?
Java code:
public static void main(String[] args) {
Path modelPath = Paths.get("path/to/sentencepiece.model");
try (SpTokenizer tokenizer = new SpTokenizer(modelPath)) {
String s = "\uD83D\uDC4B\uD83D\uDC4B";
List<String> tokens = tokenizer.tokenize(s);
System.out.println(tokens);
} catch (IOException exception) {
exception.printStackTrace();
}
}
get result:
[▁, ������������]
Python Code:
import sentencepiece as spm
processor = spm.SentencePieceProcessor(model_file="path/to/sentencepiece.model")
print(processor.tokenize("👋👋",out_type=str))
get result:
['▁', '👋', '👋']
Steps to reproduce
(Paste the commands you ran that produced the error.)
- run java code and python code
- compare the tokenized result
What have you tried to solve it?
Environment Info
Please run the command ./gradlew debugEnv from the root directory of DJL (if necessary, clone DJL first). It will output information about your system, environment, and installation that can help us debug your issue. Paste the output of the command below:
----------- System Properties -----------
sun.cpu.isalist:
ftp.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
socksNonProxyHosts: local|*.local|169.254/16|*.169.254/16
sun.io.unicode.encoding: UnicodeBig
sun.cpu.endian: little
java.vendor.url.bug: http://bugreport.sun.com/bugreport/
file.separator: /
java.vendor: Oracle Corporation
sun.boot.class.path: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/sunrsasign.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/classes
java.ext.dirs: /Users/zhouyang/Library/Java/Extensions:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/ext:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java
java.version: 1.8.0_171
java.vm.info: mixed mode
awt.toolkit: sun.lwawt.macosx.LWCToolkit
user.language: zh
java.specification.vendor: Oracle Corporation
sun.java.command: ai.djl.integration.util.DebugEnvironment
java.home: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre
sun.arch.data.model: 64
java.vm.specification.version: 1.8
java.class.path: /Users/zhouyang/work/github/djl/integration/build/classes/java/main:/Users/zhouyang/work/github/djl/integration/build/resources/main:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/commons-cli/commons-cli/1.4/c51c00206bb913cd8612b24abd9fa98ae89719b1/commons-cli-1.4.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-slf4j-impl/2.13.3/7cca27a921a18645139cf651c04b83b1a19cfd76/log4j-slf4j-impl-2.13.3.jar:/Users/zhouyang/work/github/djl/basicdataset/build/libs/basicdataset-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/model-zoo/build/libs/model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/testing/build/libs/testing-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.testng/testng/7.1.0/b0bcea778fb2899aeb4014c558babea8833d180a/testng-7.1.0.jar:/Users/zhouyang/work/github/djl/mxnet/mxnet-model-zoo/build/libs/mxnet-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.mxnet/mxnet-native-auto/1.8.0/e32265c03e27e1fb18c9c0904733b00f9acffaee/mxnet-native-auto-1.8.0.jar:/Users/zhouyang/work/github/djl/pytorch/pytorch-model-zoo/build/libs/pytorch-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.pytorch/pytorch-native-auto/1.8.1/3cbb59c8b21c24cb368d296f6c4c6ef069d4d9b/pytorch-native-auto-1.8.1.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-model-zoo/build/libs/tensorflow-model-zoo-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/ai.djl.tensorflow/tensorflow-native-auto/2.4.1/20b8c7a4e6d451e782d15dd30cebd4df0ad86c74/tensorflow-native-auto-2.4.1.jar:/Users/zhouyang/work/github/djl/mxnet/mxnet-engine/build/libs/mxnet-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/pytorch/pytorch-engine/build/libs/pytorch-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-engine/build/libs/tensorflow-engine-0.12.0-SNAPSHOT.jar:/Users/zhouyang/work/github/djl/api/build/libs/api-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.slf4j/slf4j-api/1.7.30/b5a4b6d16ab13e34a88fae84c35cd5d68cac922c/slf4j-api-1.7.30.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-core/2.13.3/4e857439fc4fe974d212adaaaa3b118b8b50e3ec/log4j-core-2.13.3.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.logging.log4j/log4j-api/2.13.3/ec1508160b93d274b1add34419b897bae84c6ca9/log4j-api-2.13.3.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-csv/1.8/37ca9a9aa2d4be2599e55506a6d3170dd7a3df4/commons-csv-1.8.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.beust/jcommander/1.72/6375e521c1e11d6563d4f25a07ce124ccf8cd171/jcommander-1.72.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.inject/guice/4.1.0/faf9ee8ac09eafd1128091426dd367a8c0085d55/guice-4.1.0-no_aop.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.yaml/snakeyaml/1.21/18775fdda48574784f40b47bf478ab0593f92e4d/snakeyaml-1.21.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.code.gson/gson/2.8.6/9180733b7df8542621dc12e21e87557e8c99b8cb/gson-2.8.6.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/net.java.dev.jna/jna/5.3.0/4654d1da02e4173ba7b64f7166378847db55448a/jna-5.3.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.apache.commons/commons-compress/1.20/b8df472b31e1f17c232d2ad78ceb1c84e00c641b/commons-compress-1.20.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/javax.inject/javax.inject/1/6975da39a7040257bd51d21a231b76c915872d38/javax.inject-1.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/aopalliance/aopalliance/1.0/235ba8b489512805ac13a8f9ea77a1ca5ebe3e8/aopalliance-1.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/19.0/6ce200f6b23222af3d8abb6b6459e6c44f4bb0e9/guava-19.0.jar:/Users/zhouyang/work/github/djl/tensorflow/tensorflow-api/build/libs/tensorflow-api-0.12.0-SNAPSHOT.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.bytedeco/javacpp/1.5.5/92e1c31aaed15a3dc12008859a37ced45fa0b730/javacpp-1.5.5.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.tensorflow/tensorflow-core-api/0.3.1/954f292e85f4d2a587ede1b2e1a525e74ef96c97/tensorflow-core-api-0.3.1.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.8.0/b5f93103d113540bb848fe9ce4e6819b1f39ee49/protobuf-java-3.8.0.jar:/Users/zhouyang/.gradle/caches/modules-2/files-2.1/org.tensorflow/ndarray/0.3.1/3cdb825411a9de908cc3dac740f18628d6512260/ndarray-0.3.1.jar
user.name: zhouyang
ai.djl.logging.level: debug
file.encoding: UTF-8
java.specification.version: 1.8
java.awt.printerjob: sun.lwawt.macosx.CPrinterJob
user.timezone: Asia/Shanghai
user.home: /Users/zhouyang
library.jansi.path: /Users/zhouyang/.gradle/native/jansi/1.18/osx
http.nonProxyHosts: local|*.local|169.254/16|*.169.254/16
os.version: 10.15.7
sun.management.compiler: HotSpot 64-Bit Tiered Compilers
java.specification.name: Java Platform API Specification
java.class.version: 52.0
org.gradle.internal.http.connectionTimeout: 60000
java.library.path: /Users/zhouyang/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
org.gradle.internal.publish.checksums.insecure: true
sun.jnu.encoding: UTF-8
os.name: Mac OS X
user.variant:
java.vm.specification.vendor: Oracle Corporation
org.gradle.appname: gradlew
java.io.tmpdir: /var/folders/zv/gqw522z179l_5zv1k2q7tblm0000gn/T/
line.separator:
java.endorsed.dirs: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib/endorsed
os.arch: x86_64
java.awt.graphicsenv: sun.awt.CGraphicsEnvironment
java.runtime.version: 1.8.0_171-b11
java.vm.specification.name: Java Virtual Machine Specification
user.dir: /Users/zhouyang/work/github/djl/integration
org.gradle.internal.http.socketTimeout: 120000
user.country: CN
sun.java.launcher: SUN_STANDARD
sun.os.patch.level: unknown
java.vm.name: Java HotSpot(TM) 64-Bit Server VM
file.encoding.pkg: sun.io
path.separator: :
java.vm.vendor: Oracle Corporation
java.vendor.url: http://java.oracle.com/
gopherProxySet: false
sun.boot.library.path: /Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_171.jdk/Contents/Home/jre/lib
java.vm.version: 25.171-b11
java.runtime.name: Java(TM) SE Runtime Environment
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working