|
19 | 19 | - https://huggingface.co/unsloth/Qwen3-VL-30B-A3B-Instruct-GGUF |
20 | 20 | description: | |
21 | 21 | Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. |
22 | | - |
| 22 | + |
23 | 23 | This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. |
24 | | - |
| 24 | + |
25 | 25 | Available in Dense and MoE architectures that scale from edge to cloud, with Instruct and reasoning‑enhanced Thinking editions for flexible, on-demand deployment. |
26 | | - |
| 26 | + |
27 | 27 | #### Key Enhancements: |
28 | | - |
| 28 | + |
29 | 29 | * **Visual Agent**: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks. |
30 | | - |
| 30 | + |
31 | 31 | * **Visual Coding Boost**: Generates Draw.io/HTML/CSS/JS from images/videos. |
32 | | - |
| 32 | + |
33 | 33 | * **Advanced Spatial Perception**: Judges object positions, viewpoints, and occlusions; provides stronger 2D grounding and enables 3D grounding for spatial reasoning and embodied AI. |
34 | | - |
| 34 | + |
35 | 35 | * **Long Context & Video Understanding**: Native 256K context, expandable to 1M; handles books and hours-long video with full recall and second-level indexing. |
36 | | - |
| 36 | + |
37 | 37 | * **Enhanced Multimodal Reasoning**: Excels in STEM/Math—causal analysis and logical, evidence-based answers. |
38 | | - |
| 38 | + |
39 | 39 | * **Upgraded Visual Recognition**: Broader, higher-quality pretraining is able to “recognize everything”—celebrities, anime, products, landmarks, flora/fauna, etc. |
40 | | - |
| 40 | + |
41 | 41 | * **Expanded OCR**: Supports 32 languages (up from 19); robust in low light, blur, and tilt; better with rare/ancient characters and jargon; improved long-document structure parsing. |
42 | | - |
| 42 | + |
43 | 43 | * **Text Understanding on par with pure LLMs**: Seamless text–vision fusion for lossless, unified comprehension. |
44 | | - |
| 44 | + |
45 | 45 | #### Model Architecture Updates: |
46 | | - |
| 46 | + |
47 | 47 | 1. **Interleaved-MRoPE**: Full‑frequency allocation over time, width, and height via robust positional embeddings, enhancing long‑horizon video reasoning. |
48 | | - |
| 48 | + |
49 | 49 | 2. **DeepStack**: Fuses multi‑level ViT features to capture fine-grained details and sharpen image–text alignment. |
50 | | - |
| 50 | + |
51 | 51 | 3. **Text–Timestamp Alignment:** Moves beyond T‑RoPE to precise, timestamp‑grounded event localization for stronger video temporal modeling. |
52 | | - |
| 52 | + |
53 | 53 | This is the weight repository for Qwen3-VL-30B-A3B-Instruct. |
54 | 54 | overrides: |
55 | 55 | mmproj: mmproj/mmproj-F16.gguf |
|
130 | 130 | - filename: mmproj/mmproj-Qwen3-VL-4B-Thinking-F16.gguf |
131 | 131 | sha256: 72354fcd3fc75935b84e745ca492d6e78dd003bb5a020d71b296e7650926ac87 |
132 | 132 | uri: huggingface://unsloth/Qwen3-VL-4B-Thinking-GGUF/mmproj-F16.gguf |
133 | | -- !!merge <<: *llama3 |
| 133 | +- !!merge <<: *qwen3vl |
134 | 134 | name: "qwen3-vl-2b-thinking" |
135 | 135 | urls: |
136 | 136 | - https://huggingface.co/unsloth/Qwen3-VL-2B-Thinking-GGUF |
|
0 commit comments