Skip to content

Commit 593e29c

Browse files
authored
Updated Aria model card (#38472)
* Update aria.md * Update aria.md * Suggested Updates - aria.md
1 parent 77cf493 commit 593e29c

File tree

1 file changed

+91
-31
lines changed

1 file changed

+91
-31
lines changed

docs/source/en/model_doc/aria.md

Lines changed: 91 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,60 +14,119 @@ rendered properly in your Markdown viewer.
1414
1515
-->
1616

17+
<div style="float: right;">
18+
<div class="flex flex-wrap space-x-1">
19+
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
20+
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
21+
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
22+
</div>
23+
</div>
24+
1725
# Aria
1826

19-
<div class="flex flex-wrap space-x-1">
20-
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
21-
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat">
22-
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white">
23-
</div>
27+
[Aria](https://huggingface.co/papers/2410.05993) is a multimodal mixture-of-experts (MoE) model. The goal of this model is to open-source a training recipe for creating a multimodal native model from scratch. Aria has 3.9B and 3.5B activated parameters per visual and text token respectively. Text is handled by a MoE decoder and visual inputs are handled by a lightweight visual encoder. It is trained in 4 stages, language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training.
2428

25-
## Overview
29+
You can find all the original Aria checkpoints under the [Aria](https://huggingface.co/rhymes-ai?search_models=aria) organization.
2630

27-
The Aria model was proposed in [Aria: An Open Multimodal Native Mixture-of-Experts Model](https://huggingface.co/papers/2410.05993) by Li et al. from the Rhymes.AI team.
31+
> [!TIP]
32+
> Click on the Aria models in the right sidebar for more examples of how to apply Aria to different multimodal tasks.
2833
29-
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
34+
The example below demonstrates how to generate text based on an image with [`Pipeline`] or the [`AutoModel`] class.
3035

31-
The abstract from the paper is the following:
36+
<hfoptions id="usage">
37+
<hfoption id="Pipeline">
3238

33-
*Information comes in diverse modalities. Multimodal native AI models are essential to integrate real-world information and deliver comprehensive understanding. While proprietary multimodal native models exist, their lack of openness imposes obstacles for adoptions, let alone adaptations. To fill this gap, we introduce Aria, an open multimodal native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. Aria is a mixture-of-expert model with 3.9B and 3.5B activated parameters per visual token and text token, respectively. It outperforms Pixtral-12B and Llama3.2-11B, and is competitive against the best proprietary models on various multimodal tasks. We pre-train Aria from scratch following a 4-stage pipeline, which progressively equips the model with strong capabilities in language understanding, multimodal understanding, long context window, and instruction following. We open-source the model weights along with a codebase that facilitates easy adoptions and adaptations of Aria in real-world applications.*
39+
```python
40+
import torch
41+
from transformers import pipeline
3442

35-
This model was contributed by [m-ric](https://huggingface.co/m-ric).
36-
The original code can be found [here](https://github.com/rhymes-ai/Aria).
43+
pipeline = pipeline(
44+
"image-to-text",
45+
model="rhymes-ai/Aria",
46+
device=0,
47+
torch_dtype=torch.bfloat16
48+
)
49+
pipeline(
50+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
51+
text="What is shown in this image?"
52+
)
53+
```
3754

38-
## Usage tips
55+
</hfoption>
56+
<hfoption id="AutoModel">
3957

40-
Here's how to use the model for vision tasks:
4158
```python
42-
import requests
4359
import torch
44-
from PIL import Image
60+
from transformers import AutoModelForCausalLM, AutoProcessor
61+
62+
model = AutoModelForCausalLM.from_pretrained(
63+
"rhymes-ai/Aria",
64+
device_map="auto",
65+
torch_dtype=torch.bfloat16,
66+
attn_implementation="sdpa"
67+
)
68+
69+
processor = AutoProcessor.from_pretrained("rhymes-ai/Aria")
4570

46-
from transformers import AriaProcessor, AriaForConditionalGeneration
71+
messages = [
72+
{
73+
"role": "user", "content": [
74+
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
75+
{"type": "text", "text": "What is shown in this image?"},
76+
]
77+
},
78+
]
4779

48-
model_id_or_path = "rhymes-ai/Aria"
80+
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
81+
ipnuts = inputs.to(model.device, torch.bfloat16)
4982

50-
model = AriaForConditionalGeneration.from_pretrained(
51-
model_id_or_path, device_map="auto"
83+
output = model.generate(
84+
**inputs,
85+
max_new_tokens=15,
86+
stop_strings=["<|im_end|>"],
87+
tokenizer=processor.tokenizer,
88+
do_sample=True,
89+
temperature=0.9,
5290
)
91+
output_ids = output[0][inputs["input_ids"].shape[1]:]
92+
response = processor.decode(output_ids, skip_special_tokens=True)
93+
print(response)
94+
```
95+
96+
</hfoption>
97+
</hfoptions>
5398

54-
processor = AriaProcessor.from_pretrained(model_id_or_path)
99+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
100+
101+
The example below uses [torchao](../quantization/torchao) to only quantize the weights to int4 and the [rhymes-ai/Aria-sequential_mlp](https://huggingface.co/rhymes-ai/Aria-sequential_mlp) checkpoint. This checkpoint replaces grouped GEMM with `torch.nn.Linear` layers for easier quantization.
55102

56-
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
103+
```py
104+
# pip install torchao
105+
import torch
106+
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoProcessor
107+
108+
quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
109+
model = AutoModelForCausalLM.from_pretrained(
110+
"rhymes-ai/Aria-sequential_mlp",
111+
torch_dtype=torch.bfloat16,
112+
device_map="auto",
113+
quantization_config=quantization_config
114+
)
115+
processor = AutoProcessor.from_pretrained(
116+
"rhymes-ai/Aria-sequential_mlp",
117+
)
57118

58119
messages = [
59120
{
60-
"role": "user",
61-
"content": [
62-
{"type": "image"},
63-
{"text": "what is the image?", "type": "text"},
64-
],
65-
}
121+
"role": "user", "content": [
122+
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
123+
{"type": "text", "text": "What is shown in this image?"},
124+
]
125+
},
66126
]
67127

68-
text = processor.apply_chat_template(messages, add_generation_prompt=True)
69-
inputs = processor(text=text, images=image, return_tensors="pt")
70-
inputs.to(model.device)
128+
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
129+
inputs = inputs.to(model.device, torch.bfloat16)
71130

72131
output = model.generate(
73132
**inputs,
@@ -79,6 +138,7 @@ output = model.generate(
79138
)
80139
output_ids = output[0][inputs["input_ids"].shape[1]:]
81140
response = processor.decode(output_ids, skip_special_tokens=True)
141+
print(response)
82142
```
83143

84144

0 commit comments

Comments
 (0)