Skip to content

Commit 7a36786

Browse files
authored
Merge branch 'main' into compile-llava-enable
2 parents 370c9d2 + 7d4b3dd commit 7a36786

122 files changed

Lines changed: 2083 additions & 1381 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.circleci/config.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ jobs:
1313
check_circleci_user:
1414
docker:
1515
- image: python:3.10-slim
16+
resource_class: small
1617
parallelism: 1
1718
steps:
1819
- run: echo $CIRCLE_PROJECT_USERNAME

README.md

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -255,17 +255,37 @@ You should install 🤗 Transformers in a [virtual environment](https://docs.pyt
255255

256256
First, create a virtual environment with the version of Python you're going to use and activate it.
257257

258-
Then, you will need to install at least one of Flax, PyTorch, or TensorFlow.
259-
Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/), [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or [Flax](https://github.com/google/flax#quick-install) and [Jax](https://github.com/google/jax#installation) installation pages regarding the specific installation command for your platform.
258+
**macOS/Linux**
259+
260+
```python -m venv env
261+
source env/bin/activate
262+
```
263+
264+
**Windows**
265+
266+
``` python -m venv env
267+
env\Scripts\activate
268+
```
269+
270+
To use 🤗 Transformers, you must install at least one of Flax, PyTorch, or TensorFlow. Refer to the official installation guides for platform-specific commands:
271+
272+
[TensorFlow installation page](https://www.tensorflow.org/install/),
273+
[PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or [Flax](https://github.com/google/flax#quick-install) and [Jax](https://github.com/google/jax#installation)
260274

261275
When one of those backends has been installed, 🤗 Transformers can be installed using pip as follows:
262276

263-
```bash
277+
```
264278
pip install transformers
265279
```
266280

267281
If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must [install the library from source](https://huggingface.co/docs/transformers/installation#installing-from-source).
268282

283+
```
284+
git clone https://github.com/huggingface/transformers.git
285+
cd transformers
286+
pip install
287+
```
288+
269289
### With conda
270290

271291
🤗 Transformers can be installed using conda as follows:

docs/source/en/installation.md

Lines changed: 38 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -32,27 +32,18 @@ Install 🤗 Transformers for whichever deep learning library you're working wit
3232

3333
You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). A virtual environment makes it easier to manage different projects, and avoid compatibility issues between dependencies.
3434

35-
Start by creating a virtual environment in your project directory:
35+
Now you're ready to install 🤗 Transformers with the following command:
3636

3737
```bash
38-
python -m venv .env
38+
pip install transformers
3939
```
4040

41-
Activate the virtual environment. On Linux and MacOs:
41+
For GPU acceleration, install the appropriate CUDA drivers for [PyTorch](https://pytorch.org/get-started/locally) and TensorFlow(https://www.tensorflow.org/install/pip).
4242

43-
```bash
44-
source .env/bin/activate
45-
```
46-
Activate Virtual environment on Windows
43+
Run the command below to check if your system detects an NVIDIA GPU.
4744

4845
```bash
49-
.env/Scripts/activate
50-
```
51-
52-
Now you're ready to install 🤗 Transformers with the following command:
53-
54-
```bash
55-
pip install transformers
46+
nvidia-smi
5647
```
5748

5849
For CPU-support only, you can conveniently install 🤗 Transformers and a deep learning library in one line. For example, install 🤗 Transformers and PyTorch with:
@@ -254,3 +245,36 @@ Once your file is downloaded and locally cached, specify it's local path to load
254245
See the [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) section for more details on downloading files stored on the Hub.
255246

256247
</Tip>
248+
249+
## Troubleshooting
250+
251+
See below for some of the more common installation issues and how to resolve them.
252+
253+
### Unsupported Python version
254+
255+
Ensure you are using Python 3.9 or later. Run the command below to check your Python version.
256+
257+
```
258+
python --version
259+
```
260+
261+
### Missing dependencies
262+
263+
Install all required dependencies by running the following command. Ensure you’re in the project directory before executing the command.
264+
265+
```
266+
pip install -r requirements.txt
267+
```
268+
269+
### Windows-specific
270+
271+
If you encounter issues on Windows, you may need to activate Developer Mode. Navigate to Windows Settings > For Developers > Developer Mode.
272+
273+
Alternatively, create and activate a virtual environment as shown below.
274+
275+
```
276+
python -m venv env
277+
.\env\Scripts\activate
278+
```
279+
280+

docs/source/en/model_doc/grounding-dino.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -56,25 +56,26 @@ Here's how to use the model for zero-shot object detection:
5656
>>> image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
5757
>>> image = Image.open(requests.get(image_url, stream=True).raw)
5858
>>> # Check for cats and remote controls
59-
>>> text = "a cat. a remote control."
59+
>>> text_labels = [["a cat", "a remote control"]]
6060

61-
>>> inputs = processor(images=image, text=text, return_tensors="pt").to(device)
61+
>>> inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
6262
>>> with torch.no_grad():
6363
... outputs = model(**inputs)
6464

6565
>>> results = processor.post_process_grounded_object_detection(
6666
... outputs,
67-
... inputs.input_ids,
68-
... box_threshold=0.4,
67+
... threshold=0.4,
6968
... text_threshold=0.3,
70-
... target_sizes=[image.size[::-1]]
69+
... target_sizes=[(image.height, image.width)]
7170
... )
72-
>>> print(results)
73-
[{'boxes': tensor([[344.6959, 23.1090, 637.1833, 374.2751],
74-
[ 12.2666, 51.9145, 316.8582, 472.4392],
75-
[ 38.5742, 70.0015, 176.7838, 118.1806]], device='cuda:0'),
76-
'labels': ['a cat', 'a cat', 'a remote control'],
77-
'scores': tensor([0.4785, 0.4381, 0.4776], device='cuda:0')}]
71+
>>> # Retrieve the first image result
72+
>>> result = results[0]
73+
>>> for box, score, text_label in zip(result["boxes"], result["scores"], result["text_labels"]):
74+
... box = [round(x, 2) for x in box.tolist()]
75+
... print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
76+
Detected a cat with confidence 0.479 at location [344.7, 23.11, 637.18, 374.28]
77+
Detected a cat with confidence 0.438 at location [12.27, 51.91, 316.86, 472.44]
78+
Detected a remote control with confidence 0.478 at location [38.57, 70.0, 176.78, 118.18]
7879
```
7980

8081
## Grounded SAM

docs/source/en/model_doc/omdet-turbo.md

Lines changed: 45 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -44,37 +44,40 @@ One unique property of OmDet-Turbo compared to other zero-shot object detection
4444
Here's how to load the model and prepare the inputs to perform zero-shot object detection on a single image:
4545

4646
```python
47-
import requests
48-
from PIL import Image
49-
50-
from transformers import AutoProcessor, OmDetTurboForObjectDetection
51-
52-
processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
53-
model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
54-
55-
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
56-
image = Image.open(requests.get(url, stream=True).raw)
57-
classes = ["cat", "remote"]
58-
inputs = processor(image, text=classes, return_tensors="pt")
59-
60-
outputs = model(**inputs)
61-
62-
# convert outputs (bounding boxes and class logits)
63-
results = processor.post_process_grounded_object_detection(
64-
outputs,
65-
classes=classes,
66-
target_sizes=[image.size[::-1]],
67-
score_threshold=0.3,
68-
nms_threshold=0.3,
69-
)[0]
70-
for score, class_name, box in zip(
71-
results["scores"], results["classes"], results["boxes"]
72-
):
73-
box = [round(i, 1) for i in box.tolist()]
74-
print(
75-
f"Detected {class_name} with confidence "
76-
f"{round(score.item(), 2)} at location {box}"
77-
)
47+
>>> import torch
48+
>>> import requests
49+
>>> from PIL import Image
50+
51+
>>> from transformers import AutoProcessor, OmDetTurboForObjectDetection
52+
53+
>>> processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
54+
>>> model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
55+
56+
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
57+
>>> image = Image.open(requests.get(url, stream=True).raw)
58+
>>> text_labels = ["cat", "remote"]
59+
>>> inputs = processor(image, text=text_labels, return_tensors="pt")
60+
61+
>>> with torch.no_grad():
62+
... outputs = model(**inputs)
63+
64+
>>> # convert outputs (bounding boxes and class logits)
65+
>>> results = processor.post_process_grounded_object_detection(
66+
... outputs,
67+
... target_sizes=[(image.height, image.width)],
68+
... text_labels=text_labels,
69+
... threshold=0.3,
70+
... nms_threshold=0.3,
71+
... )
72+
>>> result = results[0]
73+
>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
74+
>>> for box, score, text_label in zip(boxes, scores, text_labels):
75+
... box = [round(i, 2) for i in box.tolist()]
76+
... print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
77+
Detected remote with confidence 0.768 at location [39.89, 70.35, 176.74, 118.04]
78+
Detected cat with confidence 0.72 at location [11.6, 54.19, 314.8, 473.95]
79+
Detected remote with confidence 0.563 at location [333.38, 75.77, 370.7, 187.03]
80+
Detected cat with confidence 0.552 at location [345.15, 23.95, 639.75, 371.67]
7881
```
7982

8083
### Multi image inference
@@ -93,22 +96,22 @@ OmDet-Turbo can perform batched multi-image inference, with support for differen
9396

9497
>>> url1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
9598
>>> image1 = Image.open(BytesIO(requests.get(url1).content)).convert("RGB")
96-
>>> classes1 = ["cat", "remote"]
97-
>>> task1 = "Detect {}.".format(", ".join(classes1))
99+
>>> text_labels1 = ["cat", "remote"]
100+
>>> task1 = "Detect {}.".format(", ".join(text_labels1))
98101

99102
>>> url2 = "http://images.cocodataset.org/train2017/000000257813.jpg"
100103
>>> image2 = Image.open(BytesIO(requests.get(url2).content)).convert("RGB")
101-
>>> classes2 = ["boat"]
104+
>>> text_labels2 = ["boat"]
102105
>>> task2 = "Detect everything that looks like a boat."
103106

104107
>>> url3 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
105108
>>> image3 = Image.open(BytesIO(requests.get(url3).content)).convert("RGB")
106-
>>> classes3 = ["statue", "trees"]
109+
>>> text_labels3 = ["statue", "trees"]
107110
>>> task3 = "Focus on the foreground, detect statue and trees."
108111

109112
>>> inputs = processor(
110113
... images=[image1, image2, image3],
111-
... text=[classes1, classes2, classes3],
114+
... text=[text_labels1, text_labels2, text_labels3],
112115
... task=[task1, task2, task3],
113116
... return_tensors="pt",
114117
... )
@@ -119,19 +122,19 @@ OmDet-Turbo can perform batched multi-image inference, with support for differen
119122
>>> # convert outputs (bounding boxes and class logits)
120123
>>> results = processor.post_process_grounded_object_detection(
121124
... outputs,
122-
... classes=[classes1, classes2, classes3],
123-
... target_sizes=[image1.size[::-1], image2.size[::-1], image3.size[::-1]],
124-
... score_threshold=0.2,
125+
... text_labels=[text_labels1, text_labels2, text_labels3],
126+
... target_sizes=[(image.height, image.width) for image in [image1, image2, image3]],
127+
... threshold=0.2,
125128
... nms_threshold=0.3,
126129
... )
127130

128131
>>> for i, result in enumerate(results):
129-
... for score, class_name, box in zip(
130-
... result["scores"], result["classes"], result["boxes"]
132+
... for score, text_label, box in zip(
133+
... result["scores"], result["text_labels"], result["boxes"]
131134
... ):
132135
... box = [round(i, 1) for i in box.tolist()]
133136
... print(
134-
... f"Detected {class_name} with confidence "
137+
... f"Detected {text_label} with confidence "
135138
... f"{round(score.item(), 2)} at location {box} in image {i}"
136139
... )
137140
Detected remote with confidence 0.77 at location [39.9, 70.4, 176.7, 118.0] in image 0

docs/source/en/model_doc/owlv2.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -50,20 +50,22 @@ OWLv2 is, just like its predecessor [OWL-ViT](owlvit), a zero-shot text-conditio
5050

5151
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
5252
>>> image = Image.open(requests.get(url, stream=True).raw)
53-
>>> texts = [["a photo of a cat", "a photo of a dog"]]
54-
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
53+
>>> text_labels = [["a photo of a cat", "a photo of a dog"]]
54+
>>> inputs = processor(text=text_labels, images=image, return_tensors="pt")
5555
>>> outputs = model(**inputs)
5656

5757
>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
58-
>>> target_sizes = torch.Tensor([image.size[::-1]])
59-
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
60-
>>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
61-
>>> i = 0 # Retrieve predictions for the first image for the corresponding text queries
62-
>>> text = texts[i]
63-
>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
64-
>>> for box, score, label in zip(boxes, scores, labels):
58+
>>> target_sizes = torch.tensor([(image.height, image.width)])
59+
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
60+
>>> results = processor.post_process_grounded_object_detection(
61+
... outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
62+
... )
63+
>>> # Retrieve predictions for the first image for the corresponding text queries
64+
>>> result = results[0]
65+
>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
66+
>>> for box, score, text_label in zip(boxes, scores, text_labels):
6567
... box = [round(i, 2) for i in box.tolist()]
66-
... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
68+
... print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
6769
Detected a photo of a cat with confidence 0.614 at location [341.67, 23.39, 642.32, 371.35]
6870
Detected a photo of a cat with confidence 0.665 at location [6.75, 51.96, 326.62, 473.13]
6971
```
@@ -103,6 +105,9 @@ Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image proce
103105
## Owlv2Processor
104106

105107
[[autodoc]] Owlv2Processor
108+
- __call__
109+
- post_process_grounded_object_detection
110+
- post_process_image_guided_detection
106111

107112
## Owlv2Model
108113

docs/source/en/model_doc/owlvit.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -49,20 +49,22 @@ OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CL
4949

5050
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
5151
>>> image = Image.open(requests.get(url, stream=True).raw)
52-
>>> texts = [["a photo of a cat", "a photo of a dog"]]
53-
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
52+
>>> text_labels = [["a photo of a cat", "a photo of a dog"]]
53+
>>> inputs = processor(text=text_labels, images=image, return_tensors="pt")
5454
>>> outputs = model(**inputs)
5555

5656
>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
57-
>>> target_sizes = torch.Tensor([image.size[::-1]])
57+
>>> target_sizes = torch.tensor([(image.height, image.width)])
5858
>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
59-
>>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
60-
>>> i = 0 # Retrieve predictions for the first image for the corresponding text queries
61-
>>> text = texts[i]
62-
>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
63-
>>> for box, score, label in zip(boxes, scores, labels):
59+
>>> results = processor.post_process_grounded_object_detection(
60+
... outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
61+
... )
62+
>>> # Retrieve predictions for the first image for the corresponding text queries
63+
>>> result = results[0]
64+
>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
65+
>>> for box, score, text_label in zip(boxes, scores, text_labels):
6466
... box = [round(i, 2) for i in box.tolist()]
65-
... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
67+
... print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
6668
Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29]
6769
Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17]
6870
```
@@ -91,16 +93,12 @@ A demo notebook on using OWL-ViT for zero- and one-shot (image-guided) object de
9193
- post_process_object_detection
9294
- post_process_image_guided_detection
9395

94-
## OwlViTFeatureExtractor
95-
96-
[[autodoc]] OwlViTFeatureExtractor
97-
- __call__
98-
- post_process
99-
- post_process_image_guided_detection
100-
10196
## OwlViTProcessor
10297

10398
[[autodoc]] OwlViTProcessor
99+
- __call__
100+
- post_process_grounded_object_detection
101+
- post_process_image_guided_detection
104102

105103
## OwlViTModel
106104

0 commit comments

Comments
 (0)