huggingface
diff --git a/‎.circleci/config.yml‎
Lines changed: 1 addition & 0 deletions b/‎.circleci/config.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 23 additions & 3 deletions b/‎README.md‎
Lines changed: 23 additions & 3 deletions
diff --git a/‎docs/source/en/installation.md‎
Lines changed: 38 additions & 14 deletions b/‎docs/source/en/installation.md‎
Lines changed: 38 additions & 14 deletions
diff --git a/‎docs/source/en/model_doc/grounding-dino.md‎
Lines changed: 12 additions & 11 deletions b/‎docs/source/en/model_doc/grounding-dino.md‎
Lines changed: 12 additions & 11 deletions
diff --git a/‎docs/source/en/model_doc/omdet-turbo.md‎
Lines changed: 45 additions & 42 deletions b/‎docs/source/en/model_doc/omdet-turbo.md‎
Lines changed: 45 additions & 42 deletions
diff --git a/‎docs/source/en/model_doc/owlv2.md‎
Lines changed: 15 additions & 10 deletions b/‎docs/source/en/model_doc/owlv2.md‎
Lines changed: 15 additions & 10 deletions
diff --git a/‎docs/source/en/model_doc/owlvit.md‎
Lines changed: 14 additions & 16 deletions b/‎docs/source/en/model_doc/owlvit.md‎
Lines changed: 14 additions & 16 deletions
@@ -13,6 +13,7 @@ jobs:
     check_circleci_user:
         docker:
             - image: python:3.10-slim
+        resource_class: small
         parallelism: 1
         steps:
             - run: echo $CIRCLE_PROJECT_USERNAME
 
@@ -255,17 +255,37 @@ You should install 🤗 Transformers in a [virtual environment](https://docs.pyt
 
 First, create a virtual environment with the version of Python you're going to use and activate it.
 
-Then, you will need to install at least one of Flax, PyTorch, or TensorFlow.
-Please refer to [TensorFlow installation page](https://www.tensorflow.org/install/), [PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or [Flax](https://github.com/google/flax#quick-install) and [Jax](https://github.com/google/jax#installation) installation pages regarding the specific installation command for your platform.
+**macOS/Linux**
+
+```python -m venv env
+source env/bin/activate
+```
+
+**Windows**
+
+``` python -m venv env
+env\Scripts\activate
+```
+
+To use 🤗 Transformers, you must install at least one of Flax, PyTorch, or TensorFlow. Refer to the official installation guides for platform-specific commands:
+
+[TensorFlow installation page](https://www.tensorflow.org/install/), 
+[PyTorch installation page](https://pytorch.org/get-started/locally/#start-locally) and/or [Flax](https://github.com/google/flax#quick-install) and [Jax](https://github.com/google/jax#installation) 
 
 When one of those backends has been installed, 🤗 Transformers can be installed using pip as follows:
 
-```bash
+```
 pip install transformers
 ```
 
 If you'd like to play with the examples or need the bleeding edge of the code and can't wait for a new release, you must [install the library from source](https://huggingface.co/docs/transformers/installation#installing-from-source).
 
+```
+git clone https://github.com/huggingface/transformers.git
+cd transformers
+pip install
+```
+
 ### With conda
 
 🤗 Transformers can be installed using conda as follows:
 
@@ -32,27 +32,18 @@ Install 🤗 Transformers for whichever deep learning library you're working wit
 
 You should install 🤗 Transformers in a [virtual environment](https://docs.python.org/3/library/venv.html). If you're unfamiliar with Python virtual environments, take a look at this [guide](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/). A virtual environment makes it easier to manage different projects, and avoid compatibility issues between dependencies.
 
-Start by creating a virtual environment in your project directory:
+Now you're ready to install 🤗 Transformers with the following command:
 
 ```bash
-python -m venv .env
+pip install transformers
 ```
 
-Activate the virtual environment. On Linux and MacOs:
+For GPU acceleration, install the appropriate CUDA drivers for [PyTorch](https://pytorch.org/get-started/locally) and TensorFlow(https://www.tensorflow.org/install/pip).
 
-```bash
-source .env/bin/activate
-```
-Activate Virtual environment on Windows
+Run the command below to check if your system detects an NVIDIA GPU.
 
 ```bash
-.env/Scripts/activate
-```
-
-Now you're ready to install 🤗 Transformers with the following command:
-
-```bash
-pip install transformers
+nvidia-smi
 ```
 
 For CPU-support only, you can conveniently install 🤗 Transformers and a deep learning library in one line. For example, install 🤗 Transformers and PyTorch with:
@@ -254,3 +245,36 @@ Once your file is downloaded and locally cached, specify it's local path to load
 See the [How to download files from the Hub](https://huggingface.co/docs/hub/how-to-downstream) section for more details on downloading files stored on the Hub.
 
 </Tip>
+
+## Troubleshooting
+
+See below for some of the more common installation issues and how to resolve them.
+
+### Unsupported Python version
+
+Ensure you are using Python 3.9 or later. Run the command below to check your Python version.
+
+```
+python --version
+```
+
+### Missing dependencies
+
+Install all required dependencies by running the following command. Ensure you’re in the project directory before executing the command.
+
+```
+pip install -r requirements.txt
+```
+
+### Windows-specific
+
+If you encounter issues on Windows, you may need to activate Developer Mode. Navigate to Windows Settings > For Developers > Developer Mode.
+
+Alternatively, create and activate a virtual environment as shown below.
+
+```
+python -m venv env
+.\env\Scripts\activate
+```
+
+
@@ -56,25 +56,26 @@ Here's how to use the model for zero-shot object detection:
 >>> image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 >>> image = Image.open(requests.get(image_url, stream=True).raw)
 >>> # Check for cats and remote controls
->>> text = "a cat. a remote control."
+>>> text_labels = [["a cat", "a remote control"]]
 
->>> inputs = processor(images=image, text=text, return_tensors="pt").to(device)
+>>> inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
 >>> with torch.no_grad():
 ...     outputs = model(**inputs)
 
 >>> results = processor.post_process_grounded_object_detection(
 ...     outputs,
-...     inputs.input_ids,
-...     box_threshold=0.4,
+...     threshold=0.4,
 ...     text_threshold=0.3,
-...     target_sizes=[image.size[::-1]]
+...     target_sizes=[(image.height, image.width)]
 ... )
->>> print(results)
-[{'boxes': tensor([[344.6959,  23.1090, 637.1833, 374.2751],
-        [ 12.2666,  51.9145, 316.8582, 472.4392],
-        [ 38.5742,  70.0015, 176.7838, 118.1806]], device='cuda:0'),
-  'labels': ['a cat', 'a cat', 'a remote control'],
-  'scores': tensor([0.4785, 0.4381, 0.4776], device='cuda:0')}]
+>>> # Retrieve the first image result
+>>> result = results[0]
+>>> for box, score, text_label in zip(result["boxes"], result["scores"], result["text_labels"]):
+...     box = [round(x, 2) for x in box.tolist()]
+...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
+Detected a cat with confidence 0.479 at location [344.7, 23.11, 637.18, 374.28]
+Detected a cat with confidence 0.438 at location [12.27, 51.91, 316.86, 472.44]
+Detected a remote control with confidence 0.478 at location [38.57, 70.0, 176.78, 118.18]
 ```
 
 ## Grounded SAM
 
@@ -44,37 +44,40 @@ One unique property of OmDet-Turbo compared to other zero-shot object detection
 Here's how to load the model and prepare the inputs to perform zero-shot object detection on a single image:
 
 ```python
-import requests
-from PIL import Image
-
-from transformers import AutoProcessor, OmDetTurboForObjectDetection
-
-processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
-model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
-
-url = "http://images.cocodataset.org/val2017/000000039769.jpg"
-image = Image.open(requests.get(url, stream=True).raw)
-classes = ["cat", "remote"]
-inputs = processor(image, text=classes, return_tensors="pt")
-
-outputs = model(**inputs)
-
-# convert outputs (bounding boxes and class logits)
-results = processor.post_process_grounded_object_detection(
-    outputs,
-    classes=classes,
-    target_sizes=[image.size[::-1]],
-    score_threshold=0.3,
-    nms_threshold=0.3,
-)[0]
-for score, class_name, box in zip(
-    results["scores"], results["classes"], results["boxes"]
-):
-    box = [round(i, 1) for i in box.tolist()]
-    print(
-        f"Detected {class_name} with confidence "
-        f"{round(score.item(), 2)} at location {box}"
-    )
+>>> import torch
+>>> import requests
+>>> from PIL import Image
+
+>>> from transformers import AutoProcessor, OmDetTurboForObjectDetection
+
+>>> processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
+>>> model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
+
+>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+>>> image = Image.open(requests.get(url, stream=True).raw)
+>>> text_labels = ["cat", "remote"]
+>>> inputs = processor(image, text=text_labels, return_tensors="pt")
+
+>>> with torch.no_grad():
+...     outputs = model(**inputs)
+
+>>> # convert outputs (bounding boxes and class logits)
+>>> results = processor.post_process_grounded_object_detection(
+...     outputs,
+...     target_sizes=[(image.height, image.width)],
+...     text_labels=text_labels,
+...     threshold=0.3,
+...     nms_threshold=0.3,
+... )
+>>> result = results[0]
+>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
+>>> for box, score, text_label in zip(boxes, scores, text_labels):
+...     box = [round(i, 2) for i in box.tolist()]
+...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
+Detected remote with confidence 0.768 at location [39.89, 70.35, 176.74, 118.04]
+Detected cat with confidence 0.72 at location [11.6, 54.19, 314.8, 473.95]
+Detected remote with confidence 0.563 at location [333.38, 75.77, 370.7, 187.03]
+Detected cat with confidence 0.552 at location [345.15, 23.95, 639.75, 371.67]
 ```
 
 ### Multi image inference
@@ -93,22 +96,22 @@ OmDet-Turbo can perform batched multi-image inference, with support for differen
 
 >>> url1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
 >>> image1 = Image.open(BytesIO(requests.get(url1).content)).convert("RGB")
->>> classes1 = ["cat", "remote"]
->>> task1 = "Detect {}.".format(", ".join(classes1))
+>>> text_labels1 = ["cat", "remote"]
+>>> task1 = "Detect {}.".format(", ".join(text_labels1))
 
 >>> url2 = "http://images.cocodataset.org/train2017/000000257813.jpg"
 >>> image2 = Image.open(BytesIO(requests.get(url2).content)).convert("RGB")
->>> classes2 = ["boat"]
+>>> text_labels2 = ["boat"]
 >>> task2 = "Detect everything that looks like a boat."
 
 >>> url3 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
 >>> image3 = Image.open(BytesIO(requests.get(url3).content)).convert("RGB")
->>> classes3 = ["statue", "trees"]
+>>> text_labels3 = ["statue", "trees"]
 >>> task3 = "Focus on the foreground, detect statue and trees."
 
 >>> inputs = processor(
 ...     images=[image1, image2, image3],
-...     text=[classes1, classes2, classes3],
+...     text=[text_labels1, text_labels2, text_labels3],
 ...     task=[task1, task2, task3],
 ...     return_tensors="pt",
 ... )
@@ -119,19 +122,19 @@ OmDet-Turbo can perform batched multi-image inference, with support for differen
 >>> # convert outputs (bounding boxes and class logits)
 >>> results = processor.post_process_grounded_object_detection(
 ...     outputs,
-...     classes=[classes1, classes2, classes3],
-...     target_sizes=[image1.size[::-1], image2.size[::-1], image3.size[::-1]],
-...     score_threshold=0.2,
+...     text_labels=[text_labels1, text_labels2, text_labels3],
+...     target_sizes=[(image.height, image.width) for image in [image1, image2, image3]],
+...     threshold=0.2,
 ...     nms_threshold=0.3,
 ... )
 
 >>> for i, result in enumerate(results):
-...     for score, class_name, box in zip(
-...         result["scores"], result["classes"], result["boxes"]
+...     for score, text_label, box in zip(
+...         result["scores"], result["text_labels"], result["boxes"]
 ...     ):
 ...         box = [round(i, 1) for i in box.tolist()]
 ...         print(
-...             f"Detected {class_name} with confidence "
+...             f"Detected {text_label} with confidence "
 ...             f"{round(score.item(), 2)} at location {box} in image {i}"
 ...         )
 Detected remote with confidence 0.77 at location [39.9, 70.4, 176.7, 118.0] in image 0
 
@@ -50,20 +50,22 @@ OWLv2 is, just like its predecessor [OWL-ViT](owlvit), a zero-shot text-conditio
 
 >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 >>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = [["a photo of a cat", "a photo of a dog"]]
->>> inputs = processor(text=texts, images=image, return_tensors="pt")
+>>> text_labels = [["a photo of a cat", "a photo of a dog"]]
+>>> inputs = processor(text=text_labels, images=image, return_tensors="pt")
 >>> outputs = model(**inputs)
 
 >>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
->>> target_sizes = torch.Tensor([image.size[::-1]])
->>> # Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
->>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
->>> i = 0  # Retrieve predictions for the first image for the corresponding text queries
->>> text = texts[i]
->>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
->>> for box, score, label in zip(boxes, scores, labels):
+>>> target_sizes = torch.tensor([(image.height, image.width)])
+>>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
+>>> results = processor.post_process_grounded_object_detection(
+...     outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
+... )
+>>> # Retrieve predictions for the first image for the corresponding text queries
+>>> result = results[0]
+>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
+>>> for box, score, text_label in zip(boxes, scores, text_labels):
 ...     box = [round(i, 2) for i in box.tolist()]
-...     print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
+...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
 Detected a photo of a cat with confidence 0.614 at location [341.67, 23.39, 642.32, 371.35]
 Detected a photo of a cat with confidence 0.665 at location [6.75, 51.96, 326.62, 473.13]
 ```
@@ -103,6 +105,9 @@ Usage of OWLv2 is identical to [OWL-ViT](owlvit) with a new, updated image proce
 ## Owlv2Processor
 
 [[autodoc]] Owlv2Processor
+    - __call__
+    - post_process_grounded_object_detection
+    - post_process_image_guided_detection
 
 ## Owlv2Model
 
 
@@ -49,20 +49,22 @@ OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CL
 
 >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
 >>> image = Image.open(requests.get(url, stream=True).raw)
->>> texts = [["a photo of a cat", "a photo of a dog"]]
->>> inputs = processor(text=texts, images=image, return_tensors="pt")
+>>> text_labels = [["a photo of a cat", "a photo of a dog"]]
+>>> inputs = processor(text=text_labels, images=image, return_tensors="pt")
 >>> outputs = model(**inputs)
 
 >>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
->>> target_sizes = torch.Tensor([image.size[::-1]])
+>>> target_sizes = torch.tensor([(image.height, image.width)])
 >>> # Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
->>> results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
->>> i = 0  # Retrieve predictions for the first image for the corresponding text queries
->>> text = texts[i]
->>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
->>> for box, score, label in zip(boxes, scores, labels):
+>>> results = processor.post_process_grounded_object_detection(
+...     outputs=outputs, target_sizes=target_sizes, threshold=0.1, text_labels=text_labels
+... )
+>>> # Retrieve predictions for the first image for the corresponding text queries
+>>> result = results[0]
+>>> boxes, scores, text_labels = result["boxes"], result["scores"], result["text_labels"]
+>>> for box, score, text_label in zip(boxes, scores, text_labels):
 ...     box = [round(i, 2) for i in box.tolist()]
-...     print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
+...     print(f"Detected {text_label} with confidence {round(score.item(), 3)} at location {box}")
 Detected a photo of a cat with confidence 0.707 at location [324.97, 20.44, 640.58, 373.29]
 Detected a photo of a cat with confidence 0.717 at location [1.46, 55.26, 315.55, 472.17]
 ```
@@ -91,16 +93,12 @@ A demo notebook on using OWL-ViT for zero- and one-shot (image-guided) object de
     - post_process_object_detection
     - post_process_image_guided_detection
 
-## OwlViTFeatureExtractor
-
-[[autodoc]] OwlViTFeatureExtractor
-    - __call__
-    - post_process
-    - post_process_image_guided_detection
-
 ## OwlViTProcessor
 
 [[autodoc]] OwlViTProcessor
+    - __call__
+    - post_process_grounded_object_detection
+    - post_process_image_guided_detection
 
 ## OwlViTModel