Skip to content

Conversation

@williballenthin
Copy link
Collaborator

closes #2745

closes ida: embedded pe: offsets are virtual addresses rather than file offsets
Fixes #2746
closes ghidra: embedded pe: offsets are virtual addresses rather than file offsets
Fixes #2747
closes binary ninja: embedded pe: offsets are virtual addresses rather than file offsets
Fixes #2748
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add bug fixes, new features, breaking changes and anything else you think is worthwhile mentioning to the master (unreleased) section of CHANGELOG.md. If no CHANGELOG update is needed add the following to the PR description: [x] No CHANGELOG update needed

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @williballenthin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses an inconsistency in how file offsets were being reported by various disassembler extractors. By implementing dedicated virtual address to file offset conversion logic for Binary Ninja and Ghidra, and leveraging existing IDA Pro functionality, the change ensures that all FileOffsetAddress instances accurately reflect positions within the original file. This improves the precision and reliability of feature extraction, particularly for characteristics like embedded PE files and strings, which are inherently tied to file-based locations.

Highlights

  • Corrected Offset Handling: Ensures that FileOffsetAddress objects consistently represent file offsets rather than virtual addresses across all supported disassemblers (Binary Ninja, Ghidra, IDA Pro).
  • New Helper Functions: Introduced va_to_file_offset for Binary Ninja and addr_to_file_offset for Ghidra to accurately map virtual addresses to file offsets.
  • Refactored Extractors: Updated the embedded pe and string extraction logic in Binary Ninja, Ghidra, and IDA Pro extractors to utilize these new or existing file offset conversion mechanisms.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions github-actions bot dismissed their stale review November 3, 2025 12:27

CHANGELOG updated or no update needed, thanks! 😄

gemini-code-assist[bot]

This comment was marked as resolved.

williballenthin and others added 2 commits November 3, 2025 13:28
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a bug where embedded PE features and string features were incorrectly using virtual addresses instead of file offsets when returning FileOffsetAddress objects across IDA, Ghidra, and Binary Ninja extractors.

Key changes:

  • Introduced platform-specific helper functions to convert virtual addresses to file offsets
  • Updated embedded PE and string extraction to use actual file offsets instead of virtual addresses
  • Added CHANGELOG entry documenting the fix

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
capa/features/extractors/ida/file.py Uses ida_loader.get_fileregion_offset() to convert virtual addresses to file offsets
capa/features/extractors/ghidra/helpers.py Adds addr_to_file_offset() helper function to map Ghidra addresses to file offsets
capa/features/extractors/ghidra/file.py Uses new addr_to_file_offset() helper for embedded PE and string features
capa/features/extractors/binja/helpers.py Adds va_to_file_offset() helper function to map Binary Ninja virtual addresses to file offsets
capa/features/extractors/binja/file.py Uses new va_to_file_offset() helper for embedded PE and string features
CHANGELOG.md Documents the bug fix
Comments suppressed due to low confidence (1)

capa/features/extractors/binja/helpers.py:113

  • The error message should be more descriptive by including context about why the mapping failed (e.g., 'Virtual address 0x{va:x} is not within any mapped segment or section'). This would help with debugging when addresses fall outside mapped regions.
    # This enforces strict mapping so callers must handle missing mappings explicitly.
    raise RuntimeError(f"unable to map virtual address to file offset: 0x{va:x}")


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@williballenthin
Copy link
Collaborator Author

fyi @xusheng6

Copy link
Collaborator

@mr-tz mr-tz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
CC: @mike-hunhoff @colton-gabertan for Ghidra and
@xusheng6 for Binja

@mike-hunhoff
Copy link
Collaborator

Thanks @williballenthin ! We're failing code style checks from:

diff --git a/capa/features/extractors/ghidra/helpers.py b/capa/features/extractors/ghidra/helpers.py
index b4092f2..b6ddcf6 100644
--- a/capa/features/extractors/ghidra/helpers.py
+++ b/capa/features/extractors/ghidra/helpers.py
@@ -320,7 +320,7 @@ def addr_to_file_offset(addr: ghidra.program.model.address.Address) -> int:
       - compute file offset = block.getStartingOffset() + section-relative offset
       - if no block matches, fall back to subtracting program image base
     """
-    prog = currentProgram()  # type: ignore[name-defined] # noqa: F821 
+    prog = currentProgram()  # type: ignore[name-defined] # noqa: F821
     aoff = addr.getOffset()
 
     for block in prog.getMemory().getBlocks():  # type: ignore[name-defined] # noqa: F821
@@ -334,4 +334,3 @@ def addr_to_file_offset(addr: ghidra.program.model.address.Address) -> int:
     # if no block matched, fall back to image-base subtraction
     base = prog.getImageBase().getOffset()
     return int(aoff - base)

Copy link
Collaborator

@mike-hunhoff mike-hunhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems reasonable to me, LGTM! 🚀

@williballenthin
Copy link
Collaborator Author

ghidra failures, the accessors aren't available as I anticipated. will have to take some more investigation.

@mike-hunhoff do you want to take on the Ghidra triage here? and/or should i decouple the Ghidra fixes from the other commits here? I don't have a Ghidra dev environment set up right now, so it might be a bit before I address this.

https://github.com/mandiant/capa/actions/runs/19063884923/job/54452534659?pr=2749

return llil[0]


def va_to_file_offset(bv: BinaryView, va: int) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to use a helper here -- binja has an API to do that directly: get_data_offset_for_address. See https://github.com/Vector35/binaryninja-api/blob/01c9f9a2df334909fae97abdfe695b620c7161bb/python/binaryview.py#L9965

@mike-hunhoff
Copy link
Collaborator

ghidra failures, the accessors aren't available as I anticipated. will have to take some more investigation.

@mike-hunhoff do you want to take on the Ghidra triage here? and/or should i decouple the Ghidra fixes from the other commits here? I don't have a Ghidra dev environment set up right now, so it might be a bit before I address this.

https://github.com/mandiant/capa/actions/runs/19063884923/job/54452534659?pr=2749

@williballenthin let's decouple the Ghidra changes. We really need to update capa's Ghidra backend to use Ghidra's built-in Python 3 support. With that, we're going to be changing quite a bit of the code, so this change should wait. Let's create an issue in the meantime to track?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

embedded pe: offsets are virtual addresses rather than file offsets

5 participants