Skip to content

Observatory UI incorrectly detects data collections from any text mention #105

@aparkin

Description

@aparkin

Problem

The _extract_collection_refs method in ui/app/dataloader.py (line 519-521) uses a simple substring check to detect which BERDL collections a project uses:

def _extract_collection_refs(self, readme_content: str) -> list[str]:
    """Extract BERDL collection IDs mentioned in README text."""
    return [cid for cid in self._COLLECTION_IDS if cid in readme_content]

This scans the entire concatenated text of README.md, RESEARCH_PLAN.md, and REPORT.md (line 453-458). Any mention of a collection ID — even in Future Directions, Literature Context, or a passing reference — causes that collection to appear as a "Data Collection" on the project's observatory page.

Example

The phb_granule_ecology project mentioned kescience_fitnessbrowser only in a Future Directions bullet ("Query the BERDL Fitness Browser (kescience_fitnessbrowser) for phaC mutant fitness phenotypes..."). The observatory displayed Fitness Browser as a data source even though the project never queried it.

Workaround applied: Removed the backtick-quoted collection ID from the text (commit 49cb304).

Suggested Fix

Instead of scanning all text, restrict collection detection to the Data Sources section of README.md or the Data section of REPORT.md. For example:

def _extract_collection_refs(self, all_text: str) -> list[str]:
    # Only look in Data Sources / Data sections, not the full text
    data_sections = []
    for section_name in ["Data Sources", "Data"]:
        match = re.search(
            rf"^## {re.escape(section_name)}\s*$\n(.*?)(?=^## |\Z)",
            all_text, re.MULTILINE | re.DOTALL,
        )
        if match:
            data_sections.append(match.group(1))
    search_text = "\n".join(data_sections) if data_sections else all_text
    return [cid for cid in self._COLLECTION_IDS if cid in search_text]

This would only flag collections that are explicitly listed as data sources, not those mentioned in passing.

Affected Code

  • ui/app/dataloader.py, lines 197-210 (_COLLECTION_IDS), 453-458 (text concatenation), 519-521 (_extract_collection_refs)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions