Problem
The _extract_collection_refs method in ui/app/dataloader.py (line 519-521) uses a simple substring check to detect which BERDL collections a project uses:
def _extract_collection_refs(self, readme_content: str) -> list[str]:
"""Extract BERDL collection IDs mentioned in README text."""
return [cid for cid in self._COLLECTION_IDS if cid in readme_content]
This scans the entire concatenated text of README.md, RESEARCH_PLAN.md, and REPORT.md (line 453-458). Any mention of a collection ID — even in Future Directions, Literature Context, or a passing reference — causes that collection to appear as a "Data Collection" on the project's observatory page.
Example
The phb_granule_ecology project mentioned kescience_fitnessbrowser only in a Future Directions bullet ("Query the BERDL Fitness Browser (kescience_fitnessbrowser) for phaC mutant fitness phenotypes..."). The observatory displayed Fitness Browser as a data source even though the project never queried it.
Workaround applied: Removed the backtick-quoted collection ID from the text (commit 49cb304).
Suggested Fix
Instead of scanning all text, restrict collection detection to the Data Sources section of README.md or the Data section of REPORT.md. For example:
def _extract_collection_refs(self, all_text: str) -> list[str]:
# Only look in Data Sources / Data sections, not the full text
data_sections = []
for section_name in ["Data Sources", "Data"]:
match = re.search(
rf"^## {re.escape(section_name)}\s*$\n(.*?)(?=^## |\Z)",
all_text, re.MULTILINE | re.DOTALL,
)
if match:
data_sections.append(match.group(1))
search_text = "\n".join(data_sections) if data_sections else all_text
return [cid for cid in self._COLLECTION_IDS if cid in search_text]
This would only flag collections that are explicitly listed as data sources, not those mentioned in passing.
Affected Code
ui/app/dataloader.py, lines 197-210 (_COLLECTION_IDS), 453-458 (text concatenation), 519-521 (_extract_collection_refs)
Problem
The
_extract_collection_refsmethod inui/app/dataloader.py(line 519-521) uses a simple substring check to detect which BERDL collections a project uses:This scans the entire concatenated text of README.md, RESEARCH_PLAN.md, and REPORT.md (line 453-458). Any mention of a collection ID — even in Future Directions, Literature Context, or a passing reference — causes that collection to appear as a "Data Collection" on the project's observatory page.
Example
The
phb_granule_ecologyproject mentionedkescience_fitnessbrowseronly in a Future Directions bullet ("Query the BERDL Fitness Browser (kescience_fitnessbrowser) for phaC mutant fitness phenotypes..."). The observatory displayed Fitness Browser as a data source even though the project never queried it.Workaround applied: Removed the backtick-quoted collection ID from the text (commit 49cb304).
Suggested Fix
Instead of scanning all text, restrict collection detection to the Data Sources section of README.md or the Data section of REPORT.md. For example:
This would only flag collections that are explicitly listed as data sources, not those mentioned in passing.
Affected Code
ui/app/dataloader.py, lines 197-210 (_COLLECTION_IDS), 453-458 (text concatenation), 519-521 (_extract_collection_refs)