This repository is used to matching external funders—as identified in OpenAlex metadata associated with scholarly articles—to their corresponding 41 codes or internal Funders used in ScholarWorks. This process serves as a preprocessing step for downstream tasks such as linking academic articles to their respective grants in ScholarWorks.
Two key platforms(both have API) are integral to this matching process:
1. OpenAlex
OpenAlex is an open-access bibliographic catalog of scientific articles, authors, institutions, and funders. It provides structured metadata about scholarly outputs, including:
- Article-level details
- Associated funders and, often, grants
- Unique OpenAlex IDs for each funder (e.g.,
https://openalex.org/F123456789)
Note: Each funder listed in OpenAlex includes an OpenAlex ID that can be used to retrieve structured metadata about the funder.
ROR provides open, persistent identifiers for research organizations, and includes:
- Organization metadata
- Alternative names
- Parent-child organizational structures
Note: While OpenAlex sometimes provides ROR IDs for funders, this linkage is not always consistent or complete.
The matching pipeline consists of two main steps:
- Parse metadata of academic articles (sourced from OpenAlex) associated with a specific funder.
- Restrict results to a self-defined time interval (e.g., past 5 years).
- Collect a list of unique funders and their OpenAlex IDs.
- Use the OpenAlex and (optionally) ROR metadata to enrich funder names.
- Normalize names and aliases to identify matching entries in the Internet 41 code registry used by ScholarWorks.
- Output a mapping file linking each OpenAlex funder to its respective Internet 41 code.
An example of output csv file would look something like this:
| unique_funder | matched_funder | matched_funder_code | matched_object |
|---|---|---|---|
| National Science Foundation | National Science Foundation (United States, Alexandria) - NSF | 41___NATIONAL_SCIENCE_FOUNDATION_(ALEXANDRIA) | child/Exact |
Funder_41_Code_Matching/
├── internal_funders/ # Normalized internal funder reference provided by Scholarworks
├── resources/ # Stored supporting csv file (e.g. country code, internal funder 41 code)
├── unique_funders/ # Extracted unique funder lists from OpenAlex (step 1)
├── edge_cases.py # Handles special or ambiguous funder matching logic
├── environment.yml # Conda environment definition file
├── funder_match.py # Core matching logic between OpenAlex funders and Internet 41 codes
├── main.py # Main script to run the pipeline end-to-end (step 2)
├── new_match_output.csv # Output CSV with matched funders and codes
└── README.md # Project documentation (you’re here!)
To run this repository, follow these steps to set up the virtual environment and run the code.
Open your terminal in the project root directory and run:
# Create a new Conda environment with all the required dependencies
conda env create -f environment.yml
# Activate the newly created environment
conda activate funder41Code_envTo extract all unique funders and their associated OpenAlex funder IDs from articles, run the following function in:
📄 funder41Code/unique_funders/unique_funder_and_count.py
Call the function get_all_funder(start_year: int, end_year: int, schoolId: str) with:
start_year,end_year: the time interval for articles to be pulled from OpenAlexschoolId: the OpenAlex institution ID (you can find this by searching the institution on OpenAlex)
To extract funders for Brandeis University (OpenAlex ID: I6902469) from 2018 to 2024:
schoolId = "I6902469" # OpenAlex ID for Brandeis University
get_all_funder(2018, 2024, schoolId)This will collect and save a list of unique funders acknowledged in articles affiliated with the institution, along with their OpenAlex funder IDs, for the given time range.
The output of the code above is stored in:
unique_funders/output/sorted_funder_data.csv
It would look something like this:
| funder_name | count | openalex_id |
|---|---|---|
| National Science Foundation | 606 | F4320306076 |
Each row represents a unique funder extracted from OpenAlex-matched articles, including:
- funder_name: the display name of the funder
- count: how many times this funder appears across the matched articles
- openalex_id: the unique OpenAlex ID for this funder
The repository uses this CSV file for further analysis, reporting, or fuzzy matching in downstream tasks.
Note: This method can run range from 10 - 20 minutes depend on the volume of articles are inculded in the time interval.
Run the main matching script to perform fuzzy matching between the unique funders and the 41-code funder reference list.
python main.pyThis executes the main() method, which performs the following steps:
For each funder in unique_funders/output/sorted_funder_data.csv:
-
Extract Metadata
- Use the funder's OpenAlex ID to call OpenAlex API and retrieve:
- Alternative funder names (from OpenAlex)
- Country code (if available)
- ROR ID (if available)
- If a ROR ID is available:
- Call the ROR API to fetch:
- More alternative names (aliases, labels)
- Parent ROR ID (if the funder is a sub-unit)
- Call the ROR API to fetch:
- Use the funder's OpenAlex ID to call OpenAlex API and retrieve:
-
Fuzzy Match Against Internal 41 Code List
- Loop through all internal funders from:
internal_funders/output/new_internal_41.csv - For each record, perform fuzzy string matching using all available names for the external funder.
- Loop through all internal funders from:
-
Matching Rules The logic for determining a match is as follows:
-
✅ Direct Match(Exact)
fuzzy_score >= 95andcountry_code matches→ Match found
-
🧪 Edge Case(EdgeCase)
- If no match is found, this may be:
- A European or Chinese grant where the named funder is not the overseeing funder
- If no match is found, this may be:
-
🏛 Parent Institution Match
- If a parent ROR ID is available:
- Fetch parent institution info
- Retry the 3 matching conditions above using the parent’s metadata
- If a parent ROR ID is available:
-
💡 Fallback Suggestion(HighestScore)
- If still no match and no parent ror Id:
- Suggest the internal funder with the highest similarity score, which is >= 60
- Mark this as a non-confident suggestion for manual validation
- If still no match and no parent ror Id:
The "confidence level" of the matching rules: Exact = EdgeCase > HighestScore
-
The final matching results are saved to:
final_match_output.csv
With the following columns:
unique_funder,matched_funder,matched_funder_code,matched_object
Where:
unique_funder: Funder name from OpenAlexmatched_funder: Best-matched name from internal 41 listmatched_funder_code: Associated code from internal listmatched_object: includes a tag that summarizes how the match was made.
matched_objectFormat:child|parent/MatchType
There are four possibleMatchTypevalues:
Exact: Direct fuzzy match with country (score ≥ 95)EdgeCase: Fuzzy match without country OR a known ambiguous funder situation (e.g. EU or China)HighestScore: No match found, fallback suggestion based on highest fuzzy scorenot_found: No match or valid suggestion, even after using parent RORAnd the prefix:
child: The match was based on the original funder's dataparent: The match was based on the parent institution's data (via ROR)
matched_object |
Meaning |
|---|---|
child/Exact |
Funder matched using its own metadata with a direct (Exact) match |
parent/EdgeCase |
Funder matched using its parent’s metadata, under an edge-case fallback |
child/HighestScore |
No official match found, but suggests best guess using fuzzy score only |
child/not_found |
No match found, and no usable parent ROR ID available |
... |
... |
This structure allows downstream systems and reviewers to easily assess the match quality and determine whether manual verification is needed.
| unique_funder | matched_funder | matched_funder_code | matched_object |
|---|---|---|---|
| National Science Foundation | "National Science Foundation (United States, Alexandria) - NSF" | 41___NATIONAL_SCIENCE_FOUNDATION_(ALEXANDRIA) | child/Exact |
This folder preprocessed and normalized internal funder data:
The internal funder CSV file is normalized to improve fuzzy matching performance. It separates:
funder_name: Cleaned and standardized name of the fundercountry: Country code for location-aware matchingcity: Optional city data to help resolve duplicatesascyloym: Shortened aliases or acronyms used by the funder (if applicable)
These internal records represent the full set of accepted funders in the system and do not change over time.
✅ Note: There is no need to re-run this processing step. This list is fixed and included as part of the system configuration.
This script allows you to test the fuzzy matching logic on a single, user-defined funder rather than processing the full external list.
- Mainly intended for debugging or development
- You can safely ignore this file if you're only running the standard full-matching pipeline