-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Contributing a custum eval to the repository. Updated Version. #1124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
andrew-openai
merged 1 commit into
openai:main
from
IGES-Institut:contrib_population_span_extraction
Jun 8, 2023
Merged
Contributing a custum eval to the repository. Updated Version. #1124
andrew-openai
merged 1 commit into
openai:main
from
IGES-Institut:contrib_population_span_extraction
Jun 8, 2023
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… is in the registry file.
Contributor
|
I suggest you give this PR a meaningful title, for example the title of the eval. |
Collaborator
|
This PR is updated version of #1087. |
usama-openai
approved these changes
Jun 8, 2023
Collaborator
usama-openai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this eval! This PR looks good. I'm approving this PR.
Collaborator
|
You should see GPT-4 API access enabled in your account in the next few days. |
arbreton
pushed a commit
to arbreton/evals
that referenced
this pull request
Jul 8, 2023
…i#1124) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name The <eval_name> is **population_span_extraction** ID is **population_span_extraction.dev.v0** ### Eval description The model is shown abstracts of clinical drug trials and tasked with extracting the text spans that specify the population demographic of the shown abstract. The population demographic can be but is not necessarily specified in multiple seperate spans. A previous version included examples containing 'problem' as part of the population (as per PICO criteria labeling) as opposed to strictly population demographics. We are now resubmitting a different version, with different abstracts, which contains only demographics annotations. ### What makes this a useful eval? The Repository specifically asks for "Real-world use cases". Extracting population spans from clinical study trials is immensly useful to researchers who have to go over and compare large amounts of clinical drug trials. The eval dataset is generated with multiple different prompts and statisfies all further critera posed by Open AI. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Efficacy of the dorzolamide/timolol fixed combination versus latanoprost in the treatment of ocular hypertension or glaucoma: combined analysis of pooled data from two large randomized observer and patient-masked studies.\n\nIn previous analyses of primary efficacy data from two randomized clinical trials, standard dosing regimens of the dorzolamide/timolol fixed combination (COSOPT) and latanoprost (XALATAN) were shown to have equivalent efficacy with regard to reduction in mean daytime diurnal intraocular pressure (IOP). We performed additional post hoc analyses of pooled data from these studies to compare further the efficacy of the two treatments. The studies used identical 3-month, parallel group, randomized, observer-masked and patient-masked, multicenter designs. Patients with a baseline IOP > or = 24 mm Hg were randomized to either the 2% dorzolamide/0.5% timolol combination eye drops twice daily (n = 273) or 0.005% latanoprost eye drops once daily (n = 271). The IOP measurements were made at 8 AM, 10 AM, 2 PM, and 4 PM at the baseline visit and then on each of the 3 monthly assessment days. The following measures were analyzed on a post hoc basis: 1) percentages of patients meeting target levels of IOP reduction; 2) mean IOP reduction in those patients with high IOP (> or =30 mmHg) at baseline; 3) mean IOP at each of the assessment time points during a day. A total of 259 patients in the dorzolamide/timolol group and 268 patients in the latanoprost group were included in the efficacy analysis. At 3 months, both treatments showed similar efficacy with regard to the percentages of patients who achieved target levels of IOP reduction (e.g., 40% IOP reduction in 15% of dorzolamide/timolol combination patients and 13% of latanoprost patients), mean IOP reduction in those patients with high IOP at baseline (dorzolamide/ timolol combination, 12.5 mmHg, latanoprost, 12.6 mmHg), and mean IOP at each time point during the day. By the measures used in this analysis, the dorzolamide/timolol combination and latanoprost were equally effective at lowering IOP in patients with ocular hypertension or glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a baseline IOP > or = 24 mm Hg '"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "Twenty-four-hour control with latanoprost-timolol-fixed combination therapy vs latanoprost therapy.\n\nOBJECTIVE: To evaluate the 24-hour efficacy and safety of the latanoprost-timolol maleate-fixed combination vs latanoprost therapy in patients with primary open-angle glaucoma.\nMETHODS: A prospective, observer-masked, crossover, active-controlled, randomized comparison in which after a 6-week medicine-free period, patients were randomized to either latanoprost-timolol-fixed combination therapy or latanoprost therapy, both dosed once each evening, alone for 8 weeks. Patients were then switched to the opposite treatment for 8 weeks. At the end of the washout and treatment periods, a 24-hour diurnal curve was performed.\nRESULTS: The baseline untreated mean +/- SD diurnal curve in 37 patients who completed the study was 24.2 +/- 2.0 mm Hg. The mean diurnal curve was 19.2 +/- 2.6 mm Hg for those who received latanoprost therapy alone and 16.7 +/- 2.1 mm Hg for those who received the fixed combination therapy (P<.001). The fixed combination therapy also provided a lower absolute intraocular pressure level (1.5-2.9 mm Hg, P<.001) and a greater intraocular pressure reduction from the untreated baseline (P<.001). Stinging was statistically lower with latanoprost therapy alone (P = .04), but itching was statistically increased compared with the fixed combination therapy (P = .04).\nCONCLUSION: The result of this study suggests that the latanoprost-timolol-fixed combination compared with latanoprost therapy alone provides improved intraocular pressure reduction over the 24-hour diurnal curve and for each individual time point in patients with primary open-angle glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: ' patients with primary open-angle glaucoma.'"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "A 12-week, randomized, double-masked, multicenter study of the fixed combination of latanoprost and timolol in the evening versus the individual components.\n\nPURPOSE: To compare the efficacy and tolerability of fixed-combination latanoprost and timolol applied in the evening with the concomitant use of the individual components.\nDESIGN: Twelve-week, randomized, double-masked, multicenter study.\nPARTICIPANTS: Five hundred seventeen randomized patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg.\nMETHODS: Patients received either the fixed combination of latanoprost and timolol once daily in the evening and a placebo in the morning and evening or the unfixed combination of latanoprost once daily in the evening and timolol in the morning and evening. Study visits were at weeks 2, 6, and 12. MAIN OUTCOME MEASURES: The primary efficacy end point was mean change from baseline to week 12 in diurnal IOP (mean IOPs of 8 am, 12 pm, and 4 pm). The fixed combination was considered noninferior to the unfixed combination if the upper limit of the 95% confidence interval (CI) of the difference was <1.5 mmHg (analysis of covariance). Adverse events were recorded at each visit.\nRESULTS: In all, 502 patients were included in intent-to-treat analyses (fixed combination, n = 255; unfixed combination, n = 247). For the fixed- and unfixed-combination groups, mean baseline diurnal IOP levels were 25.4 mmHg and 25.2 mmHg, respectively, and mean diurnal IOP reductions were 8.7 mmHg and 9.0 mmHg (between-treatment difference, 0.3 mmHg; 95% CI, -0.1 to 0.7 mmHg; P = 0.15). Both treatments were well tolerated.\nCONCLUSIONS: The fixed combination of latanoprost and timolol administered once daily in the evening is not inferior to the unfixed combination of latanoprost once daily in the evening and timolol twice daily. The fixed combination provides an effective and well-tolerated alternative to multiple instillations."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg'"} {"input": [{"role": "system", "content": "This is from a clinical drug trial abstract. Extract the parts specifying population demographics."}, {"role": "user", "content": "Efficacy of latanoprost or fixed-combination latanoprost-timolol in patients switched from a combination of timolol and a nonprostaglandin medication.\n\nPURPOSE: To compare latanoprost with the fixed-combination latanoprost-timolol in glaucoma or ocular hypertension patients switched from a combination glaucoma therapy with timolol and another nonprostaglandin medication.\nDESIGN: Prospective randomized clinical trial.\nMETHODS: Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor) underwent a 30-day washout of their medications. A masked observer then measured their intraocular pressure (IOP). The subjects were randomized to either latanoprost or fixed-combination latanoprost-timolol eyedrops to use once daily at 7 am. The IOP was measured again 30 days after the patients started using one of the study drugs by the same examiner at the same time. MAIN OUTCOME MEASURE: Comparison of the study medications' hypotensive effect.\nRESULTS: Fifty-three eyes (28 in the latanoprost group and 25 in the latanoprost-timolol group) from 28 patients were included in the study. The IOP reduction was greater in both study groups compared with the previous combination therapy with timolol and another nonprostaglandin medication in millimeters of mercury (7.7+/-2.3 vs. 5.5+/-2.3, P<0.001, for the latanoprost group; 8.5+/-3.5 vs. 6.3+/-2.7, P<0.001, for the latanoprost-timolol group) and percentage (35.8+/-8.2% vs. 25.6+/-8.9%, P<0.001, for the latanoprost group; 38.6+/-8.7% vs. 28.6+/-9.0%, P<0.001, for the latanoprost-timolol group). There was no statistical difference between latanoprost and fixed-combination latanoprost-timolol in reducing IOP, in either millimeters of mercury (P = 0.3) or percentage (P = 0.2).\nCONCLUSIONS: Both latanoprost and fixed-combination latanoprost-timolol may be viable substitutes for timolol and another nonprostaglandin medication in glaucoma or ocular hypertension patients."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor)'"} {"input": [{"role": "system", "content": "The Following text is an abstract of a clinical drug trial that specifies a population demographic. I want you to extract the text spans that contain these informations."}, {"role": "user", "content": "A 6-week, double-masked, parallel-group study of the efficacy and safety of travoprost 0.004% compared with latanoprost 0:005%/timolol 0.5% in patients with primary open-angle glaucoma or ocular hypertension.\n\nOBJECTIVE: The objective of this study was to directly compare the intraocular pressure (IOP)-lowering efficacy and safety of travoprost 0.004% eyedrops with the fixed combination of latanoprost 0.005%/timolol 0.5% eyedrops in patients with primary open-angle glaucoma or ocular hypertension.\nMETHODS: This was a randomized, double-masked, multicenter, parallel-group, active-controlled study. Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension were eligible to participate if their IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy, as indicated by IOP of 22 to 36 mm Hg at 9 AM at screening. Patients were randomly assigned in a 1:1 ratio to receive placebo + travoprost or latanoprost/timolol + placebo. Patients in the travoprost group administered travoprost at 9 PM and placebo at 9 AM; patients in the latanoprost/timolol group administered latanoprost/timolol at 9 AM and placebo at 9 PM. IOP measurements were performed using Goldmann applanation tonometry at 9 AM and 5 PM at the week-2 and week-6 visits. Both volunteered and elicited reports of adverse events were collected; all patients who were randomized and received > or =1 dose of study drug were included in the safety analysis.\nRESULTS: One hundred ten patients were randomized, of whom 106 patients were evaluable (travoprost, n = 50; latanoprost/timolol, n = 56). There were no statistically significant differences at baseline between the treatment groups, based on age group, sex, race, iris color, or diagnosis. Mean IOP values were not statistically different between groups at baseline or during treatment. In the pooled results for 9 Am assessment at weeks 2 and 6, mean (SEM) IOP reductions for travoprost and latanoprost/timolol were 7.0 (0.5) and 6.4 (0.5) mm Hg, respectively (P = NS). Adverse events related to therapy were mild in nature, and there were no statistically significant differences between the 2 treatment groups. The most frequently experienced adverse events in the travoprost group were ocular hyperemia (9.3%), foreign body sensation (5.6%), abnormal vision (1.9%), allergic reaction (1.9%), conjunctivitis (1.9%), dacryocystitis (1.9%), eye discharge (1.9%), eye pruritus (1.9%), lid edema (1.9%), lid erythema (1.9%), and tearing (1.9%). In the latanoprost/timolol group, the most frequently experienced adverse events were cataract (1.8%), dry eyes (1.8%), eye pruritus (1.8%), foreign body sensation (1.8%), and ocular hyperemia (1.8%).\nCONCLUSIONS: Mean IOP changes from baseline for travoprost 0.004% and latanoprost 0.005%/timolol 0.5% fixed combination were not significantly different at follow-up in these patients. Both medications were well tolerated."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma or ocular hypertension.', 'Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension', 'IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Comparison of the efficacy and safety of travoprost with a fixed-combination of dorzolamide and timolol in patients with open-angle glaucoma or ocular hypertension.\n\nPURPOSE: The purpose of this study was to compare travoprost (TRAV; travoprost 0.004%) and the fixed-combination of dorzolamide/timolol (DTFC; dorzolamide 2.0%/timolol maleate 0.5%) ophthalmic solutions for reducing intraocular pressure (IOP) in patients with primary open-angle glaucoma (OAG) or ocular hypertension (OHT).\nMETHODS: This was a randomized single masked, study with parallel controls. The TRAV group (n = 29) dosed once daily at 9:00 PM while the DTFC group (n = 27) dosed twice daily at 9:00 AM and 9:00 PM. IOP was measured at baseline, and following 3 weeks and 6 weeks of treatment at 8:00 AM, 12:00 PM, 4:00 PM, and 8:00 PM.\nRESULTS: Mean average IOP reductions from baseline during the course of the day were 7.5 (32.7%) and 7.1 (30.7%) mmHg for TRAV and 4.8 (23.1%) and 4.5 (21.7%) mmHg for DTFC at 3 weeks and 6 weeks, respectively. The greater IOP reduction for patients receiving TRAV was statistically significant at both the 3 and 6 week visits when averaged across all four time points (p < 0.01). The two products were well-tolerated over the course of the 6 week study. Some factors such as taste perversion were reported more often in the DTFC group.\nCONCLUSIONS: Travoprost monotherapy provided better efficacy in terms of IOP reduction and percentage of IOP reduction compared to dorzolamide 2.0%/timolol maleate 0.5% fixed combination."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma (OAG)', 'ocular hypertension (OHT)'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Efficacy and safety of latanoprost versus travoprost in exfoliative glaucoma patients.\n\nOBJECTIVE: To evaluate 24-hour intraocular pressure (IOP) efficacy of latanoprost versus travoprost, each given every evening, in exfoliative glaucoma patients.\nDESIGN: Prospective, observer-masked, crossover comparison.\nPARTICIPANTS: Forty patients with exfoliation glaucoma.\nMETHODS: Patients with a pressure of >24 mmHg were randomized to latanoprost or travoprost for an 8-week treatment period after a 6-week medicine-free period. Patients were then switched to the opposite treatment for the second period. At untreated baseline and at the end of each treatment period the IOP was measured at 6 am, 10 am, 2 pm, 6 pm, 10 pm, and 2 am. MAIN OUTCOME MEASURE: Diurnal IOP.\nRESULTS: The mean 24-hour IOP was 25.1+/-2.5 mmHg at baseline, 17.8+/-2.1 mmHg on latanoprost, and 17.3+/-2.2 mmHg on travoprost (P = 0.001). Individual time points were similar between treatments, except at 6 pm when travoprost provided lower IOP (16.7+/-2.6 vs 17.9+/-2.5 mmHg, P<0.001). Adverse events showed more conjunctival hyperemia with travoprost (n = 15) than latanoprost (n = 6; P = 0.03).\nCONCLUSIONS: Latanoprost and travoprost both significantly reduce the 24-hour IOP from baseline in exfoliative glaucoma, but travoprost may demonstrate a greater hypotensive efficacy in the late afternoon."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a pressure of >24 mmHg', 'exfoliative glaucoma patients'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparison of the ocular hypotensive effects of bimatoprost and timolol-dorzolamide combination in patients with elevated intraocular pressure: a 6-month study.\n\nPURPOSE: To compare the ocular hypotensive efficacy and safety of topical bimatoprost and timolol-dorzolamide combination in patients with primary open-angle glaucoma (POAG) or ocular hypertension during 6 months of treatment.\nMETHODS: A sample of 65 patients with a diagnosis of POAG or ocular hypertension were randomized to receive either bimatoprost 0.03% once daily or timolol-dorzolamide combination twice daily. Study visits occurred at baseline and after 2 weeks and 1, 3 and 6 months of therapy. Intraocular pressure (IOP) measurements were performed at 12.00 hours at all study visits and also at 08.00 hours and 16.00 hours at baseline and 6-month visits. At each visit, local and systemic side-effects that occurred during the treatment period were recorded. Student's t-test was used to compare the differences between IOP values.\nRESULTS: Differences in IOP between the bimatoprost and timolol-dorzolamide groups were statistically insignificant at all study visits (p > 0.05). In the bimatoprost-treated group, the IOP reduction was 6.2 +/- 1.8 mmHg, whereas it was 6.5 +/- 2.3 mmHg in the timolol-dorzolamide group after 6 months of treatment. The difference was not statistically significant (p = 0.48).\nCONCLUSIONS: The IOP-lowering efficacies of bimatoprost and timolol-dorzolamide combination were similar over a 6-month follow-up. Both bimatoprost and the timolol-dorzolamide combination were well tolerated. Bimatoprost can be used as a longterm monotherapy agent in the treatment of POAG and ocular hypertension."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open-angle glaucoma (POAG) or ocular hypertension'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparing the fixed combination brimonidine-timolol versus fixed combination dorzolamide-timolol in patients with elevated intraocular pressure.\n\nPURPOSE: To evaluate the efficacy of fixed combination brimonidine-timolol (FCBT) versus fixed combination dorzolamide-timolol (FCDT) given twice daily in patients with primary open angle glaucoma (POAG) or ocular hypertension (OH).\nDESIGN: Prospective, multicentre, masked-observer, crossover comparison.\nPARTICIPANTS: Sixteen patients with POAG and 14 with OH.\nMETHODS: The participants of the study were washed out from their previous medication and randomized to fixed FCBT or FCDT for the first 4-week treatment period. Subjects then were washed for 4 weeks and started on the opposite medication for the second 4-week period. Intraocular pressure (IOP) was measured with a Goldmann applanation tonometer at 8:00 a.m., 12:00 noon and 4:00 p.m. at each baseline and at the end of each treatment period. Unsolicited ocular adverse events were also recorded. MAIN OUTCOME MEASURES: Comparison of the IOP lowering effect of FCBT and FCDT.\nRESULTS: The baseline mean diurnal IOP for all 30 subjects (30 eyes) was 22.9 +/- 1.6 mmHg. Both fixed combinations significantly reduced IOP compared with baseline (p < 0.00001). The mean diurnal IOP following 4 weeks of therapy was 15.0 +/- 2.1 mmHg for FCBT and 15.4 +/- 2.1 mmHg for FCDT (p = 0.510). The mean diurnal IOP reduction was 7.8 +/- 1.9 mmHg for FCBT and 7.4 +/- 1.8 mmHg for FCDT (p = 0.430). Overall, 14 subjects complained about ocular adverse events: two only for FCBT, seven only for FCDT and five for both drugs. Although there was no significant difference between the number of subjects that reported ocular adverse events with FCBT (n = 7) and FCDT (n = 12) (p = 0.359), FCDT caused more ocular stinging upon instillation (n = 9) than FCBT (n = 1) (p = 0.027).\nCONCLUSION: This study suggests that FCBT and FCDT, each given twice daily, have similar efficacy in patients with POAG or OH."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open angle glaucoma (POAG) or ocular hypertension (OH)', 'patients with POAG', 'OH'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "A comparison of the safety and intraocular pressure lowering of bimatoprost/timolol fixed combination versus latanoprost/timolol fixed combination in patients with open-angle glaucoma.\n\nPURPOSE: To compare the efficacy and tolerability of a once daily evening dose of the latanoprost/timolol fixed combination (LTFC) with that of a once-daily evening dose of the bimatoprost/timolol fixed combination (BTFC) in patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides.\nDESIGN: Prospective, randomized, evaluator masked, single-center study.\nPARTICIPANTS: 36 patients with a diagnosis of open-angle glaucoma, with or without pseudoexfoliation, and inadequate control of IOP, insufficiently responsive to monotherapy with prostaglandin analogues/prostamides. MAIN OUTCOME MEASURE: The primary end-points were the change in IOP at 9:00 am from baseline to week 4, and the difference between treatment groups in the mean diurnal IOP reduction from baseline to week 4.\nRESULTS: BTFC provided significantly greater mean diurnal IOP reduction [mean (standard deviation)] 2.8 (0.9) mmHg, compared with LTFC 2.1 (0.6) mmHg, p = 0.0214. Both treatments significantly reduced the IOP from baseline at each IOP time-point measured, p < 0.0001, and for the mean diurnal IOP; p = 0.0049 for the LTFC, and p < 0.0001 for the BTFC. There were no significant differences in average hyperemia scores among groups, 1.25 (0.5) vs. 1.62 (0.69), p = 0.3835, for the LTFC and the BTFC, respectively.\nCONCLUSIONS: The results of this study showed a significantly higher IOP-lowering effect of a once-daily evening dose of the BTFC compared to that of a once-daily evening administration of the LTFC."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides'"} ``` </details>
jacobbieker
pushed a commit
to withmartian/-ARCHIVED--router-evals
that referenced
this pull request
Jan 9, 2024
…i#1124) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name The <eval_name> is **population_span_extraction** ID is **population_span_extraction.dev.v0** ### Eval description The model is shown abstracts of clinical drug trials and tasked with extracting the text spans that specify the population demographic of the shown abstract. The population demographic can be but is not necessarily specified in multiple seperate spans. A previous version included examples containing 'problem' as part of the population (as per PICO criteria labeling) as opposed to strictly population demographics. We are now resubmitting a different version, with different abstracts, which contains only demographics annotations. ### What makes this a useful eval? The Repository specifically asks for "Real-world use cases". Extracting population spans from clinical study trials is immensly useful to researchers who have to go over and compare large amounts of clinical drug trials. The eval dataset is generated with multiple different prompts and statisfies all further critera posed by Open AI. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Efficacy of the dorzolamide/timolol fixed combination versus latanoprost in the treatment of ocular hypertension or glaucoma: combined analysis of pooled data from two large randomized observer and patient-masked studies.\n\nIn previous analyses of primary efficacy data from two randomized clinical trials, standard dosing regimens of the dorzolamide/timolol fixed combination (COSOPT) and latanoprost (XALATAN) were shown to have equivalent efficacy with regard to reduction in mean daytime diurnal intraocular pressure (IOP). We performed additional post hoc analyses of pooled data from these studies to compare further the efficacy of the two treatments. The studies used identical 3-month, parallel group, randomized, observer-masked and patient-masked, multicenter designs. Patients with a baseline IOP > or = 24 mm Hg were randomized to either the 2% dorzolamide/0.5% timolol combination eye drops twice daily (n = 273) or 0.005% latanoprost eye drops once daily (n = 271). The IOP measurements were made at 8 AM, 10 AM, 2 PM, and 4 PM at the baseline visit and then on each of the 3 monthly assessment days. The following measures were analyzed on a post hoc basis: 1) percentages of patients meeting target levels of IOP reduction; 2) mean IOP reduction in those patients with high IOP (> or =30 mmHg) at baseline; 3) mean IOP at each of the assessment time points during a day. A total of 259 patients in the dorzolamide/timolol group and 268 patients in the latanoprost group were included in the efficacy analysis. At 3 months, both treatments showed similar efficacy with regard to the percentages of patients who achieved target levels of IOP reduction (e.g., 40% IOP reduction in 15% of dorzolamide/timolol combination patients and 13% of latanoprost patients), mean IOP reduction in those patients with high IOP at baseline (dorzolamide/ timolol combination, 12.5 mmHg, latanoprost, 12.6 mmHg), and mean IOP at each time point during the day. By the measures used in this analysis, the dorzolamide/timolol combination and latanoprost were equally effective at lowering IOP in patients with ocular hypertension or glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a baseline IOP > or = 24 mm Hg '"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "Twenty-four-hour control with latanoprost-timolol-fixed combination therapy vs latanoprost therapy.\n\nOBJECTIVE: To evaluate the 24-hour efficacy and safety of the latanoprost-timolol maleate-fixed combination vs latanoprost therapy in patients with primary open-angle glaucoma.\nMETHODS: A prospective, observer-masked, crossover, active-controlled, randomized comparison in which after a 6-week medicine-free period, patients were randomized to either latanoprost-timolol-fixed combination therapy or latanoprost therapy, both dosed once each evening, alone for 8 weeks. Patients were then switched to the opposite treatment for 8 weeks. At the end of the washout and treatment periods, a 24-hour diurnal curve was performed.\nRESULTS: The baseline untreated mean +/- SD diurnal curve in 37 patients who completed the study was 24.2 +/- 2.0 mm Hg. The mean diurnal curve was 19.2 +/- 2.6 mm Hg for those who received latanoprost therapy alone and 16.7 +/- 2.1 mm Hg for those who received the fixed combination therapy (P<.001). The fixed combination therapy also provided a lower absolute intraocular pressure level (1.5-2.9 mm Hg, P<.001) and a greater intraocular pressure reduction from the untreated baseline (P<.001). Stinging was statistically lower with latanoprost therapy alone (P = .04), but itching was statistically increased compared with the fixed combination therapy (P = .04).\nCONCLUSION: The result of this study suggests that the latanoprost-timolol-fixed combination compared with latanoprost therapy alone provides improved intraocular pressure reduction over the 24-hour diurnal curve and for each individual time point in patients with primary open-angle glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: ' patients with primary open-angle glaucoma.'"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "A 12-week, randomized, double-masked, multicenter study of the fixed combination of latanoprost and timolol in the evening versus the individual components.\n\nPURPOSE: To compare the efficacy and tolerability of fixed-combination latanoprost and timolol applied in the evening with the concomitant use of the individual components.\nDESIGN: Twelve-week, randomized, double-masked, multicenter study.\nPARTICIPANTS: Five hundred seventeen randomized patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg.\nMETHODS: Patients received either the fixed combination of latanoprost and timolol once daily in the evening and a placebo in the morning and evening or the unfixed combination of latanoprost once daily in the evening and timolol in the morning and evening. Study visits were at weeks 2, 6, and 12. MAIN OUTCOME MEASURES: The primary efficacy end point was mean change from baseline to week 12 in diurnal IOP (mean IOPs of 8 am, 12 pm, and 4 pm). The fixed combination was considered noninferior to the unfixed combination if the upper limit of the 95% confidence interval (CI) of the difference was <1.5 mmHg (analysis of covariance). Adverse events were recorded at each visit.\nRESULTS: In all, 502 patients were included in intent-to-treat analyses (fixed combination, n = 255; unfixed combination, n = 247). For the fixed- and unfixed-combination groups, mean baseline diurnal IOP levels were 25.4 mmHg and 25.2 mmHg, respectively, and mean diurnal IOP reductions were 8.7 mmHg and 9.0 mmHg (between-treatment difference, 0.3 mmHg; 95% CI, -0.1 to 0.7 mmHg; P = 0.15). Both treatments were well tolerated.\nCONCLUSIONS: The fixed combination of latanoprost and timolol administered once daily in the evening is not inferior to the unfixed combination of latanoprost once daily in the evening and timolol twice daily. The fixed combination provides an effective and well-tolerated alternative to multiple instillations."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg'"} {"input": [{"role": "system", "content": "This is from a clinical drug trial abstract. Extract the parts specifying population demographics."}, {"role": "user", "content": "Efficacy of latanoprost or fixed-combination latanoprost-timolol in patients switched from a combination of timolol and a nonprostaglandin medication.\n\nPURPOSE: To compare latanoprost with the fixed-combination latanoprost-timolol in glaucoma or ocular hypertension patients switched from a combination glaucoma therapy with timolol and another nonprostaglandin medication.\nDESIGN: Prospective randomized clinical trial.\nMETHODS: Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor) underwent a 30-day washout of their medications. A masked observer then measured their intraocular pressure (IOP). The subjects were randomized to either latanoprost or fixed-combination latanoprost-timolol eyedrops to use once daily at 7 am. The IOP was measured again 30 days after the patients started using one of the study drugs by the same examiner at the same time. MAIN OUTCOME MEASURE: Comparison of the study medications' hypotensive effect.\nRESULTS: Fifty-three eyes (28 in the latanoprost group and 25 in the latanoprost-timolol group) from 28 patients were included in the study. The IOP reduction was greater in both study groups compared with the previous combination therapy with timolol and another nonprostaglandin medication in millimeters of mercury (7.7+/-2.3 vs. 5.5+/-2.3, P<0.001, for the latanoprost group; 8.5+/-3.5 vs. 6.3+/-2.7, P<0.001, for the latanoprost-timolol group) and percentage (35.8+/-8.2% vs. 25.6+/-8.9%, P<0.001, for the latanoprost group; 38.6+/-8.7% vs. 28.6+/-9.0%, P<0.001, for the latanoprost-timolol group). There was no statistical difference between latanoprost and fixed-combination latanoprost-timolol in reducing IOP, in either millimeters of mercury (P = 0.3) or percentage (P = 0.2).\nCONCLUSIONS: Both latanoprost and fixed-combination latanoprost-timolol may be viable substitutes for timolol and another nonprostaglandin medication in glaucoma or ocular hypertension patients."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor)'"} {"input": [{"role": "system", "content": "The Following text is an abstract of a clinical drug trial that specifies a population demographic. I want you to extract the text spans that contain these informations."}, {"role": "user", "content": "A 6-week, double-masked, parallel-group study of the efficacy and safety of travoprost 0.004% compared with latanoprost 0:005%/timolol 0.5% in patients with primary open-angle glaucoma or ocular hypertension.\n\nOBJECTIVE: The objective of this study was to directly compare the intraocular pressure (IOP)-lowering efficacy and safety of travoprost 0.004% eyedrops with the fixed combination of latanoprost 0.005%/timolol 0.5% eyedrops in patients with primary open-angle glaucoma or ocular hypertension.\nMETHODS: This was a randomized, double-masked, multicenter, parallel-group, active-controlled study. Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension were eligible to participate if their IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy, as indicated by IOP of 22 to 36 mm Hg at 9 AM at screening. Patients were randomly assigned in a 1:1 ratio to receive placebo + travoprost or latanoprost/timolol + placebo. Patients in the travoprost group administered travoprost at 9 PM and placebo at 9 AM; patients in the latanoprost/timolol group administered latanoprost/timolol at 9 AM and placebo at 9 PM. IOP measurements were performed using Goldmann applanation tonometry at 9 AM and 5 PM at the week-2 and week-6 visits. Both volunteered and elicited reports of adverse events were collected; all patients who were randomized and received > or =1 dose of study drug were included in the safety analysis.\nRESULTS: One hundred ten patients were randomized, of whom 106 patients were evaluable (travoprost, n = 50; latanoprost/timolol, n = 56). There were no statistically significant differences at baseline between the treatment groups, based on age group, sex, race, iris color, or diagnosis. Mean IOP values were not statistically different between groups at baseline or during treatment. In the pooled results for 9 Am assessment at weeks 2 and 6, mean (SEM) IOP reductions for travoprost and latanoprost/timolol were 7.0 (0.5) and 6.4 (0.5) mm Hg, respectively (P = NS). Adverse events related to therapy were mild in nature, and there were no statistically significant differences between the 2 treatment groups. The most frequently experienced adverse events in the travoprost group were ocular hyperemia (9.3%), foreign body sensation (5.6%), abnormal vision (1.9%), allergic reaction (1.9%), conjunctivitis (1.9%), dacryocystitis (1.9%), eye discharge (1.9%), eye pruritus (1.9%), lid edema (1.9%), lid erythema (1.9%), and tearing (1.9%). In the latanoprost/timolol group, the most frequently experienced adverse events were cataract (1.8%), dry eyes (1.8%), eye pruritus (1.8%), foreign body sensation (1.8%), and ocular hyperemia (1.8%).\nCONCLUSIONS: Mean IOP changes from baseline for travoprost 0.004% and latanoprost 0.005%/timolol 0.5% fixed combination were not significantly different at follow-up in these patients. Both medications were well tolerated."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma or ocular hypertension.', 'Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension', 'IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Comparison of the efficacy and safety of travoprost with a fixed-combination of dorzolamide and timolol in patients with open-angle glaucoma or ocular hypertension.\n\nPURPOSE: The purpose of this study was to compare travoprost (TRAV; travoprost 0.004%) and the fixed-combination of dorzolamide/timolol (DTFC; dorzolamide 2.0%/timolol maleate 0.5%) ophthalmic solutions for reducing intraocular pressure (IOP) in patients with primary open-angle glaucoma (OAG) or ocular hypertension (OHT).\nMETHODS: This was a randomized single masked, study with parallel controls. The TRAV group (n = 29) dosed once daily at 9:00 PM while the DTFC group (n = 27) dosed twice daily at 9:00 AM and 9:00 PM. IOP was measured at baseline, and following 3 weeks and 6 weeks of treatment at 8:00 AM, 12:00 PM, 4:00 PM, and 8:00 PM.\nRESULTS: Mean average IOP reductions from baseline during the course of the day were 7.5 (32.7%) and 7.1 (30.7%) mmHg for TRAV and 4.8 (23.1%) and 4.5 (21.7%) mmHg for DTFC at 3 weeks and 6 weeks, respectively. The greater IOP reduction for patients receiving TRAV was statistically significant at both the 3 and 6 week visits when averaged across all four time points (p < 0.01). The two products were well-tolerated over the course of the 6 week study. Some factors such as taste perversion were reported more often in the DTFC group.\nCONCLUSIONS: Travoprost monotherapy provided better efficacy in terms of IOP reduction and percentage of IOP reduction compared to dorzolamide 2.0%/timolol maleate 0.5% fixed combination."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma (OAG)', 'ocular hypertension (OHT)'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Efficacy and safety of latanoprost versus travoprost in exfoliative glaucoma patients.\n\nOBJECTIVE: To evaluate 24-hour intraocular pressure (IOP) efficacy of latanoprost versus travoprost, each given every evening, in exfoliative glaucoma patients.\nDESIGN: Prospective, observer-masked, crossover comparison.\nPARTICIPANTS: Forty patients with exfoliation glaucoma.\nMETHODS: Patients with a pressure of >24 mmHg were randomized to latanoprost or travoprost for an 8-week treatment period after a 6-week medicine-free period. Patients were then switched to the opposite treatment for the second period. At untreated baseline and at the end of each treatment period the IOP was measured at 6 am, 10 am, 2 pm, 6 pm, 10 pm, and 2 am. MAIN OUTCOME MEASURE: Diurnal IOP.\nRESULTS: The mean 24-hour IOP was 25.1+/-2.5 mmHg at baseline, 17.8+/-2.1 mmHg on latanoprost, and 17.3+/-2.2 mmHg on travoprost (P = 0.001). Individual time points were similar between treatments, except at 6 pm when travoprost provided lower IOP (16.7+/-2.6 vs 17.9+/-2.5 mmHg, P<0.001). Adverse events showed more conjunctival hyperemia with travoprost (n = 15) than latanoprost (n = 6; P = 0.03).\nCONCLUSIONS: Latanoprost and travoprost both significantly reduce the 24-hour IOP from baseline in exfoliative glaucoma, but travoprost may demonstrate a greater hypotensive efficacy in the late afternoon."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a pressure of >24 mmHg', 'exfoliative glaucoma patients'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparison of the ocular hypotensive effects of bimatoprost and timolol-dorzolamide combination in patients with elevated intraocular pressure: a 6-month study.\n\nPURPOSE: To compare the ocular hypotensive efficacy and safety of topical bimatoprost and timolol-dorzolamide combination in patients with primary open-angle glaucoma (POAG) or ocular hypertension during 6 months of treatment.\nMETHODS: A sample of 65 patients with a diagnosis of POAG or ocular hypertension were randomized to receive either bimatoprost 0.03% once daily or timolol-dorzolamide combination twice daily. Study visits occurred at baseline and after 2 weeks and 1, 3 and 6 months of therapy. Intraocular pressure (IOP) measurements were performed at 12.00 hours at all study visits and also at 08.00 hours and 16.00 hours at baseline and 6-month visits. At each visit, local and systemic side-effects that occurred during the treatment period were recorded. Student's t-test was used to compare the differences between IOP values.\nRESULTS: Differences in IOP between the bimatoprost and timolol-dorzolamide groups were statistically insignificant at all study visits (p > 0.05). In the bimatoprost-treated group, the IOP reduction was 6.2 +/- 1.8 mmHg, whereas it was 6.5 +/- 2.3 mmHg in the timolol-dorzolamide group after 6 months of treatment. The difference was not statistically significant (p = 0.48).\nCONCLUSIONS: The IOP-lowering efficacies of bimatoprost and timolol-dorzolamide combination were similar over a 6-month follow-up. Both bimatoprost and the timolol-dorzolamide combination were well tolerated. Bimatoprost can be used as a longterm monotherapy agent in the treatment of POAG and ocular hypertension."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open-angle glaucoma (POAG) or ocular hypertension'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparing the fixed combination brimonidine-timolol versus fixed combination dorzolamide-timolol in patients with elevated intraocular pressure.\n\nPURPOSE: To evaluate the efficacy of fixed combination brimonidine-timolol (FCBT) versus fixed combination dorzolamide-timolol (FCDT) given twice daily in patients with primary open angle glaucoma (POAG) or ocular hypertension (OH).\nDESIGN: Prospective, multicentre, masked-observer, crossover comparison.\nPARTICIPANTS: Sixteen patients with POAG and 14 with OH.\nMETHODS: The participants of the study were washed out from their previous medication and randomized to fixed FCBT or FCDT for the first 4-week treatment period. Subjects then were washed for 4 weeks and started on the opposite medication for the second 4-week period. Intraocular pressure (IOP) was measured with a Goldmann applanation tonometer at 8:00 a.m., 12:00 noon and 4:00 p.m. at each baseline and at the end of each treatment period. Unsolicited ocular adverse events were also recorded. MAIN OUTCOME MEASURES: Comparison of the IOP lowering effect of FCBT and FCDT.\nRESULTS: The baseline mean diurnal IOP for all 30 subjects (30 eyes) was 22.9 +/- 1.6 mmHg. Both fixed combinations significantly reduced IOP compared with baseline (p < 0.00001). The mean diurnal IOP following 4 weeks of therapy was 15.0 +/- 2.1 mmHg for FCBT and 15.4 +/- 2.1 mmHg for FCDT (p = 0.510). The mean diurnal IOP reduction was 7.8 +/- 1.9 mmHg for FCBT and 7.4 +/- 1.8 mmHg for FCDT (p = 0.430). Overall, 14 subjects complained about ocular adverse events: two only for FCBT, seven only for FCDT and five for both drugs. Although there was no significant difference between the number of subjects that reported ocular adverse events with FCBT (n = 7) and FCDT (n = 12) (p = 0.359), FCDT caused more ocular stinging upon instillation (n = 9) than FCBT (n = 1) (p = 0.027).\nCONCLUSION: This study suggests that FCBT and FCDT, each given twice daily, have similar efficacy in patients with POAG or OH."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open angle glaucoma (POAG) or ocular hypertension (OH)', 'patients with POAG', 'OH'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "A comparison of the safety and intraocular pressure lowering of bimatoprost/timolol fixed combination versus latanoprost/timolol fixed combination in patients with open-angle glaucoma.\n\nPURPOSE: To compare the efficacy and tolerability of a once daily evening dose of the latanoprost/timolol fixed combination (LTFC) with that of a once-daily evening dose of the bimatoprost/timolol fixed combination (BTFC) in patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides.\nDESIGN: Prospective, randomized, evaluator masked, single-center study.\nPARTICIPANTS: 36 patients with a diagnosis of open-angle glaucoma, with or without pseudoexfoliation, and inadequate control of IOP, insufficiently responsive to monotherapy with prostaglandin analogues/prostamides. MAIN OUTCOME MEASURE: The primary end-points were the change in IOP at 9:00 am from baseline to week 4, and the difference between treatment groups in the mean diurnal IOP reduction from baseline to week 4.\nRESULTS: BTFC provided significantly greater mean diurnal IOP reduction [mean (standard deviation)] 2.8 (0.9) mmHg, compared with LTFC 2.1 (0.6) mmHg, p = 0.0214. Both treatments significantly reduced the IOP from baseline at each IOP time-point measured, p < 0.0001, and for the mean diurnal IOP; p = 0.0049 for the LTFC, and p < 0.0001 for the BTFC. There were no significant differences in average hyperemia scores among groups, 1.25 (0.5) vs. 1.62 (0.69), p = 0.3835, for the LTFC and the BTFC, respectively.\nCONCLUSIONS: The results of this study showed a significantly higher IOP-lowering effect of a once-daily evening dose of the BTFC compared to that of a once-daily evening administration of the LTFC."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides'"} ``` </details>
Linmj-Judy
pushed a commit
to TablewareBox/evals
that referenced
this pull request
Feb 27, 2024
…i#1124) # Thank you for contributing an eval!♥️ 🚨 Please make sure your PR follows these guidelines, **failure to follow the guidelines below will result in the PR being closed automatically**. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨 **PLEASE READ THIS**: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** Also, please note that we're using **Git LFS** for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available [here](https://git-lfs.com). ## Eval details 📑 ### Eval name The <eval_name> is **population_span_extraction** ID is **population_span_extraction.dev.v0** ### Eval description The model is shown abstracts of clinical drug trials and tasked with extracting the text spans that specify the population demographic of the shown abstract. The population demographic can be but is not necessarily specified in multiple seperate spans. A previous version included examples containing 'problem' as part of the population (as per PICO criteria labeling) as opposed to strictly population demographics. We are now resubmitting a different version, with different abstracts, which contains only demographics annotations. ### What makes this a useful eval? The Repository specifically asks for "Real-world use cases". Extracting population spans from clinical study trials is immensly useful to researchers who have to go over and compare large amounts of clinical drug trials. The eval dataset is generated with multiple different prompts and statisfies all further critera posed by Open AI. ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high-quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your YAML is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (<https://platform.openai.com/docs/usage-policies>). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgment We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access be granted. ### Submit eval - [x] I have filled out all required fields of this form - [x] I have used **Git LFS** for the Eval JSON data - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Efficacy of the dorzolamide/timolol fixed combination versus latanoprost in the treatment of ocular hypertension or glaucoma: combined analysis of pooled data from two large randomized observer and patient-masked studies.\n\nIn previous analyses of primary efficacy data from two randomized clinical trials, standard dosing regimens of the dorzolamide/timolol fixed combination (COSOPT) and latanoprost (XALATAN) were shown to have equivalent efficacy with regard to reduction in mean daytime diurnal intraocular pressure (IOP). We performed additional post hoc analyses of pooled data from these studies to compare further the efficacy of the two treatments. The studies used identical 3-month, parallel group, randomized, observer-masked and patient-masked, multicenter designs. Patients with a baseline IOP > or = 24 mm Hg were randomized to either the 2% dorzolamide/0.5% timolol combination eye drops twice daily (n = 273) or 0.005% latanoprost eye drops once daily (n = 271). The IOP measurements were made at 8 AM, 10 AM, 2 PM, and 4 PM at the baseline visit and then on each of the 3 monthly assessment days. The following measures were analyzed on a post hoc basis: 1) percentages of patients meeting target levels of IOP reduction; 2) mean IOP reduction in those patients with high IOP (> or =30 mmHg) at baseline; 3) mean IOP at each of the assessment time points during a day. A total of 259 patients in the dorzolamide/timolol group and 268 patients in the latanoprost group were included in the efficacy analysis. At 3 months, both treatments showed similar efficacy with regard to the percentages of patients who achieved target levels of IOP reduction (e.g., 40% IOP reduction in 15% of dorzolamide/timolol combination patients and 13% of latanoprost patients), mean IOP reduction in those patients with high IOP at baseline (dorzolamide/ timolol combination, 12.5 mmHg, latanoprost, 12.6 mmHg), and mean IOP at each time point during the day. By the measures used in this analysis, the dorzolamide/timolol combination and latanoprost were equally effective at lowering IOP in patients with ocular hypertension or glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a baseline IOP > or = 24 mm Hg '"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "Twenty-four-hour control with latanoprost-timolol-fixed combination therapy vs latanoprost therapy.\n\nOBJECTIVE: To evaluate the 24-hour efficacy and safety of the latanoprost-timolol maleate-fixed combination vs latanoprost therapy in patients with primary open-angle glaucoma.\nMETHODS: A prospective, observer-masked, crossover, active-controlled, randomized comparison in which after a 6-week medicine-free period, patients were randomized to either latanoprost-timolol-fixed combination therapy or latanoprost therapy, both dosed once each evening, alone for 8 weeks. Patients were then switched to the opposite treatment for 8 weeks. At the end of the washout and treatment periods, a 24-hour diurnal curve was performed.\nRESULTS: The baseline untreated mean +/- SD diurnal curve in 37 patients who completed the study was 24.2 +/- 2.0 mm Hg. The mean diurnal curve was 19.2 +/- 2.6 mm Hg for those who received latanoprost therapy alone and 16.7 +/- 2.1 mm Hg for those who received the fixed combination therapy (P<.001). The fixed combination therapy also provided a lower absolute intraocular pressure level (1.5-2.9 mm Hg, P<.001) and a greater intraocular pressure reduction from the untreated baseline (P<.001). Stinging was statistically lower with latanoprost therapy alone (P = .04), but itching was statistically increased compared with the fixed combination therapy (P = .04).\nCONCLUSION: The result of this study suggests that the latanoprost-timolol-fixed combination compared with latanoprost therapy alone provides improved intraocular pressure reduction over the 24-hour diurnal curve and for each individual time point in patients with primary open-angle glaucoma."}], "ideal": "In the abstract, population demographics are defined by the following spans: ' patients with primary open-angle glaucoma.'"} {"input": [{"role": "system", "content": "Extract the text spans containing information on the Population Demographic from the following abstract."}, {"role": "user", "content": "A 12-week, randomized, double-masked, multicenter study of the fixed combination of latanoprost and timolol in the evening versus the individual components.\n\nPURPOSE: To compare the efficacy and tolerability of fixed-combination latanoprost and timolol applied in the evening with the concomitant use of the individual components.\nDESIGN: Twelve-week, randomized, double-masked, multicenter study.\nPARTICIPANTS: Five hundred seventeen randomized patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg.\nMETHODS: Patients received either the fixed combination of latanoprost and timolol once daily in the evening and a placebo in the morning and evening or the unfixed combination of latanoprost once daily in the evening and timolol in the morning and evening. Study visits were at weeks 2, 6, and 12. MAIN OUTCOME MEASURES: The primary efficacy end point was mean change from baseline to week 12 in diurnal IOP (mean IOPs of 8 am, 12 pm, and 4 pm). The fixed combination was considered noninferior to the unfixed combination if the upper limit of the 95% confidence interval (CI) of the difference was <1.5 mmHg (analysis of covariance). Adverse events were recorded at each visit.\nRESULTS: In all, 502 patients were included in intent-to-treat analyses (fixed combination, n = 255; unfixed combination, n = 247). For the fixed- and unfixed-combination groups, mean baseline diurnal IOP levels were 25.4 mmHg and 25.2 mmHg, respectively, and mean diurnal IOP reductions were 8.7 mmHg and 9.0 mmHg (between-treatment difference, 0.3 mmHg; 95% CI, -0.1 to 0.7 mmHg; P = 0.15). Both treatments were well tolerated.\nCONCLUSIONS: The fixed combination of latanoprost and timolol administered once daily in the evening is not inferior to the unfixed combination of latanoprost once daily in the evening and timolol twice daily. The fixed combination provides an effective and well-tolerated alternative to multiple instillations."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with ocular hypertension; open-angle, pigmentary, or exfoliation glaucoma; and baseline (after washout) intraocular pressure (IOP) levels between 23 and 33 mmHg'"} {"input": [{"role": "system", "content": "This is from a clinical drug trial abstract. Extract the parts specifying population demographics."}, {"role": "user", "content": "Efficacy of latanoprost or fixed-combination latanoprost-timolol in patients switched from a combination of timolol and a nonprostaglandin medication.\n\nPURPOSE: To compare latanoprost with the fixed-combination latanoprost-timolol in glaucoma or ocular hypertension patients switched from a combination glaucoma therapy with timolol and another nonprostaglandin medication.\nDESIGN: Prospective randomized clinical trial.\nMETHODS: Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor) underwent a 30-day washout of their medications. A masked observer then measured their intraocular pressure (IOP). The subjects were randomized to either latanoprost or fixed-combination latanoprost-timolol eyedrops to use once daily at 7 am. The IOP was measured again 30 days after the patients started using one of the study drugs by the same examiner at the same time. MAIN OUTCOME MEASURE: Comparison of the study medications' hypotensive effect.\nRESULTS: Fifty-three eyes (28 in the latanoprost group and 25 in the latanoprost-timolol group) from 28 patients were included in the study. The IOP reduction was greater in both study groups compared with the previous combination therapy with timolol and another nonprostaglandin medication in millimeters of mercury (7.7+/-2.3 vs. 5.5+/-2.3, P<0.001, for the latanoprost group; 8.5+/-3.5 vs. 6.3+/-2.7, P<0.001, for the latanoprost-timolol group) and percentage (35.8+/-8.2% vs. 25.6+/-8.9%, P<0.001, for the latanoprost group; 38.6+/-8.7% vs. 28.6+/-9.0%, P<0.001, for the latanoprost-timolol group). There was no statistical difference between latanoprost and fixed-combination latanoprost-timolol in reducing IOP, in either millimeters of mercury (P = 0.3) or percentage (P = 0.2).\nCONCLUSIONS: Both latanoprost and fixed-combination latanoprost-timolol may be viable substitutes for timolol and another nonprostaglandin medication in glaucoma or ocular hypertension patients."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Glaucoma or ocular hypertension patients receiving a combined treatment of timolol 0.5% and another nonprostaglandin medication (pilocarpine 4%, alpha-agonist, or a topical carbonic anhydrase inhibitor)'"} {"input": [{"role": "system", "content": "The Following text is an abstract of a clinical drug trial that specifies a population demographic. I want you to extract the text spans that contain these informations."}, {"role": "user", "content": "A 6-week, double-masked, parallel-group study of the efficacy and safety of travoprost 0.004% compared with latanoprost 0:005%/timolol 0.5% in patients with primary open-angle glaucoma or ocular hypertension.\n\nOBJECTIVE: The objective of this study was to directly compare the intraocular pressure (IOP)-lowering efficacy and safety of travoprost 0.004% eyedrops with the fixed combination of latanoprost 0.005%/timolol 0.5% eyedrops in patients with primary open-angle glaucoma or ocular hypertension.\nMETHODS: This was a randomized, double-masked, multicenter, parallel-group, active-controlled study. Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension were eligible to participate if their IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy, as indicated by IOP of 22 to 36 mm Hg at 9 AM at screening. Patients were randomly assigned in a 1:1 ratio to receive placebo + travoprost or latanoprost/timolol + placebo. Patients in the travoprost group administered travoprost at 9 PM and placebo at 9 AM; patients in the latanoprost/timolol group administered latanoprost/timolol at 9 AM and placebo at 9 PM. IOP measurements were performed using Goldmann applanation tonometry at 9 AM and 5 PM at the week-2 and week-6 visits. Both volunteered and elicited reports of adverse events were collected; all patients who were randomized and received > or =1 dose of study drug were included in the safety analysis.\nRESULTS: One hundred ten patients were randomized, of whom 106 patients were evaluable (travoprost, n = 50; latanoprost/timolol, n = 56). There were no statistically significant differences at baseline between the treatment groups, based on age group, sex, race, iris color, or diagnosis. Mean IOP values were not statistically different between groups at baseline or during treatment. In the pooled results for 9 Am assessment at weeks 2 and 6, mean (SEM) IOP reductions for travoprost and latanoprost/timolol were 7.0 (0.5) and 6.4 (0.5) mm Hg, respectively (P = NS). Adverse events related to therapy were mild in nature, and there were no statistically significant differences between the 2 treatment groups. The most frequently experienced adverse events in the travoprost group were ocular hyperemia (9.3%), foreign body sensation (5.6%), abnormal vision (1.9%), allergic reaction (1.9%), conjunctivitis (1.9%), dacryocystitis (1.9%), eye discharge (1.9%), eye pruritus (1.9%), lid edema (1.9%), lid erythema (1.9%), and tearing (1.9%). In the latanoprost/timolol group, the most frequently experienced adverse events were cataract (1.8%), dry eyes (1.8%), eye pruritus (1.8%), foreign body sensation (1.8%), and ocular hyperemia (1.8%).\nCONCLUSIONS: Mean IOP changes from baseline for travoprost 0.004% and latanoprost 0.005%/timolol 0.5% fixed combination were not significantly different at follow-up in these patients. Both medications were well tolerated."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma or ocular hypertension.', 'Adult subjects with open-angle glaucoma (with or without pseudoexfoliation or pigment dispersion component) or ocular hypertension', 'IOP was inadequately controlled with > or =4 weeks of beta-blocker monotherapy'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "Comparison of the efficacy and safety of travoprost with a fixed-combination of dorzolamide and timolol in patients with open-angle glaucoma or ocular hypertension.\n\nPURPOSE: The purpose of this study was to compare travoprost (TRAV; travoprost 0.004%) and the fixed-combination of dorzolamide/timolol (DTFC; dorzolamide 2.0%/timolol maleate 0.5%) ophthalmic solutions for reducing intraocular pressure (IOP) in patients with primary open-angle glaucoma (OAG) or ocular hypertension (OHT).\nMETHODS: This was a randomized single masked, study with parallel controls. The TRAV group (n = 29) dosed once daily at 9:00 PM while the DTFC group (n = 27) dosed twice daily at 9:00 AM and 9:00 PM. IOP was measured at baseline, and following 3 weeks and 6 weeks of treatment at 8:00 AM, 12:00 PM, 4:00 PM, and 8:00 PM.\nRESULTS: Mean average IOP reductions from baseline during the course of the day were 7.5 (32.7%) and 7.1 (30.7%) mmHg for TRAV and 4.8 (23.1%) and 4.5 (21.7%) mmHg for DTFC at 3 weeks and 6 weeks, respectively. The greater IOP reduction for patients receiving TRAV was statistically significant at both the 3 and 6 week visits when averaged across all four time points (p < 0.01). The two products were well-tolerated over the course of the 6 week study. Some factors such as taste perversion were reported more often in the DTFC group.\nCONCLUSIONS: Travoprost monotherapy provided better efficacy in terms of IOP reduction and percentage of IOP reduction compared to dorzolamide 2.0%/timolol maleate 0.5% fixed combination."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'in patients with primary open-angle glaucoma (OAG)', 'ocular hypertension (OHT)'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Efficacy and safety of latanoprost versus travoprost in exfoliative glaucoma patients.\n\nOBJECTIVE: To evaluate 24-hour intraocular pressure (IOP) efficacy of latanoprost versus travoprost, each given every evening, in exfoliative glaucoma patients.\nDESIGN: Prospective, observer-masked, crossover comparison.\nPARTICIPANTS: Forty patients with exfoliation glaucoma.\nMETHODS: Patients with a pressure of >24 mmHg were randomized to latanoprost or travoprost for an 8-week treatment period after a 6-week medicine-free period. Patients were then switched to the opposite treatment for the second period. At untreated baseline and at the end of each treatment period the IOP was measured at 6 am, 10 am, 2 pm, 6 pm, 10 pm, and 2 am. MAIN OUTCOME MEASURE: Diurnal IOP.\nRESULTS: The mean 24-hour IOP was 25.1+/-2.5 mmHg at baseline, 17.8+/-2.1 mmHg on latanoprost, and 17.3+/-2.2 mmHg on travoprost (P = 0.001). Individual time points were similar between treatments, except at 6 pm when travoprost provided lower IOP (16.7+/-2.6 vs 17.9+/-2.5 mmHg, P<0.001). Adverse events showed more conjunctival hyperemia with travoprost (n = 15) than latanoprost (n = 6; P = 0.03).\nCONCLUSIONS: Latanoprost and travoprost both significantly reduce the 24-hour IOP from baseline in exfoliative glaucoma, but travoprost may demonstrate a greater hypotensive efficacy in the late afternoon."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'Patients with a pressure of >24 mmHg', 'exfoliative glaucoma patients'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparison of the ocular hypotensive effects of bimatoprost and timolol-dorzolamide combination in patients with elevated intraocular pressure: a 6-month study.\n\nPURPOSE: To compare the ocular hypotensive efficacy and safety of topical bimatoprost and timolol-dorzolamide combination in patients with primary open-angle glaucoma (POAG) or ocular hypertension during 6 months of treatment.\nMETHODS: A sample of 65 patients with a diagnosis of POAG or ocular hypertension were randomized to receive either bimatoprost 0.03% once daily or timolol-dorzolamide combination twice daily. Study visits occurred at baseline and after 2 weeks and 1, 3 and 6 months of therapy. Intraocular pressure (IOP) measurements were performed at 12.00 hours at all study visits and also at 08.00 hours and 16.00 hours at baseline and 6-month visits. At each visit, local and systemic side-effects that occurred during the treatment period were recorded. Student's t-test was used to compare the differences between IOP values.\nRESULTS: Differences in IOP between the bimatoprost and timolol-dorzolamide groups were statistically insignificant at all study visits (p > 0.05). In the bimatoprost-treated group, the IOP reduction was 6.2 +/- 1.8 mmHg, whereas it was 6.5 +/- 2.3 mmHg in the timolol-dorzolamide group after 6 months of treatment. The difference was not statistically significant (p = 0.48).\nCONCLUSIONS: The IOP-lowering efficacies of bimatoprost and timolol-dorzolamide combination were similar over a 6-month follow-up. Both bimatoprost and the timolol-dorzolamide combination were well tolerated. Bimatoprost can be used as a longterm monotherapy agent in the treatment of POAG and ocular hypertension."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open-angle glaucoma (POAG) or ocular hypertension'"} {"input": [{"role": "system", "content": "What is the Population Demographic for the following abstract? Extract the text spans that define it."}, {"role": "user", "content": "Comparing the fixed combination brimonidine-timolol versus fixed combination dorzolamide-timolol in patients with elevated intraocular pressure.\n\nPURPOSE: To evaluate the efficacy of fixed combination brimonidine-timolol (FCBT) versus fixed combination dorzolamide-timolol (FCDT) given twice daily in patients with primary open angle glaucoma (POAG) or ocular hypertension (OH).\nDESIGN: Prospective, multicentre, masked-observer, crossover comparison.\nPARTICIPANTS: Sixteen patients with POAG and 14 with OH.\nMETHODS: The participants of the study were washed out from their previous medication and randomized to fixed FCBT or FCDT for the first 4-week treatment period. Subjects then were washed for 4 weeks and started on the opposite medication for the second 4-week period. Intraocular pressure (IOP) was measured with a Goldmann applanation tonometer at 8:00 a.m., 12:00 noon and 4:00 p.m. at each baseline and at the end of each treatment period. Unsolicited ocular adverse events were also recorded. MAIN OUTCOME MEASURES: Comparison of the IOP lowering effect of FCBT and FCDT.\nRESULTS: The baseline mean diurnal IOP for all 30 subjects (30 eyes) was 22.9 +/- 1.6 mmHg. Both fixed combinations significantly reduced IOP compared with baseline (p < 0.00001). The mean diurnal IOP following 4 weeks of therapy was 15.0 +/- 2.1 mmHg for FCBT and 15.4 +/- 2.1 mmHg for FCDT (p = 0.510). The mean diurnal IOP reduction was 7.8 +/- 1.9 mmHg for FCBT and 7.4 +/- 1.8 mmHg for FCDT (p = 0.430). Overall, 14 subjects complained about ocular adverse events: two only for FCBT, seven only for FCDT and five for both drugs. Although there was no significant difference between the number of subjects that reported ocular adverse events with FCBT (n = 7) and FCDT (n = 12) (p = 0.359), FCDT caused more ocular stinging upon instillation (n = 9) than FCBT (n = 1) (p = 0.027).\nCONCLUSION: This study suggests that FCBT and FCDT, each given twice daily, have similar efficacy in patients with POAG or OH."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with primary open angle glaucoma (POAG) or ocular hypertension (OH)', 'patients with POAG', 'OH'"} {"input": [{"role": "system", "content": "I want to know how this abstract defines the Population Demographics. Please extract the sections in the abstract that define the demographics."}, {"role": "user", "content": "A comparison of the safety and intraocular pressure lowering of bimatoprost/timolol fixed combination versus latanoprost/timolol fixed combination in patients with open-angle glaucoma.\n\nPURPOSE: To compare the efficacy and tolerability of a once daily evening dose of the latanoprost/timolol fixed combination (LTFC) with that of a once-daily evening dose of the bimatoprost/timolol fixed combination (BTFC) in patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides.\nDESIGN: Prospective, randomized, evaluator masked, single-center study.\nPARTICIPANTS: 36 patients with a diagnosis of open-angle glaucoma, with or without pseudoexfoliation, and inadequate control of IOP, insufficiently responsive to monotherapy with prostaglandin analogues/prostamides. MAIN OUTCOME MEASURE: The primary end-points were the change in IOP at 9:00 am from baseline to week 4, and the difference between treatment groups in the mean diurnal IOP reduction from baseline to week 4.\nRESULTS: BTFC provided significantly greater mean diurnal IOP reduction [mean (standard deviation)] 2.8 (0.9) mmHg, compared with LTFC 2.1 (0.6) mmHg, p = 0.0214. Both treatments significantly reduced the IOP from baseline at each IOP time-point measured, p < 0.0001, and for the mean diurnal IOP; p = 0.0049 for the LTFC, and p < 0.0001 for the BTFC. There were no significant differences in average hyperemia scores among groups, 1.25 (0.5) vs. 1.62 (0.69), p = 0.3835, for the LTFC and the BTFC, respectively.\nCONCLUSIONS: The results of this study showed a significantly higher IOP-lowering effect of a once-daily evening dose of the BTFC compared to that of a once-daily evening administration of the LTFC."}], "ideal": "In the abstract, population demographics are defined by the following spans: 'patients with open-angle glaucoma with elevated intraocular pressure (IOP) insufficiently responsive to monotherapy with prostaglandin analogues/prostamides'"} ``` </details>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Thank you for contributing an eval!♥️
🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access be granted. 🚨
PLEASE READ THIS:
In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject it since GPT-4 is already capable of completing the task.
We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.
Also, please note that we're using Git LFS for storing the JSON files, so please make sure that you move the JSON file to Git LFS before submitting a PR. Details on how to use Git LFS are available here.
Eval details 📑
Eval name
The <eval_name> is population_span_extraction
ID is population_span_extraction.dev.v0
Eval description
The model is shown abstracts of clinical drug trials and tasked with extracting the text spans that specify the population demographic of the shown abstract. The population demographic can be but is not necessarily specified in multiple seperate spans.
A previous version included examples containing 'problem' as part of the population (as per PICO criteria labeling) as opposed to strictly population demographics.
We are now resubmitting a different version, with different abstracts, which contains only demographics annotations.
What makes this a useful eval?
The Repository specifically asks for "Real-world use cases". Extracting population spans from clinical study trials is immensly useful to researchers who have to go over and compare large amounts of clinical drug trials.
The eval dataset is generated with multiple different prompts and statisfies all further critera posed by Open AI.
Criteria for a good eval ✅
Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).
Your eval should be:
Basicevals or theFactModel-graded eval, or an exhaustive rubric for evaluating answers for theCriteriaModel-graded eval.If there is anything else that makes your eval worth including, please document it below.
Unique eval value
Eval structure 🏗️
Your eval should
evals/registry/data/{name}evals/registry/evals/{name}.yaml(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)
Final checklist 👀
Submission agreement
By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).
Email address validation
If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the commits on the merged pull request.
Limited availability acknowledgment
We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and the high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.
Submit eval
pip install pre-commit; pre-commit installand have verified thatblack,isort, andautoflakeare running when I commit and pushFailure to fill out all required fields will result in the PR being closed.
Eval JSON data
Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:
View evals in JSON
Eval