Skip to content

Commit f077b82

Browse files
14H034160212qiming baojorge-openai
authored
A Larger Deep Multi-Step Deductive Reasoning Dataset over Natural Language with Multi-Step Deductive Reasoning Instruction For OpenAI EVAL (#651)
# Thank you for contributing an eval! ♥️ 🚨 Please make sure your PR follows these guidelines, __failure to follow the guidelines below will result in the PR being closed automatically__. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨 __PLEASE READ THIS__: In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task. We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. **Starting April 10, the minimum eval count is 15 samples, we hope this makes it easier to create and contribute evals.** ## Eval details 📑 ### Eval name [pararule-plus-multi-step-deductive-reasoning] ### Eval description [We proposed a multi-step deductive reasoning instruction for the [PARARULE-Plus dataset](https://github.com/Strong-AI-Lab/PARARULE-Plus), which is a larger deep multi-step deductive reasoning dataset over natural language. We also submitted the PARARULE-Plus into the `Huggingface/Datasets`. Here is the [link](https://huggingface.co/datasets/qbao775/PARARULE-Plus). PARARULE-Plus dataset addresses the reasoning depth imbalance issue from the RuleTaker dataset. The dataset specifically increases the dataset on the deep reasoning depth, including depth=2, 3, 4, 5. In this pull request, we submit a dataset that includes `2708`, `2694`, `2704`, and `2692` questions for Depth=2, Depth=3, Depth=4, and Depth=5, respectively. Furthermore, we evaluate ChatGPT, and it fails on this dataset. Here is the [tweet link](https://twitter.com/qiming_bao/status/1615510552088018944). ### What makes this a useful eval? [Logical reasoning ability is a fascinating topic in the NLP community. We hope to see if ChatGPT and GPT4 sheds more light on this topic.] ## Criteria for a good eval ✅ Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals). Your eval should be: - [x] Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world. - [x] Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not. - [x] Includes good signal around what is the right behavior. This means either a correct answer for `Basic` evals or the `Fact` Model-graded eval, or an exhaustive rubric for evaluating answers for the `Criteria` Model-graded eval. - [x] **Include at least 15 high quality examples.** If there is anything else that makes your eval worth including, please document it below. ### Unique eval value > Insert what makes your eval high quality that was not mentioned above. (Not required) ## Eval structure 🏗️ Your eval should - [x] Check that your data is in `evals/registry/data/{name}` - [x] Check that your yaml is registered at `evals/registry/evals/{name}.yaml` - [x] Ensure you have the right to use the data you submit via this eval (For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.) ## Final checklist 👀 ### Submission agreement By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies). - [x] I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies. ### Email address validation If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request. - [x] I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request. ### Limited availability acknowledgement We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR. - [x] I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted. ### Submit eval - [x] I have filled out all required fields in the evals PR form - [ ] (Ignore if not submitting code) I have run `pip install pre-commit; pre-commit install` and have verified that `black`, `isort`, and `autoflake` are running when I commit and push Failure to fill out all required fields will result in the PR being closed. ### Eval JSON data Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here: <details> <summary>View evals in JSON</summary> ### Eval ``` {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is heavy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D2-11451"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 2 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is lazy. The wolf is strong. The wolf is fierce. The lion chases the mouse. The wolf likes the dog. The mouse is smart. The dog is smart. The dog is cute. The dog is small. If something is not smart then it needs the mouse. If something needs the mouse then it is rough. If something is not kind then it is strong. If something is not big then it is furry. If something is cute then it is small. If something is small and not awful then it is lovely. If something is strong and not kind then it is heavy. If something is slow and lazy then it is awful. If something is awful and not small then it is fierce. All furry animals are beautiful. Question: The lion is not heavy. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D2-11452"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is rough. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D3-10559"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 3 rules to answer the question."}, {"role": "user", "content": "\nPassage: The lion is slow. The lion is sleepy. The tiger is fierce. The tiger is big. The lion likes the dog. The tiger needs the mouse. The dog is smart. The mouse is smart. The mouse is small. The mouse is cute. If something is not smart then it sees the dog. If something sees the dog then it is lazy. If something is not kind then it is fierce. If something is not horrible then it is furry. If something is small then it is cute. If something is cute and not strong then it is beautiful. If something is fierce and not kind then it is awful. If something is slow and sleepy then it is strong. If something is strong and not cute then it is big. If something is furry then it is lovely. All lovely animals are round. All beautiful animals are quiet. All awful animals are heavy. All big animals are horrible. All lazy animals are rough. Question: The lion is not rough. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D3-105510"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The rabbit is not slow. \nAnswer: "}], "ideal": "false", "id_string": "NonNegationRule-Animal-D4-25898"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 4 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is slow. The snake is lazy. The snake is rough. The snake chases the mouse. The crocodile sees the rabbit. The crocodile is fierce. The crocodile is big. The mouse is smart. The mouse is quiet. The mouse is nice. The rabbit is cute. The rabbit is small. The rabbit is adorable. Smart animals are cute. If something is lazy then it attacks the mouse. If something attacks the mouse then it is tired. If something is slow and lazy then it is rough. If something is cute and small then it is beautiful. If something is fierce and big then it is heavy. If something is rough then it is dull. If something is dull then it is sleepy. All sleepy animals are big. If something is cute then it is small. If something is small then it is adorable. If something is adorable then it is nice. All adorable animals are kind. If something is heavy then it is awful. All awful animals are obese. All obese animals are lazy. If something is beautiful then it is lovely. All lovely animals are furry. All furry animals are slow. If something is tired then it is strong. All strong animals are reckless. Question: The snake is reckless. \nAnswer: "}], "ideal": "true", "id_string": "NonNegationRule-Animal-D4-25899"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is big. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-Animal-D5-23709"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: The snake is dull. The snake is slow. The bald eagle is awful. The bald eagle is powerful. The snake attacks the rabbit. The bald eagle likes the squirrel. The rabbit is quiet. The squirrel is quiet. The squirrel is beautiful. The squirrel is cute. If something is not quiet then it visits the rabbit. If something visits the rabbit then it is rough. If something is not kind then it is awful. If something is not fierce then it is furry. If something is beautiful then it is cute. If something is cute and not angry then it is small. If something is awful and not kind then it is horrible. If something is dull and slow then it is angry. If something is angry and not cute then it is powerful. If something is furry then it is lovely. If something is lovely then it is clever. If something is clever then it is kind. All kind animals are smart. All small animals are round. If something is round then it is nice. All nice animals are funny. If something is horrible then it is heavy. If something is heavy then it is tired. All tired animals are reckless. If something is powerful then it is fierce. If something is fierce then it is lazy. All lazy animals are boring. All rough animals are sleepy. If something is sleepy then it is strong. All strong animals are big. Question: The snake is not big. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-Animal-D5-237010"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is bad. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22331"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Harry is not bad. \nAnswer: "}], "ideal": "false", "id_string": "NegationRule-D5-22332"} {"input": [{"role": "system", "content": "Instructions: You will be presented with a passage and a question about that passage. You need to answer true or false to the question. Read the question thoroughly and answer true or false. Read the passage thoroughly to ensure you know what the passage entails and you need to use 5 rules to answer the question."}, {"role": "user", "content": "\nPassage: Harry is huge. Harry is strong. Erin is small. Erin is tiny. Alan is nice. Anne is sad. Anne is poor. If someone is not big then they are sad. If someone is not bad then they are kind. If someone is nice then they are smart. If someone is smart and not rough then they are clever. If someone is sad and not big then they are dull. If someone is dull then they are little. If someone is little then they are thin. All thin people are bad. If someone is small and tiny then they are rough. If someone is rough and not smart then they are poor. If someone is poor then they are fashion. If someone is fashion then they are energetic. If someone is energetic then they are young. If someone is kind then they are wealthy. If someone is wealthy then they are quiet. If someone is quiet then they are smart. All smart people are wealthy. If someone is clever then they are famous. If someone is famous then they are old. All old people are experienced. Question: Anne is wealthy. \nAnswer: "}], "ideal": "true", "id_string": "NegationRule-D5-22333"} ``` </details> --------- Co-authored-by: qiming bao <[email protected]> Co-authored-by: Jorge <[email protected]>
1 parent 74a9ea3 commit f077b82

File tree

2 files changed

+11
-0
lines changed

2 files changed

+11
-0
lines changed
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:89995dbc5c9968a2e8053c658b6fd498dc8f52cb885ebc26f796e53b7609e3e0
3+
size 2707256
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
pararule-plus-multi-step-deductive-reasoning:
2+
id: pararule-plus-multi-step-deductive-reasoning.dev.v0
3+
description: multi-step deductive reasoning instruction for the PARARULE-Plus dataset
4+
metrics: [accuracy]
5+
pararule-plus-multi-step-deductive-reasoning.dev.v0:
6+
class: evals.elsuite.basic.fuzzy_match:FuzzyMatch
7+
args:
8+
samples_jsonl: pararule-plus-multi-step-deductive-reasoning/pararule-plus-multi-step-deductive-reasoning.jsonl

0 commit comments

Comments
 (0)