Questions Regarding Dataset Annotation Process

Hi,

I have some questions about the process for annotating ground truth answers in your TAG benchmark. There seems to be quite a few questions that are inherently subjective, with no one correct 'ground truth' answer. In addition, I see some inconsistent questions between `tag_queries.csv` and `hand_written.py`. 

It would be very useful if you could share the exact outputs produced by your `hand_written.py` script, to see what versions of questions + annotated ground truth answers were used to report the paper performances. Any help here would be greatly appreciated! 


### Subjective Questions
- `pipeline_59()`: **Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?**
    - What is the criteria for saying, e.g. 'Per Mertesacker' is a more unique name than 'Miroslav Klose', etc.?
- `pipeline_50()`: **Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?**
    - What defines 'most futuristic'? Choosing between, say, 'Millikan High' and 'Polytechnic High' feels subjective.
- `pipeline_51()`: **Of the 5 posts wih highest popularity, list their titles in order of most technical to least technical.**
- `pipeline_56()`: **Among the posts owned by a user over 65 with a score of over 10, what are the post id's of the top 2 posts made with the least expertise?**
    - How is 'least expertise' defined to the annotator?
- `pipeline_60()`: **Out of users that have obtained at least 200 badges, what are the top 2 display names that seem most based off a real name?**
    - Why is 'Glen_b' more based off of a real name than 'whuber'?
- `pipeline_107()`: **Of all the comments commented by the user with a username of Harvey Motulsky and with a score of 5, rank the post ids in order of most helpful to least helpful**
    - Was 'most helpful' defined in a specific way to the annotators?
- `pipeline_61()`: **Of the cities containing exclusively virtual schools which are the top 3 safest places to live?**
    - Is a measure of 'safest place to live' defined somewhere in the BIRD database or elsewhere?
- `pipeline_62()`: **List the cities containing the top 5 most enrolled schools in order from most diverse to least diverse.**
    - Similar question here: Is 'most diverse school' a criteria defined in the BIRD database?
- `pipeline_64()`: **Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.**
    - A couple questions here: how is 'strongest academic reputation defined'? Additionally, while the question asks for an ordered list, the LOTUS program (and corresponding ground truth answer) returns a single item, 'Santa Clara'.
- `pipeline_65()`: **Among the cities with the top 10 lowest enrollment for students in grades 1 through 12, which are the top 2 most popular cities to visit?**
    - How is 'most popular cities to visit' defined? The ground truth chooses 'Shaver Lake' over 'Wawona', but a quick Google search seems to indicate that [Wawona/Yosemite](https://www.nps.gov/yose/planyourvisit/traffic.htm#:~:text=Each%20year%2C%20Yosemite%20National%20Park,no%20lodging%20or%20campground%20availability) gets far more visitors than Shaver Lake? 

### Dataset Inconsistencies 
- `pipeline_40()`: **Among the players whose height is over 180, how many of them have a volley score of over 70 and are taller than Bill Clinton?**
    - Judging from the variable `steph_height` and the example in Appendix A from the paper, it seems as though this was switched from 'Steph Curry' -> 'Bill Clinton' at some point. Which version of the dataset is reported in the paper?
- `pipeline_952()`: **Of the constructors that have been ranked 1 in 2014, whose logo looks most like Secretariat?**
    - In `tag_queries.csv`, this is *Of the constructors that have been ranked 1 in 2014, which has the most prestige?*. Similar question - which version of the question is used in reporting performance in your paper?
- `pipeline_5()`: **What are the two most common first names among the female school administrators?**
    - On [line 94](https://github.com/TAG-Research/TAG-Bench/blob/76d5795d6e35f770894d3f180af58b6638964fcf/tag/hand_written.py#L94), the `.head(20)` function is applied, I imagine to speed up the query execution. However, this leads to a query that is no longer faithful to the original natural language question - there is no structural guarantee enforced in the database that a female name is among the top 20 most common names. A faithful query would need to call `sem_filter()` over all names in the `schools_df` table. 
- `pipeline_4()`: **What is the grade span offered in the school with the highest longitude in counties that are part of the 'Silicon Valley' region?**
    - In `tag_queries.csv`, 'cities' is used in place of 'counties'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions Regarding Dataset Annotation Process #7

Subjective Questions

Dataset Inconsistencies

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Questions Regarding Dataset Annotation Process #7

Description

Subjective Questions

Dataset Inconsistencies

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions