Hi,
I have some questions about the process for annotating ground truth answers in your TAG benchmark. There seems to be quite a few questions that are inherently subjective, with no one correct 'ground truth' answer. In addition, I see some inconsistent questions between tag_queries.csv and hand_written.py.
It would be very useful if you could share the exact outputs produced by your hand_written.py script, to see what versions of questions + annotated ground truth answers were used to report the paper performances. Any help here would be greatly appreciated!
Subjective Questions
pipeline_59(): Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?
- What is the criteria for saying, e.g. 'Per Mertesacker' is a more unique name than 'Miroslav Klose', etc.?
pipeline_50(): Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?
- What defines 'most futuristic'? Choosing between, say, 'Millikan High' and 'Polytechnic High' feels subjective.
pipeline_51(): Of the 5 posts wih highest popularity, list their titles in order of most technical to least technical.
pipeline_56(): Among the posts owned by a user over 65 with a score of over 10, what are the post id's of the top 2 posts made with the least expertise?
- How is 'least expertise' defined to the annotator?
pipeline_60(): Out of users that have obtained at least 200 badges, what are the top 2 display names that seem most based off a real name?
- Why is 'Glen_b' more based off of a real name than 'whuber'?
pipeline_107(): Of all the comments commented by the user with a username of Harvey Motulsky and with a score of 5, rank the post ids in order of most helpful to least helpful
- Was 'most helpful' defined in a specific way to the annotators?
pipeline_61(): Of the cities containing exclusively virtual schools which are the top 3 safest places to live?
- Is a measure of 'safest place to live' defined somewhere in the BIRD database or elsewhere?
pipeline_62(): List the cities containing the top 5 most enrolled schools in order from most diverse to least diverse.
- Similar question here: Is 'most diverse school' a criteria defined in the BIRD database?
pipeline_64(): Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.
- A couple questions here: how is 'strongest academic reputation defined'? Additionally, while the question asks for an ordered list, the LOTUS program (and corresponding ground truth answer) returns a single item, 'Santa Clara'.
pipeline_65(): Among the cities with the top 10 lowest enrollment for students in grades 1 through 12, which are the top 2 most popular cities to visit?
- How is 'most popular cities to visit' defined? The ground truth chooses 'Shaver Lake' over 'Wawona', but a quick Google search seems to indicate that Wawona/Yosemite gets far more visitors than Shaver Lake?
Dataset Inconsistencies
pipeline_40(): Among the players whose height is over 180, how many of them have a volley score of over 70 and are taller than Bill Clinton?
- Judging from the variable
steph_height and the example in Appendix A from the paper, it seems as though this was switched from 'Steph Curry' -> 'Bill Clinton' at some point. Which version of the dataset is reported in the paper?
pipeline_952(): Of the constructors that have been ranked 1 in 2014, whose logo looks most like Secretariat?
- In
tag_queries.csv, this is Of the constructors that have been ranked 1 in 2014, which has the most prestige?. Similar question - which version of the question is used in reporting performance in your paper?
pipeline_5(): What are the two most common first names among the female school administrators?
- On line 94, the
.head(20) function is applied, I imagine to speed up the query execution. However, this leads to a query that is no longer faithful to the original natural language question - there is no structural guarantee enforced in the database that a female name is among the top 20 most common names. A faithful query would need to call sem_filter() over all names in the schools_df table.
pipeline_4(): What is the grade span offered in the school with the highest longitude in counties that are part of the 'Silicon Valley' region?
- In
tag_queries.csv, 'cities' is used in place of 'counties'
Hi,
I have some questions about the process for annotating ground truth answers in your TAG benchmark. There seems to be quite a few questions that are inherently subjective, with no one correct 'ground truth' answer. In addition, I see some inconsistent questions between
tag_queries.csvandhand_written.py.It would be very useful if you could share the exact outputs produced by your
hand_written.pyscript, to see what versions of questions + annotated ground truth answers were used to report the paper performances. Any help here would be greatly appreciated!Subjective Questions
pipeline_59(): Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?pipeline_50(): Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?pipeline_51(): Of the 5 posts wih highest popularity, list their titles in order of most technical to least technical.pipeline_56(): Among the posts owned by a user over 65 with a score of over 10, what are the post id's of the top 2 posts made with the least expertise?pipeline_60(): Out of users that have obtained at least 200 badges, what are the top 2 display names that seem most based off a real name?pipeline_107(): Of all the comments commented by the user with a username of Harvey Motulsky and with a score of 5, rank the post ids in order of most helpful to least helpfulpipeline_61(): Of the cities containing exclusively virtual schools which are the top 3 safest places to live?pipeline_62(): List the cities containing the top 5 most enrolled schools in order from most diverse to least diverse.pipeline_64(): Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.pipeline_65(): Among the cities with the top 10 lowest enrollment for students in grades 1 through 12, which are the top 2 most popular cities to visit?Dataset Inconsistencies
pipeline_40(): Among the players whose height is over 180, how many of them have a volley score of over 70 and are taller than Bill Clinton?steph_heightand the example in Appendix A from the paper, it seems as though this was switched from 'Steph Curry' -> 'Bill Clinton' at some point. Which version of the dataset is reported in the paper?pipeline_952(): Of the constructors that have been ranked 1 in 2014, whose logo looks most like Secretariat?tag_queries.csv, this is Of the constructors that have been ranked 1 in 2014, which has the most prestige?. Similar question - which version of the question is used in reporting performance in your paper?pipeline_5(): What are the two most common first names among the female school administrators?.head(20)function is applied, I imagine to speed up the query execution. However, this leads to a query that is no longer faithful to the original natural language question - there is no structural guarantee enforced in the database that a female name is among the top 20 most common names. A faithful query would need to callsem_filter()over all names in theschools_dftable.pipeline_4(): What is the grade span offered in the school with the highest longitude in counties that are part of the 'Silicon Valley' region?tag_queries.csv, 'cities' is used in place of 'counties'