Skip to content

Questions Regarding Dataset Annotation Process #7

@parkervg

Description

@parkervg

Hi,

I have some questions about the process for annotating ground truth answers in your TAG benchmark. There seems to be quite a few questions that are inherently subjective, with no one correct 'ground truth' answer. In addition, I see some inconsistent questions between tag_queries.csv and hand_written.py.

It would be very useful if you could share the exact outputs produced by your hand_written.py script, to see what versions of questions + annotated ground truth answers were used to report the paper performances. Any help here would be greatly appreciated!

Subjective Questions

  • pipeline_59(): Of the top 10 players taller than 180 ordered by average heading accuracy descending, what are the top 3 most unique sounding names?
    • What is the criteria for saying, e.g. 'Per Mertesacker' is a more unique name than 'Miroslav Klose', etc.?
  • pipeline_50(): Among the magnet schools with SAT test takers of over 500, which school name sounds most futuristic?
    • What defines 'most futuristic'? Choosing between, say, 'Millikan High' and 'Polytechnic High' feels subjective.
  • pipeline_51(): Of the 5 posts wih highest popularity, list their titles in order of most technical to least technical.
  • pipeline_56(): Among the posts owned by a user over 65 with a score of over 10, what are the post id's of the top 2 posts made with the least expertise?
    • How is 'least expertise' defined to the annotator?
  • pipeline_60(): Out of users that have obtained at least 200 badges, what are the top 2 display names that seem most based off a real name?
    • Why is 'Glen_b' more based off of a real name than 'whuber'?
  • pipeline_107(): Of all the comments commented by the user with a username of Harvey Motulsky and with a score of 5, rank the post ids in order of most helpful to least helpful
    • Was 'most helpful' defined in a specific way to the annotators?
  • pipeline_61(): Of the cities containing exclusively virtual schools which are the top 3 safest places to live?
    • Is a measure of 'safest place to live' defined somewhere in the BIRD database or elsewhere?
  • pipeline_62(): List the cities containing the top 5 most enrolled schools in order from most diverse to least diverse.
    • Similar question here: Is 'most diverse school' a criteria defined in the BIRD database?
  • pipeline_64(): Of the schools with the top 3 SAT excellence rate, order their counties by academic reputation from strongest to weakest.
    • A couple questions here: how is 'strongest academic reputation defined'? Additionally, while the question asks for an ordered list, the LOTUS program (and corresponding ground truth answer) returns a single item, 'Santa Clara'.
  • pipeline_65(): Among the cities with the top 10 lowest enrollment for students in grades 1 through 12, which are the top 2 most popular cities to visit?
    • How is 'most popular cities to visit' defined? The ground truth chooses 'Shaver Lake' over 'Wawona', but a quick Google search seems to indicate that Wawona/Yosemite gets far more visitors than Shaver Lake?

Dataset Inconsistencies

  • pipeline_40(): Among the players whose height is over 180, how many of them have a volley score of over 70 and are taller than Bill Clinton?
    • Judging from the variable steph_height and the example in Appendix A from the paper, it seems as though this was switched from 'Steph Curry' -> 'Bill Clinton' at some point. Which version of the dataset is reported in the paper?
  • pipeline_952(): Of the constructors that have been ranked 1 in 2014, whose logo looks most like Secretariat?
    • In tag_queries.csv, this is Of the constructors that have been ranked 1 in 2014, which has the most prestige?. Similar question - which version of the question is used in reporting performance in your paper?
  • pipeline_5(): What are the two most common first names among the female school administrators?
    • On line 94, the .head(20) function is applied, I imagine to speed up the query execution. However, this leads to a query that is no longer faithful to the original natural language question - there is no structural guarantee enforced in the database that a female name is among the top 20 most common names. A faithful query would need to call sem_filter() over all names in the schools_df table.
  • pipeline_4(): What is the grade span offered in the school with the highest longitude in counties that are part of the 'Silicon Valley' region?
    • In tag_queries.csv, 'cities' is used in place of 'counties'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions