Skip to content

Match multiple metadata values#520

Closed
bganglia wants to merge 103 commits intomasterfrom
match-multiple-metadata-values
Closed

Match multiple metadata values#520
bganglia wants to merge 103 commits intomasterfrom
match-multiple-metadata-values

Conversation

@bganglia
Copy link
Collaborator

@bganglia bganglia commented Jan 31, 2020

Describe your changes
This PR allows selecting more than one value within each metadata field. For example, this would make it possible to process images from only the first and last dates of an experiment, or some subset of the cameras used.

Type of update
Is this a:

  • New feature or feature enhancement
  • Work in progress

Associated issues
This will close #461

Additional context

  1. Data structures on the Python side
    In this PR, meta_filters is a dictionary with keys that can be single values or lists. If they are lists, metadata_parser checks whether at least one matches. So this is a valid group of filters:
meta_filters = {
    "id":["11", "32"],
    "plantbarcode":"A1"
}
  1. Syntax on the CLI side
    Two different syntax styles were mentioned in the issue:
    a. --match plantbarcode:A1,plantbarcode:B2,plantbarcode:C3 This way is easier for writing CLI scripts that wrap plantcv-workflow.py and simpler to parse in Python.
    b. --match plantbarcode:[A1,B2,C3] This way is much easier to read and probably better for experimentation, but harder to parse in Python.
    In this PR, metadata_parser accepts either/both. For example, the following would all be read in the same way:
    --match id:[1,2,3]
    --match id:1,id:2,id:3
    --match id:[1,2],id:3
    If this is too much flexibility, the code could be limited to one style.

To Do

  • Accept multiple metadata values
  • Support \ and quoting
  • Raise informative (fatal) error if invalid metadata filters are given
  • Refactor
  • Make sure that all current tests pass
  • Add new tests
  • Add documentation

@bganglia bganglia added the work in progress Mark work in progress label Jan 31, 2020
@coveralls
Copy link

coveralls commented Feb 2, 2020

Coverage Status

Coverage decreased (-0.1%) to 99.709% when pulling dd506eb on match-multiple-metadata-values into 6ad7e13 on master.

@dschneiderch
Copy link
Collaborator

since you've done both I dont know why we wouldn't implement both. this is going to be great! Thanks @bganglia

@bganglia bganglia force-pushed the match-multiple-metadata-values branch from b594237 to 8065281 Compare February 9, 2020 17:53
@bganglia
Copy link
Collaborator Author

@nfahlgren
I have a few questions about error handling before I wrap up the tests for this feature. Right now the code assumes the user meant something reasonable instead of throwing warnings or errors when the metadata filters look fishy. Specifically:

  1. Empty metadata values ("") are valid.
  2. Presently, : and , are not interpreted as special characters when that would create an error. For example, id:: would translate to {"id":":"}. Alternately, you could require backslashes, meaning the user would have to write id:\: to avoid an error.

Which style is more fitting for PlantCV - throwing more errors/warnings or making more assumptions?

@HaleySchuhl
Copy link
Contributor

@bganglia , thanks for all your work on this so far!

I think that the first example might make most sense to be handled as if no matching is needed for that type of metadata but also print a warning. For example if someone input --match treatment:[] I would think most people meant all treatments should be included rather than no images.

The second case with handling special characters such as : I like your idea about allowing slashes since it is similar to the regex style we've been moving towards regarding timestamps. A fatal error with an informative error message would be best in my opinion in the case where the escape character is missing before special characters. Do you agree @nfahlgren

@nfahlgren
Copy link
Member

Yeah, I think if there are any errors we think would be common/likely it's worth putting some error handling around them, but I think it's okay to make some assumptions for everything else. Just comes down to the amount of effort in writing checks for lots of probably unlikely inputs, of which there could be just about anything. I think this is where the documentation helps, plus users posting issues when they have data that breaks our assumptions that we can then improve the code around.

It is possible people will need to use characters like : since it's a valid character in filenames in Linux/Unix, and that's where most metadata is extracted from. Having the option to escape sounds good to me, but maybe an easier alternative is to instruct users to quote the strings?

@bganglia
Copy link
Collaborator Author

@HaleySchuhl @nfahlgren That makes sense. I will modify the checklist to include these ideas.

Just to be clear, should the syntax id:, which is currently equivalent to id:[], raise an error or also be interpreted as selecting all images?

@nfahlgren
Copy link
Member

In my thinking it's an error. The default behavior is to not filter on any metadata, so all id in this case are included automatically. So there's no need for a user to specify that all id should be matched.

@bganglia
Copy link
Collaborator Author

@nfahlgren OK. I will raise an error on empty filters instead.

@bganglia
Copy link
Collaborator Author

@nfahlgren @HaleySchuhl
This pull request is nearing completion, so I wanted to check in to make sure I am on the right track.

This PR adds the following:

  1. Matching multiple metadata values specified either as id:1,id:2 or id:[1,2]
  2. Support for quoting and escaping ,, [, ], and :
  3. Very specific and helpful error messages when syntax errors are present, e.g.
EmptyListError: Empty list
id:[]
    ^

Internally, the code splits the input into basic units like camera or : and has hard-coded rules to decide how to build the dictionary and recognize errors.

If I am on the right track, I will wrap up the documentation and mark this PR as completed. If not, I can try to change things or work on something else instead.

@nfahlgren nfahlgren linked an issue Jul 23, 2020 that may be closed by this pull request
@nfahlgren nfahlgren added this to the PlantCV v4.x milestone Jul 23, 2020
@nfahlgren nfahlgren linked an issue Jul 23, 2020 that may be closed by this pull request
8 tasks
@dschneiderch dschneiderch mentioned this pull request Aug 26, 2020
10 tasks
@bganglia
Copy link
Collaborator Author

I think that there is a better approach than the one that I was using here, so I will open a new pull request to replace this one.

@bganglia bganglia closed this Jun 18, 2021
@nfahlgren nfahlgren deleted the match-multiple-metadata-values branch October 16, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

work in progress Mark work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Overall improvements to workflow parallelization 'match' multiple values of same metadata field

5 participants