Skip to content

Conversation

@peterdesmet
Copy link
Member

Since we include the schemas verbosely, I think we should allow publishers to add more rigorous type, format and constraints than the one provided at rs.tdwg.org.

For type we have to be a bit careful, which is why I suggest to use "type": "any" in our table schemas for terms that can have multiple types. That would differentiate:

eventDate: can be string or datetime

From

eventType: must always be a string

The implementation rule for "any" is that there must be no processing. For CSVs that means those values are interpreted as strings.

@tucotuco you probably have a better overview of terms that can deviate from strings?

@tucotuco
Copy link
Collaborator

I don't think that terms should allow multiple types. I imagine myself trying to load data into a strongly-typed database schema and finding that the table schemas on which I am basing an aggregation are changing from dataset to dataset.

@tucotuco
Copy link
Collaborator

Conversely, I think formats and constraints can only be useful, but constraints shouldn't be broader than those provided in the schema definitions - that ultimately could change the semantics of some terms.

@peterdesmet
Copy link
Member Author

but constraints shouldn't be broader than those provided in the schema definitions

That is what is currently suggested for constraints in this PR:

The constraints provided in the table schema at rs.tdwg.org MAY be updated, but it MUST NOT relax the original constraints.

@peterdesmet
Copy link
Member Author

My reasoning for allowing more specific types was especially with datetime in mind:

  • As a consumer, it's really useful to know that all values in eventDate comply (or I can at least validate) with datetime and specific format. datetime+format is very powerful
  • As a publisher, I can communicate that I made an effort to have all my values standardized.

I'm curious what others think. @timrobertson100 @mdoering @MattBlissett

@timrobertson100
Copy link
Member

timrobertson100 commented Sep 11, 2025

I think I agree with @peterdesmet

I imagine myself trying to load data into a strongly-typed database schema and finding that the table schemas on which I am basing an aggregation are changing from dataset to dataset.

If you are imagining doing e.g. a PostgreSQL COPY ... FROM ... some.csv then I'm not sure FD will be strict enough to accommodate all scenarios. I anticipate you'd have to assume strings and then some functions/parsers to convert into typed fields.

I'm no FD expert but I believe even something like a number field in FD can have , or . delimiters or be declared to be a bareNumber allowing for additions such as %.

I'd expect any consumer of a wide variety of DPs would need to deal with variation across them. Having the ability for a publisher to use String seems convenient and likely necessary for many and having the ability for them to declare stronger typing where possible seems helpful too.

(As a more general comment, if strong typing is really what is wanted then CSV is not a format I'd promote for all the reasons we're discussing. Avro, Parquet etc are better suited mediaTypes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants