Skip to content

Enhancement/validations on update#541

Merged
nasaul merged 13 commits intoNixtla:mainfrom
janrth:enhancement/validations_on_update
Jan 19, 2026
Merged

Enhancement/validations on update#541
nasaul merged 13 commits intoNixtla:mainfrom
janrth:enhancement/validations_on_update

Conversation

@janrth
Copy link
Copy Markdown
Contributor

@janrth janrth commented Dec 14, 2025

Tries to solve #358

Validates that the update df has the expected shape so that each unique_id starts from the last ds as seen in the previous df and contains the expected number of ds.

For each unique_id the number of ds date points are counted from the observed update df and this is then compared to the expected number of date points, which is calculated by the estimated start and end date. The estimated start date is observed based on the stored series last date + offset(freq).

There is an option to turn off the validate_input step. While overall the performance is pretty fast, it might be a bit annoying if one has hundreds of millions of rows.

Initially I started just checking if the first ds of the update is in the future for each unique_id, but then I felt this is not checking much really and started to implement a stronger logic. The issue itself is a bit vague and I am open for any changes as I implemented based on my interpretation of the task.

Description

Tries to implement more checks on update df

Checklist:

  • This PR has a meaningful title and a clear description.
  • The tests pass.
  • All linting tasks pass.
  • The notebooks are clean.

@janrth
Copy link
Copy Markdown
Contributor Author

janrth commented Dec 14, 2025

@jmoralez I tried to tackle #358

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Dec 28, 2025

Merging this PR will not alter performance

✅ 12 untouched benchmarks


Comparing janrth:enhancement/validations_on_update (f4cc87e) with main (e1f281a)

Open in CodSpeed

Copy link
Copy Markdown
Contributor

@nasaul nasaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start for this issue, however it requires some more thinking about what validation is actually being done in the function.

  1. Should we validate on the aggregate level or on a individual series approach? If we are doing on an individual series we should be using something like:
for uid in df[self.id_col].unique():
      if uid in self.uids:
          expected_start = self.last_dates[uid] + offset(self.freq)
          actual_start = df[df[self.id_col] == uid][self.time_col].min()
          if actual_start != expected_start:
              raise ValueError(f"Series {uid} starts at {actual_start}, expected {expected_start}")
  1. I think that we should focus right on the tests and build up functionality from there, the test should include:
  • Valid continuous updates
  • Invalid gaps in data
  • Invalid starting dates
  • New series
  • Different frequencies
  • Both pandas and polars DataFrames

Feel free to discuss if the proposed test are good enough or if we should focus on other validations also.

Comment thread mlforecast/core.py Outdated
Comment thread mlforecast/core.py Outdated
Comment thread mlforecast/core.py Outdated
Comment thread mlforecast/core.py
Comment thread mlforecast/core.py
Copy link
Copy Markdown
Contributor

@nasaul nasaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, however in order to merge you should address the following:

  • Polars categorical encoding mismatch (core.py) - All 5 Polars tests fail. The join operation fails because the categorical columns have different encodings. Needs string casting before join, like the pandas branch does.
  • Type hint is wrong (core.py) - Says pd.DataFrame but should be DataFrame since it handles both pandas and polars
  • Add docstring for validate_input parameter in both both forecast.py and core.py

Copy link
Copy Markdown
Contributor

@nasaul nasaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Jan, I've actually updated the tests in order to capture different frequencies and the solution doesn't hold. We have to use the offset from utilsforecast ufp.offset_times in order to make it work.

Copy link
Copy Markdown
Contributor

@nasaul nasaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nasaul nasaul merged commit ea35835 into Nixtla:main Jan 19, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants