Skip to content

WIP: collect metadata for each task#32

Merged
rudokemper merged 4 commits into
mainfrom
collect-task-metadata
Feb 7, 2024
Merged

WIP: collect metadata for each task#32
rudokemper merged 4 commits into
mainfrom
collect-task-metadata

Conversation

@rudokemper
Copy link
Copy Markdown
Member

Goal

This is an approach to collecting metadata about each job run by map-gl-renderer, whether through the CLI or via a queue message. For the latter environment, we can later write this metadata to a database table. I would like to get some quick feedback on this approach before proceeding further: does the way in which I've implemented this, and the data I am collecting, make sense to others? Anything else to add?

Log output

This is the metadata content upon a successful job:

{
  style: 'self',
  status: 'success',
  errorCode: null,
  errorMessage: null,
  filename: 'output.mbtiles',
  filesize: 200704,
  numberOfTiles: 6,
  numberOfAttempts: 1,
  workBegun: '2024-02-05T16:09:37.338Z',
  workEnded: '2024-02-05T16:09:37.812Z',
  expiration: '2025-02-04T21:58:49.812Z'
}

This is the metadata content when I have intentionally broken something in generateMBTiles:

{
  style: 'self',
  status: 'failed',
  errorCode: '500',
  errorMessage: 'Error writing to MBTiles file: calculateeTileRangeForBounds is not defined',
  filename: undefined,
  filesize: undefined,
  numberOfTiles: undefined,
  numberOfAttempts: 1,
  workBegun: '2024-02-05T16:10:39.625Z',
  workEnded: '2024-02-05T16:10:39.628Z',
  expiration: null
}

What I changed

These are the variables I am collecting:

  • style: The type of style requested (mapbox, bing, etc.).
  • status: Success or failure.
  • errorCode: The idea here is to document an error code that can clarify if the failure was a result of an internal error, or erroneous user input. Although this is a headless tool, we can use HTTP response codes for client (400) or server (500) as way to codify this.
  • errorMessage: The content of the specific message being thrown.
  • filename: The name of the file (e.g. value of output).
  • filesize: Filesize of the file
  • numberOfTiles: Figure this could be a helpful thing to show on the front end - the number of tiles stored in the mbtiles db
  • numberOfAttempts: Right now I just have this set to 1, since we haven't done any iterative attempts to reprocess upon failure as of yet.
  • workBegun: timestamp captured at the very beginning of the job.
  • workEnded: timestamp captured upon success
  • expiration: timestamp + 1 year captured upon success. (Thinking we could use this to eventually purge files that have expired.)

The implementation of this is being handled by returning these values instead of throwing an error directly. I have added this to any likely failure point in the main function initiateRendering as well as generateMBTiles (where a lot of the success metadata is being generated as well).

What I am not doing

Right now I'm just console logging the metadata upon completion. I am not passing or writing this anywhere as of yet.

I have not considered timeout errors. I haven't seen it, but it's possible the flow could hang somewhere.

@rudokemper rudokemper requested a review from IamJeffG February 5, 2024 16:28
Copy link
Copy Markdown
Contributor

@IamJeffG IamJeffG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly agree with your proposed fields to communicate a task result. I left some ideas for small-and-easy changes in the comments, if you agree with them.

Am I correct that the idea is that azure_queue_service.js will write the metadata contents back various columns in the SQL row?

Comment thread src/initiate.js Outdated
workBegun,
workEnded: new Date().toISOString(),
// if status = success, set expiration for one year
expiration:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any mechanism (as of now) to actually clean up expired records?

If not, I'd be inclined to leave expiration out of this, and maybe add it later once we know how that cleanup is going to work. For example, I could envision an implementation in which some "garbage collector" runs periodically, and it knows the business rules of when to clean up a record. For that matter, we may want to change those rules at some point. In both those cases, the expiration need not be set at task finish time, but rather at expiration time.

The one reason I can think of to include this right now is if the expiration will need to be surfaced in a UI well in advance.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No mechanism as of yet. I was thinking of our discussion on the MapPacker design doc to expire offline maps, and that the best opportunity for us to set an expiration timestamp would be at this stage of task finishing. But we could definitely leave this out until we're ready to tackle cleanup as a separate batch of work.

Comment thread src/initiate.js Outdated
? new Date(Date.now() + 31556952000).toISOString()
: null,
};
} catch (error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super experienced with running node scripts from the command line but my guess is that the user experience (for a technical user, anyway) might actually be better if you never catch the exception, if one was thrown. My guess is that, when running from CLI, the script will print the entire stack trace, as well as the error message, and this will be much more helpful for the technical developer to spot what went wrong and how to fix it.

In other words, I'm suggesting what if handleError was not called in initiate, but instead only in azure_queue_service.js? And the try/catch likewise lives in azure_queue_service.js, not here in initiateRendering(). I'm suggesting that each entrypoint has different optimal behavior of how to share error back with the caller.

(I am assuming that CLI will not ever write back to the SQL DB)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree with you on this point. I'll push up a commit with a revision of when handleError is called (or not).

Comment thread src/utils.js Outdated
export const handleError = (error, message) => {
return {
status: "failed",
errorCode: "500",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I alluded to 4xx and 5xx codes being used to differentiate between "caller's fault" vs "my fault", but I was thinking of it as an analogy. Since there is no HTTP call, "code 500" feels a bit too literal?

You could probably even omit errorCode and overload "status" with possible values of

* BadRequest
* InternalError
* Pending
* Success

Note that I would expect the webapp's /status/<taskid> endpoint to always return HTTP 200 (as long as the taskid is valid), even if the task failed or is pending, because it's successfully returning the task status. So the errorCode set here wouldn't actually end up exposed there.

That said, what you have seems clear enough that I'm also fine if you want to leave it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I too was thinking of these codes as an analogy, and wasn't sure yet where to actually go with the error codes - I would have likely rewritten them to be something more accurate on the front end anyway, and I think your proposal to expand the range of possibilities in status is a better solution 👍

Comment thread src/initiate.js Outdated
numberOfTiles: metadata.numberOfTiles,
numberOfAttempts: 1,
workBegun,
workEnded: new Date().toISOString(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd report workBegun and workEnded even if there's error.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, maybe azure_queue_service.js should handle workBegun and workEnded. That way it could write workBegun to the DB even before the work is finished; plus you get for free the workEnded addition even in case of an error.

OTOH this would not be so good if it's important that the CLI keep track of these too.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it could be interesting that we return these for the CLI. I say this from first-hand experience of running similar Node tools overnight, and not being sure when it actually finished.

Comment thread src/initiate.js Outdated
status: metadata.status,
errorCode: metadata.errorCode,
errorMessage: metadata.errorMessage,
filename: metadata.filename,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to be an absolute or relative path?

If absolute, is it relative to the local volume mount? A URL?

I think I might return a relative path from under the volume mount (i.e. not including the volume mount top level dir); that way the webapp server and this task worker might use different local mount locations, but below that the relative path works in both.

An Azure storage URL might be more ideal except that technically the task worker doesn't know about Azure Storage; that's why I might lean away from it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, relative path. GCV works the same way - we provide the mount location as an env var, and we can do similar for the MapPacker front end.

Comment thread src/initiate.js Outdated
const msg =
"Stylesheet has local mbtiles file sources, but no sourceDir is set";
throw new Error(msg);
return handleError(new Error(msg), "checking for local mbtiles sources");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a bad request, right? The caller has to change something before this will ever succeed.

I might define (in utils.js) two separate error handlers, one for bad client request, one for unexpected server error. The two would set different errorCode (as you have it now) or different status (per an idea I share in src/utils.js).

A minor variation of this is to define two different Error subclasses and you only throw BadRequestError(msg) here, leaving the entry point (clip.js or azure_queue_service.js) to catch that error if it wants. Example in Python:

Comment thread src/azure_queue_service.js Outdated
// Pass the extracted values to initiateRendering
await initiateRendering(
// Pass the extracted values to initiateRendering, and receive the metadata
const metadata = await initiateRendering(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: choose a more specific word. "metadata" can be used for all sorts of stuff.

idea: taskResult, renderResult

Comment thread src/initiate.js Outdated
filename: metadata.filename,
filesize: metadata.filesize,
numberOfTiles: metadata.numberOfTiles,
numberOfAttempts: 1,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking your code does not need to handle retries explicitly; rather the queuing system will retry to deliver the same message if it hadn't succeeded (been deleted from the queue by the task worker) within the lease time.

All your task worker code has to do is read DequeuedMessageItem.dequeueCount:

const response = await sourceQueueClient.receiveMessages({
      visibilityTimeout: 2 * 60 * 60,
});
const message = response.receivedMessageItems[0];
const numberOfAttempts = message.dequeueCount;

@rudokemper
Copy link
Copy Markdown
Member Author

Thanks @IamJeffG, helpful comments and feedback all around. I have pushed up changes reflecting all of the above points. The most notable change is implementing the error handling in azure_queue_service.js, and updating status to BadRequest or InternalServerError depending on where the exception was caught. Exceptions encountered via usage of the CLI returns a stacktrace as before.

Copy link
Copy Markdown
Contributor

@IamJeffG IamJeffG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@rudokemper rudokemper marked this pull request as ready for review February 7, 2024 19:45
@rudokemper rudokemper merged commit e56fa36 into main Feb 7, 2024
@rudokemper rudokemper deleted the collect-task-metadata branch February 7, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

For Azure Storage Queue, return status of request (and legible 4xx/5xx errors for failed requests)

2 participants