WIP: collect metadata for each task by rudokemper · Pull Request #32 · ConservationMetrics/mapgl-tile-renderer

rudokemper · 2024-02-05T16:28:10Z

Goal

This is an approach to collecting metadata about each job run by map-gl-renderer, whether through the CLI or via a queue message. For the latter environment, we can later write this metadata to a database table. I would like to get some quick feedback on this approach before proceeding further: does the way in which I've implemented this, and the data I am collecting, make sense to others? Anything else to add?

Log output

This is the metadata content upon a successful job:

{
  style: 'self',
  status: 'success',
  errorCode: null,
  errorMessage: null,
  filename: 'output.mbtiles',
  filesize: 200704,
  numberOfTiles: 6,
  numberOfAttempts: 1,
  workBegun: '2024-02-05T16:09:37.338Z',
  workEnded: '2024-02-05T16:09:37.812Z',
  expiration: '2025-02-04T21:58:49.812Z'
}

This is the metadata content when I have intentionally broken something in generateMBTiles:

{
  style: 'self',
  status: 'failed',
  errorCode: '500',
  errorMessage: 'Error writing to MBTiles file: calculateeTileRangeForBounds is not defined',
  filename: undefined,
  filesize: undefined,
  numberOfTiles: undefined,
  numberOfAttempts: 1,
  workBegun: '2024-02-05T16:10:39.625Z',
  workEnded: '2024-02-05T16:10:39.628Z',
  expiration: null
}

What I changed

These are the variables I am collecting:

style: The type of style requested (mapbox, bing, etc.).
status: Success or failure.
errorCode: The idea here is to document an error code that can clarify if the failure was a result of an internal error, or erroneous user input. Although this is a headless tool, we can use HTTP response codes for client (400) or server (500) as way to codify this.
errorMessage: The content of the specific message being thrown.
filename: The name of the file (e.g. value of output).
filesize: Filesize of the file
numberOfTiles: Figure this could be a helpful thing to show on the front end - the number of tiles stored in the mbtiles db
numberOfAttempts: Right now I just have this set to 1, since we haven't done any iterative attempts to reprocess upon failure as of yet.
workBegun: timestamp captured at the very beginning of the job.
workEnded: timestamp captured upon success
expiration: timestamp + 1 year captured upon success. (Thinking we could use this to eventually purge files that have expired.)

The implementation of this is being handled by returning these values instead of throwing an error directly. I have added this to any likely failure point in the main function initiateRendering as well as generateMBTiles (where a lot of the success metadata is being generated as well).

What I am not doing

Right now I'm just console logging the metadata upon completion. I am not passing or writing this anywhere as of yet.

I have not considered timeout errors. I haven't seen it, but it's possible the flow could hang somewhere.

IamJeffG

I mostly agree with your proposed fields to communicate a task result. I left some ideas for small-and-easy changes in the comments, if you agree with them.

Am I correct that the idea is that azure_queue_service.js will write the metadata contents back various columns in the SQL row?

IamJeffG · 2024-02-05T21:38:08Z

+      workBegun,
+      workEnded: new Date().toISOString(),
+      // if status = success, set expiration for one year
+      expiration:


Do you have any mechanism (as of now) to actually clean up expired records?

If not, I'd be inclined to leave expiration out of this, and maybe add it later once we know how that cleanup is going to work. For example, I could envision an implementation in which some "garbage collector" runs periodically, and it knows the business rules of when to clean up a record. For that matter, we may want to change those rules at some point. In both those cases, the expiration need not be set at task finish time, but rather at expiration time.

The one reason I can think of to include this right now is if the expiration will need to be surfaced in a UI well in advance.

No mechanism as of yet. I was thinking of our discussion on the MapPacker design doc to expire offline maps, and that the best opportunity for us to set an expiration timestamp would be at this stage of task finishing. But we could definitely leave this out until we're ready to tackle cleanup as a separate batch of work.

IamJeffG · 2024-02-05T21:48:46Z

+          ? new Date(Date.now() + 31556952000).toISOString()
+          : null,
+    };
  } catch (error) {


I am not super experienced with running node scripts from the command line but my guess is that the user experience (for a technical user, anyway) might actually be better if you never catch the exception, if one was thrown. My guess is that, when running from CLI, the script will print the entire stack trace, as well as the error message, and this will be much more helpful for the technical developer to spot what went wrong and how to fix it.

In other words, I'm suggesting what if handleError was not called in initiate, but instead only in azure_queue_service.js? And the try/catch likewise lives in azure_queue_service.js, not here in initiateRendering(). I'm suggesting that each entrypoint has different optimal behavior of how to share error back with the caller.

(I am assuming that CLI will not ever write back to the SQL DB)

I fully agree with you on this point. I'll push up a commit with a revision of when handleError is called (or not).

IamJeffG · 2024-02-05T21:53:47Z

+export const handleError = (error, message) => {
+  return {
+    status: "failed",
+    errorCode: "500",


I know I alluded to 4xx and 5xx codes being used to differentiate between "caller's fault" vs "my fault", but I was thinking of it as an analogy. Since there is no HTTP call, "code 500" feels a bit too literal?

You could probably even omit errorCode and overload "status" with possible values of

* BadRequest * InternalError * Pending * Success

Note that I would expect the webapp's /status/<taskid> endpoint to always return HTTP 200 (as long as the taskid is valid), even if the task failed or is pending, because it's successfully returning the task status. So the errorCode set here wouldn't actually end up exposed there.

That said, what you have seems clear enough that I'm also fine if you want to leave it.

Yeah, I too was thinking of these codes as an analogy, and wasn't sure yet where to actually go with the error codes - I would have likely rewritten them to be something more accurate on the front end anyway, and I think your proposal to expand the range of possibilities in status is a better solution 👍

IamJeffG · 2024-02-05T21:56:24Z

+      numberOfTiles: metadata.numberOfTiles,
+      numberOfAttempts: 1,
+      workBegun,
+      workEnded: new Date().toISOString(),


I'd report workBegun and workEnded even if there's error.

You know, maybe azure_queue_service.js should handle workBegun and workEnded. That way it could write workBegun to the DB even before the work is finished; plus you get for free the workEnded addition even in case of an error.

OTOH this would not be so good if it's important that the CLI keep track of these too.

I do think it could be interesting that we return these for the CLI. I say this from first-hand experience of running similar Node tools overnight, and not being sure when it actually finished.

IamJeffG · 2024-02-05T21:58:07Z

+      status: metadata.status,
+      errorCode: metadata.errorCode,
+      errorMessage: metadata.errorMessage,
+      filename: metadata.filename,


Is this going to be an absolute or relative path?

If absolute, is it relative to the local volume mount? A URL?

I think I might return a relative path from under the volume mount (i.e. not including the volume mount top level dir); that way the webapp server and this task worker might use different local mount locations, but below that the relative path works in both.

An Azure storage URL might be more ideal except that technically the task worker doesn't know about Azure Storage; that's why I might lean away from it.

Yep, relative path. GCV works the same way - we provide the mount location as an env var, and we can do similar for the MapPacker front end.

IamJeffG · 2024-02-05T22:04:16Z

    const msg =
      "Stylesheet has local mbtiles file sources, but no sourceDir is set";
-    throw new Error(msg);
+    return handleError(new Error(msg), "checking for local mbtiles sources");


This is an example of a bad request, right? The caller has to change something before this will ever succeed.

I might define (in utils.js) two separate error handlers, one for bad client request, one for unexpected server error. The two would set different errorCode (as you have it now) or different status (per an idea I share in src/utils.js).

A minor variation of this is to define two different Error subclasses and you only throw BadRequestError(msg) here, leaving the entry point (clip.js or azure_queue_service.js) to catch that error if it wants. Example in Python:

error definitions: https://github.com/ConservationMetrics/cmi-jobs/blob/1391e0751719b24d1f6a88f518dc653ae1dde309/azure-functions/http_utils.py#L13-L26

core logic throws exception, not handles it: https://github.com/ConservationMetrics/cmi-jobs/blob/1391e0751719b24d1f6a88f518dc653ae1dde309/azure-functions/launch_tasks/launch_tasks_azfn.py#L120-L123

IamJeffG · 2024-02-05T22:07:21Z

-        // Pass the extracted values to initiateRendering
-        await initiateRendering(
+        // Pass the extracted values to initiateRendering, and receive the metadata
+        const metadata = await initiateRendering(


nitpick: choose a more specific word. "metadata" can be used for all sorts of stuff.

idea: taskResult, renderResult

IamJeffG · 2024-02-06T01:31:19Z

+      filename: metadata.filename,
+      filesize: metadata.filesize,
+      numberOfTiles: metadata.numberOfTiles,
+      numberOfAttempts: 1,


I'm thinking your code does not need to handle retries explicitly; rather the queuing system will retry to deliver the same message if it hadn't succeeded (been deleted from the queue by the task worker) within the lease time.

All your task worker code has to do is read DequeuedMessageItem.dequeueCount:

const response = await sourceQueueClient.receiveMessages({ visibilityTimeout: 2 * 60 * 60, }); const message = response.receivedMessageItems[0]; const numberOfAttempts = message.dequeueCount;

rudokemper · 2024-02-07T00:43:54Z

Thanks @IamJeffG, helpful comments and feedback all around. I have pushed up changes reflecting all of the above points. The most notable change is implementing the error handling in azure_queue_service.js, and updating status to BadRequest or InternalServerError depending on where the exception was caught. Exceptions encountered via usage of the CLI returns a stacktrace as before.

IamJeffG

looking good!

WIP: collect task metadata

694bc1c

rudokemper requested a review from IamJeffG February 5, 2024 16:28

IamJeffG reviewed Feb 5, 2024

View reviewed changes

IamJeffG reviewed Feb 6, 2024

View reviewed changes

rudokemper added 3 commits February 6, 2024 19:15

Revise error handling and naming

3ea75bc

Fix for var returning in generateMBTiles

f9425a9

Fix for retrieving filesize

c79dff7

rudokemper requested a review from IamJeffG February 7, 2024 00:43

rudokemper linked an issue Feb 7, 2024 that may be closed by this pull request

For Azure Storage Queue, return status of request (and legible 4xx/5xx errors for failed requests) #31

Closed

IamJeffG reviewed Feb 7, 2024

View reviewed changes

rudokemper marked this pull request as ready for review February 7, 2024 19:45

rudokemper merged commit e56fa36 into main Feb 7, 2024

rudokemper deleted the collect-task-metadata branch February 7, 2024 19:46

Conversation

rudokemper commented Feb 5, 2024

Goal

Log output

What I changed

What I am not doing

Uh oh!

IamJeffG left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rudokemper commented Feb 7, 2024

Uh oh!

IamJeffG left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants