-
Notifications
You must be signed in to change notification settings - Fork 38
New tool: ingest_email.py #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
gvanrossum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's treat this as a draft PR
| """Ingest email files into a database.""" | ||
|
|
||
| # Collect all .eml files | ||
| if verbose: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a cleaner way to do this verbose printing? Could there be a printVerbose() method that we call instead of print() and then it decides to print the message or not based on the verbose flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would feel a little "clever", so I'll skip it.
| print(f" {preview}") | ||
|
|
||
| # Pass source_id to mark as ingested atomically with the message | ||
| source_ids = [email_id] if email_id else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe emit a warning when the email id doesn't exist.
Do we try to create one based on the from/to/timestamp/subject hash if we don't have one otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. I'll defer doing that until there's demand.
| def mark_source_ingested(self, source_id: str) -> None: | ||
| """Mark a source as ingested. | ||
| This performs an INSERT but does NOT commit. It should be called within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried this, but the docs say you can do cursor.in_transaction to see if there's an active transaction. That could keep someone from calling this if there's no active transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meh, this function only exists so add_messages_with_indexing can call it. I'm not too worried about users calling it wrongly.
This does the same as
@add_messagesin test_gmail.py does, but now the storage provider interface has changed to allow storing the "is it ingested" flag per message id in the database, so it is transactionally safe.(Almost all by Claude Opus 4.5 Preview.)
I'm not in a hurry to remove test_email.py -- we need to make sure that all its use cases are now incorporated into either ingest_email.py ro query.py. I think @parse_messages is the only one left.