feat: incremental download using lockfile#141
Conversation
21ab38b to
90f211c
Compare
| LockfileManager.init() | ||
| for page in pages: | ||
| override_output_path_config(output_path) | ||
| _page = Page.from_id(int(page)) if page.isdigit() else Page.from_url(page) |
There was a problem hiding this comment.
page command downloads page body directly here. So it's difficult to filter using version infomotion.
| from confluence_markdown_exporter.confluence import Page | ||
|
|
||
| with measure(f"Export pages {', '.join(pages)} with descendants"): | ||
| override_output_path_config(output_path) |
There was a problem hiding this comment.
override_output_path_config does not need to be called multiple times. I move this line.
15987a9 to
0b5dbd3
Compare
| "homepage_title": sanitize_filename(Page.from_id(self.space.homepage).title), | ||
| "ancestor_ids": "/".join(str(a) for a in self.ancestors), | ||
| "ancestor_titles": "/".join( | ||
| sanitize_filename(Page.from_id(a).title) for a in self.ancestors |
There was a problem hiding this comment.
In the _template_vars property, although the results are cached, Page.from_id(a).title) for a in self.ancestors unnecessarily calls from_id. This is inconvenient for speeding up with the incremental option.
So I added Ancestor type to store ancestor infomation.
|
Hi @Spenhouet, I use confluence-markdown-exporter regularly, and it’s a tool I really like and rely on, so thank you for maintaining it. |
|
@naoki-tateyama thanks for your work on this. I'm currently pretty busy, I hope I can find some time soon. Sorry if this will take some time. Maybe it gives some time for you to battle test this implementation. |
391a8cb to
2516f41
Compare
7b52e61 to
5029d3d
Compare
c7519f8 to
4201ccf
Compare
| return ( | ||
| " > ".join([self.convert_page_link(ancestor) for ancestor in self.page.ancestors]) | ||
| " > ".join( | ||
| [self.convert_page_link(ancestor.id) for ancestor in self.page.ancestors] |
There was a problem hiding this comment.
convert_page_link takes an integer while ancestor.id is a string. Can you check what it is and either adjust ancestor.id to int or ensure that convert_page_link can handle the string?
| for page_id in (pbar := tqdm(page_ids, smoothing=0.05)): | ||
| pbar.set_postfix_str(f"Exporting page {page_id}") | ||
| export_page(page_id) | ||
| for page in (pbar := tqdm(pages, smoothing=0.05)): |
There was a problem hiding this comment.
We could prefilter before starting the tqdm to only show a progressbar for pages to be exported:
pages_to_export = [page for page in pages if LockfileManager.should_export(page)]
if not pages_to_export:
logger.info("No pages to export based on lockfile state.")
return
for page in (pbar := tqdm(pages_to_export, smoothing=0.05)):
| existing.pages.update(self.pages) | ||
| existing.last_export = datetime.now(timezone.utc).isoformat() | ||
|
|
||
| json_str = existing.model_dump_json(indent=2) |
There was a problem hiding this comment.
The pages list should be sorted by key (page id) so that the file is more stable and diffs make sense when the file is tracked via git.
confluence_markdown_exporter/main.py
Outdated
| ), | ||
| ] = None, | ||
| *, | ||
| incremental: Annotated[ |
There was a problem hiding this comment.
I'd like to have this as a default config/setting instead. So no parameter option for it, just a config option where by default this is turned on.
confluence_markdown_exporter/main.py
Outdated
|
|
||
|
|
||
| @app.command(help="Delete exported files that are not tracked in the lockfile.") | ||
| def prune( |
There was a problem hiding this comment.
Why do this as a separate command? Because it could be destructive?
Why not always run this? Could also make this a config option which someone can disable if this is not desired.
There was a problem hiding this comment.
@Spenhouet Thank you for your comments! I'll work on your comments soon.
Regarding this comment, for users who use prior versions of cme, the files already downloaded are not tracked by the confluence-lock.json. If prune command is always executed, files not on the lockfile would be deleted.
This may be inconvenient for users using multiple executions of export commands like
cme pages 1234
cme pages 5678
The execution of downloading 1234 would delete 5678.
After all, the 5678 file would be downloaded by the second command, so this may not be a problem, but deleting may not be intended by users.
To allow users to explicitly control the behavior, I intentionally separated prune command.
What do you think?
There was a problem hiding this comment.
Maybe we need to rethink how the pruning works. Imo. we need to detect what files previously tracked on the lock file are now gone I.e. a entry removed in the lock file results in removal of the file on disk. Similar for renamed or moved pages.
There was a problem hiding this comment.
Some brainstorming here:
- For every page a new scan hits and compares against the lock file (
record_page) we can tell if theexport_pathchanged and we do know the previousexport_pathso we can delete the file at the previous export path at that point in time while executingrecord_page. This should already cover moved and renamed pages. - That leaves us with deleted pages. These are bit more tricky. Note that we are not guaranteed that the command executed is always against the whole space (or against the scope of the lock file). Which means we can not simply "delete everything that is in the lock file but wasn't in the sync". We could use that info to narrow down how many pages of the lock file we might need to check. At the end of a run we could get the list of page which are in the lock file but were not in the sync results. Then we could query all these pages and see if they still exist. For all pages which are in the lock file but no longer exist in Confluence we delete the old page file on disk. This way we only perform the deletion for previously synced pages and for nothing else. That check is a bit expensive but I don't yet have a better idea.
There was a problem hiding this comment.
Addressed in 9e4369c. Replaced the separate prune command with automatic cleanup during export. Each lockfile entry now records command and args to define its scope. On cleanup, pages no longer present in the current scope are automatically deleted from disk and the lockfile. Moved pages (changed export_path) also have their old files removed.
There was a problem hiding this comment.
I checked if we can run batch requests and Sonnet provided this script, which uses the v2 API and a fallback via CQL for instances without v2 API:
# ---------------------------------------------------------------------------
# Confluence existence check — batched requests
# ---------------------------------------------------------------------------
def _fetch_existing_ids_v2(
session: requests.Session,
v2_base: str,
all_ids: list[str],
) -> set[str]:
"""Atlassian Cloud: GET /wiki/api/v2/pages?id=X&id=Y&...&limit=250.
One request per batch of V2_BATCH_SIZE IDs. IDs present in the response
exist; IDs absent from the response are deleted.
"""
existing: set[str] = set()
n_batches = math.ceil(len(all_ids) / V2_BATCH_SIZE)
for batch_num, start in enumerate(range(0, len(all_ids), V2_BATCH_SIZE), 1):
batch = all_ids[start : start + V2_BATCH_SIZE]
print(
f" Batch {batch_num}/{n_batches} ({len(batch)} IDs) ...",
end="\r",
flush=True,
)
params: list[tuple[str, str | int]] = [("id", pid) for pid in batch]
params.append(("limit", len(batch)))
r = session.get(f"{v2_base}/pages", params=params)
if not r.ok:
print(
f"\nERROR: v2 pages request failed (HTTP {r.status_code}).\n"
f"Response: {r.text[:400]}",
file=sys.stderr,
)
sys.exit(1)
for item in r.json().get("results", []):
existing.add(str(item["id"]))
print(" " * 60, end="\r")
return existing
def _fetch_existing_ids_cql(
session: requests.Session,
api_base: str,
all_ids: list[str],
) -> set[str]:
"""Self-hosted fallback: CQL id in (...) in batches of CQL_BATCH_SIZE.
Smaller batches (25) stay well within the CQL aggregator limits.
"""
existing: set[str] = set()
n_batches = math.ceil(len(all_ids) / CQL_BATCH_SIZE)
for batch_num, start in enumerate(range(0, len(all_ids), CQL_BATCH_SIZE), 1):
batch = all_ids[start : start + CQL_BATCH_SIZE]
print(
f" Batch {batch_num}/{n_batches} ({len(batch)} IDs) ...",
end="\r",
flush=True,
)
cql = "id in (" + ",".join(batch) + ")"
r = session.get(
f"{api_base}/content/search",
params={"cql": cql, "limit": len(batch), "fields": "id"},
)
if not r.ok:
print(
f"\nERROR: CQL query failed (HTTP {r.status_code}).\n"
f"Response: {r.text[:400]}",
file=sys.stderr,
)
sys.exit(1)
for item in r.json().get("results", []):
existing.add(str(item["id"]))
print(" " * 60, end="\r")
return existing
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
def main() -> None:
parser = argparse.ArgumentParser(
description="Find pages in a .confluence-lock.json that no longer exist in Confluence.",
)
parser.add_argument(
"--lock",
default=str(Path(__file__).parent / ".confluence-lock.json"),
help="Path to .confluence-lock.json (default: <script-dir>/.confluence-lock.json)",
)
args = parser.parse_args()
lock_path = Path(args.lock)
pages = _load_lock(lock_path)
all_ids = list(pages.keys())
print(f"Lock file: {lock_path.resolve()}")
print(f"Total pages in lock: {len(all_ids)}")
api_base, username, api_token, pat = _get_credentials()
session = _make_session(username, api_token, pat)
# Derive raw URL from api_base to build v2 URL
raw_url = api_base.replace("/wiki/rest/api", "").replace("/rest/api", "")
v2_base = _get_v2_base(raw_url)
if v2_base:
n_batches = math.ceil(len(all_ids) / V2_BATCH_SIZE)
print(
f"Confluence API (v2): {v2_base}\n"
f"Checking {len(all_ids)} pages in {n_batches} batch(es) "
f"of up to {V2_BATCH_SIZE} IDs ..."
)
existing_ids = _fetch_existing_ids_v2(session, v2_base, all_ids)
else:
n_batches = math.ceil(len(all_ids) / CQL_BATCH_SIZE)
print(
f"Confluence API (v1 CQL): {api_base}\n"
f"Checking {len(all_ids)} pages in {n_batches} batch(es) "
f"of up to {CQL_BATCH_SIZE} IDs ..."
)
existing_ids = _fetch_existing_ids_cql(session, api_base, all_ids)
deleted_ids = sorted(set(all_ids) - existing_ids, key=int)
print(f"Result: {len(existing_ids)} existing, {len(deleted_ids)} deleted.\n")
if not deleted_ids:
print("No deleted pages found.")
return
print(f"Deleted pages ({len(deleted_ids)}):")
print(f"{'ID':<15} {'Title':<50} export_path")
print("-" * 120)
for page_id in deleted_ids:
entry = pages[page_id]
title = entry.get("title", "")
export_path = entry.get("export_path", "")
print(f"{page_id:<15} {title:<50} {export_path}")
There was a problem hiding this comment.
Each lockfile entry now records command and args to define its scope
Not a fan of recording the scope and args in the lock file. Also it should be possible to e.g. only sync a single page without all other previously synced pages being deleted. We should only delete pages which were synced before and truely no longer exist on Confluence.
I'd prefer to the do the batch scan at the end of the sync for all pages that were not within the sync results but are in the lock file.
There was a problem hiding this comment.
We might want to add config options for the auto prune (Default on) and the batch size v2 and cql.
There was a problem hiding this comment.
Addressed in ed2f5a1. Replaced the scope-based approach with a v2 API batch check (wiki/api/v2/pages) during cleanup. Unseen lockfile pages are checked against Confluence in batches of 250, and only pages confirmed to no longer exist are deleted from disk and the lockfile. Old files for moved pages (changed export_path) are also cleaned up.
Added export.cleanup_stale config option (default: True) to enable/disable this behavior.
|
@naoki-tateyama thanks for the implementation. I like it. It's clean and minimal and works well. See my review remarks above. When they are addressed I'm happy to merge this. |
Co-authored-by: Sebastian Penhouet <Spenhouet@users.noreply.github.com>
convert_page_link expects int, so align the Ancestor type accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only show progress bar for pages that will actually be exported, instead of including skipped pages in the count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace --incremental CLI flag with export.skip_unchanged config setting. Defaults to true so lockfile-based incremental export is always on. Uses init() pattern with a helper function init_lockfile_if_enabled(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9538fe3 to
9e4369c
Compare
After export, batch-check unseen lockfile pages against Confluence API (CQL id in ...) and delete local files for pages no longer on Confluence. Also delete old files when a page's export path changes. Add configurable export.cleanup_stale option (default: True). Remove prune command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9e4369c to
ed2f5a1
Compare
naoki-tateyama
left a comment
There was a problem hiding this comment.
@Spenhouet Thank you for your follow up commits. I reviewd them and it's LGTM. I don't have any updates for this PR, so if you are ready, please mege this and publish the newer version! 🙏
|
@naoki-tateyama Thanks for your work on this! |
Summary
Introduced incremental download using lockfile.
When exporting multiple docs with the descendants, the program only downloads new and updated files by matching their versions and file paths.
This reduces the unnecessary downloading time.
incrementaloption: downloads only new and updated files withpages,pages_with_descendants,spacesandall_spacescommands.prunecommand: deletes files which are not tracked by the lockfile.dry-runmode shows the target files to delete. I intentionally created this command because users may export commands multiple times at one time. If commands delete unncessary files at each time, it may cause undesirable deletions.related issues
Perfomance improvement
Although the
from_idmethod results are cached,_template_varscallsfrom_idfor the every ancestor.It's very costly.
I introduced Ancestor type to store ancestor infomotion.
Test Plan
I tried
pages_with_descendantscommand andspacescommand in the local environment, and they worked well.After fetching 884 files in a space with
spacescommand, it took 18 s to finish thespacescommand for the second execution.