feat: incremental download using lockfile by naoki-tateyama · Pull Request #141 · Spenhouet/confluence-markdown-exporter

naoki-tateyama · 2026-02-03T14:26:59Z

Summary

Introduced incremental download using lockfile.
When exporting multiple docs with the descendants, the program only downloads new and updated files by matching their versions and file paths.
This reduces the unnecessary downloading time.

incremental option: downloads only new and updated files with pages, pages_with_descendants, spaces and all_spaces commands.
prune command: deletes files which are not tracked by the lockfile. dry-run mode shows the target files to delete. I intentionally created this command because users may export commands multiple times at one time. If commands delete unncessary files at each time, it may cause undesirable deletions.

related issues

Perfomance improvement

Although the from_id method results are cached, _template_vars calls from_id for the every ancestor.
It's very costly.
I introduced Ancestor type to store ancestor infomotion.

Test Plan

I tried pages_with_descendants command and spaces command in the local environment, and they worked well.
After fetching 884 files in a space with spaces command, it took 18 s to finish the spaces command for the second execution.

time uv run confluence-markdown-exporter spaces {space_key} --incremental                            
  100%|██████████████████████████| 884/884 [00:00<00:00, 8128.21it/s, Skipping page {page_id} (no    
  changes)]                                                                                               
  uv run confluence-markdown-exporter spaces space_key --incremental  0.62s user 0.14s system 4%   
   cpu 18.222 total

naoki-tateyama · 2026-02-03T14:50:12Z

confluence_markdown_exporter/main.py

+            LockfileManager.init()
        for page in pages:
-            override_output_path_config(output_path)
            _page = Page.from_id(int(page)) if page.isdigit() else Page.from_url(page)


page command downloads page body directly here. So it's difficult to filter using version infomotion.

confluence_markdown_exporter/main.py

naoki-tateyama · 2026-02-03T14:52:33Z

confluence_markdown_exporter/main.py

    from confluence_markdown_exporter.confluence import Page

    with measure(f"Export pages {', '.join(pages)} with descendants"):
+        override_output_path_config(output_path)


override_output_path_config does not need to be called multiple times. I move this line.

confluence_markdown_exporter/confluence.py

…vars

naoki-tateyama · 2026-02-08T05:54:41Z

confluence_markdown_exporter/confluence.py

            "homepage_title": sanitize_filename(Page.from_id(self.space.homepage).title),
-            "ancestor_ids": "/".join(str(a) for a in self.ancestors),
-            "ancestor_titles": "/".join(
-                sanitize_filename(Page.from_id(a).title) for a in self.ancestors


In the _template_vars property, although the results are cached, Page.from_id(a).title) for a in self.ancestors unnecessarily calls from_id. This is inconvenient for speeding up with the incremental option.
So I added Ancestor type to store ancestor infomation.

naoki-tateyama · 2026-02-08T06:15:44Z

Hi @Spenhouet, I use confluence-markdown-exporter regularly, and it’s a tool I really like and rely on, so thank you for maintaining it.
I’ve opened this PR and would really appreciate it if you could take a look when you have time.
After opening the initial PR, I noticed a potential performance issue and pushed an additional commit to address it. The current version includes that fix.

Spenhouet · 2026-02-08T17:08:39Z

@naoki-tateyama thanks for your work on this. I'm currently pretty busy, I hope I can find some time soon. Sorry if this will take some time. Maybe it gives some time for you to battle test this implementation.

Spenhouet · 2026-03-03T19:10:46Z

confluence_markdown_exporter/confluence.py

            return (
-                " > ".join([self.convert_page_link(ancestor) for ancestor in self.page.ancestors])
+                " > ".join(
+                    [self.convert_page_link(ancestor.id) for ancestor in self.page.ancestors]


convert_page_link takes an integer while ancestor.id is a string. Can you check what it is and either adjust ancestor.id to int or ensure that convert_page_link can handle the string?

Addressed in 6f15581.

Spenhouet · 2026-03-03T20:23:13Z

confluence_markdown_exporter/confluence.py

-    for page_id in (pbar := tqdm(page_ids, smoothing=0.05)):
-        pbar.set_postfix_str(f"Exporting page {page_id}")
-        export_page(page_id)
+    for page in (pbar := tqdm(pages, smoothing=0.05)):


We could prefilter before starting the tqdm to only show a progressbar for pages to be exported:

pages_to_export = [page for page in pages if LockfileManager.should_export(page)] if not pages_to_export: logger.info("No pages to export based on lockfile state.") return for page in (pbar := tqdm(pages_to_export, smoothing=0.05)):

Addressed in 02562ff.

confluence_markdown_exporter/utils/lockfile.py

Spenhouet · 2026-03-03T20:27:43Z

confluence_markdown_exporter/utils/lockfile.py

+        existing.pages.update(self.pages)
+        existing.last_export = datetime.now(timezone.utc).isoformat()
+
+        json_str = existing.model_dump_json(indent=2)


The pages list should be sorted by key (page id) so that the file is more stable and diffs make sense when the file is tracked via git.

Addressed in 509d3e6.

Spenhouet · 2026-03-03T20:28:59Z

confluence_markdown_exporter/main.py

        ),
    ] = None,
+    *,
+    incremental: Annotated[


I'd like to have this as a default config/setting instead. So no parameter option for it, just a config option where by default this is turned on.

Addressed in 4253132.

Spenhouet · 2026-03-03T20:30:53Z

confluence_markdown_exporter/main.py



+@app.command(help="Delete exported files that are not tracked in the lockfile.")
+def prune(


Why do this as a separate command? Because it could be destructive?

Why not always run this? Could also make this a config option which someone can disable if this is not desired.

@Spenhouet Thank you for your comments! I'll work on your comments soon.

Regarding this comment, for users who use prior versions of cme, the files already downloaded are not tracked by the confluence-lock.json. If prune command is always executed, files not on the lockfile would be deleted.
This may be inconvenient for users using multiple executions of export commands like

cme pages 1234 cme pages 5678

The execution of downloading 1234 would delete 5678.
After all, the 5678 file would be downloaded by the second command, so this may not be a problem, but deleting may not be intended by users.
To allow users to explicitly control the behavior, I intentionally separated prune command.
What do you think?

Maybe we need to rethink how the pruning works. Imo. we need to detect what files previously tracked on the lock file are now gone I.e. a entry removed in the lock file results in removal of the file on disk. Similar for renamed or moved pages.

Some brainstorming here:

For every page a new scan hits and compares against the lock file (record_page) we can tell if the export_path changed and we do know the previous export_path so we can delete the file at the previous export path at that point in time while executing record_page. This should already cover moved and renamed pages.

That leaves us with deleted pages. These are bit more tricky. Note that we are not guaranteed that the command executed is always against the whole space (or against the scope of the lock file). Which means we can not simply "delete everything that is in the lock file but wasn't in the sync". We could use that info to narrow down how many pages of the lock file we might need to check. At the end of a run we could get the list of page which are in the lock file but were not in the sync results. Then we could query all these pages and see if they still exist. For all pages which are in the lock file but no longer exist in Confluence we delete the old page file on disk. This way we only perform the deletion for previously synced pages and for nothing else. That check is a bit expensive but I don't yet have a better idea.

Addressed in 9e4369c. Replaced the separate prune command with automatic cleanup during export. Each lockfile entry now records command and args to define its scope. On cleanup, pages no longer present in the current scope are automatically deleted from disk and the lockfile. Moved pages (changed export_path) also have their old files removed.

I checked if we can run batch requests and Sonnet provided this script, which uses the v2 API and a fallback via CQL for instances without v2 API:

# --------------------------------------------------------------------------- # Confluence existence check — batched requests # --------------------------------------------------------------------------- def _fetch_existing_ids_v2( session: requests.Session, v2_base: str, all_ids: list[str], ) -> set[str]: """Atlassian Cloud: GET /wiki/api/v2/pages?id=X&id=Y&...&limit=250. One request per batch of V2_BATCH_SIZE IDs. IDs present in the response exist; IDs absent from the response are deleted. """ existing: set[str] = set() n_batches = math.ceil(len(all_ids) / V2_BATCH_SIZE) for batch_num, start in enumerate(range(0, len(all_ids), V2_BATCH_SIZE), 1): batch = all_ids[start : start + V2_BATCH_SIZE] print( f" Batch {batch_num}/{n_batches} ({len(batch)} IDs) ...", end="\r", flush=True, ) params: list[tuple[str, str | int]] = [("id", pid) for pid in batch] params.append(("limit", len(batch))) r = session.get(f"{v2_base}/pages", params=params) if not r.ok: print( f"\nERROR: v2 pages request failed (HTTP {r.status_code}).\n" f"Response: {r.text[:400]}", file=sys.stderr, ) sys.exit(1) for item in r.json().get("results", []): existing.add(str(item["id"])) print(" " * 60, end="\r") return existing def _fetch_existing_ids_cql( session: requests.Session, api_base: str, all_ids: list[str], ) -> set[str]: """Self-hosted fallback: CQL id in (...) in batches of CQL_BATCH_SIZE. Smaller batches (25) stay well within the CQL aggregator limits. """ existing: set[str] = set() n_batches = math.ceil(len(all_ids) / CQL_BATCH_SIZE) for batch_num, start in enumerate(range(0, len(all_ids), CQL_BATCH_SIZE), 1): batch = all_ids[start : start + CQL_BATCH_SIZE] print( f" Batch {batch_num}/{n_batches} ({len(batch)} IDs) ...", end="\r", flush=True, ) cql = "id in (" + ",".join(batch) + ")" r = session.get( f"{api_base}/content/search", params={"cql": cql, "limit": len(batch), "fields": "id"}, ) if not r.ok: print( f"\nERROR: CQL query failed (HTTP {r.status_code}).\n" f"Response: {r.text[:400]}", file=sys.stderr, ) sys.exit(1) for item in r.json().get("results", []): existing.add(str(item["id"])) print(" " * 60, end="\r") return existing # --------------------------------------------------------------------------- # Entry point # --------------------------------------------------------------------------- def main() -> None: parser = argparse.ArgumentParser( description="Find pages in a .confluence-lock.json that no longer exist in Confluence.", ) parser.add_argument( "--lock", default=str(Path(__file__).parent / ".confluence-lock.json"), help="Path to .confluence-lock.json (default: <script-dir>/.confluence-lock.json)", ) args = parser.parse_args() lock_path = Path(args.lock) pages = _load_lock(lock_path) all_ids = list(pages.keys()) print(f"Lock file: {lock_path.resolve()}") print(f"Total pages in lock: {len(all_ids)}") api_base, username, api_token, pat = _get_credentials() session = _make_session(username, api_token, pat) # Derive raw URL from api_base to build v2 URL raw_url = api_base.replace("/wiki/rest/api", "").replace("/rest/api", "") v2_base = _get_v2_base(raw_url) if v2_base: n_batches = math.ceil(len(all_ids) / V2_BATCH_SIZE) print( f"Confluence API (v2): {v2_base}\n" f"Checking {len(all_ids)} pages in {n_batches} batch(es) " f"of up to {V2_BATCH_SIZE} IDs ..." ) existing_ids = _fetch_existing_ids_v2(session, v2_base, all_ids) else: n_batches = math.ceil(len(all_ids) / CQL_BATCH_SIZE) print( f"Confluence API (v1 CQL): {api_base}\n" f"Checking {len(all_ids)} pages in {n_batches} batch(es) " f"of up to {CQL_BATCH_SIZE} IDs ..." ) existing_ids = _fetch_existing_ids_cql(session, api_base, all_ids) deleted_ids = sorted(set(all_ids) - existing_ids, key=int) print(f"Result: {len(existing_ids)} existing, {len(deleted_ids)} deleted.\n") if not deleted_ids: print("No deleted pages found.") return print(f"Deleted pages ({len(deleted_ids)}):") print(f"{'ID':<15} {'Title':<50} export_path") print("-" * 120) for page_id in deleted_ids: entry = pages[page_id] title = entry.get("title", "") export_path = entry.get("export_path", "") print(f"{page_id:<15} {title:<50} {export_path}")

Each lockfile entry now records command and args to define its scope

Not a fan of recording the scope and args in the lock file. Also it should be possible to e.g. only sync a single page without all other previously synced pages being deleted. We should only delete pages which were synced before and truely no longer exist on Confluence.

I'd prefer to the do the batch scan at the end of the sync for all pages that were not within the sync results but are in the lock file.

We might want to add config options for the auto prune (Default on) and the batch size v2 and cql.

Addressed in ed2f5a1. Replaced the scope-based approach with a v2 API batch check (wiki/api/v2/pages) during cleanup. Unseen lockfile pages are checked against Confluence in batches of 250, and only pages confirmed to no longer exist are deleted from disk and the lockfile. Old files for moved pages (changed export_path) are also cleaned up.

Added export.cleanup_stale config option (default: True) to enable/disable this behavior.

Spenhouet · 2026-03-03T20:33:00Z

@naoki-tateyama thanks for the implementation. I like it. It's clean and minimal and works well.

See my review remarks above. When they are addressed I'm happy to merge this.

Co-authored-by: Sebastian Penhouet <Spenhouet@users.noreply.github.com>

convert_page_link expects int, so align the Ancestor type accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Only show progress bar for pages that will actually be exported, instead of including skipped pages in the count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace --incremental CLI flag with export.skip_unchanged config setting. Defaults to true so lockfile-based incremental export is always on. Uses init() pattern with a helper function init_lockfile_if_enabled(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

After export, batch-check unseen lockfile pages against Confluence API (CQL id in ...) and delete local files for pages no longer on Confluence. Also delete old files when a page's export path changes. Add configurable export.cleanup_stale option (default: True). Remove prune command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

naoki-tateyama

@Spenhouet Thank you for your follow up commits. I reviewd them and it's LGTM. I don't have any updates for this PR, so if you are ready, please mege this and publish the newer version! 🙏

Spenhouet · 2026-03-08T20:36:21Z

@naoki-tateyama Thanks for your work on this!

Add lockfile tracking for exported pages

1e5181b

naoki-tateyama mentioned this pull request Feb 3, 2026

Add lockfile tracking for exported pages #140

Closed

feat: Incremental page download using lockfile

90f211c

naoki-tateyama force-pushed the feature/incremental-download branch from 21ab38b to 90f211c Compare February 3, 2026 14:33

add prune command

8691524

naoki-tateyama mentioned this pull request Feb 3, 2026

Saving metadata of export executions #138

Closed

naoki-tateyama commented Feb 3, 2026

View reviewed changes

add incremental export option to pages command

0b5dbd3

naoki-tateyama force-pushed the feature/incremental-download branch from 15987a9 to 0b5dbd3 Compare February 8, 2026 04:38

fix performance issue by adding Descendant type

6bac022

naoki-tateyama commented Feb 8, 2026

View reviewed changes

confluence_markdown_exporter/confluence.py Outdated Show resolved Hide resolved

naoki-tateyama added 2 commits February 8, 2026 14:32

integrate version property into Descendant class

6edb622

fix performance issue by avoiding ancestor fetching in the _template_…

70617ea

…vars

naoki-tateyama commented Feb 8, 2026

View reviewed changes

add documentation for the new incremental option

d34eff7

remove expand parameters from from_id and from_url methods.

bd13853

This was referenced Feb 8, 2026

Add incremental sync to avoid re-exporting unchanged pages #142

Closed

Incremental export (only download new/changed pages on subsequent runs) #115

Closed

move LockfileManager to outside of the Page class

2516f41

naoki-tateyama force-pushed the feature/incremental-download branch from 391a8cb to 2516f41 Compare February 9, 2026 00:39

parse ancestor in simpler way

5029d3d

naoki-tateyama force-pushed the feature/incremental-download branch 3 times, most recently from 7b52e61 to 5029d3d Compare February 10, 2026 23:55

add tests for lockfile utils

4201ccf

naoki-tateyama force-pushed the feature/incremental-download branch from c7519f8 to 4201ccf Compare February 11, 2026 06:40

naoki-tateyama added 2 commits February 12, 2026 00:37

refactor Ancestor and Descendant classes

8603a3b

atomic file writes

dd86763

Spenhouet reviewed Mar 3, 2026

View reviewed changes

confluence_markdown_exporter/utils/lockfile.py Outdated Show resolved Hide resolved

Spenhouet reviewed Mar 3, 2026

View reviewed changes

naoki-tateyama and others added 5 commits March 4, 2026 22:23

Update confluence_markdown_exporter/utils/lockfile.py

5263c39

Co-authored-by: Sebastian Penhouet <Spenhouet@users.noreply.github.com>

fix Ancestor.id type from str to int

6f15581

convert_page_link expects int, so align the Ancestor type accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

prefilter pages before tqdm progress bar

02562ff

Only show progress bar for pages that will actually be exported, instead of including skipped pages in the count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sort lockfile entries by key for stable git diffs

509d3e6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

naoki-tateyama force-pushed the feature/incremental-download branch from 9538fe3 to 9e4369c Compare March 4, 2026 15:44

naoki-tateyama force-pushed the feature/incremental-download branch from 9e4369c to ed2f5a1 Compare March 4, 2026 16:33

Spenhouet added 2 commits March 5, 2026 21:55

Introduce new config options and refactor clean

f899267

Fix connection config

a115f39

naoki-tateyama commented Mar 7, 2026

View reviewed changes

Redownload if file is missing

1439186

Spenhouet merged commit d302023 into Spenhouet:main Mar 8, 2026
1 check passed



		@app.command(help="Delete exported files that are not tracked in the lockfile.")
		def prune(

Uh oh!

Conversation

naoki-tateyama commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Perfomance improvement

Test Plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoki-tateyama commented Feb 8, 2026

Uh oh!

Spenhouet commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoki-tateyama Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Spenhouet commented Mar 3, 2026

Uh oh!

naoki-tateyama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Spenhouet commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

naoki-tateyama commented Feb 3, 2026 •

edited

Loading

Spenhouet commented Feb 8, 2026 •

edited

Loading

naoki-tateyama Mar 4, 2026 •

edited

Loading