Skip to content

feat: incremental download using lockfile#141

Merged
Spenhouet merged 23 commits intoSpenhouet:mainfrom
naoki-tateyama:feature/incremental-download
Mar 8, 2026
Merged

feat: incremental download using lockfile#141
Spenhouet merged 23 commits intoSpenhouet:mainfrom
naoki-tateyama:feature/incremental-download

Conversation

@naoki-tateyama
Copy link
Contributor

@naoki-tateyama naoki-tateyama commented Feb 3, 2026

Summary

Introduced incremental download using lockfile.
When exporting multiple docs with the descendants, the program only downloads new and updated files by matching their versions and file paths.
This reduces the unnecessary downloading time.

  • incremental option: downloads only new and updated files with pages, pages_with_descendants, spaces and all_spaces commands.
  • prune command: deletes files which are not tracked by the lockfile. dry-run mode shows the target files to delete. I intentionally created this command because users may export commands multiple times at one time. If commands delete unncessary files at each time, it may cause undesirable deletions.

related issues

Perfomance improvement

Although the from_id method results are cached, _template_vars calls from_id for the every ancestor.
It's very costly.
I introduced Ancestor type to store ancestor infomotion.

Test Plan

I tried pages_with_descendants command and spaces command in the local environment, and they worked well.
After fetching 884 files in a space with spaces command, it took 18 s to finish the spaces command for the second execution.

time uv run confluence-markdown-exporter spaces {space_key} --incremental                            
  100%|██████████████████████████| 884/884 [00:00<00:00, 8128.21it/s, Skipping page {page_id} (no    
  changes)]                                                                                               
  uv run confluence-markdown-exporter spaces space_key --incremental  0.62s user 0.14s system 4%   
   cpu 18.222 total

@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch from 21ab38b to 90f211c Compare February 3, 2026 14:33
LockfileManager.init()
for page in pages:
override_output_path_config(output_path)
_page = Page.from_id(int(page)) if page.isdigit() else Page.from_url(page)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page command downloads page body directly here. So it's difficult to filter using version infomotion.

from confluence_markdown_exporter.confluence import Page

with measure(f"Export pages {', '.join(pages)} with descendants"):
override_output_path_config(output_path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override_output_path_config does not need to be called multiple times. I move this line.

@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch from 15987a9 to 0b5dbd3 Compare February 8, 2026 04:38
"homepage_title": sanitize_filename(Page.from_id(self.space.homepage).title),
"ancestor_ids": "/".join(str(a) for a in self.ancestors),
"ancestor_titles": "/".join(
sanitize_filename(Page.from_id(a).title) for a in self.ancestors
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the _template_vars property, although the results are cached, Page.from_id(a).title) for a in self.ancestors unnecessarily calls from_id. This is inconvenient for speeding up with the incremental option.
So I added Ancestor type to store ancestor infomation.

@naoki-tateyama
Copy link
Contributor Author

Hi @Spenhouet, I use confluence-markdown-exporter regularly, and it’s a tool I really like and rely on, so thank you for maintaining it.
I’ve opened this PR and would really appreciate it if you could take a look when you have time.
After opening the initial PR, I noticed a potential performance issue and pushed an additional commit to address it. The current version includes that fix.

@Spenhouet
Copy link
Owner

Spenhouet commented Feb 8, 2026

@naoki-tateyama thanks for your work on this. I'm currently pretty busy, I hope I can find some time soon. Sorry if this will take some time. Maybe it gives some time for you to battle test this implementation.

@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch from 391a8cb to 2516f41 Compare February 9, 2026 00:39
@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch 3 times, most recently from 7b52e61 to 5029d3d Compare February 10, 2026 23:55
@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch from c7519f8 to 4201ccf Compare February 11, 2026 06:40
return (
" > ".join([self.convert_page_link(ancestor) for ancestor in self.page.ancestors])
" > ".join(
[self.convert_page_link(ancestor.id) for ancestor in self.page.ancestors]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_page_link takes an integer while ancestor.id is a string. Can you check what it is and either adjust ancestor.id to int or ensure that convert_page_link can handle the string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6f15581.

for page_id in (pbar := tqdm(page_ids, smoothing=0.05)):
pbar.set_postfix_str(f"Exporting page {page_id}")
export_page(page_id)
for page in (pbar := tqdm(pages, smoothing=0.05)):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could prefilter before starting the tqdm to only show a progressbar for pages to be exported:

    pages_to_export = [page for page in pages if LockfileManager.should_export(page)]

    if not pages_to_export:
        logger.info("No pages to export based on lockfile state.")
        return

    for page in (pbar := tqdm(pages_to_export, smoothing=0.05)):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 02562ff.

existing.pages.update(self.pages)
existing.last_export = datetime.now(timezone.utc).isoformat()

json_str = existing.model_dump_json(indent=2)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pages list should be sorted by key (page id) so that the file is more stable and diffs make sense when the file is tracked via git.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 509d3e6.

),
] = None,
*,
incremental: Annotated[
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have this as a default config/setting instead. So no parameter option for it, just a config option where by default this is turned on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 4253132.



@app.command(help="Delete exported files that are not tracked in the lockfile.")
def prune(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this as a separate command? Because it could be destructive?

Why not always run this? Could also make this a config option which someone can disable if this is not desired.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spenhouet Thank you for your comments! I'll work on your comments soon.

Regarding this comment, for users who use prior versions of cme, the files already downloaded are not tracked by the confluence-lock.json. If prune command is always executed, files not on the lockfile would be deleted.
This may be inconvenient for users using multiple executions of export commands like

cme pages 1234
cme pages 5678

The execution of downloading 1234 would delete 5678.
After all, the 5678 file would be downloaded by the second command, so this may not be a problem, but deleting may not be intended by users.
To allow users to explicitly control the behavior, I intentionally separated prune command.
What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to rethink how the pruning works. Imo. we need to detect what files previously tracked on the lock file are now gone I.e. a entry removed in the lock file results in removal of the file on disk. Similar for renamed or moved pages.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some brainstorming here:

  1. For every page a new scan hits and compares against the lock file (record_page) we can tell if the export_path changed and we do know the previous export_path so we can delete the file at the previous export path at that point in time while executing record_page. This should already cover moved and renamed pages.
  2. That leaves us with deleted pages. These are bit more tricky. Note that we are not guaranteed that the command executed is always against the whole space (or against the scope of the lock file). Which means we can not simply "delete everything that is in the lock file but wasn't in the sync". We could use that info to narrow down how many pages of the lock file we might need to check. At the end of a run we could get the list of page which are in the lock file but were not in the sync results. Then we could query all these pages and see if they still exist. For all pages which are in the lock file but no longer exist in Confluence we delete the old page file on disk. This way we only perform the deletion for previously synced pages and for nothing else. That check is a bit expensive but I don't yet have a better idea.

Copy link
Contributor Author

@naoki-tateyama naoki-tateyama Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9e4369c. Replaced the separate prune command with automatic cleanup during export. Each lockfile entry now records command and args to define its scope. On cleanup, pages no longer present in the current scope are automatically deleted from disk and the lockfile. Moved pages (changed export_path) also have their old files removed.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked if we can run batch requests and Sonnet provided this script, which uses the v2 API and a fallback via CQL for instances without v2 API:

# ---------------------------------------------------------------------------
# Confluence existence check — batched requests
# ---------------------------------------------------------------------------

def _fetch_existing_ids_v2(
    session: requests.Session,
    v2_base: str,
    all_ids: list[str],
) -> set[str]:
    """Atlassian Cloud: GET /wiki/api/v2/pages?id=X&id=Y&...&limit=250.

    One request per batch of V2_BATCH_SIZE IDs.  IDs present in the response
    exist; IDs absent from the response are deleted.
    """
    existing: set[str] = set()
    n_batches = math.ceil(len(all_ids) / V2_BATCH_SIZE)

    for batch_num, start in enumerate(range(0, len(all_ids), V2_BATCH_SIZE), 1):
        batch = all_ids[start : start + V2_BATCH_SIZE]
        print(
            f"  Batch {batch_num}/{n_batches} ({len(batch)} IDs) ...",
            end="\r",
            flush=True,
        )
        params: list[tuple[str, str | int]] = [("id", pid) for pid in batch]
        params.append(("limit", len(batch)))
        r = session.get(f"{v2_base}/pages", params=params)
        if not r.ok:
            print(
                f"\nERROR: v2 pages request failed (HTTP {r.status_code}).\n"
                f"Response: {r.text[:400]}",
                file=sys.stderr,
            )
            sys.exit(1)
        for item in r.json().get("results", []):
            existing.add(str(item["id"]))

    print(" " * 60, end="\r")
    return existing


def _fetch_existing_ids_cql(
    session: requests.Session,
    api_base: str,
    all_ids: list[str],
) -> set[str]:
    """Self-hosted fallback: CQL id in (...) in batches of CQL_BATCH_SIZE.

    Smaller batches (25) stay well within the CQL aggregator limits.
    """
    existing: set[str] = set()
    n_batches = math.ceil(len(all_ids) / CQL_BATCH_SIZE)

    for batch_num, start in enumerate(range(0, len(all_ids), CQL_BATCH_SIZE), 1):
        batch = all_ids[start : start + CQL_BATCH_SIZE]
        print(
            f"  Batch {batch_num}/{n_batches} ({len(batch)} IDs) ...",
            end="\r",
            flush=True,
        )
        cql = "id in (" + ",".join(batch) + ")"
        r = session.get(
            f"{api_base}/content/search",
            params={"cql": cql, "limit": len(batch), "fields": "id"},
        )
        if not r.ok:
            print(
                f"\nERROR: CQL query failed (HTTP {r.status_code}).\n"
                f"Response: {r.text[:400]}",
                file=sys.stderr,
            )
            sys.exit(1)
        for item in r.json().get("results", []):
            existing.add(str(item["id"]))

    print(" " * 60, end="\r")
    return existing


# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------

def main() -> None:
    parser = argparse.ArgumentParser(
        description="Find pages in a .confluence-lock.json that no longer exist in Confluence.",
    )
    parser.add_argument(
        "--lock",
        default=str(Path(__file__).parent / ".confluence-lock.json"),
        help="Path to .confluence-lock.json (default: <script-dir>/.confluence-lock.json)",
    )
    args = parser.parse_args()

    lock_path = Path(args.lock)
    pages = _load_lock(lock_path)
    all_ids = list(pages.keys())

    print(f"Lock file: {lock_path.resolve()}")
    print(f"Total pages in lock: {len(all_ids)}")

    api_base, username, api_token, pat = _get_credentials()
    session = _make_session(username, api_token, pat)

    # Derive raw URL from api_base to build v2 URL
    raw_url = api_base.replace("/wiki/rest/api", "").replace("/rest/api", "")
    v2_base = _get_v2_base(raw_url)

    if v2_base:
        n_batches = math.ceil(len(all_ids) / V2_BATCH_SIZE)
        print(
            f"Confluence API (v2): {v2_base}\n"
            f"Checking {len(all_ids)} pages in {n_batches} batch(es) "
            f"of up to {V2_BATCH_SIZE} IDs ..."
        )
        existing_ids = _fetch_existing_ids_v2(session, v2_base, all_ids)
    else:
        n_batches = math.ceil(len(all_ids) / CQL_BATCH_SIZE)
        print(
            f"Confluence API (v1 CQL): {api_base}\n"
            f"Checking {len(all_ids)} pages in {n_batches} batch(es) "
            f"of up to {CQL_BATCH_SIZE} IDs ..."
        )
        existing_ids = _fetch_existing_ids_cql(session, api_base, all_ids)

    deleted_ids = sorted(set(all_ids) - existing_ids, key=int)

    print(f"Result: {len(existing_ids)} existing, {len(deleted_ids)} deleted.\n")

    if not deleted_ids:
        print("No deleted pages found.")
        return

    print(f"Deleted pages ({len(deleted_ids)}):")
    print(f"{'ID':<15}  {'Title':<50}  export_path")
    print("-" * 120)
    for page_id in deleted_ids:
        entry = pages[page_id]
        title = entry.get("title", "")
        export_path = entry.get("export_path", "")
        print(f"{page_id:<15}  {title:<50}  {export_path}")

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each lockfile entry now records command and args to define its scope

Not a fan of recording the scope and args in the lock file. Also it should be possible to e.g. only sync a single page without all other previously synced pages being deleted. We should only delete pages which were synced before and truely no longer exist on Confluence.

I'd prefer to the do the batch scan at the end of the sync for all pages that were not within the sync results but are in the lock file.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to add config options for the auto prune (Default on) and the batch size v2 and cql.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in ed2f5a1. Replaced the scope-based approach with a v2 API batch check (wiki/api/v2/pages) during cleanup. Unseen lockfile pages are checked against Confluence in batches of 250, and only pages confirmed to no longer exist are deleted from disk and the lockfile. Old files for moved pages (changed export_path) are also cleaned up.

Added export.cleanup_stale config option (default: True) to enable/disable this behavior.

@Spenhouet
Copy link
Owner

@naoki-tateyama thanks for the implementation. I like it. It's clean and minimal and works well.

See my review remarks above. When they are addressed I'm happy to merge this.

naoki-tateyama and others added 5 commits March 4, 2026 22:23
Co-authored-by: Sebastian Penhouet <Spenhouet@users.noreply.github.com>
convert_page_link expects int, so align the Ancestor type accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only show progress bar for pages that will actually be exported,
instead of including skipped pages in the count.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace --incremental CLI flag with export.skip_unchanged config setting.
Defaults to true so lockfile-based incremental export is always on.
Uses init() pattern with a helper function init_lockfile_if_enabled().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch from 9538fe3 to 9e4369c Compare March 4, 2026 15:44
After export, batch-check unseen lockfile pages against Confluence API
(CQL id in ...) and delete local files for pages no longer on Confluence.
Also delete old files when a page's export path changes. Add configurable
export.cleanup_stale option (default: True). Remove prune command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@naoki-tateyama naoki-tateyama force-pushed the feature/incremental-download branch from 9e4369c to ed2f5a1 Compare March 4, 2026 16:33
Copy link
Contributor Author

@naoki-tateyama naoki-tateyama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Spenhouet Thank you for your follow up commits. I reviewd them and it's LGTM. I don't have any updates for this PR, so if you are ready, please mege this and publish the newer version! 🙏

@Spenhouet Spenhouet merged commit d302023 into Spenhouet:main Mar 8, 2026
1 check passed
@Spenhouet
Copy link
Owner

@naoki-tateyama Thanks for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants