Error in user YAML: (<unknown>): did not find expected alphabetic or numeric character while scanning an alias at line 2 column 14

---
title: Local File Sync Strategy with GitHub Delta Updates
description: **GitHub Repository**: 250,000+ markdown files **Previous approach**: GitHub API pagination → would take hours, hit rate limits **New approach**: Local baseline + SHA1 delta updates → instant initi...
category: how-to
tags:
  - tutorial
  - api
  - docker
  - database
  - backend
updated-date: 2025-12-18
difficulty-level: advanced
---

Local File Sync Strategy with GitHub Delta Updates

Problem Solved

GitHub Repository: 250,000+ markdown files Previous approach: GitHub API pagination → would take hours, hit rate limits New approach: Local baseline + SHA1 delta updates → instant initial load, incremental updates

Architecture

GitHub Sync Pipeline

flowchart LR
    GitHub[GitHub\njjohare/logseq\nmainKnowledgeGraph/pages\n998 MD files] --> SHA1{SHA1 check\nor FORCE_FULL_SYNC?}
    SHA1 -->|"FORCE_FULL_SYNC=1\n(bypass filter)"| Parser
    SHA1 -->|"Changed files\n(local SHA1 ≠ GitHub SHA1)"| Parser[GitHubSyncService\nOntologyParser\nKnowledgeGraphParser]
    SHA1 -->|Unchanged files| LocalDisk[Local disk\n/app/data/pages\nuse as-is]
    Parser -->|"public:: true"| KG[KG Page Nodes\nNeo4j :GraphNode\ntype = knowledge]
    Parser -->|"### OntologyBlock"| OWL[OWL Nodes\nNeo4j :OwlClass\nSUBCLASS_OF rels]
    Parser -->|"[[wikilinks]]"| Linked[LinkedPage Nodes\nNeo4j :GraphNode\ntype = linked_page]
    LocalDisk --> Neo4j
    KG --> Neo4j[(Neo4j\nGraph store)]
    OWL --> Neo4j
    Linked --> Neo4j
    Neo4j --> GSA[GraphStateActor\nIn-memory graph\nclassified node sets]
    GSA --> WS[Binary WebSocket\nclient broadcast]

Full sync pipeline: local SHA1 delta filter avoids redundant processing; FORCE_FULL_SYNC=1 bypasses the filter to reprocess all files. Only public:: true files produce KG page nodes; OntologyBlocks are extracted from all files regardless.

Data Flow

┌─────────────────────────────────────────────────────────────┐
│ Host Filesystem: /mnt/mldata/githubs/AR-AI-Knowledge-Graph │
│                  /data/pages (1001 .md files)                │
└──────────────────────────┬──────────────────────────────────┘
                           │ (Docker volume mount :rw)
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Container: /app/data/pages (persistent, writable)           │
└──────────────────────────┬──────────────────────────────────┘
                           │
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ LocalFileSyncService                                         │
│  1. Scan local files → Vec<PathBuf>                        │
│  2. Calculate SHA1 for each file                            │
│  3. Fetch GitHub SHA1 map (metadata only, paginated)        │
│  4. Compare: local_sha != github_sha?                       │
│  5a. If match: Use local file ✅                            │
│  5b. If mismatch: Download from GitHub, update local 🔄     │
│  6. Process files (parse, enrich, batch save to Neo4j)      │
└─────────────────────────────────────────────────────────────┘

File Type Detection Flow

flowchart TD
    All[All markdown files\n/app/data/pages/*.md] --> DFT[detect_file_type]
    DFT --> Public{"Has\npublic:: true\ntag?"}
    DFT --> Onto{"Contains\n### OntologyBlock?"}
    DFT --> Journal{"Path starts with\njournals/?"}

    Public -->|Yes| KGNode[KG Page Node\nparse wikilinks\n→ Neo4j :GraphNode\ntype=knowledge]
    Public -->|No| Ignored[Not a KG node\nskip for graph]

    Onto -->|Yes, any file| OWLNode[OWL Nodes\nextract axioms\n→ Neo4j :OwlClass\nSUBCLASS_OF rels]

    Journal -->|Yes| Skip[Skip\nnot ingested\ninto graph]

    KGNode -->|"[[wikilinks]]"| Linked[LinkedPage Nodes\n→ Neo4j :GraphNode\ntype=linked_page]

detect_file_type() routes each file to the correct parser: public:: true tag gates KG page node creation; OntologyBlocks are extracted from all files (including non-public ones); journal files are excluded entirely.

Implementation

1. Docker Mount (DONE)

File: docker-compose.yml:16

volumes:
  - ./data/pages:/app/data/pages:rw  # Mount local pages for persistent baseline

Benefits:

Changes persist across container restarts
Can bulk-update pages directory on host
Read/write access for delta updates

2. Local File Sync Service (DONE)

File: src/services/local_file_sync_service.rs

Key Methods:

sync_with_github_delta() - Main entry point
scan_local_pages() - Read local .md files
fetch_github_sha_map() - Get SHA1 hashes from GitHub
calculate_file_sha1() - Compute local file hash
fetch_and_update_file() - Download changed files from GitHub
process_file_content() - Parse KG/ontology data
save_batch() - Persist to Neo4j in batches of 50

Features:

✅ Instant startup (local files available immediately)
✅ Incremental updates (only changed files from GitHub)
✅ Graceful degradation (works offline with local baseline)
✅ SHA1 verification (ensures data integrity)
✅ Batch processing (efficient Neo4j writes)

3. Usage

// In main.rs or sync binary
use crate::services::local_file_sync_service::LocalFileSyncService;

let sync_service = LocalFileSyncService::new(
    content_api,
    kg_repo,
    onto_repo,
    enrichment_service,
);

let stats = sync_service.sync_with_github_delta().await?;

println!("Synced {} files from local baseline", stats.files_synced_from_local);
println!("Updated {} files from GitHub", stats.files_updated_from_github);
println!("Processed {} knowledge graph files", stats.kg_files_processed);

Performance Comparison

Old Approach (GitHub API Only)

Pagination: 250,000 files / 100 per page = 2,500 API calls
Time: ~2,500 * 0.5s = 1,250 seconds (20+ minutes)
Rate limits: High risk
Network dependency: Critical

New Approach (Local + Delta)

Local scan: 1,001 files = ~0.1 seconds
GitHub SHA1 fetch: 2,500 API calls (metadata only) = ~5 minutes (one-time)
Delta updates: ~10-50 files/day = <10 seconds
Total initial sync: ~5 minutes (vs 20+ minutes)
Daily updates: <10 seconds (vs 20+ minutes)

Maintenance Workflow

Initial Baseline Setup

# On host machine
cd /mnt/mldata/githubs/AR-AI-Knowledge-Graph/data/pages

# Option 1: Git clone (recommended for bulk download)
git clone --depth 1 --filter=blob:none \
  https://github.com/jjohare/logseq.git temp
cp temp/mainKnowledgeGraph/pages/*.md .
rm -rf temp

# Option 2: Existing baseline (you already have 1001 files)
# Just keep them, they'll be delta-updated automatically

Daily Sync

# Inside container or via API endpoint
cargo run --bin sync_local_github

# Or via HTTP API
curl -X POST http://localhost:4000/api/admin/sync

Force Re-download All

# Clear local cache, re-sync everything
rm /mnt/mldata/githubs/AR-AI-Knowledge-Graph/data/pages/*.md

# On next sync, all files will be downloaded from GitHub
cargo run --bin sync_local_github

Configuration

Environment Variables

# .env file
GITHUB_OWNER=jjohare
GITHUB_REPO=logseq
GITHUB_BRANCH=main
GITHUB_BASE_PATH=mainKnowledgeGraph/pages
GITHUB_TOKEN=github_pat_...

# Local pages directory (inside container)
LOCAL_PAGES_DIR=/app/data/pages

# Batch size for Neo4j writes
SYNC_BATCH_SIZE=50

Advantages Over GitHub-Only Approach

Performance
- Initial load: Instant (reads from local disk)
- Updates: Only changed files (typically <50/day)
- No pagination overhead for 250k files
Reliability
- Works offline with local baseline
- GitHub API failures don't block operations
- Rate limit resilient
Persistence
- Volume mount survives container restarts
- Can pre-populate on host machine
- Easy bulk updates (git clone, rsync, etc.)
Efficiency
- SHA1 comparison is cheap (metadata-only API calls)
- Only downloads changed files
- Batched database writes
Scalability
- Handles millions of files (limited by disk, not API)
- Incremental sync scales linearly
- No GitHub API pagination bottleneck

Future Enhancements

Git Tree API (More Efficient SHA1 Fetching)

// Instead of paginating contents API, use git tree API
// GET /repos/{owner}/{repo}/git/trees/{branch}?recursive=1
// Returns ALL files + SHA1 in single response (up to 100k files)

async fn fetch_github_tree_sha_map(&self) -> Result<HashMap<String, String>, String> {
    let url = format!(
        "https://api.github.com/repos/{}/{}/git/trees/{}?recursive=1",
        owner, repo, branch
    );

    let response = self.client.get(&url).send().await?;
    let tree: GitTree = response.json().await?;

    let sha_map = tree.tree
        .into_iter()
        .filter(|entry| entry.path.starts_with("mainKnowledgeGraph/pages/")
                     && entry.path.ends_with(".md"))
        .map(|entry| (entry.path.split('/').last().unwrap().to_string(), entry.sha))
        .collect();

    Ok(sha_map)
}

Benefits: Single API call for entire repository tree (up to 100k entries)

Parallel Downloads

use tokio::task::JoinSet;

// Download multiple changed files in parallel
let mut tasks = JoinSet::new();

for file_name in files_to_update {
    let service = self.clone();
    tasks.spawn(async move {
        service.fetch_and_update_file(file_name).await
    });
}

while let Some(result) = tasks.join_next().await {
    match result {
        Ok(Ok(_)) => stats.files_updated += 1,
        Ok(Err(e)) => stats.errors.push(e),
        Err(e) => stats.errors.push(format!("Task error: {}", e)),
    }
}

Watch Mode

use notify::{Watcher, RecursiveMode};

// Watch local directory for changes, auto-sync to Neo4j
let watcher = notify::recommended_watcher(|event| {
    if let Ok(Event::Modify(..)) = event {
        // File changed, re-process and update Neo4j
    }
})?;

watcher.watch(LOCAL_PAGES_DIR, RecursiveMode::NonRecursive)?;

Integration with Existing Code

Replace GitHubSyncService Calls

Before:

let github_sync = GitHubSyncService::new(...);
let stats = github_sync.sync_graphs().await?;

After:

let local_sync = LocalFileSyncService::new(...);
let stats = local_sync.sync_with_github_delta().await?;

Add to services/mod.rs

pub mod local_file_sync_service;
pub use local_file_sync_service::LocalFileSyncService;

Create Sync Binary

// bin/sync_local.rs
#[tokio::main]
async fn main() -> Result<()> {
    let local_sync = LocalFileSyncService::new(...);
    let stats = local_sync.sync_with_github_delta().await?;

    println!("✅ Sync complete:");
    println!("  - {} files from local", stats.files_synced_from_local);
    println!("  - {} files updated from GitHub", stats.files_updated_from_github);
    println!("  - Duration: {:?}", stats.duration);

    Ok(())
}

Testing

# 1. Verify mount
docker exec visionclaw_container ls -la /app/data/pages | head -20

# 2. Check file count
docker exec visionclaw_container find /app/data/pages -name "*.md" | wc -l

# 3. Test SHA1 calculation
docker exec visionclaw_container sha1sum /app/data/pages/ADAS.md

# 4. Run sync service
docker exec visionclaw_container cargo run --bin sync_local

# 5. Verify Neo4j data
docker exec visionclaw-neo4j cypher-shell -u neo4j -p visionclaw-dev-password \
  "MATCH (n:GraphNode) RETURN count(n) AS total"

Rollback Plan

If issues arise, revert to GitHub-only sync:

# docker-compose.yml - Comment out local mount
volumes:
  # - ./data/pages:/app/data/pages:rw  # DISABLED

// Use GitHubSyncService instead of LocalFileSyncService
let github_sync = GitHubSyncService::new(...);

Status: Implemented and ready for testing Next Steps:

Add to src/services/mod.rs
Create bin/sync_local.rs
Test with mounted volume
Monitor delta updates
Document in main README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Local File Sync Strategy with GitHub Delta Updates

Problem Solved

Architecture

GitHub Sync Pipeline

Data Flow

File Type Detection Flow

Implementation

1. Docker Mount (DONE)

2. Local File Sync Service (DONE)

3. Usage

Performance Comparison

Old Approach (GitHub API Only)

New Approach (Local + Delta)

Maintenance Workflow

Initial Baseline Setup

Daily Sync

Force Re-download All

Configuration

Environment Variables

Advantages Over GitHub-Only Approach

Future Enhancements

Git Tree API (More Efficient SHA1 Fetching)

Parallel Downloads

Watch Mode

Integration with Existing Code

Replace GitHubSyncService Calls

Add to services/mod.rs

Create Sync Binary

Testing

Rollback Plan

FilesExpand file tree

local-file-sync-strategy.md

Latest commit

History

local-file-sync-strategy.md

File metadata and controls

Local File Sync Strategy with GitHub Delta Updates

Problem Solved

Architecture

GitHub Sync Pipeline

Data Flow

File Type Detection Flow

Implementation

1. Docker Mount (DONE)

2. Local File Sync Service (DONE)

3. Usage

Performance Comparison

Old Approach (GitHub API Only)

New Approach (Local + Delta)

Maintenance Workflow

Initial Baseline Setup

Daily Sync

Force Re-download All

Configuration

Environment Variables

Advantages Over GitHub-Only Approach

Future Enhancements

Git Tree API (More Efficient SHA1 Fetching)

Parallel Downloads

Watch Mode

Integration with Existing Code

Replace GitHubSyncService Calls

Add to services/mod.rs

Create Sync Binary

Testing

Rollback Plan