A fully automated archive for Hugging Face Daily Papers.
It crawls papers from https://huggingface.co/papers/date/YYYY-MM-DD, stores normalized JSON records, generates bilingual summaries (EN + ZH), builds search indexes, and publishes a static Next.js website on GitHub Pages.
- Tagged releases:
v0.1,v0.3.1,v0.3.2 - Main branch includes additional hardening after
v0.3.2(dedupe + release-time visibility gating)
- Daily crawler (Python 3.11,
requests+beautifulsoup4+lxml) - Per-paper archival JSON under date folders
- Fields per paper:
titleauthorsabstractsummary_ensummary_zhhf_urlarxiv_urlarxiv_pdf_urlgithub_urlupvotesfetched_at
- Translation pipeline:
dummytranslator (works without API keys)openroutertranslator (default model:moonshotai/kimi-k2.5)- Auto-synthesize
summary_enfromabstractwhensummary_enis missing
- Search and index generation:
data/index.jsondata/search_index.jsondata/dates/<date>.json
- Static website (Next.js export) with:
- date-based browsing
- full-text search (title/authors/abstract/summaries)
- bilingual summary toggle
- per-paper detail pages
- daily Overview panel
- Automation via GitHub Actions + GitHub Pages
- Weekend skip: crawler skips Saturday/Sunday by default (can be overridden)
- Rate-limit friendly crawling:
- randomized sleep between papers
- retries with backoff
- skip already complete records
- Cross-date dedupe in index build:
- one visible record per
paper_id - keeps earliest date as canonical display date
- merges richer fields from duplicates
- one visible record per
- Release-time visibility gate (default):
- timezone:
Asia/Shanghai(GMT+8) - release time:
08:00 - delay days:
1 - effect: papers for
YYYY-MM-DDare not exposed in site indexes untilYYYY-MM-(DD+1) 08:00
- timezone:
hf-papers-archive/
scripts/
fetch_daily.py
translate.py
build_index.py
backfill_range.py
migrate_paper_layout.py
data/
papers/
dates/
index.json
search_index.json
site/
app/
components/
lib/
scripts/sync-data.mjs
package.json
.github/workflows/
daily.yml
requirements.txt
README.md
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt
npm install --prefix sitepython3 scripts/fetch_daily.py --date 2026-02-18 --skip-existing-completeOptional (weekend override):
python3 scripts/fetch_daily.py --date 2026-02-15 --allow-weekendAuto mode (uses OpenRouter only if key is present):
export OPENROUTER_API_KEY=<your_key>
export OPENROUTER_MODEL=moonshotai/kimi-k2.5
python3 scripts/translate.py --provider auto --date 2026-02-18 --workers 6python3 scripts/build_index.pyOptional visibility overrides:
export ARCHIVE_TIMEZONE=Asia/Shanghai
export ARCHIVE_RELEASE_HOUR=8
export ARCHIVE_RELEASE_MINUTE=0
export ARCHIVE_RELEASE_DELAY_DAYS=1
python3 scripts/build_index.pynpm run build --prefix siteOutput directory: site/out
Dev mode:
npm run dev --prefix siteStatic preview example:
cd site/out && python3 -m http.server 4173Run full day-by-day pipeline (fetch -> translate -> build index):
python3 scripts/backfill_range.py \
--start 2026-02-01 \
--end 2026-02-19 \
--provider auto \
--workers 6 \
--skip-existing-completeExample: data/papers/2026-02-18/2602.14111.json
{
"date": "2026-02-18",
"paper_id": "2602.14111",
"title": "Paper Title",
"authors": ["Author A", "Author B"],
"abstract": "...",
"summary_en": "...",
"summary_zh": "...",
"hf_url": "https://huggingface.co/papers/2602.14111",
"arxiv_url": "https://arxiv.org/abs/2602.14111",
"arxiv_pdf_url": "https://arxiv.org/pdf/2602.14111",
"github_url": "https://github.com/org/repo",
"upvotes": 42,
"fetched_at": "2026-02-19T00:12:34.123456+00:00"
}Workflow: .github/workflows/daily.yml
schedule: daily at00:00 UTC(which is08:00 GMT+8)workflow_dispatch: manual run
- Resolve target date (default: previous day in selected timezone)
- Run single-day pipeline with skip-existing optimization
- Rebuild indexes
- Commit
data/only if changed - Build Next.js static site
- Deploy to GitHub Pages
Settings -> Pages -> Source:GitHub ActionsSettings -> Actions -> General -> Workflow permissions:Read and write permissions
OPENROUTER_API_KEY(required for real translation and AI Overview)OPENROUTER_MODEL(optional)OPENROUTER_SUMMARY_MODEL(optional)
- Initial end-to-end MVP
- Daily fetch, JSON archive, index build, static site, Pages workflow
- OpenRouter-based translation flow stabilized
- Translation concurrency introduced (
--workers, default 6) - Prompt quality improvements for translation and overview generation
- UI/visual refinements and icon updates
- Home/detail page usability improvements
- Deployment flow polished for GitHub Pages
- Date alignment fixes for display and daily scheduling
- Per-day folder layout adopted for paper JSON files
- Backfill and migration utilities added
- Deduplication across dates in index build
- Release-time visibility gate to prevent early exposure of next-date content
- Additional skip-existing logic to reduce redundant crawling/translation and lower rate-limit risk
- If source page structure changes, parser fallbacks try to keep the pipeline resilient.
- Missing fields are allowed; the pipeline should not crash because of partial extraction.
- For production automation, keep translation worker count conservative (
2-6) to reduce API throttling risk.