Skip to content

Commit 778a140

Browse files
DOC-3373: Add TinyMCE 8-specific llms.txt files for AI/LLM discoverability (#3989)
* DOC-3373: Add TinyMCE 8-specific llms.txt files for AI/LLM discoverability. * DOC-3373: refined txt files, organized by category instead of alphabetical. Added new landing pages for supported frameworks. Updated link on installation section to point to new landing pages. * DOC-3373: Update all titles in the Frontend Frameworks section so they are uniquec. * DOC-3373: Revert new landing pages, add automation script to build txt files locally. * DOC-3373: Fix CodeQL flagged issues. * DOC-3373: Revert bootstrap landing page. * DOC-3373: More CodeQL fixes. * Potential fix for code scanning alert no. 65: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * DOC-3373: More CodeQL fixes. * Potential fix for code scanning alert no. 67: Bad HTML filtering regexp Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 68: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 71: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 72: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 74: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 76: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 77: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * Potential fix for code scanning alert no. 79: Incomplete multi-character sanitization Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> * DOC-3373: Delete stray tiny-docs-ai.adoc found during runs. * DOC-3373: Update README-llm-files.md. * DOC-3373: Refinements and improvements based on feedback. --------- Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
1 parent 4b7cb6b commit 778a140

File tree

8 files changed

+2201
-12
lines changed

8 files changed

+2201
-12
lines changed

-scripts/README-llm-files.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Generating LLM Files
2+
3+
This directory contains scripts to automatically generate `llms.txt` and `llms-full.txt` files for LLM consumption.
4+
5+
## Overview
6+
7+
The LLM files provide structured documentation references that help AI assistants:
8+
- Find the correct documentation pages
9+
- Understand the documentation structure
10+
- Reduce hallucinations by providing accurate URLs
11+
- Discover all available integration options
12+
13+
## Files
14+
15+
- `generate-llm-files.js` - Node.js script that generates the LLM files
16+
- `generate-llm-files.sh` - Shell wrapper script for easier execution
17+
18+
## Usage
19+
20+
### Option 1: After Local Build
21+
22+
1. Build the documentation site:
23+
```bash
24+
yarn antora ./antora-playbook.yml
25+
```
26+
27+
2. Generate LLM files from local sitemap:
28+
```bash
29+
yarn generate-llm-files
30+
# or
31+
./-scripts/generate-llm-files.sh
32+
```
33+
34+
### Option 2: From Remote Sitemap
35+
36+
Generate directly from the published sitemap (useful for syncing with production):
37+
38+
```bash
39+
yarn generate-llm-files-from-url
40+
# or
41+
node ./-scripts/generate-llm-files.js https://www.tiny.cloud/docs/antora-sitemap.xml
42+
```
43+
44+
### Option 3: Custom Sitemap Source
45+
46+
```bash
47+
node ./-scripts/generate-llm-files.js /path/to/sitemap.xml
48+
# or
49+
node ./-scripts/generate-llm-files.js https://example.com/sitemap.xml
50+
```
51+
52+
## Workflow
53+
54+
### Manual Regeneration (Current Approach)
55+
56+
**After major/minor/patch releases:**
57+
1. Run the script to regenerate files from production sitemap:
58+
```bash
59+
yarn generate-llm-files-from-url
60+
```
61+
This ensures the LLM files match what's actually published on the live site.
62+
63+
Alternatively, if you need to generate from a local build:
64+
```bash
65+
yarn generate-llm-files
66+
```
67+
2. Review the generated files in a PR
68+
3. Commit and merge
69+
70+
**Why not automated in CI/CD?**
71+
- The script makes 400+ HTTP requests to fetch H1 titles (~4-5 minutes)
72+
- Resource-intensive and slow for every build
73+
- Manual review ensures quality before committing
74+
- Validates no 404s are listed and titles match actual page content
75+
76+
### File Locations
77+
78+
The files are generated in `modules/ROOT/attachments/`:
79+
- `llms.txt` - Simplified, curated documentation index (~105 lines)
80+
- `llms-full.txt` - Complete documentation index with all pages (~700 lines)
81+
82+
**Post-build:** Files are moved to the root directory (handled in separate PR) and accessible at:
83+
- `https://www.tiny.cloud/docs/tinymce/latest/llms.txt`
84+
- `https://www.tiny.cloud/docs/tinymce/latest/llms-full.txt`
85+
86+
## How It Works
87+
88+
1. **Reads sitemap.xml** - Extracts all documentation URLs from the sitemap (only `/latest/` URLs)
89+
2. **Fetches H1 titles** - Makes HTTP requests to each page to get the actual H1 title (validates no 404s)
90+
3. **Generates titles** - Uses fetched H1 titles, falls back to URL-based titles if fetch fails
91+
4. **Categorizes pages** - Groups by topic (integrations, plugins, API, etc.) based on URL patterns
92+
5. **Deduplicates** - Removes duplicate URLs and makes titles unique within categories
93+
6. **Generates structured markdown** - Creates both simplified (`llms.txt`) and complete (`llms-full.txt`) indexes
94+
95+
## Customization
96+
97+
The script uses hardcoded categorization logic. To customize:
98+
99+
1. Edit `generate-llm-files.js`
100+
2. Modify the `categorizeUrl()` function to adjust categorization
101+
3. Update `generateLLMsTxt()` and `generateLLMsFullTxt()` to change output format
102+
103+
## Notes
104+
105+
- The script requires Node.js and `sanitize-html` package (installed via `yarn install`)
106+
- Generated files are written to `modules/ROOT/attachments/`
107+
- Uses only the sitemap (no dependency on `nav.adoc`)
108+
- Fetches actual H1 titles from pages (validates no 404s)
109+
- Rate-limited fetching: 10 concurrent requests with 100ms delay between batches
110+
- Request timeout: 10 seconds per page
111+
- Security: Validates URLs to prevent SSRF attacks (only allows tiny.cloud domains)
112+
- Handles HTML entity decoding (`&#8217;``'`)
113+
- Filters out error pages and duplicate URLs
114+
- Makes titles unique within categories (e.g., "ES6 and npm (Webpack)", "ES6 and npm (Rollup)")
115+
- Falls back to URL-based title generation if H1 fetch fails
116+
117+
## Troubleshooting
118+
119+
**Error: "Source not found"**
120+
- Make sure the sitemap path is correct
121+
- For remote URLs, check your internet connection
122+
- For local files, ensure Antora has generated the site first
123+
124+
**Missing page titles**
125+
- If H1 fetch fails, the script uses URL-based title generation as fallback
126+
- Check that pages return valid HTML with H1 tags
127+
- 404 pages are automatically filtered out
128+
129+
**Incorrect categorization**
130+
- Review the `categorizeUrl()` function (note: function name is singular, not plural)
131+
- Add custom patterns for new page types

0 commit comments

Comments
 (0)