Skip to content

Commit 9fd7799

Browse files
kennethkalmerclaude
andcommitted
docs: markdown generation documentation and tooling
⚠️ NOTE: This commit will be dropped before final PR merge. Experimental markdown generation with React hydration. Currently disabled via MARKDOWN_SIMPLE_MODE=true. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
1 parent af98228 commit 9fd7799

File tree

5 files changed

+1283
-8
lines changed

5 files changed

+1283
-8
lines changed

LANGUAGE_MARKDOWN_GENERATION.md

Lines changed: 369 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,369 @@
1+
# Language-Specific Markdown Generation
2+
3+
This implementation generates language-specific markdown files from HTML pages with React-based language selectors.
4+
5+
## Overview
6+
7+
The system can operate in two modes:
8+
9+
1. **Simple Mode** (legacy): Converts static HTML to markdown without language awareness
10+
2. **Advanced Mode** (new): Hydrates React, switches languages, and generates separate markdown files per language
11+
12+
## How It Works
13+
14+
### Advanced Mode (Default)
15+
16+
1. **Load HTML**: Reads built HTML files from `./public`
17+
2. **Setup JSDOM**: Creates a browser-like environment with React support
18+
3. **Asset Rewriting**: Rewrites `ASSET_PREFIX` URLs to local paths (since assets aren't deployed yet)
19+
4. **React Hydration**: Loads and executes Gatsby bundles (webpack-runtime, framework, app, page bundles)
20+
5. **Language Detection**: Identifies available languages from:
21+
- Language selector DOM elements
22+
- Page metadata
23+
- Product-based language data (`src/data/languages/languageData.ts`)
24+
6. **Language Switching**: For each language:
25+
- Updates URL search params (`?lang=javascript`)
26+
- Triggers React re-render
27+
- Waits for content to update
28+
7. **Content Extraction**: Extracts main content and converts to markdown
29+
8. **File Generation**: Saves as `page.{language}.md` (e.g., `docs/realtime/channels.javascript.md`)
30+
31+
### File Naming Convention
32+
33+
- **With languages**: `/docs/foo/index.html``/docs/foo.javascript.md`, `/docs/foo.python.md`, etc.
34+
- **Without languages**: `/docs/foo/index.html``/docs/foo.md` (current behavior)
35+
36+
## Usage
37+
38+
### During Build (Automatic)
39+
40+
Advanced mode runs automatically after each build:
41+
42+
```bash
43+
yarn build
44+
```
45+
46+
To force simple mode:
47+
48+
```bash
49+
MARKDOWN_SIMPLE_MODE=true yarn build
50+
```
51+
52+
### Standalone Script
53+
54+
Generate markdown without rebuilding the site:
55+
56+
```bash
57+
# Default (advanced mode, all pages, all languages)
58+
yarn generate-markdown
59+
60+
# Simple mode (static HTML conversion)
61+
yarn generate-markdown:simple
62+
63+
# Verbose logging
64+
yarn generate-markdown:verbose
65+
66+
# Custom options
67+
node scripts/generate-language-markdown.ts --mode=advanced --verbose
68+
```
69+
70+
#### CLI Options
71+
72+
```
73+
--mode=<mode> Export mode: "simple" or "advanced" (default: advanced)
74+
--env=<environment> Environment to load (.env.<environment>)
75+
--pages=<pattern> Glob pattern to filter pages (e.g., "docs/realtime/*")
76+
--languages=<list> Comma-separated languages (e.g., "javascript,python")
77+
--site-url=<url> Site URL for absolute links
78+
--verbose, -v Enable verbose logging
79+
--help, -h Show help message
80+
```
81+
82+
### Examples
83+
84+
```bash
85+
# Generate for specific pages
86+
yarn generate-markdown --pages="docs/realtime/*"
87+
88+
# Generate specific languages only
89+
yarn generate-markdown --languages="javascript,python" --verbose
90+
91+
# Use different environment
92+
yarn generate-markdown --env=staging
93+
```
94+
95+
## Environment Variables
96+
97+
- `ASSET_PREFIX`: Asset CDN URL (automatically rewritten to local paths)
98+
- `MARKDOWN_SIMPLE_MODE`: Set to `'true'` to force simple mode
99+
- `VERBOSE`: Set to `'true'` for detailed logging
100+
- `GATSBY_ABLY_MAIN_WEBSITE`: Site URL for absolute links
101+
102+
## Implementation Details
103+
104+
### File Structure
105+
106+
```
107+
data/onPostBuild/
108+
├── markdownOutput.ts # Mode switcher and simple implementation
109+
├── markdownOutputWithLanguages.ts # Advanced mode with React hydration
110+
└── index.ts # Post-build hook orchestration
111+
112+
scripts/
113+
└── generate-language-markdown.ts # Standalone CLI script
114+
```
115+
116+
### Key Components
117+
118+
#### 1. JSDOM Setup (`markdownOutputWithLanguages.ts`)
119+
120+
```typescript
121+
class LocalAssetResourceLoader extends ResourceLoader {
122+
// Rewrites ASSET_PREFIX URLs to local ./public paths
123+
async fetch(url: string, options: any) {
124+
if (this.assetPrefix && url.includes(this.assetPrefix)) {
125+
const localPath = url.replace(this.assetPrefix, '');
126+
return fs.readFile(path.join('./public', localPath));
127+
}
128+
return super.fetch(url, options);
129+
}
130+
}
131+
```
132+
133+
#### 2. Language Detection
134+
135+
```typescript
136+
function detectAvailableLanguages(document: Document, htmlFile: string): string[] {
137+
// 1. Try DOM selectors
138+
const options = document.querySelectorAll('[data-language-selector] option');
139+
if (options.length > 0) {
140+
return Array.from(options).map(opt => opt.value);
141+
}
142+
143+
// 2. Fallback to product-based data
144+
const product = extractProductFromPath(htmlFile); // e.g., 'realtime' → 'pubsub'
145+
return Object.keys(languageData[product]);
146+
}
147+
```
148+
149+
#### 3. Language Switching
150+
151+
```typescript
152+
async function switchLanguage(dom: JSDOM, language: string): Promise<boolean> {
153+
// Update URL search params
154+
window.location.search = `?lang=${language}`;
155+
156+
// Trigger events
157+
window.dispatchEvent(new Event('popstate'));
158+
window.dispatchEvent(new Event('hashchange'));
159+
160+
// Manipulate selector
161+
const selector = document.querySelector('[data-language-selector] select');
162+
selector.value = language;
163+
selector.dispatchEvent(new Event('change'));
164+
165+
// Wait for content to update
166+
await waitFor(() => contentChanged(), 5000);
167+
}
168+
```
169+
170+
### Frontmatter Schema
171+
172+
```yaml
173+
---
174+
title: "Channel Lifecycle"
175+
url: "/docs/realtime/channels"
176+
generated_at: "2025-11-18T10:30:00Z"
177+
description: "Learn about channel lifecycle and state management"
178+
language: "javascript"
179+
language_version: "2.11"
180+
---
181+
```
182+
183+
## Supported Languages
184+
185+
Languages are defined per product in `src/data/languages/languageData.ts`:
186+
187+
- **Pub/Sub**: javascript, nodejs, typescript, react, csharp, flutter, java, kotlin, objc, php, python, ruby, swift, go, laravel
188+
- **Chat**: javascript, react, swift, kotlin
189+
- **Spaces**: javascript, react
190+
- **Asset Tracking**: javascript, swift, kotlin
191+
192+
## Troubleshooting
193+
194+
### React Hydration Fails
195+
196+
**Symptom**: Falls back to simple mode
197+
198+
**Causes**:
199+
- Missing Gatsby bundles
200+
- JavaScript errors during hydration
201+
- Timeout (default: 30s)
202+
203+
**Solution**: Check browser console logs, increase timeout in `CONFIG.hydrationTimeout`
204+
205+
### Language Switching Doesn't Work
206+
207+
**Symptom**: All language files have identical content
208+
209+
**Causes**:
210+
- Language selector not found
211+
- React state not updating
212+
- Content not conditional on language
213+
214+
**Solution**:
215+
- Verify language selector exists: `document.querySelector('[data-language-selector]')`
216+
- Check if content actually changes by language in browser
217+
- Increase `CONFIG.languageSwitchTimeout`
218+
219+
### Asset Loading Errors
220+
221+
**Symptom**: Scripts fail to load, 404 errors
222+
223+
**Causes**:
224+
- `ASSET_PREFIX` not properly rewritten
225+
- Assets not built yet
226+
- Incorrect path resolution
227+
228+
**Solution**:
229+
- Ensure `./public` directory exists with all assets
230+
- Check `ASSET_PREFIX` value matches expected URL
231+
- Verify `rewriteAssetUrls()` is working correctly
232+
233+
### Memory Issues
234+
235+
**Symptom**: Process crashes with OOM
236+
237+
**Causes**:
238+
- Too many JSDOM instances
239+
- Large pages
240+
- Memory leaks
241+
242+
**Solution**:
243+
- Process files sequentially (current implementation)
244+
- Reduce `CONFIG.hydrationTimeout`
245+
- Use `--max-old-space-size=4096` Node flag
246+
247+
## Performance Considerations
248+
249+
### Simple Mode
250+
- **Speed**: ~50-100ms per page
251+
- **Memory**: ~50MB for 100 pages
252+
- **Use Case**: No language selectors, static content
253+
254+
### Advanced Mode
255+
- **Speed**: ~2-5 seconds per page (per language)
256+
- **Memory**: ~200-500MB for 100 pages
257+
- **Use Case**: Language selectors, conditional content
258+
259+
### Optimization Strategies
260+
261+
1. **Parallel Processing** (future): Use worker threads for multiple pages
262+
2. **Caching**: Reuse JSDOM environment for same template types
263+
3. **Selective Generation**: Only regenerate changed pages
264+
4. **Hybrid Mode**: Use simple mode for pages without language selectors
265+
266+
## Future Enhancements
267+
268+
### 1. Smart Detection
269+
- Detect which pages actually need language processing
270+
- Skip pages where content doesn't change by language
271+
272+
### 2. Incremental Generation
273+
```typescript
274+
interface IncrementalOptions {
275+
changedFiles?: string[]; // Only regenerate these
276+
compareHash?: boolean; // Skip if content hash unchanged
277+
}
278+
```
279+
280+
### 3. Parallel Processing
281+
```typescript
282+
import { Worker } from 'worker_threads';
283+
284+
async function processInParallel(files: string[], workers: number) {
285+
// Distribute files across worker threads
286+
}
287+
```
288+
289+
### 4. Page Filtering
290+
Already designed in CLI but not implemented:
291+
292+
```bash
293+
yarn generate-markdown --pages="docs/realtime/*"
294+
yarn generate-markdown --languages="javascript,python"
295+
```
296+
297+
## Testing
298+
299+
### Manual Testing
300+
301+
```bash
302+
# 1. Build the site
303+
yarn build
304+
305+
# 2. Check generated files
306+
ls public/docs/realtime/*.md
307+
308+
# 3. Verify content differs by language
309+
diff public/docs/realtime/channels.javascript.md public/docs/realtime/channels.python.md
310+
311+
# 4. Test CLI
312+
yarn generate-markdown:verbose
313+
```
314+
315+
### Test Cases
316+
317+
1. **Pages with language selector**: Should generate multiple `.{lang}.md` files
318+
2. **Pages without language selector**: Should generate single `.md` file
319+
3. **Invalid HTML**: Should fall back to simple mode
320+
4. **Missing assets**: Should handle gracefully
321+
5. **ASSET_PREFIX**: Should rewrite URLs correctly
322+
323+
### Debugging
324+
325+
Enable verbose logging:
326+
327+
```bash
328+
VERBOSE=true yarn generate-markdown
329+
```
330+
331+
Or use Node debugger:
332+
333+
```bash
334+
node --inspect-brk scripts/generate-language-markdown.ts
335+
```
336+
337+
## Known Limitations
338+
339+
1. **Server-Side Only**: Cannot run in browser
340+
2. **Sequential Processing**: One page at a time (slow for large sites)
341+
3. **React Dependency**: Requires React to be fully functional
342+
4. **Limited Language Detection**: Relies on DOM or product mapping
343+
5. **No Incremental Updates**: Regenerates all files every time
344+
6. **Memory Intensive**: JSDOM + React uses significant RAM
345+
346+
## Contributing
347+
348+
When modifying the language generation:
349+
350+
1. Test both simple and advanced modes
351+
2. Verify ASSET_PREFIX handling for staging/production
352+
3. Check memory usage for large page sets
353+
4. Update this documentation
354+
5. Add tests for new features
355+
356+
## Related Files
357+
358+
- `src/components/Layout/LanguageSelector.tsx` - Language selector component
359+
- `src/data/languages/languageData.ts` - Language versions per product
360+
- `gatsby-config.ts` - Asset prefix configuration
361+
- `data/onPostBuild/index.ts` - Post-build hook orchestration
362+
363+
## Questions?
364+
365+
For issues or questions:
366+
1. Check the troubleshooting section above
367+
2. Review JSDOM and Gatsby documentation
368+
3. Examine browser console for client-side behavior
369+
4. Contact the documentation team

0 commit comments

Comments
 (0)