An intelligent AI-powered educational tool that transforms lecture audio into comprehensive study materials with automatic transcription, summarization, quiz generation, and flashcard creation.An end-to-end AI EdTech tool that:
-
Transcribes lecture audio (upload MP3/WAV or live microphone) using Whisper with automatic backend selection (faster-whisper -> transformers -> openai-whisper fallback).
-
Summarizes long transcripts into structured notes (adjustable length) using T5 or BART.
-
Generates quizzes (MCQs) & flashcards (JSON export) via keyword extraction (KeyBERT / spaCy fallback).
-
Exports PDF, DOCX, and Flashcards JSON.
-
Simple Streamlit web UI (tabs: Upload Mode, Live Mode, Results & Export). Ready for deployment on Hugging Face Spaces or Streamlit Community Cloud.
-
Overview- Automatic language detection (Whisper)
-
Key Features- Adjustable summarization detail: short / medium / detailed
-
Tech Stack- Continuous pseudo-streaming transcription (buffered time chunks with incremental diff updates)
-
System Architecture- Chunked parallel transcription for large files (configurable chunk & overlap seconds)
-
Installation- Optional simple energy-based VAD (skip silent spans when chunking)
-
Usage- Progress bar during long transcription jobs
-
Project Structure- Mixed precision (float16) on GPU paths when available
-
Performance Optimization- Caching of embeddings & QA answers to accelerate repeated MCQ / flashcard regeneration
-
Troubleshooting- MCQs (5) and Flashcards (5) generation
-
License- Download buttons for PDF, DOCX, JSON
-
Graceful fallbacks if KeyBERT or spaCy are missing
An end-to-end AI EdTech solution that converts lecture audio into structured study materials:app.py # Streamlit UI
utils/
1. **๐ Transcription** - Upload audio files (MP3/WAV) or record live lectures with automatic speech recognition transcribe.py # File + live transcription helpers
2. **๐ Summarization** - Generate concise, structured notes with adjustable detail levels summarize.py # Chunking & summarization
3. **โ Quiz Generation** - Create multiple-choice questions with semantic distractor generation quiz.py # Keyword extraction + MCQs + flashcards
4. **๐ Flashcard Creation** - Automatically generate study flashcards with context-aware definitions exporters.py # PDF / DOCX / JSON builders
5. **๐ค Multi-format Export** - Export to PDF, DOCX, and JSON formatsrequirements.txt
README.md
---```
## โจ Key Features## Installation
### ๐๏ธ **Advanced Transcription**### 1. Clone & Install Dependencies
- **Multi-backend ASR**: Automatic fallback chain (Faster-Whisper โ Transformers โ OpenAI-Whisper)
- **GPU Acceleration**: CUDA support with float16 mixed precision```bash
- **Parallel Processing**: Chunked transcription with configurable overlap for accuracypip install -r requirements.txt
- **Voice Activity Detection (VAD)**: Skip silent segments to reduce compute time```
- **Live Recording**: Real-time microphone capture with buffered streaming
- **Language Auto-detection**: Automatically identifies spoken languageWhisper & transformers will download models on first run. For spaCy English model (optional, improves keyword extraction):
### ๐ **Intelligent Summarization**```bash
- **Adjustable Length**: Short, Medium, or Detailed summariespython -m spacy download en_core_web_sm
- **Chunked Processing**: Handles long transcripts (30+ minutes)```
- **Structured Notes**: Optional topic segmentation with semantic embeddings
- **Model Selection**: T5-small/base with BART fallbackIf `pyaudio` fails on Windows, you may use `sounddevice` (already included). Ensure a working microphone is available.
### ๐ง **Smart Quiz & Flashcard Generation**### 2. Run the App
- **Semantic Understanding**: Uses Sentence-Transformers for context-aware content
- **Quality Improvements**:```bash
- **Keyword Extraction**: Stop word filtering, NER prioritization, length-based rankingstreamlit run app.py
- **MCQ Generation**: Unique distractor validation, prevents answer duplication```
- **Flashcard Definitions**: 3-tier strategy (QA model โ sentence extraction โ context fallback)
- **Answer Validation**: Ensures logical consistencyOpen the provided local URL in your browser.
- **Caching**: Embedded QA answer caching for faster regeneration
## Usage
### ๐จ **Enhanced User Interface**
- **Card-based Display**: Visual flashcard layout with color-coded sections1. (Optional) In the sidebar, click "Warmup models" to pre-load.
- **Interactive MCQs**: Option labels (A/B/C/D), expandable answers, correct answer highlighting2. Choose a tab:
- **Progress Tracking**: Real-time progress bars during transcription - Upload Mode: upload an MP3/WAV, wait for transcription, then click "Process Transcript".
- **Responsive Design**: Clean, intuitive Streamlit interface - Live Mode: click Start Live to begin capturing mic audio (buffered every few seconds). Click Stop Live, then "Process Live Transcript".
3. Switch to Results & Export tab to review transcript, summary, MCQs, flashcards.
### ๐ค **Flexible Export Options**4. Download PDF, DOCX, or flashcards JSON.
- **PDF**: Professional formatted notes with ReportLab
- **DOCX**: Microsoft Word compatible documents## Environment Notes
- **JSON**: Flashcard export (Anki-compatible format)
- CPU is supported; GPU (if available) speeds up inference.
---- For extremely long lectures, consider splitting audio externally; memory use grows with transcript size.
- Live mode currently re-processes the whole rolling buffer each chunk (simpler, not true streaming). For production low-latency, integrate a streaming ASR approach.
## ๐ ๏ธ Tech Stack
## Deployment
### **Core Technologies**
| Component | Technology | Purpose |### Hugging Face Spaces (Recommended)
|-----------|-----------|---------|
| **Frontend** | Streamlit | Interactive web UI |1. Create a new Space (SDK: Streamlit).
| **Backend** | Python 3.13 | Core application logic |2. Upload repo files (or connect Git). Ensure `requirements.txt` present.
| **Deep Learning** | PyTorch 2.7+ (CUDA 11.8) | GPU acceleration |3. First build will download models.
### **AI/ML Models**### Streamlit Community Cloud
| Model | Library | Use Case |
|-------|---------|----------|1. Create a new app pointing to `app.py`.
| **Whisper** | Faster-Whisper / Transformers / OpenAI-Whisper | Speech-to-text transcription |2. Add `requirements.txt`.
| **T5-small** | Hugging Face Transformers | Text summarization |3. Deploy; models cached after first run.
| **DistilBERT** | Hugging Face Transformers | Question answering for definitions |
| **Sentence-BERT** | Sentence-Transformers | Semantic embeddings & similarity |### Caching & Model Size
| **KeyBERT** | KeyBERT | Keyword extraction |
| **spaCy** | spaCy (en_core_web_sm) | Named entity recognition |`openai/whisper-small` ~ 1.4GB. Ensure environment has sufficient storage. If resource constrained, swap to `openai/whisper-base` in `utils/transcribe.py`.
### **Key Libraries**## Configuration & Performance Tweaks
streamlit # Web UI framework- Whisper backend & size:
torch # Deep learning backend - In the sidebar choose model size (tiny, base, small, etc.).
transformers # Hugging Face models - Set env var WHISPER_MODEL_SIZE to override default.
faster-whisper # Optimized Whisper inference - Faster-whisper auto-selected if installed; falls back gracefully.
sentence-transformers # Semantic embeddings- Parallel chunk transcription:
keybert # Keyword extraction - Sidebar: Chunk seconds (e.g. 30) & Overlap seconds (e.g. 4) for accuracy at boundaries.
spacy # NLP processing - Increase Parallel workers (default CPU core heuristic) to speed multi-core processing.
sounddevice # Audio recording- Voice Activity Detection (VAD): enable Use VAD to skip low-energy segments and reduce compute.
reportlab # PDF generation- Beam size: accuracy vs speed tradeoff (1 = fastest, >1 = better). Adjust in sidebar.
python-docx # DOCX generation- Fast Mode:
numpy # Numerical computing - Uses distilled summarizer, smaller ASR model selection hint, reduced beam size.
- Automatically used for GPU inference (torch autocast) when transformers path is active; faster-whisper already uses efficient kernels.
---- Caching:
- Embedding vectors for distractors and QA answers cached in-memory (`quiz.py`). Re-running MCQ/flashcard generation avoids recomputation.
## ๐๏ธ System Architecture- Structured notes toggle: disable if you want the absolute fastest run (skips embedding-based topic segmentation).
### **Processing Pipeline**## Error Handling
```- Missing optional libs (KeyBERT, spaCy) โ fallback logic prints fewer / simpler keywords.
โโโโโโโโโโโโโโโโโโโ- Audio issues (unsupported format) โ exception surfaced in UI.
โ Audio Input โ- Live mode start failure (no microphone) โ error message.
โ (File / Live) โ
โโโโโโโโโโฌโโโโโโโโโ## Flashcards JSON Format
โ
โผ```json
โโโโโโโโโโโโโโโโโโโ[
โ Transcription โ { "term": "Backpropagation", "definition": "First sentence in transcript mentioning Backpropagation" }
โ - Faster-Whisper (GPU)]
โ - Chunked parallel processing```
โ - VAD filteringImport mapping for Anki: term -> Front, definition -> Back.
โโโโโโโโโโฌโโโโโโโโโ
โ## Roadmap / Possible Enhancements
โผ
โโโโโโโโโโโโโโโโโโโ- True streaming ASR with partial segment updates
โ Summarization โ- Semantic answer generation for definitions (QA model integration)
โ - T5/BART models- Theming & advanced note structuring (headings detection)
โ - Chunked processing- Persistence / user accounts
โ - Structured notes
โโโโโโโโโโฌโโโโโโโโโ## License
โ
โผMIT (adjust as needed).
โโโโโโโโโโโโโโโโโโโ
โ Quiz/Flashcardsโ## Troubleshooting
โ - KeyBERT keywords
โ - Semantic MCQs| Issue | Fix |
โ - QA definitions|-------|-----|
โโโโโโโโโโฌโโโโโโโโโ| `pyaudio` install fails | Use sounddevice (already used). Remove pyaudio line if not needed. |
โ| Slow transcription | Use Fast Mode, reduce beam size, enable chunking + VAD, ensure GPU. |
โผ| No keywords | Install KeyBERT or spaCy model. |
โโโโโโโโโโโโโโโโโโโ| Large PDF truncated | Summary first; consider shorter summaries. |
โ Export Outputs โ| Error: *We expect a numpy ndarray or torch tensor as input, got <class '_io.BytesIO'>* | Fixed in latest version: uploaded audio is written to a temp file before calling ASR. If persisting, ensure `transcribe_file` uses temp path logic. |
โ PDF/DOCX/JSON โ
โโโโโโโโโโโโโโโโโโโ## Performance Tips
| Goal | Recommendation |
| Faster transcription (GPU) | Install CUDA-enabled PyTorch; faster-whisper auto-uses GPU with float16 & parallel chunking. |
```| Faster transcription (CPU) | Enable Fast Mode, choose base or `tiny`, set chunking (30s) & beam size = 1. |
app.py # Main Streamlit UI orchestrator| Better accuracy | Increase beam size to 3โ5 and select small/medium. |
โโโ utils/| Faster summarization | Fast Mode uses distilled BART model. |
โ โโโ transcribe.py # ASR backend manager (Whisper variants)| Large audio (>30 min) | Split into smaller files; summarize each then combine. |
โ โโโ summarize.py # Chunked summarization with T5/BART| Memory constraints | Use tiny model, disable structured notes, reduce beam size. |
โ โโโ quiz.py # Keyword extraction, MCQ & flashcard generation| Re-run MCQs/flashcards faster | Benefit from embedding & QA caching automatically. |
โ โโโ exporters.py # PDF/DOCX/JSON export handlers
โ โโโ structure.py # Topic segmentation & structured notesEnvironment variables (optional):
โ โโโ models.py # Model loading & caching utilities
set WHISPER_MODEL_SIZE=base
---set USE_FAST_WHISPER=1
```
## ๐ฆ Installation
Or on Unix shells:
### **Prerequisites**```
- Python 3.13+export WHISPER_MODEL_SIZE=base
- NVIDIA GPU (optional, but recommended for 5-10x speed improvement)export USE_FAST_WHISPER=1
- CUDA 11.8+ (for GPU support)```
- 4GB+ RAM (8GB+ recommended)
---
### **Step 1: Clone Repository**Happy studying!
```bash
git clone https://github.com/yourusername/lecture-voice-to-notes.git
cd lecture-voice-to-notes
```
### **Step 2: Create Virtual Environment**
```bash
python -m venv .venv
# Windows
.\.venv\Scripts\Activate.ps1
# Linux/Mac
source .venv/bin/activate
```
### **Step 3: Install Dependencies**
**For CPU-only:**
```bash
pip install -r requirements.txt
```
**For GPU (NVIDIA CUDA 11.8):**
```bash
# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Install remaining dependencies
pip install -r requirements.txt
```
**Verify GPU Installation:**
```bash
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
# Expected output: CUDA available: True
```
### **Step 4: Download spaCy Model (Optional)**
```bash
python -m spacy download en_core_web_sm
```
> Improves keyword extraction and named entity recognition
### **Step 5: Run Application**
```bash
streamlit run app.py
```
> Opens browser at `http://localhost:8501`
---
## ๐ Usage
### **1. Upload Mode (Audio Files)**
1. Navigate to **"Upload Audio"** tab
2. Upload an audio file (MP3, WAV, M4A, FLAC)
3. Wait for transcription to complete (progress bar shown)
4. Click **"Process Transcript"** to generate summary, MCQs, and flashcards
### **2. Live Recording Mode**
1. Navigate to **"Live Recording"** tab
2. Click **"Start Live Recording"**
3. Speak into your microphone (captures in 5-second chunks)
4. Click **"Stop Recording"**
5. Click **"Process Live Transcript"**
### **3. View Results**
Navigate to **"Results & Export"** tab to see:
- ๐ **Full Transcript** - Complete transcribed text
- ๐ **Summary** - Structured notes (adjustable length in sidebar)
- โ **MCQs** - Multiple choice questions with expandable answers
- ๐ **Flashcards** - Term/definition pairs with visual card layout
### **4. Export Options**
- **๐ Download PDF** - Professional formatted document
- **๐ Download DOCX** - Microsoft Word document
- **๐ Download Flashcards JSON** - Anki-compatible format
### **5. Configuration (Sidebar)**
- **Model Size**: tiny / base / small / medium / large
- **Summarization Length**: Short / Medium / Detailed
- **Chunking**: Enable parallel processing (recommended for >5 min audio)
- **VAD**: Skip silent segments
- **Fast Mode**: Trade quality for speed
- **Warmup Models**: Pre-load models to reduce first-run latency
---
## ๐ Project Structure
```
Voice/
โโโ app.py # Main Streamlit application
โโโ requirements.txt # Python dependencies
โโโ README.md # Project documentation
โโโ .gitignore # Git exclusion rules
โ
โโโ utils/ # Core modules
โ โโโ __init__.py
โ โโโ transcribe.py # Whisper ASR backends (faster-whisper/transformers/openai-whisper)
โ โโโ summarize.py # T5/BART summarization with chunking
โ โโโ quiz.py # Keyword extraction, MCQ generation, flashcard creation
โ โโโ exporters.py # PDF/DOCX/JSON export utilities
โ โโโ structure.py # Topic segmentation & structured notes
โ โโโ models.py # Model loading & caching
โ
โโโ .venv/ # Virtual environment (excluded from Git)
```
### **File Responsibilities**
| File | Lines | Purpose |
|------|-------|---------|
| `app.py` | ~400 | Streamlit UI, session state management, tab navigation |
| `utils/transcribe.py` | ~380 | ASR model loading, file/live transcription, chunked parallel processing |
| `utils/summarize.py` | ~200 | Chunked summarization, model selection, structured notes |
| `utils/quiz.py` | ~280 | Keyword extraction, semantic MCQ generation, 3-tier flashcard definitions |
| `utils/exporters.py` | ~150 | PDF/DOCX formatting, JSON export |
| `utils/structure.py` | ~120 | Topic segmentation using sentence embeddings |
| `utils/models.py` | ~80 | Shared model loading, caching utilities |
---
## โก Performance Optimization
### **GPU Acceleration**
โ
**Enabled** - RTX 4060 / 3060 / A100 / V100 / etc.
**Expected Speed Improvements:**
| Audio Length | CPU (base) | GPU (small) | Speedup |
|--------------|------------|-------------|---------|
| 5 minutes | ~180s | ~25s | **7.2x** |
| 15 minutes | ~540s | ~70s | **7.7x** |
| 30 minutes | ~1080s | ~140s | **7.7x** |
### **Optimization Strategies**
#### **For Speed**
```python
# Sidebar Settings
Model Size: base or tiny
Fast Mode: Enabled
Beam Size: 1
Chunking: Enabled (30s chunks, 4s overlap)
VAD: Enabled
Parallel Workers: 4-8 (based on CPU cores)
```
#### **For Accuracy**
```python
# Sidebar Settings
Model Size: small or medium
Fast Mode: Disabled
Beam Size: 5
Chunking: Enabled (60s chunks, 8s overlap)
VAD: Disabled
Structured Notes: Enabled
```
#### **For Memory Efficiency**
```python
# Sidebar Settings
Model Size: tiny
Chunking: Enabled (15s chunks, 2s overlap)
Structured Notes: Disabled
```
### **Caching Mechanisms**
- โ
**Embedding Cache**: Stores sentence embeddings for MCQ distractor generation
- โ
**QA Answer Cache**: Caches DistilBERT answers for flashcard definitions
- โ
**Model Cache**: Keeps loaded models in memory across sessions
---
## ๐ Troubleshooting
### **Common Issues**
#### **1. GPU Not Detected**
**Problem:** `Device set to use cpu`
**Solution:**
```bash
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
# If False, reinstall PyTorch with CUDA
pip uninstall torch torchvision torchaudio -y
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
#### **2. Poor Quiz/Flashcard Quality**
**Problem:** Generic questions, weak definitions
**Causes:**
- Audio too short (<5 minutes) or lacks educational content
- Generic conversational audio (not lecture content)
**Solutions:**
- Use 15-30 minute educational lectures
- Ensure audio has technical terms and concepts
- Check keyword extraction output (should have domain-specific terms)
#### **3. Slow Transcription**
**Problem:** Takes >10 minutes for 5-minute audio
**Solutions:**
```python
# Enable optimizations in sidebar:
โ Fast Mode
โ Chunking (30s, 4s overlap)
โ VAD enabled
โ Model size: base or tiny
โ Beam size: 1
โ Parallel workers: 4+
```
#### **4. Memory Errors**
**Problem:** `CUDA out of memory` or `RAM exceeded`
**Solutions:**
- Reduce model size: `small` โ `base` โ `tiny`
- Enable chunking with smaller chunk size (15-30s)
- Disable structured notes
- Close other GPU-intensive applications
#### **5. No Keywords Found**
**Problem:** MCQs/flashcards generation fails
**Solutions:**
```bash
# Install optional dependencies
pip install keybert
python -m spacy download en_core_web_sm
```
#### **6. Microphone Not Working**
**Problem:** Live recording fails
**Solutions:**
- Check microphone permissions (Windows Settings โ Privacy)
- Try different audio input device (sidebar dropdown)
- Install alternative audio library:
```bash
pip install sounddevice
```
### **Performance Benchmarks**
**Test System:** RTX 4060 Laptop GPU, 16GB RAM, i7-12700H
| Configuration | 5-min Audio | 15-min Audio | 30-min Audio |
|---------------|-------------|--------------|--------------|
| **CPU (base)** | 180s | 540s | 1080s |
| **GPU (small, beam=1)** | 25s | 70s | 140s |
| **GPU (small, beam=5)** | 40s | 115s | 230s |
| **GPU (small, chunked+VAD)** | 20s | 55s | 110s |
### **Quality Validation**
**Recommended Test:**
1. Use 15-30 minute educational lecture (e.g., Khan Academy, MIT OpenCourseWare)
2. Check keyword extraction: Should have 10+ domain-specific terms
3. Validate MCQs: Questions should be answerable from transcript
4. Review flashcards: Definitions should be contextually accurate
---
## ๐จ Recent Improvements
### **v2.0 - UI Enhancement (Oct 2025)**
- โจ Card-based flashcard display with visual hierarchy
- โจ Interactive MCQ layout with expandable answers
- โจ Color-coded correct answers (green checkmarks)
- โจ A/B/C/D option labeling for quiz questions
### **v1.9 - Quality Improvements (Oct 2025)**
- โจ Enhanced keyword extraction (stop words, NER prioritization)
- โจ Improved MCQ distractor validation (unique answers)
- โจ 3-tier flashcard definition strategy (QA โ sentence โ context)
- โจ Answer caching for faster regeneration
### **v1.8 - Bug Fixes (Oct 2025)**
- ๐ Fixed tensor stacking error in semantic distractor generation
- ๐ Simplified embedding cache logic (removed index errors)
- ๐ Improved error handling for edge cases
### **v1.7 - GPU Support (Oct 2025)**
- โก Added CUDA 11.8 support for PyTorch
- โก Enabled float16 mixed precision on GPU
- โก 7-10x speed improvement with GPU acceleration
---
## ๐ Use Cases
### **Students**
- ๐ Convert recorded lectures into study notes
- โ Generate practice quizzes for exam preparation
- ๐ Create flashcards for spaced repetition learning
- ๐ Export notes to PDF for offline studying
### **Educators**
- ๐ Generate supplementary materials from lecture recordings
- โ๏ธ Create quiz banks from course content
- ๐ค Provide students with structured study guides
- ๐ Quickly review and summarize long lectures
### **Content Creators**
- ๐๏ธ Transcribe podcast episodes
- ๐ Generate show notes and summaries
- ๐ Extract key topics and concepts
- ๐ Create episode guides for audiences
### **Researchers**
- ๐ Transcribe conference talks and seminars
- ๐ Summarize research presentations
- ๐ Extract key findings and methodologies
- ๐ Create reference flashcards for literature review
---
## ๐ฎ Future Enhancements
- [ ] True streaming ASR with partial segment updates
- [ ] Multi-language support (Spanish, French, German, etc.)
- [ ] User authentication and session persistence
- [ ] Cloud storage integration (Google Drive, Dropbox)
- [ ] Collaborative note-sharing features
- [ ] Advanced theming and customization options
- [ ] Mobile app (React Native / Flutter)
- [ ] Integration with Notion, Obsidian, Roam Research
- [ ] Diagram/image extraction from lecture slides
- [ ] Video lecture support (YouTube, Zoom recordings)
---
## ๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
---
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
## ๐ Acknowledgments
- **OpenAI Whisper** - Robust speech recognition models
- **Hugging Face** - Transformers library and model hub
- **Streamlit** - Elegant web app framework
- **KeyBERT** - Efficient keyword extraction
- **spaCy** - Industrial-strength NLP
---
## ๐ง Contact
For questions, suggestions, or issues, please open an issue on GitHub.
---
**Made with โค๏ธ by the Lecture Voice-to-Notes Team**
**โญ If you find this project useful, please consider giving it a star!**