-
Notifications
You must be signed in to change notification settings - Fork 1
Questions generator & validator #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kiyro7
wants to merge
21
commits into
master
Choose a base branch
from
questions_generator
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 3 commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
835b5ce
initial commit
kiyro7 39d54cf
first prototype
kiyro7 d813c88
added LLM questions marker
kiyro7 8e42c8e
removed methodology
kiyro7 48ed43c
requirements.txt added versions
kiyro7 1bcf046
simplified docker
kiyro7 8a54af1
heuristic patterns update
kiyro7 d7a57d7
updated questions ranking and added examples
kiyro7 e20a3e0
docker-compose finally done
kiyro7 5694ae7
interactive mode
kiyro7 6ec4877
logging added
kiyro7 0b28da7
logging update
kiyro7 39f7626
docker fix (builds aprox 40 mins)
kiyro7 bee9a7a
fixed heuristic questions generation
kiyro7 a16784b
clearing
kiyro7 c2df6e4
created static folder
kiyro7 666535d
full logs refactor and translation to russian
kiyro7 e92b6ac
stashed new question generator code for future updates
kiyro7 e7e72da
docker update - llm & stuff and code separated
kiyro7 21b960b
global question generator refactor
kiyro7 a89eb4d
added instructions
kiyro7 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| FROM python:3.10-slim | ||
|
|
||
| # 1. System deps | ||
| RUN apt-get update && apt-get install -y --no-install-recommends \ | ||
| git wget gcc g++ \ | ||
| libprotobuf-dev protobuf-compiler \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| # 2. Workdir | ||
| WORKDIR /app | ||
|
|
||
| # 3. Python deps | ||
| COPY requirements.txt . | ||
| RUN pip install --no-cache-dir --upgrade pip \ | ||
| && pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cpu \ | ||
| && pip install --no-cache-dir -r requirements.txt | ||
|
|
||
| # 4. NLTK | ||
| RUN python -m nltk.downloader punkt stopwords | ||
|
|
||
| RUN python -m nltk.downloader punkt | ||
kiyro7 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| # 5. Copy local model | ||
| COPY rut5-base/ /app/rut5-base/ | ||
|
|
||
| # 6. Copy project | ||
| COPY . . | ||
|
|
||
| # 7. Run | ||
| CMD ["python", "run.py"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # Запуск | ||
|
|
||
| ## Загрузка модели локально (единоразово) | ||
| - `powershell -ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex"` (windows) | ||
| - `curl -LsSf https://hf.co/cli/install.sh | bash` (linux/macos) | ||
| - `cd app\questions_generator` | ||
| - `hf download cointegrated/rut5-base-multitask --local-dir rut5-base` | ||
kiyro7 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ## Выбор файла ВКР | ||
| - заменить в `run.py` в функции `main` путь для файла ВКР | ||
| ## Запуск (после любых изменений) | ||
| - `docker build -t vkr-generator .` | ||
| - `docker run -it --rm vkr-generator` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,151 @@ | ||
| import re | ||
| from typing import List, Dict | ||
| from nltk.tokenize import sent_tokenize, word_tokenize | ||
| from nltk.corpus import stopwords | ||
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM | ||
|
|
||
|
|
||
| class VkrQuestionGenerator: | ||
| """ | ||
| Генератор вопросов по тексту ВКР. | ||
| Основан на гибридном подходе: NLTK + rut5-base-multitask. | ||
| """ | ||
| def __init__(self, vkr_text: str, model_path: str = "./rut5-base"): | ||
| self.vkr_text = vkr_text | ||
| self.sentences = sent_tokenize(vkr_text) | ||
| self.stopwords = set(stopwords.words("russian")) | ||
|
|
||
| # ---- Модель rut5 ---- | ||
| self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False) | ||
| self.model = AutoModelForSeq2SeqLM.from_pretrained(model_path) | ||
|
|
||
| # --------------------------------------------------------- | ||
| # --- 1. ЭВРИСТИКА: Извлечение ключевых частей ВКР --- | ||
| # --------------------------------------------------------- | ||
|
|
||
| def extract_section(self, title: str) -> str: | ||
| """ | ||
| Универсальный метод извлечения раздела по заголовку. | ||
| """ | ||
| pattern = rf"{title}.*?(?=\n[A-ZА-Я][^\n]*\n)" | ||
| m = re.search(pattern, self.vkr_text, re.DOTALL | re.IGNORECASE) | ||
| return m.group(0) if m else "" | ||
|
|
||
| def extract_intro(self) -> str: | ||
| return self.extract_section("Введение") | ||
|
|
||
| def extract_conclusion(self) -> str: | ||
| return self.extract_section("Заключение") | ||
|
|
||
| def extract_methodology(self) -> str: | ||
| return self.extract_section("Методолог") | ||
|
|
||
| # --------------------------------------------------------- | ||
| # --- 2. ЭВРИСТИКА: Поиск ключевых концепций --- | ||
| # --------------------------------------------------------- | ||
|
|
||
| def extract_keywords(self, text: str) -> List[str]: | ||
| tokens = word_tokenize(text.lower()) | ||
| return [ | ||
| t for t in tokens | ||
| if t.isalnum() and t not in self.stopwords and len(t) > 4 | ||
| ] | ||
|
|
||
| # --------------------------------------------------------- | ||
| # --- 3. Генерация вопросов через rut5 (режим ask) --- | ||
| # --------------------------------------------------------- | ||
|
|
||
| def llm_generate_question(self, text_fragment: str) -> str: | ||
| """ | ||
| Генерация вопроса по фрагменту текста через rut5 ask | ||
| """ | ||
| prompt = f"ask: {text_fragment}" | ||
| enc = self.tokenizer(prompt, return_tensors="pt", truncation=True) | ||
| out = self.model.generate( | ||
| **enc, | ||
| max_length=64, | ||
| num_beams=5, | ||
| early_stopping=True | ||
| ) | ||
| return self.tokenizer.decode(out[0], skip_special_tokens=True) | ||
|
|
||
| # --------------------------------------------------------- | ||
| # --- 4. ЭВРИСТИЧЕСКИЕ ШАБЛОНЫ (из документа) --- | ||
| # --------------------------------------------------------- | ||
|
|
||
| def heuristic_questions(self) -> List[str]: | ||
| """ | ||
| Генерация вопросов по эвристикам из загруженных PDF. | ||
| """ | ||
| intro = self.extract_intro() | ||
| conc = self.extract_conclusion() | ||
| meth = self.extract_methodology() | ||
| keywords = self.extract_keywords(self.vkr_text) | ||
|
|
||
| q = [] | ||
|
|
||
| # --- По связям между разделами --- | ||
| if intro and conc: | ||
| q.append("Как сформулированные во введении задачи связаны с выводами работы?") | ||
|
|
||
| # --- По методологии --- | ||
| if meth: | ||
| for kw in keywords[:3]: | ||
| q.append(f"Почему был выбран метод {kw} и где он применён в работе?") | ||
|
|
||
| # --- По выводам --- | ||
| if conc: | ||
| q.append("На основании каких данных был сделан ключевой вывод в заключении?") | ||
|
|
||
| # --- Общие вопросы (из документа) --- | ||
| q.extend([ | ||
| "Есть ли опенсорс аналоги упомянутых решений?", | ||
| "В чем практическая значимость представленного метода?", | ||
| "Какие ограничения имеет разработанный подход?", | ||
| "Для каких дополнительных задач можно применить полученные результаты?", | ||
| ]) | ||
|
|
||
| return q | ||
|
|
||
| # --------------------------------------------------------- | ||
| # --- 5. Гибридная генерация: LLM + эвристики --- | ||
| # --------------------------------------------------------- | ||
|
|
||
| def generate_llm_questions(self, count=5) -> List[str]: | ||
| """ | ||
| Генерация N вопросов через rut5 по ключевым фрагментам документа. | ||
| """ | ||
| q = [] | ||
| fragments = self.sentences[:40] # первые ~40 предложений для контекста | ||
|
|
||
| step = max(1, len(fragments) // count) | ||
|
|
||
| for i in range(0, len(fragments), step): | ||
| frag = fragments[i] | ||
| try: | ||
| llm_q = self.llm_generate_question(frag) | ||
| if len(llm_q) > 10: | ||
| q.append(llm_q) | ||
| except: | ||
| continue | ||
|
|
||
| if len(q) >= count: | ||
| break | ||
|
|
||
| return q | ||
|
|
||
| # --------------------------------------------------------- | ||
| # --- 6. Главный метод --- | ||
| # --------------------------------------------------------- | ||
|
|
||
| def generate_all(self) -> List[str]: | ||
| """ | ||
| Генерирует полный набор вопросов: | ||
| - эвристические | ||
| - модельные (LLM) | ||
| """ | ||
| result = [] | ||
| result.extend(self.heuristic_questions()) | ||
| result.extend(["Начало rut5-base-multitask вопросов"]) | ||
| result.extend(self.generate_llm_questions(count=10)) | ||
| return list(dict.fromkeys(result)) # убрать дубли |
kiyro7 marked this conversation as resolved.
Show resolved
Hide resolved
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| transformers | ||
| sentencepiece | ||
| nltk | ||
| huggingface_hub | ||
| python-docx |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,58 @@ | ||
| from generator import VkrQuestionGenerator | ||
| from validator import VkrQuestionValidator | ||
| import sys | ||
| import os | ||
| from docx import Document | ||
| import nltk | ||
|
|
||
|
|
||
| def load_vkr_text(path: str) -> str: | ||
| if not os.path.exists(path): | ||
| print(f"[ERROR] Файл '{path}' не найден.") | ||
| sys.exit(1) | ||
|
|
||
| document = Document(path) | ||
| text = [] | ||
| for paragraph in document.paragraphs: | ||
| text.append(paragraph.text) | ||
|
|
||
| return '\n'.join(text) | ||
|
|
||
|
|
||
| def main(): | ||
| try: | ||
| nltk.data.find('tokenizers/punkt_tab/english') | ||
| except LookupError: | ||
| print("Загрузка необходимых данных NLTK...") | ||
| nltk.download('punkt_tab') | ||
|
|
||
| print("=== Загрузка текста ВКР ===") | ||
| text = load_vkr_text("vkr_examples/VKR1.docx") | ||
|
|
||
| print("=== Инициализация генератора ===") | ||
| gen = VkrQuestionGenerator(text, model_path="/app/rut5-base") | ||
|
|
||
| print("=== Инициализация валидатора ===") | ||
| validator = VkrQuestionValidator(text) | ||
|
|
||
| print("=== Генерация вопросов ===") | ||
| questions = gen.generate_all() | ||
|
|
||
| print("\n=== Результаты ===") | ||
| for q in questions: | ||
| rel = validator.check_relevance(q) | ||
| clr = validator.check_clarity(q) | ||
| diff = validator.check_difficulty(q) | ||
|
|
||
| status = "✔ OK" if (rel and clr and diff) else "✖ FAIL" | ||
|
|
||
| print(f"\n[{status}] {q}") | ||
| print(f" - relevance: {rel}") | ||
| print(f" - clarity: {clr}") | ||
| print(f" - difficulty:{diff}") | ||
|
|
||
| print("\n=== Готово ===") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.