Skip to content

Conversation

@doggy8088
Copy link
Contributor

Purpose

This pull request introduces support for translating Jupyter Notebook (.ipynb) files in the co_op_translator project.

Description

The changes include adding a new JupyterNotebookTranslator class, integrating it into the project translation workflow, and updating relevant configurations and methods to handle notebook files.

Related Issue

Fixes #51

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

  • Yes
  • No

Type of change

  • Bugfix
  • Feature
  • Code style update (e.g., formatting, local variables)
  • Refactoring (no functional or API changes)
  • Documentation content changes
  • Other... Please describe:

Checklist

Before submitting your pull request, please confirm the following:

  • I have thoroughly tested my changes: I confirm that I have run the code and manually tested all affected areas.
  • All existing tests pass: I have run all tests and confirmed that nothing is broken.
  • I have added new tests (if applicable): I have written tests that cover the new functionality introduced by my code changes.
  • I have followed the Co-op Translators coding conventions: My code adheres to the style guide and coding conventions outlined in the repository.
  • I have documented my changes (if applicable): I have updated the documentation to reflect the changes where necessary.

Additional context

Notebook Translation Support:

  • New JupyterNotebookTranslator class: Added a dedicated class in src/co_op_translator/core/llm/jupyter_notebook_translator.py to handle the translation of Jupyter Notebook files. It extracts markdown cells for translation while preserving code cells and metadata.
  • Constants update: Introduced SUPPORTED_NOTEBOOK_EXTENSIONS in src/co_op_translator/config/constants.py to define supported notebook file extensions.

Integration into Translation Workflow:

  • Core module updates:

    • Imported JupyterNotebookTranslator in src/co_op_translator/core/llm/__init__.py and added it to the __all__ list.
    • Updated src/co_op_translator/core/project/project_translator.py to initialize and use the notebook translator in the ProjectTranslator class.
  • Translation manager enhancements:

    • Integrated notebook translation into the TranslationManager class in src/co_op_translator/core/project/translation_manager.py. Added methods translate_notebook and translate_all_notebook_files to handle individual and bulk notebook translations.
    • Updated translate_project_async to include notebook translation when processing markdown files.
    • Modified get_outdated_translations to account for outdated notebook translations.

These changes collectively enable seamless translation of Jupyter Notebook files within the existing project translation workflow.

Copilot AI review requested due to automatic review settings June 14, 2025 17:10
@github-actions github-actions bot added core Related to any changes in core source files tests labels Jun 14, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for translating Jupyter Notebook (.ipynb) files by introducing a JupyterNotebookTranslator and weaving it into the existing translation workflow.

  • Added JupyterNotebookTranslator to extract and translate markdown cells while preserving code and metadata.
  • Introduced SUPPORTED_NOTEBOOK_EXTENSIONS and integrated notebook handling in both ProjectTranslator and TranslationManager.
  • Updated async workflows to include notebook translation alongside markdown.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/co_op_translator/core/project/test_jupyter_notebook_integration.py New integration tests for notebook translation
tests/co_op_translator/core/llm/test_jupyter_notebook_translator.py New unit tests for JupyterNotebookTranslator
src/co_op_translator/config/constants.py Defined SUPPORTED_NOTEBOOK_EXTENSIONS
src/co_op_translator/core/llm/jupyter_notebook_translator.py Implemented notebook translation logic
src/co_op_translator/core/llm/init.py Exposed JupyterNotebookTranslator in __all__
src/co_op_translator/core/project/project_translator.py Initialized notebook translator and updated workflow
src/co_op_translator/core/project/translation_manager.py Added methods to discover and translate notebooks

@skytin1004 skytin1004 self-requested a review June 16, 2025 03:28
@skytin1004 skytin1004 changed the title Add Jupyter Notebook (.ipynb) translation support Core: Add Jupyter Notebook (.ipynb) translation support Jun 16, 2025
Copy link
Collaborator

@skytin1004 skytin1004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @doggy8088, great work on adding Jupyter notebook support! 👍

I noticed there's a small issue with the filter_files that's causing some test failures.

Note

File: src/co_op_translator/utils/common/file_utils.py
Lines: 178-204

The function currently only accepts 2 parameters, but in translation_manager.py it's being called with 3 arguments (including the file extension). This is throwing a TypeError.

Could you please update the function signature to include the optional extension parameter and update the logic ? Something like:

def filter_files(directory: str | Path, excluded_dirs, extension: str = None) -> list:
    """
    Filter and return only the files in the given directory, excluding specified directories.
    Optionally filter by file extension.

    Args:
        directory (str | Path): The directory path to search for files.
        excluded_dirs (set): A set of directory names to exclude from the search.
        extension (str, optional): File extension to filter by (e.g., '.ipynb').
                                If None, all file types are included.

    Returns:
        list: A list of Path objects representing only the files (excluding specified directories).
    """
    directory = Path(directory)
    files = []

    # Recursively traverse the directory
    for path in directory.rglob("*"):
        # Check if the path is a file, matches extension if specified, and doesn't contain excluded dirs
        if (
            path.is_file()
            and (extension is None or path.suffix.lower() == extension.lower())
            and not any(excluded_dir in path.parts for excluded_dir in excluded_dirs)
        ):
            files.append(path)

    return files

I've tested the fixes locally and they work perfectly

Copy link
Collaborator

@skytin1004 skytin1004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the test file needs to be updated to properly mock the JupyterNotebookTranslator and avoid test failures.

Note

File: tests/co_op_translator/core/project/test_project_translator.py

I've tested the fixes locally and they work perfectly! Here are the complete updated files to resolve all test failures:

Could you please replace the entire file content (test_project_translator.py) with this tested version?

Here's what needs to be added to support Jupyter notebook extension:

import pytest
import asyncio
from unittest.mock import AsyncMock, MagicMock, patch
from co_op_translator.core.project.project_translator import ProjectTranslator
from co_op_translator.config.llm_config.provider import LLMProvider


@pytest.fixture
async def temp_project_dir(tmp_path):
    """Create a temporary project directory structure."""
    # Create project structure
    docs_dir = tmp_path / "docs"
    docs_dir.mkdir()
    (docs_dir / "test.md").write_text("# Test Document\nThis is a test.")

    images_dir = tmp_path / "images"
    images_dir.mkdir()
    (images_dir / "test.png").touch()

    translations_dir = tmp_path / "translations"
    translations_dir.mkdir()

    return tmp_path


@pytest.fixture
async def project_translator(temp_project_dir):
    """Create a ProjectTranslator instance with mocked dependencies."""
    with (
        patch(
            "co_op_translator.core.llm.text_translator.TextTranslator"
        ) as mock_text_translator,
        patch(
            "co_op_translator.core.llm.markdown_translator.MarkdownTranslator"
        ) as mock_markdown_translator,
        patch(
            "co_op_translator.core.vision.image_translator.ImageTranslator"
        ) as mock_image_translator,
        patch(
            "co_op_translator.core.llm.jupyter_notebook_translator.JupyterNotebookTranslator"
        ) as mock_jupyter_translator,
        patch(
            "co_op_translator.config.llm_config.config.LLMConfig.get_available_provider"
        ) as mock_get_provider,
    ):
        # Setup mock translators and config
        mock_text_translator.create.return_value = MagicMock()
        mock_markdown_translator.create.return_value = MagicMock()
        mock_image_translator.create.return_value = MagicMock()
        mock_jupyter_translator.create.return_value = MagicMock()
        mock_get_provider.return_value = LLMProvider.AZURE_OPENAI  # Mock LLM provider

        translator = ProjectTranslator("ko ja", root_dir=temp_project_dir)
        translator.translation_manager.translate_all_markdown_files = AsyncMock(
            return_value=(2, [])
        )
        translator.translation_manager.translate_all_image_files = AsyncMock(
            return_value=(0, [])
        )
        # Mock translate_project_async to avoid unawaited coroutine warning
        translator.translation_manager.translate_project_async = AsyncMock(
            return_value=None
        )

        yield translator


@pytest.mark.asyncio
async def test_check_and_retry_translations(project_translator, temp_project_dir):
    """Test checking and retrying translations."""
    # Setup
    md_file = temp_project_dir / "docs" / "test.md"
    translated_file = temp_project_dir / "translations" / "ko" / "docs" / "test.md"
    translated_file.parent.mkdir(parents=True)
    translated_file.write_text(
        "# Test\nBroken translation"
    )  # Create a "broken" translation

    # Mock translation methods with proper async behavior
    project_translator.translation_manager.check_outdated_files = AsyncMock(
        return_value=(1, [])
    )
    project_translator.translation_manager.translate_all_markdown_files = AsyncMock(
        return_value=(2, [])
    )
    project_translator.translation_manager.translate_all_image_files = AsyncMock(
        return_value=(0, [])
    )

    # Execute and verify
    total_count, errors = await project_translator.check_and_retry_translations()

    # Verify results
    assert total_count == 3  # 1 (outdated) + 2 (markdown) + 0 (images)
    assert errors == []  # No errors expected
    assert project_translator.translation_manager.check_outdated_files.called
    assert project_translator.translation_manager.translate_all_markdown_files.called
    assert project_translator.translation_manager.translate_all_image_files.called


def test_translate_project(project_translator):
    """Test the synchronous translate_project method."""
    # Setup
    with patch.object(asyncio, "run", side_effect=lambda x: None) as mock_run:
        # Execute
        project_translator.translate_project(images=True, markdown=True)
        # Verify
        mock_run.assert_called_once()


@pytest.mark.asyncio
async def test_markdown_only_mode(temp_project_dir):
    """Test ProjectTranslator in markdown-only mode."""
    with (
        patch(
            "co_op_translator.core.llm.text_translator.TextTranslator"
        ) as mock_text_translator,
        patch(
            "co_op_translator.core.llm.markdown_translator.MarkdownTranslator"
        ) as mock_markdown_translator,
        patch(
            "co_op_translator.core.llm.jupyter_notebook_translator.JupyterNotebookTranslator"
        ) as mock_jupyter_translator,
        patch(
            "co_op_translator.config.llm_config.config.LLMConfig.get_available_provider"
        ) as mock_get_provider,
    ):
        # Setup translator mocks
        mock_text_translator_instance = AsyncMock()
        mock_text_translator.create.return_value = mock_text_translator_instance

        mock_markdown_translator_instance = AsyncMock()
        mock_markdown_translator.create.return_value = mock_markdown_translator_instance

        mock_jupyter_translator_instance = AsyncMock()
        mock_jupyter_translator.create.return_value = mock_jupyter_translator_instance

        mock_get_provider.return_value = LLMProvider.AZURE_OPENAI

        # Create translator in markdown-only mode
        translator = ProjectTranslator(
            "ko ja", root_dir=temp_project_dir, markdown_only=True
        )

        # Mock the async methods after initialization
        translator.translation_manager.translate_project_async = AsyncMock()
        translator.translation_manager.check_outdated_files = AsyncMock(
            return_value=(0, [])
        )
        translator.translation_manager.translate_all_markdown_files = AsyncMock(
            return_value=(0, [])
        )
        translator.translation_manager.translate_all_image_files = AsyncMock(
            return_value=(0, [])
        )

        # Verify markdown-only mode configuration
        assert translator.markdown_only is True
        assert translator.image_translator is None

        # Test async operation to ensure all coroutines are properly handled
        await translator.translation_manager.translate_project_async(markdown=True)
        assert (
            translator.translation_manager.translate_project_async.call_args.kwargs[
                "markdown"
            ]
            is True
        )

@skytin1004
Copy link
Collaborator

I just ran Co-op Translator with a notebook locally, and it worked perfectly, awesome work, @doggy8088!

One thing I noticed is that every Markdown cell includes metadata and a disclaimer.
I'd suggest updating the translate_markdown function to make this behavior configurable, so we can choose whether to include the disclaimer and metadata.

Copy link
Collaborator

@skytin1004 skytin1004 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion I have is to make the addition of metadata and disclaimers configurable in the translate_markdown function. Currently, these are added to every Markdown cell, which may not be necessary in all contexts.

Note

File: src\co_op_translator\core\llm\markdown_translator.py
Lines: 87-173

Here's a proposed refactor of the function to support optional metadata and disclaimers via parameters:

I tested the code below, and it worked well!

    async def translate_markdown(
        self,
        document: str,
        language_code: str,
        md_file_path: str | Path,
        markdown_only: bool = False,
        add_metadata: bool = True,
        add_disclaimer: bool = True,
    ) -> str:
        """Translate markdown document to target language.

        Handles complex documents by splitting into manageable chunks while
        preserving formatting, links, and code blocks.

        Args:
            document: Content of the markdown file
            language_code: Target language code
            md_file_path: Path to the markdown file
            markdown_only: Skip embedded image translation if True
            add_metadata: Whether to add metadata comment at the beginning
            add_disclaimer: Whether to add disclaimer at the end

        Returns:
            str: The translated content with optional metadata and disclaimer.
        """
        md_file_path = Path(md_file_path)

        # Create and format metadata (only if requested)
        metadata_comment = ""
        if add_metadata:
            metadata = self.create_metadata(md_file_path, language_code)
            metadata_comment = self.format_metadata_comment(metadata)

        # Step 1: Replace code blocks and inline code with placeholders
        (
            document_with_placeholders,
            placeholder_map,
        ) = replace_code_blocks_and_inline_code(document)

        # Step 2: Split the document into chunks and generate prompts
        link_limit = 30
        if count_links_in_markdown(document_with_placeholders) > link_limit:
            logger.info(
                f"Document contains more than {link_limit} links, splitting the document into chunks."
            )
            document_chunks = process_markdown_with_many_links(
                document_with_placeholders, link_limit
            )
        else:
            logger.info(
                f"Document contains {link_limit} or fewer links, processing normally."
            )
            document_chunks = process_markdown(document_with_placeholders)

        # Step 3: Generate translation prompts and translate each chunk
        language_name = self.font_config.get_language_name(language_code)
        is_rtl = self.font_config.is_rtl(language_code)
        prompts = [
            generate_prompt_template(language_code, language_name, chunk, is_rtl)
            for chunk in document_chunks
        ]
        results = await self._run_prompts_sequentially(prompts)
        translated_content = "\n".join(results)

        # Step 4: Restore the code blocks and inline code from placeholders
        translated_content = restore_code_blocks_and_inline_code(
            translated_content, placeholder_map
        )

        # Step 5: Update links
        updated_content = update_links(
            md_file_path,
            translated_content,
            language_code,
            self.root_dir,
            markdown_only=markdown_only,
        )
        
        # Step 6: Add metadata and disclaimer (only if requested)
        result = updated_content
        if add_metadata:
            result = metadata_comment + result
        if add_disclaimer:
            disclaimer = await self.generate_disclaimer(language_code)
            result = result + "\n\n" + disclaimer

        return result

@skytin1004
Copy link
Collaborator

skytin1004 commented Jun 29, 2025

Hi @doggy8088, As the implementation is mostly complete and has been inactive for some time, I'll proceed with merging it within the next few days. I'll handle the remaining items (tests, cleanup) myself.
Thank you again for your valuable contribution!

@skytin1004
Copy link
Collaborator

Thanks again @doggy8088 for the contribution!
I'm merging this now. Appreciate your work on it!

@skytin1004 skytin1004 merged commit 9bd0715 into Azure:main Jul 5, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Related to any changes in core source files tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for translating .ipynb (Jupyter Notebook) files

2 participants