Skip to content

Conversation

@moutasem1989
Copy link
Contributor

I created a new pull request for DMR.
This adds Docker Compose file preconfigured for Docker Model Runner (DMR). The YAML includes direct links to official Docker documentation to guide users through installing and enabling the docker-model-plugin, making it easier to run AI models locally using an OpenAI-compatible API.
The configuration uses the lightweight ai/qwen3:0.6B-Q4_0 model (~500 MB), chosen for its small resource footprint, ability to run on integrated graphics, and surprisingly reliable performance—ideal for testing or low-power systems.
From my personal experience, a tmpfs mount for /app/temp_audio is far better that docker volume in this case; ensuring temporary audio files stay entirely in RAM. This significantly reduces disk I/O and improves workflow responsiveness, and is safe given the small size of typical audio snippets. I don't think audio files can be larger than 1gb.
Finally, the README.md has been updated to reference the new DMR-enabled Docker Compose file, helping users easily discover and enable the local model runner setup.

I tried DMR with the Open AI integration and I am getting this error using Instant Playlist:

--- Attempt 1 of 3 ---
Processing with OPENAI model: ai/qwen3:0.6B-Q4_0 (at http://172.17.0.1:12434/engines/llama.cpp/v1/chat/completions).
OpenAI API Error: Error: AI service is currently unavailable.

--- Attempt 2 of 3 ---
Retrying due to previous issue: Error: AI service is currently unavailable.
Processing with OPENAI model: ai/qwen3:0.6B-Q4_0 (at http://172.17.0.1:12434/engines/llama.cpp/v1/chat/completions).
OpenAI API Error: Error: AI service is currently unavailable.

--- Attempt 3 of 3 ---
Retrying due to previous issue: Error: AI service is currently unavailable.
Processing with OPENAI model: ai/qwen3:0.6B-Q4_0 (at http://172.17.0.1:12434/engines/llama.cpp/v1/chat/completions).
OpenAI API Error: Error: AI service is currently unavailable.

Failed to generate and execute a valid query that returns results after 3 attempt(s). Last issue: Error: AI service is currently unavailable.

Logs had this to say:

audiomuse-ai-flask-app | [ERROR]-[19-11-2025 20-09-19]-An unexpected error occurred in get_openai_compatible_playlist_name
audiomuse-ai-flask-app | Traceback (most recent call last):
audiomuse-ai-flask-app | File "/app/ai.py", line 146, in get_openai_compatible_playlist_name
audiomuse-ai-flask-app | full_raw_response_content += choice['delta']['content']
audiomuse-ai-flask-app | TypeError: can only concatenate str (not "NoneType") to str

Using the configurations I added to Docker compose and bashing into the container to run this I was able to get an answer from DMR:

root@4fc6ef2cbdad:/workspace# curl -X POST "http://172.17.0.1:12434/engines/llama.cpp/v1/chat/completions" \0.1:12434/engines/llama.cpp/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer test-api-key" \
  -d '{
        "model": "ai/qwen3:0.6B-Q4_0",
        "messages": [
          { "role": "user", "content": "say hi" }
        ],
        "stream": false
      }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"role":"assistant","reasoning_content":"Okay, the user said \"say hi\". I need to respond to that. Since the user is probably trying to say hello, I should respond with a friendly greeting. Let me make sure to keep it simple and cheerful. Maybe say \"Hi there!\" to keep the conversation going. That should be good enough.","content":"Hi there! 😊  How can I help you today?"}}],"created":1763583884,"model":"ai/qwen3:0.6B-Q4_0","system_fingerprint":"b1-c22473b","object":"chat.completion","usage":{"completion_tokens":80,"prompt_tokens":10,"total_tokens":90},"id":"chatcmpl-YaO3BTmXLv1Nsp1kDjXounLq8a8KLHch","timings":{"cache_n":0,"prompt_n":10,"prompt_ms":740.572,"prompt_per_token_ms":74.0572,"prompt_per_second":13.503076000712962,"predicted_n":80,"predicted_ms":11300.522,"predicted_per_token_ms":141.256525,"predicted_per_second":7.07931

I tested the function under ai.py and this was the output:

DEBUG:llm-test:POST http://172.17.0.1:12434/engines/llama.cpp/v1/chat/completions
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.17.0.1:12434
DEBUG:urllib3.connectionpool:http://172.17.0.1:12434 "POST /engines/llama.cpp/v1/chat/completions HTTP/1.1" 200 None
LLM Response:
Error: can only concatenate str (not "NoneType") to str

I am not sure if it is timing out or simply not extracting correctly.

This is a Docker Compose file preconfigured for Docker Model Runner (DMR). The YAML includes direct links to official Docker documentation to help users install and enable the docker-model-plugin required for local model execution.
The setup uses the lightweight ai/qwen3:0.6B-Q4_0 model (~500 MB), chosen for its small footprint, ability to run on integrated graphics, and reliable performance—making it ideal for testing and low-resource environments.
Additionally, /app/temp_audio is now mounted as a tmpfs volume, keeping temporary audio data entirely in RAM. This reduces disk I/O, improves processing speed, and is generally safe given the small size of audio files and the assumption that users running local models typically have sufficient memory.
The README.md has been updated to reference the new DMR-enabled Docker Compose file, helping users easily discover and enable the local model runner setup.
@moutasem1989
Copy link
Contributor Author

moutasem1989 commented Nov 19, 2025

I tried changing the ai.py OpenAI function to catch the error and tested it separately in test_llm_access.py. It worked and gave me an answer:

=== Testing Local OpenAI-Compatible LLM ===
DEBUG:llm-test:POST http://172.17.0.1:12434/engines/llama.cpp/v1/chat/completions
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 172.17.0.1:12434
DEBUG:urllib3.connectionpool:http://172.17.0.1:12434 "POST /engines/llama.cpp/v1/chat/completions HTTP/1.1" 200 None
LLM Response:
Hi there! How are you today?

so I am not really understanding why it keeps giving TypeError: can only concatenate str (not "NoneType") to str

@rendyhd
Copy link
Contributor

rendyhd commented Nov 20, 2025

check ai.py, lines 134-158, under header "Extract content based on format". Below a potential solution, but I can't test.

Potential reason: The issue arises because DMR (using a llama.cpp backend) and OpenRouter structure their streaming JSON chunks differently. While OpenRouter typically omits the content key entirely when it is empty, DMR explicitly sends "content": null—particularly while the model is generating reasoning_content (the "Thinking" phase). The code's check, if 'content' in choice['delta'], evaluated to True for DMR because the key existed, causing the script to crash when it attempted to concatenate that None value to the string.

# Extract content based on format
                if is_openai_format:
                    # OpenAI/OpenRouter format
                    if 'choices' in chunk and len(chunk['choices']) > 0:
                        choice = chunk['choices'][0]

                        # Check for finish
                        if choice.get('finish_reason') == 'stop':
                            break

                        # Extract text from delta.content or text field
                        if 'delta' in choice:
                            content = choice['delta'].get('content')
                            if content is not None:
                                full_raw_response_content += content
                        elif 'text' in choice:
                            text = choice.get('text')
                            if text is not None:
                                full_raw_response_content += text
                else:
                    # Ollama format
                    if 'response' in chunk:
                        full_raw_response_content += chunk['response']
                    if chunk.get('done'):
                        break

            except json.JSONDecodeError:
                logger.debug("Could not decode JSON line from stream: %s", line_str)
                continue

@moutasem1989
Copy link
Contributor Author

moutasem1989 commented Nov 20, 2025

I applied your changes to the test_llm_access.py I created inside the container to test if LLM model is reachable and it worked for me. I was able to capture a response without crashing. I tested this with both DMR and an externally self hosted server that also uses OpenAI-APIs called Jan.AI.
This should work

@rendyhd
Copy link
Contributor

rendyhd commented Nov 20, 2025

That's great! I've included in my PR #196 where applied a fix in the OpenRouter rate limit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants