Dev: New deployment for new Server #226

AlbertoCasasOrtiz · 2025-01-21T23:14:44Z

Now, it is possible to deploy multiple instances of OpenCap-Core in the same machine by doing:

make run INSTANCE_ID=<id-instance> CPU_SET=<cpu-set>

Where:

<id-instance>: An integer number from 0 to max_gpu. Assigns a specific GPU for this instance.
<cpu-set>: Set of CPU threads assigned to this instance in the format n-m (e.g., 0-13).

Added some helper scripts:

./start-script.sh <n-instances>: Starts n opencap instances, each with 1 GPU assigned and 14 threads (can be changed in code). If an instance is already running, it will not modify it.
./stop-script.sh <instance-id>: Stops a specific opencap instance
./stop-all-script.sh: Stops all instances of opencap.

I am working in not doing these parameters mandatory, so if we want to just do make run, it will automatically assign 1 GPU and no limit on CPU threads.

…ame machine.

…n app.py

…Retry error handling

carmichaelong · 2025-02-07T00:59:13Z

Changing from WIP to ready for review. Additional commits are aimed at the following:

make sure a different GPU is used for each instance
optional logging to json file to capture errors that are not passed back to the database
ensure make run still works, defaulting to GPU 0 and all CPU threads
add retries to the test session to help deal with multiple machines running the test session at the same time and then getting errors from the API. while this was a problem before, this has become a bigger problem with starting many instances at the same time.

@AlbertoCasasOrtiz I know you worked on some of this, but there are enough changes that I think you could still test and review, so I'm adding you as a reviewer. If you think you'd like a different person to review, we could ping Antoine.

@antoinefalisse Tagging you here as an FYI for the changes here. If you'd like to give a review since both Alberto and I worked on it, please feel free to do so and add yourself, but no worries if you don't think it's necessary.

AlbertoCasasOrtiz · 2025-02-10T19:11:04Z

make sure a different GPU is used for each instance
Tested while the server was processing trials and different GPUs had GPU usage.
optional logging to json file to capture errors that are not passed back to the database
Tested by activating logging and checking that it is being saved in all containers.
ensure make run still works, defaulting to GPU 0 and all CPU threads
Tested by running make run, which launched the correct container with GPU 0 assigned.
add retries to the test session to help deal with multiple machines running the test session at the same time and then getting errors from the API. while this was a problem before, this has become a bigger problem with starting many instances at the same time.
It's been running for 3 days with no errors.

AlbertoCasasOrtiz and others added 10 commits January 14, 2025 16:10

Allowing running multiple instances of opencap in one server.

0cb0372

Modified Makefile and docker-compose to allow multiple instances in s…

f807e25

…ame machine.

Renamed scripts. Added script to stop all.

98a385d

Added script to check container status and start single containers.

26fa709

Added ugly logging to docker containers. Added try catch to finally i…

666e165

…n app.py

catch exception if patch fails in except block

f42a602

use different gpu for each instance

b14f5bc

add error logging json file (optional) for machines running app.py

638807e

set defaults for INSTANCE_ID and CPU_SET to restore make run usage

7d7e215

retry test session with random waiting time, simplify makeRequestWith…

116e8d4

…Retry error handling

carmichaelong changed the title ~~WIP - New deployment for new Server~~ Dev: New deployment for new Server Feb 7, 2025

AlbertoCasasOrtiz changed the base branch from dev to main February 11, 2025 23:01

AlbertoCasasOrtiz changed the base branch from main to dev February 11, 2025 23:02

AlbertoCasasOrtiz merged commit cc976fa into dev Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dev: New deployment for new Server #226

Dev: New deployment for new Server #226

Uh oh!

AlbertoCasasOrtiz commented Jan 21, 2025

Uh oh!

carmichaelong commented Feb 7, 2025

Uh oh!

AlbertoCasasOrtiz commented Feb 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Dev: New deployment for new Server #226

Dev: New deployment for new Server #226

Uh oh!

Conversation

AlbertoCasasOrtiz commented Jan 21, 2025

Uh oh!

carmichaelong commented Feb 7, 2025

Uh oh!

AlbertoCasasOrtiz commented Feb 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AlbertoCasasOrtiz commented Feb 10, 2025 •

edited

Loading