Skip to content

Conversation

@AlbertoCasasOrtiz
Copy link
Member

Now, it is possible to deploy multiple instances of OpenCap-Core in the same machine by doing:

make run INSTANCE_ID=<id-instance> CPU_SET=<cpu-set>

Where:

  • <id-instance>: An integer number from 0 to max_gpu. Assigns a specific GPU for this instance.
  • <cpu-set>: Set of CPU threads assigned to this instance in the format n-m (e.g., 0-13).

Added some helper scripts:

  • ./start-script.sh <n-instances>: Starts n opencap instances, each with 1 GPU assigned and 14 threads (can be changed in code). If an instance is already running, it will not modify it.
  • ./stop-script.sh <instance-id>: Stops a specific opencap instance
  • ./stop-all-script.sh: Stops all instances of opencap.

I am working in not doing these parameters mandatory, so if we want to just do make run, it will automatically assign 1 GPU and no limit on CPU threads.

@carmichaelong carmichaelong changed the title WIP - New deployment for new Server Dev: New deployment for new Server Feb 7, 2025
@carmichaelong
Copy link
Contributor

Changing from WIP to ready for review. Additional commits are aimed at the following:

  • make sure a different GPU is used for each instance
  • optional logging to json file to capture errors that are not passed back to the database
  • ensure make run still works, defaulting to GPU 0 and all CPU threads
  • add retries to the test session to help deal with multiple machines running the test session at the same time and then getting errors from the API. while this was a problem before, this has become a bigger problem with starting many instances at the same time.

@AlbertoCasasOrtiz I know you worked on some of this, but there are enough changes that I think you could still test and review, so I'm adding you as a reviewer. If you think you'd like a different person to review, we could ping Antoine.

@antoinefalisse Tagging you here as an FYI for the changes here. If you'd like to give a review since both Alberto and I worked on it, please feel free to do so and add yourself, but no worries if you don't think it's necessary.

@AlbertoCasasOrtiz
Copy link
Member Author

AlbertoCasasOrtiz commented Feb 10, 2025

  • make sure a different GPU is used for each instance

  • Tested while the server was processing trials and different GPUs had GPU usage.

  • optional logging to json file to capture errors that are not passed back to the database

  • Tested by activating logging and checking that it is being saved in all containers.

  • ensure make run still works, defaulting to GPU 0 and all CPU threads

  • Tested by running make run, which launched the correct container with GPU 0 assigned.

  • add retries to the test session to help deal with multiple machines running the test session at the same time and then getting errors from the API. while this was a problem before, this has become a bigger problem with starting many instances at the same time.

  • It's been running for 3 days with no errors.

@AlbertoCasasOrtiz AlbertoCasasOrtiz changed the base branch from dev to main February 11, 2025 23:01
@AlbertoCasasOrtiz AlbertoCasasOrtiz changed the base branch from main to dev February 11, 2025 23:02
@AlbertoCasasOrtiz AlbertoCasasOrtiz merged commit cc976fa into dev Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants