Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use virtualgl egl backend to run simulation without X server. #747

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zbynekwinkler
Copy link
Member

To test please run

$ docker pull osrf/subt-virtual-testbed:cloudsim_sim_latest
$ ./subt/docker/build.bash cloudsim
$ python -m subt.tools.run <your_favorite_config.toml>

It should be possible to get a successful run even on a computer with no running X server or when the X server runs as a different user.

@jisa
Copy link
Collaborator

jisa commented Nov 14, 2020

I tried, but I cannot confirm (yet) that this works.

$ ./subt/docker/build.bash cloudsim
Sending build context to Docker daemon  10.53MB
Step 1/4 : FROM osrf/subt-virtual-testbed:cloudsim_sim_latest
 ---> 85ca64df8c0a
Step 2/4 : RUN cd /tmp &&     curl -sSO https://s3.amazonaws.com/virtualgl-pr/dev/linux/virtualgl_2.6.80_amd64.deb &&     sudo apt-get install ./virtualgl_*
 ---> Running in 5dec903c0a0e
curl: (6) Could not resolve host: s3.amazonaws.com
The command '/bin/sh -c cd /tmp &&     curl -sSO https://s3.amazonaws.com/virtualgl-pr/dev/linux/virtualgl_2.6.80_amd64.deb &&     sudo apt-get install ./virtualgl_*' returned a non-zero code: 6
Error response from daemon: No such image: cloudsim:2020-11-14-1452

Built cloudsim:2020-11-14-1452 and tagged as cloudsim:latest

However, ./subt/docker/build.bash robotika succeeds. For some reason that I have not resolved yet, docker build has access to network in one case, but not in the other.

After I will fix the network issue, there will be one more problem ahead of me:

$ python3 -m subt.tools.run subt/runs/cs1_drone.toml
circuit: cave
world:   simple_cave_01
image:   robotika:latest   
logdir:  /tmp/osgar/subt/runs/2020-11-14T12.04.23-cs1_drone
robots:
       X120L: SSCI_X4_SENSOR_CONFIG_2

Creating/attaching 'sim-net'
Starting 'sim' container...
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/osgar/subt/tools/run.py", line 391, in <module>
    main(sys.argv[1:])
  File "/tmp/osgar/subt/tools/run.py", line 321, in main
    sim = _run_sim(client, circuit, logdir, world, robots)
  File "/tmp/osgar/subt/tools/run.py", line 173, in _run_sim
    sim = _create_docker(client, "sim", "cloudsim:latest", command, mounts, environment)
  File "/tmp/osgar/subt/tools/run.py", line 111, in _create_docker
    docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
AttributeError: module 'docker.types' has no attribute 'DeviceRequest'

$ sudo aptitude install python3-docker
python3-docker is already installed at the requested version (4.1.0-1.2)
python3-docker is already installed at the requested version (4.1.0-1.2)
No packages will be installed, upgraded, or removed.

subt.tools.run needs python3-docker>=4.3.0, while only 4.1 is available system-wide (both in Debian and in Ubuntu). 4.3.1 can be installed with pip3 though, so this is resolvable, even if not great.

@zbynekwinkler
Copy link
Member Author

For some reason that I have not resolved yet, docker build has access to network in one case, but not in the other.

My offer to help with fixing this on your system still holds. Best time for this would be during our telemeetings.

subt.tools.run needs python3-docker>=4.3.0, while only 4.1 is available system-wide (both in Debian and in Ubuntu). 4.3.1 can be installed with pip3 though, so this is resolvable, even if not great.

System packaged libraries are primarily for system packaged applications. The preferred way to work is to use python virtual environment. That is my setup, the setup for CI and also our solution docker image setup - that is the setup I can provide support for.

@zbynekwinkler
Copy link
Member Author

@jisa Please see #749 for a possible fix for the build issue.

@jisa
Copy link
Collaborator

jisa commented Nov 18, 2020

That is one of the things I tried. It made no difference.

I added RUN /sbin/ifconfig -a as the first command to both the cloudsim and to robotika docker files and added a --no-cache flag to the docker build command to make sure it gets properly re-invoked every time.

For robotika, I get whole lot of interfaces: br-6315d94dae1f, docker0, enp2s0 and lo. For cloudsim, I only get lo. This makes me believe that the problem comes from the base image, not from docker configuration or commandline parameters.

When I docker inspect osrf/subt-virtual-testbed:cloudsim_sim_latest, I see ContainerConfig:NetworkDisabled=true at two places. The robotika base image does not have that.

Would you mind checking your cloudsim_sim_latest? I have a locally modified version, because of yet unofficial modifications of K2. It is possible I messed it up somewhere on the way.

@jisa
Copy link
Collaborator

jisa commented Nov 19, 2020

For posterity: The networking issue was caused by local modifications of the base image saved with docker commit while the respective container was running without network access.

@jisa
Copy link
Collaborator

jisa commented Nov 19, 2020

I can make it work. I had to:

  1. sudo vglserver_config. Enabling only EGL or enabling EGL and GLX both work. Restricting access only to vglusers doesn't and we may want to figure out who needs to be added to vglusers.
  2. sudo modprobe nvidia-drm on the host machine. This module somehow does not load despite being specified in /etc/modules-load.d/nvidia.conf. I will need to solve this separately.
  3. Replace /dev/dri/card1 in subt/docker/cloudsim/run_wrapped.bash with /dev/dri/card0, rebuild the new cloudsim image.

@zbynekwinkler
Copy link
Member Author

I can make it work

🍾 Great! About vglusers - I just disabled it and forgot about it (my thinking was - why would I want a user that cannot access this? anyway, at least root needs to be there - dockers story about users is not that good anyway). As for the card0 - you have a system where the nvidia card is the only card, right? I'll have to think about how to best handle selecting the proper card (suggestions welcomed).

@zbynekwinkler
Copy link
Member Author

I'll have to think about how to best handle selecting the proper card

What about an environment variable? EGL_DEVICE that would contain the path to the device to use.

@@ -170,7 +170,7 @@ def _run_sim(client, circuit, logdir, world, robots):
f"robotName{n}:={name}",
f"robotConfig{n}:={kind}"
]
sim = _create_docker(client, "sim", "osrf/subt-virtual-testbed:cloudsim_sim_latest", command, mounts, environment)
sim = _create_docker(client, "sim", "cloudsim:latest", command, mounts, environment)
Copy link
Member

@m3d m3d Dec 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how not to forget to rebuild local cloudsim when external is upgraded?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution is the same as we are using for our base image. We can start using specific versions instead of "latest".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants