Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prepare a version of PloverDB for "turnkey" deployment by NCATS #1460

Closed
saramsey opened this issue May 12, 2021 · 13 comments
Closed

prepare a version of PloverDB for "turnkey" deployment by NCATS #1460

saramsey opened this issue May 12, 2021 · 13 comments
Assignees
Projects

Comments

@saramsey
Copy link
Member

saramsey commented May 12, 2021

NCATS wants to be able to deploy the KG2 KP via Docker. I think it makes sense to re-interpret this as "PloverDB", since there is a natural separation (in code, hosted instance, etc.) between PloverDB and ARAX. So, we will need to write a Dockerfile for it.

@saramsey saramsey changed the title make KG2 KP Dockerfile prepare a version of PloverDB for "turnkey" deployment by NCATS May 12, 2021
@amykglen
Copy link
Member

amykglen commented May 13, 2021

adding some notes after today's discussion about this:

Plover's current Dockerfile is here and how-to steps for building an image/running it are here.

(I just moved the running of that one script I mentioned into the Dockerfile.)

so I think the main piece missing is perhaps automating the grabbing of the KG file (and specifying which Biolink model version to use)? anything else?

also suggested by Eric on today's call: tweak Plover's test suite a bit so that NCATS could easily run the tests against whatever endpoint they put their Plover at (so they can verify it indeed seems to be working).

@amykglen amykglen added the kg2c label May 15, 2021
@amykglen amykglen added this to Working in KG2c dev May 15, 2021
@saramsey
Copy link
Member Author

saramsey commented May 17, 2021

tweak Plover's test suite a bit so that NCATS could easily run the tests against whatever endpoint they put their Plover at (so they can verify it indeed seems to be working).

This would likely be welcomed by NCATS ITRB. What are your thoughts about feasibility, @amykglen ?

@saramsey
Copy link
Member Author

See also the NCATS ITRB Standard CI CD Policy document.

@edeutsch
Copy link
Collaborator

There was some grumbling about SQLite databases at today's deployment call. It seemed that ITRB was trying to steer people away from SQLite and instead replace it with an external enterprise DB. They didn't close the door on it, but it seemed like they were not so happy about (this was in the context of MolePro's deployment, not ours).

It seems like we could include our databases in the docker container, but this would cause some very large containers. Or alternatively we could download the databases at launch time, but this incurs a substantial delay in launch and also net high network requirements for each launch.

@saramsey
Copy link
Member Author

saramsey commented May 17, 2021

There was some grumbling about SQLite databases at today's deployment call. It seemed that ITRB was trying to steer people away from SQLite and instead replace it with an external enterprise DB. They didn't close the door on it, but it seemed like they were not so happy about (this was in the context of MolePro's deployment, not ours).

It seems like we could include our databases in the docker container, but this would cause some very large containers. Or alternatively we could download the databases at launch time, but this incurs a substantial delay in launch and also net high network requirements for each launch.

Thank you @edeutsch for bringing this to the team's attention.

I feel like a pragmatic approach is to specify that we need a host volume (to hold our sqlite files) in the Dockerfile. We don't have to care what the absolute path is to the volume in the host OS, that's transparent to the container. We just specify a within-container path for the volume, and it appears to us like a directory where we can write/read data without bloating the container size. https://docs.docker.com/storage/volumes/
Perhaps NCATS IT is objecting to bind-mounting, and if so, perhaps they have a point. As I understand it, bind-mounting has some complexities (like, permissions issues and file ownership issues) that the "volume" approach is specifically designed to avoid. Our job is just to tell NCATS IT "we need this much space in a volume" and insert a directive to create the volume in the Dockerfile. Then we can have an initialization process (shell script, python module, whatever) that is designed to run inside the container the first time the container is set up. That script should HTTPS GET (cURL, whatever) the sqlite files into the volume so that they are available to ARAX.

As for the concept of a central managed database: having to update schemas and database contents in a NCATS IT-managed RDBMS would really really slow down updates and feature enhancements to ARAX. And I think it is not necessary if we just use the volume approach and initialize the sqlite files into the volume after the container is born, not as a part of the docker image. I think this should avoid bloating the image or the container.

@amykglen
Copy link
Member

tweak Plover's test suite a bit so that NCATS could easily run the tests against whatever endpoint they put their Plover at (so they can verify it indeed seems to be working).

This would likely be welcomed by NCATS ITRB. What are your thoughts about feasibility, @amykglen ?

nice - it was pretty trivial to add; just did so and added some 'how to test' steps to the README here.

@amykglen
Copy link
Member

amykglen commented May 19, 2021

so for building a plover docker image, @saramsey and I were figuring that it'd make the most sense for the KG2c JSON file to be scp-ed from arax.ncats.io (which means NCATS' host machine that they build images on will need to have its RSA public key on arax.ncats.io, under 'rtxconfig'.)

because the plover Dockerfile only needs scp access when building an image (and not when running a container), I see a few options for how to make this scp work, and I'm wondering if anyone has input about which of these seems better/worse for NCATS (@saramsey, @edeutsch or others):

  1. NCATS has to put copies of their host id_rsa.pub and id_rsa into their clone of the repo (i.e., in the PloverDB/ directory)
  2. NCATS has to feed the contents of their host id_rsa.pub and id_rsa into the docker build command (as args, kinda like this)
  3. NCATS has to do bash build.sh instead of a docker build command to build the docker image (using a build script means the part that scps the JSON file can be run before doing the docker build command, and thus nothing fancy with RSA keys needs to happen)

one downside of options 1 and 2 is that the RSA key pair persists in the image, but I suppose this is what the ARAX dockerfile will have to do anyway(?)

anyway, I've figured out how to make any of these options work technically (with the help of Steve's key install Dockerfile code :)), but just wondering if anyone has input about which they think NCATS would like best...

@edeutsch
Copy link
Collaborator

I just don't understand the details of docker container building and deployment to make any informed input here.

My only concern that might be worth considering is: suppose the keys were compromised somehow, under this scheme the perp would potentially have access to lots more on arax.ncats.io (any file with world read?) unless we are super careful about locking things down.

Might it make more sense to put the data in an S3 bucket and give NCATS read-only keys to the S3 bucket to be used for Docker image building. Should the keys be compromised somehow, there is only access to the contents of the bucket?

@amykglen
Copy link
Member

fair point - I think I'm actually favoring the S3 option at this point... realized that I don't think it'd be too bad to configure aws-cli in the dockerfile - we could just give NCATS their access_key and secret_access_key and they could pass them as args in the docker build command. seems like it might even be a bit easier for NCATs compared to what they'd have to do to set up ssh key access..

@amykglen
Copy link
Member

ok, I got things working (locally) so that awscli is installed and configured via the Dockerfile and can successfully grab files from S3. so the command NCATS would need to use to build the image is:

docker build -t myimage --build-arg aws_access_key=XXXX --build-arg aws_secret_key=YYYY .

where XXXX and YYYY are the keys we give them that allow them to download from a particular S3 bucket (probably would want to create a new S3 bucket for this?)

I think I favor this option vs. the scp option for a few reasons:

  1. NCATS can build this image from whatever host they want and we never have to put their host RSA key anywhere
  2. they also don't have to do any setup steps such as moving their id_rsa.pub and id_rsa files into a certain location
  3. it seems more secure (as @edeutsch pointed out) since even if keys were somehow compromised, their access_key/secret_key would only be allowed to download files from a particular S3 bucket, so I can't imagine much damage could really be done

@saramsey - does this S3 method seem ok to you?

@amykglen amykglen added sar-look Marks an issue that Steve needs to examine and removed sar-look Marks an issue that Steve needs to examine labels May 20, 2021
@amykglen
Copy link
Member

posting for the record: Steve and I discussed a bit and agreed to start with this S3 strategy, perhaps with some minor tweaks. but I also reached out to Amit to see if NCATS has any preferences here, since it's difficult to know what would work best for them and their automation plans.

@amykglen
Copy link
Member

amykglen commented May 22, 2021

so I haven't heard anything from NCATS on preferences about the above, but I pushed code that uses the S3 method, with a minor tweak so that the AWS keypair is copied into the image (they just have to put a copy of their .aws directory into the repo before building the image). seems preferable to passing via the command line, although still not sure what would work best for them. (but whatever they end up wanting, it should be a minor tweak.)

I updated the README (here) and tested to verify that everything works as expected (the KG file is successfully downloaded from S3 and everything builds fine).

I think the preferred plan is that NCATS would give us a keypair that we can then grant read-only access to the S3 bucket.

so I think we can call this done for now?

@amykglen amykglen moved this from Working to Done in KG2c dev May 22, 2021
@amykglen
Copy link
Member

I'll go ahead and close this issue - we can reopen or create a new issue if NCATS has any tweaks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

3 participants