Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Apache Mina, remove trilead for ssh connectivity #1127

Merged
merged 1 commit into from
Jun 14, 2024

Conversation

olamy
Copy link
Member

@olamy olamy commented Apr 10, 2024

Use Apache Mina, remove trilead for ssh connectivity

  • Replace trilead-api with Apache Mina ssh

Checklist

  • I have read the CONTRIBUTING doc
  • I have referenced the Jira issue related to my changes in one or more commit messages
  • I have added tests that verify my changes
  • Unit tests pass locally with my changes
  • I have added documentation as necessary
  • No Javadoc warnings were introduced with my changes
  • No spotbugs warnings were introduced with my changes
  • I have interactively tested my changes

Types of changes

What types of changes does your code introduce?

  • New feature (non-breaking change which adds functionality)

@github-actions github-actions bot added the tests Automated test addition or improvement label Apr 10, 2024
pom.xml Outdated Show resolved Hide resolved
@olamy olamy marked this pull request as ready for review June 7, 2024 06:22
@olamy olamy requested a review from a team as a code owner June 7, 2024 06:22
@olamy olamy changed the title [WIP] Use Apache Mina, remove trilead for ssh connectivity Use Apache Mina, remove trilead for ssh connectivity Jun 7, 2024
@olamy olamy requested a review from jtnord June 7, 2024 07:51
@olamy
Copy link
Member Author

olamy commented Jun 7, 2024

currently 3 new tests are failing because using testcontainers and need some changes in Jenkinsfile but it's not taken into account.

@MarkEWaite MarkEWaite added the enhancement Improvement or new feature label Jun 7, 2024
@MarkEWaite MarkEWaite self-assigned this Jun 7, 2024
@MarkEWaite
Copy link
Contributor

Thanks very much @olamy !

I've pushed the Jenkinsfile change to the master branch so that the plugin can use testcontainers.

Copy link
Member

@jtnord jtnord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 3 failures locally

[ERROR]   AuthzTest.httpWithPassword:93->testRepo:125 » IllegalArgument Invalid value: gpg.format=ssh
[ERROR]   AuthzTest.sshKeyAuth:57->testRepo:125 » IllegalArgument Invalid value: gpg.format=ssh
[ERROR]   AuthzTest.sshWithPassword:76->testRepo:125 » IllegalArgument Invalid value: gpg.format=ssh
[ERROR] org.jenkinsci.plugins.gitclient.jgit.AuthzTest.httpWithPassword -- Time elapsed: 7.125 s <<< ERROR!
java.lang.IllegalArgumentException: Invalid value: gpg.format=ssh
        at org.eclipse.jgit.lib.DefaultTypedConfigGetter.getEnum(DefaultTypedConfigGetter.java:103)
        at org.eclipse.jgit.lib.Config.getEnum(Config.java:454)
        at org.eclipse.jgit.lib.GpgConfig.<init>(GpgConfig.java:86)
        at org.eclipse.jgit.api.CommitCommand.processOptions(CommitCommand.java:662)
        at org.eclipse.jgit.api.CommitCommand.call(CommitCommand.java:189)
        at org.jenkinsci.plugins.gitclient.JGitAPIImpl.commit(JGitAPIImpl.java:555)
        at org.jenkinsci.plugins.gitclient.jgit.AuthzTest.testRepo(AuthzTest.java:125)
        at org.jenkinsci.plugins.gitclient.jgit.AuthzTest.httpWithPassword(AuthzTest.java:93)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)

see #1124 for previous bad assumptions and how they where fixed

pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
@MarkEWaite MarkEWaite removed the tests Automated test addition or improvement label Jun 8, 2024
@github-actions github-actions bot added the tests Automated test addition or improvement label Jun 8, 2024
@MarkEWaite MarkEWaite removed the tests Automated test addition or improvement label Jun 8, 2024
@github-actions github-actions bot added the tests Automated test addition or improvement label Jun 8, 2024
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
pom.xml Show resolved Hide resolved
@MarkEWaite
Copy link
Contributor

I've installed an incremental build of this pull request in my test environment and run my hosting providers tests. There are several new failure cases compared to the released 4.7.0 version of the plugin. Unfortunately, I've run out of time for today and need to do other things.

I hope to capture the details of the new failures and will post them here when I have confirmed that there is a behavior change between the released code and the pull request. Sometimes the failures are entirely the fault of my test infrastructure.

@olamy
Copy link
Member Author

olamy commented Jun 10, 2024

@MarkEWaite I managed to make the test failure repeatable (a bit flaky to async io aspect of mina)
just need to figure out the problem now.

MarkEWaite added a commit to MarkEWaite/bom that referenced this pull request Jun 11, 2024
jenkinsci/git-client-plugin#1127 has been tested
interactively and has been reviewed.  Initial tests with the plugin bill
of materials detected an incompatibility.  That incompatibility has been
fixed and further interactive testing has been done successfully.

Run the weekly test to see if it detects other incompatibilities.
@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jun 12, 2024

During some testing I found that the known hosts file stored in the temporary directory was corrupted or damaged on my one Windows agent that allows more than one executor. I'm too tired to continue the testing, but I am suspicious enough of the thread safety of the known hosts file handling that I'll do some further testing the next time I have a chance. That won't be tomorrow because tomorrow is filled with other things.

I was running all my hosting provider tests at the same time and queuing multiple runs. Most of my Windows agents only have a single executor assigned, but one machine has 3 executors. That machine is the one that scrambled the contents of the known hosts file.

@olamy
Copy link
Member Author

olamy commented Jun 12, 2024

During some testing I found that the known hosts file stored in the temporary directory was corrupted or damaged on my one Windows agent that allows more than one executor. I'm too tired to continue the testing, but I am suspicious enough of the thread safety of the known hosts file handling that I'll do some further testing the next time I have a chance. That won't be tomorrow because tomorrow is filled with other things.

I was running all my hosting provider tests at the same time and queuing multiple runs. Most of my Windows agents only have a single executor assigned, but one machine has 3 executors. That machine is the one that scrambled the contents of the known hosts file.

I will need more information such:

  • what verifier strategy is used?
  • did you start with an empty known hosts file
  • any error in jenkins log or node logs?

This would be surprising to see concurrent access as this should be really better now with JGit as it is using a LockFile implementation (see https://github.com/eclipse-jgit/jgit/blob/master/org.eclipse.jgit/src/org/eclipse/jgit/internal/storage/file/LockFile.java) compared to trilead which was using simply no sync mechanism (see https://github.com/jenkinsci/trilead-ssh2/blob/15dfef32ff14a4c3eba66552f3636f886ad340c4/src/com/trilead/ssh2/KnownHosts.java#L648)

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jun 12, 2024

Thanks for asking the questions. I haven't investigated enough to give really good answers yet, but here are some of the things that I can answer.

I will need more information such:

  • what verifier strategy is used?

I'm using manually provided key verification strategy. Details are in the configuration as code of my controller.

  • did you start with an empty known hosts file

I believe that I started with a non-existent known hosts file because the one agent log reported that a temporary file that included the word "knownhosts" in its name did not contain the host key for bitbucket.org. Since it was a file in a temporary directory with "knownhosts" and a number in the file name, I assume it was created by the agent some time after the agent started.

Unfortunately, I don't have the exact message but I believe it was related to the data in the known hosts file being somehow incorrect or incomplete.

  • any error in jenkins log or node logs?

Yes, there were multiple messages in the agent log that indicated the temporary file with "knownhosts" in its name did not contain the host key for several different Git providers. I believe that Bitbucket and Gitea were both mentioned and I think that one other was mentioned.

I'll experiment further with it on my next testing round.

It may also be a remnant from some previous problem that I only discovered because this is the first time that I've tried high volume tests of hosting providers. I previously would click each job from the user interface to start it. Now I start the jobs from the Jenkins script console with a system groovy script:

// Schedule every job in each folder of the Hosting-Providers folder
import hudson.model.Cause.UserIdCause
def hostingProviders = Jenkins.instance.getJob('Hosting-Providers')
def hostingFolders = hostingProviders.getItems()
hostingFolders.each {
  def jobs = it.getItems()
  jobs.each {
    it.scheduleBuild(0, new UserIdCause())
  }
}

That script starts 96 jobs, with at least a few of the jobs as Pipeline jobs that perform checkout on multiple agents. It immediately loads my 47 agents with jobs. When I run that script a second time, it doubles the number of jobs.

@olamy
Copy link
Member Author

olamy commented Jun 12, 2024

That script starts 96 jobs, with at least a few of the jobs as Pipeline jobs that perform checkout on multiple agents. It immediately loads my 47 agents with jobs. When I run that script a second time, it doubles the number of jobs.

does this work with current code (master) using trilead implementation?

TBH the goal of this PR is to change from Trilead to Apache Mina.
Same for @jtnord comment of directory permissions (btw there is no requirement to have special permissions of the directory ~/.ssh according to https://linux.die.net/man/1/ssh)
Do we really want to add out of context code changes?
I usually get comments on non-related changes (such as extra lines or spaces added somewhere), and now you are asking about behavior changes. so I'm lost 😆
At this stage, this PR has reached a status where it works (including bug backward compatibility :)).
I don't mind adding extra nonrelated changes, but personally, I would prefer having this merged and released to get some more feedback early, as there will obviously be some bugs. However, we cannot retain this for ages and delay all the next feedbacks from more users with different configuration setups, etc.

@olamy
Copy link
Member Author

olamy commented Jun 13, 2024

@MarkEWaite, I looked again at the manually provided verification strategy. It's using his own file per client instance (same implementation as master with trilead or even the cli one, so there shouldn't be any concurrent access issues.

@olamy
Copy link
Member Author

olamy commented Jun 13, 2024

@MarkEWaite do you have other plugins installed such https://github.com/jenkinsci/eddsa-api-plugin?

@MarkEWaite
Copy link
Contributor

does this work with current code (master) using trilead implementation?

An excellent question. I should have thought of that myself. Further exploration shows that my initial theory of a concurrency issue was wrong. I added extra executors to all of the Windows machines in my environment, some with processors that are faster than the problematic machine, some with processors that are slower. None of the other machines showed any issue. I think the problem I was seeing is specific to that machine and is not worth delaying this pull request.

Do we really want to add out of context code changes?

No, I definitely don't want to add code changes that are out of context. You've done great work and my testing confirms that great work.

I think we should squash merge this change and release immediately. Does that work for you?

@MarkEWaite
Copy link
Contributor

@MarkEWaite do you have other plugins installed such https://github.com/jenkinsci/eddsa-api-plugin?

Yes, I have eddsa plugin installed in my plugins.txt. I believe that it is installed as a dependency of ssh server plugin and trilead API plugin.

@olamy
Copy link
Member Author

olamy commented Jun 14, 2024

does this work with current code (master) using trilead implementation?

An excellent question. I should have thought of that myself. Further exploration shows that my initial theory of a concurrency issue was wrong. I added extra executors to all of the Windows machines in my environment, some with processors that are faster than the problematic machine, some with processors that are slower. None of the other machines showed any issue. I think the problem I was seeing is specific to that machine and is not worth delaying this pull request.

Do we really want to add out of context code changes?

No, I definitely don't want to add code changes that are out of context. You've done great work and my testing confirms that great work.

I think we should squash merge this change and release immediately. Does that work for you?

I just fixed some merge conflict with master.
Sounds definitely good to squash/merge -> release -> celebrate (until the first user's bug :) )

@MarkEWaite
Copy link
Contributor

I think we should squash merge this change and release immediately. Does that work for you?

I just fixed some merge conflict with master. Sounds definitely good to squash/merge -> release -> celebrate (until the first user's bug :) )

I think we should increment the version number to 5.0.0 because this is intentionally deleting some public classes. I've checked that they are unused public classes, but I think that consumers will be better served by knowing that the public API has intentionally changed to remove many of the trilead classes. I'll do that in a separate pull request after this is merged, then will run a quick test to confirm. I hope to have that completed before the weekend is done.

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Jun 14, 2024

I just fixed some merge conflict with master.

There wasn't a merge conflict with master as far as I can tell. The pull request had 8176eab as a merge commit that brought in the most recent master branch. It looks like your merge resolution added many more commits to the history. It seems to have created a duplicate of all the commits that had been in the pull request. For example:

The duplication does no harm so long as we squash merge the changes. I don't think we should try to repair the duplication. Better to squash merge it and prepare for the 5.0.0 release.

I compared the tip of the current branch with copies that I had pushed to my fork, and they are the same file content.

@olamy
Copy link
Member Author

olamy commented Jun 14, 2024

I just fixed some merge conflict with master.

There wasn't a merge conflict with master as far as I can tell. The pull request had 8176eab as a merge commit that brought in the most recent master branch. It looks like your merge resolution added many more commits to the history.

glursp that's weird. I just used a standard git rebase origin/master
I bumper version to 5.0.0
anyway what do you prefer squashing when you merge via UI or I can do it locally.
I don't really have strong preference at the end of the day it's the same result.

@MarkEWaite
Copy link
Contributor

anyway what do you prefer squashing when you merge via UI or I can do it locally.

If you're willing to do the squash merge, that's great for me. That allows you to have local control of it and be confident of the content of the squash.

@MarkEWaite
Copy link
Contributor

It has been a long day for me. I'll let you complete the squash and the merge and I'll plan to do more testing tomorrow. I need some sleep now.

Signed-off-by: Olivier Lamy <olamy@apache.org>
@olamy
Copy link
Member Author

olamy commented Jun 14, 2024

It has been a long day for me. I'll let you complete the squash and the merge and I'll plan to do more testing tomorrow. I need some sleep now.

squash done.
no worries. have some rest. Thanks for your help.

@MarkEWaite MarkEWaite merged commit 7d038eb into jenkinsci:master Jun 14, 2024
15 checks passed
@olamy olamy deleted the apache-mina-ssh branch June 16, 2024 11:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement or new feature tests Automated test addition or improvement
Projects
None yet
3 participants