Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI LAB fails when running more than one instance on the same host #37

Closed
jphickey opened this issue Feb 8, 2020 · 30 comments · Fixed by #41 or #44
Closed

CI LAB fails when running more than one instance on the same host #37

jphickey opened this issue Feb 8, 2020 · 30 comments · Fixed by #41 or #44
Assignees
Labels
bug Something isn't working
Milestone

Comments

@jphickey
Copy link
Contributor

jphickey commented Feb 8, 2020

Is your feature request related to a problem? Please describe.
With a CFE project that uses more than one target (multi-cpu), it is common during development to run more than one CFE core process on the same physical machine (i.e. when simulating on native).

But CI_LAB unconditionally binds to port 1234 so if more than one of those CFE core process is running CI_LAB then it fails to bind because the port is in use.

Describe the solution you'd like
CI_LAB should at least modify its port number based on the CPU number from the PSP. This would allow all cfe-core instances from the same mission to avoid conflict, and not require a custom build of CI_LAB for each CPU.

Describe alternatives you've considered
Containers would also theoretically solve this too but it is complicated to set up and also makes debugging more complex which is generally the objective of doing a native build in the first place.

Additional context
Suggest using (base_port + cpu_number) as the port to which CI_LAB binds, so each CPU gets a different port.

Also suggest setting the base port to 1233 instead of the current value of 1234, so that CPU1 ends up getting the same number. So this way the "cmdUtil" still works with the same default port and will go to CPU1.

Requester Info
Joseph Hickey, Vantage Systems, Inc.

@jphickey jphickey self-assigned this Feb 8, 2020
@jphickey
Copy link
Contributor Author

jphickey commented Feb 8, 2020

This is a simple change but it depends on pull #35 first

@skliper
Copy link
Contributor

skliper commented Feb 10, 2020

How about a similar approach as JSC, just pass in the port number at startup? Avoids being CPU based.
-p portNum

@skliper
Copy link
Contributor

skliper commented Feb 10, 2020

Note the JSC approach also plays nicely with the cFS Test Framework.

@jphickey
Copy link
Contributor Author

No, I'm talking about the other side of the connection (the listening socket), not cmdUtil (the sending socket). One would use the -p option on cmdUtil to send to something other than CPU1.

The problem is on the CFE FSW side - when you have more than one CPU running CI_LAB, the second one fails to start because it cannot bind to port 1234.

@skliper
Copy link
Contributor

skliper commented Feb 10, 2020

No, I'm talking about the listening socket not cmdUtil. JSC passes in the port number at startup for ci_lab to listen on.

@jphickey
Copy link
Contributor Author

Through what mechanism do they do this? CI_LAB currently hard codes the listening port as 1234:

#define cfgCI_PORT 1234

Assuming they have some sort of extension to allow passing in a port number, that's fine, but I still think the default should be such that it does not conflict (and fail) when running more than one CPU.

@skliper
Copy link
Contributor

skliper commented Feb 10, 2020

I'm not a fan of basing it on the CPU. I am a BIG fan of being able to make it unique, either passing in or defining as part of target configuration, or whatever other scheme allows the user to select a unique port for each instance. Basically I prefer an explicit definition of port that can be easily set as unique per target.

I haven't dug into the different JSC implementations (there are numerous CI's out there), but it looks to me like there's a configuration in CTF that allows passing in the port number when starting cFS. So I assume there's support for it somewhere (although maybe it's just on a wish list).

@jphickey
Copy link
Contributor Author

I tend to agree for a real FLIGHT app that every config should be explicitly set. This would typically be done via a configuration table file where each instance gets its own config, and everything (like the port number) gets assigned in this configuration file.

But for the the LAB apps (CI_LAB, TO_LAB, SCH_LAB) these are not flight apps, they are debug apps, and should be as debugger-friendly as possible. That means minimizing the external dependencies so they can ideally "just work" using reasonable defaults whenever possible. This way when you run the CFE within your debugger/IDE, one doesn't have to worry about lots of command line options or configuration options to pass in just to make it run.

Again, the subject here is just the default listen port for a debug/lab app - an explicit config (if one were to exist) would always take precedence, but right now there is no explicit config for CI_LAB as far as I know, and adding such a thing would constitute significant added complexity for what is supposed to be a simple debug app.

The real problem with the code today is a guaranteed failure when debugging more than one ci_lab instance on the same host. I will adjust the title of this ticket to not suggest a particular solution/workaround but adding the CPU number did seem like a fairly trivial/easy workaround to get the job done.

@jphickey jphickey changed the title CI LAB should modify its listening port based on CPU number CI LAB fails when running more than one instance on the same host Feb 10, 2020
@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

How about adding a command to change ports. Doesn't mess with other layers (like passing in), and isn't arbitrarily based on CPU. Always starts up at a configuration defined default.

Allows the same image to be used, easy one liner from cmdUtil to deconflict after startup (easily scriptable).

@jphickey
Copy link
Contributor Author

Couple issues there - CI_LAB binds before it gets to its command loop, and also it has to be bound to a port before one can even send a command into it (assuming the command is sent externally).

Not sure why the resistance - the proposed default change is trivial code-wise, and does not affect the "typical" use-case of a single CPU and solves the problem reasonably well.

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

Yes, bind to default at startup. Then if you want to deconflict send a command, unbind, bind to new port. Then start the next instance. User could default to different ports at startup if they don't want to use this technique (and have a unique ci_lab build for each) if they don't like the default collision.

Associating port with CPU is arbitrary. What if I want to run 2 instances on the same CPU? Broken all over again based on this design choice. Also adds a layer of complexity to debugging a connection that isn't working since port assignment now has an additional dependency. I don't like it. Lets come up with a different solution.

@jphickey
Copy link
Contributor Author

But the problem is you CANNOT bind to the default at startup. Only one will work, the second one will not. When the second one fails, it bails out of the initialization and the app fails to start. There is no loop to accept a command and re-bind.

Even if it were feasible, it is a FAR more complex solution, introducing a number of "state" variables into CI_LAB that were not there before, and would almost constitute an entire rewrite of the app.

The problem of running two entirely separate missions on the same CPU is not that big of a problem - it is easily solvable by changing the base port on one of them at compile-time. One would always build a separate CI_LAB binary for a separate mission and therefore this is relatively easy to do. The configuration header file for CI_LAB on the two separate missions is always separate and can independently change.

The problem is when you run more than one instance of CI_LAB within the same mission (i.e. on CPU1 and CPU2) where these always use the same config. Even if the CI_LAB is built separately for each CPU they both use the same header file which defines the port number, so they will still conflict.

I wouldn't call starting at a well-defined "base port" and incrementing with each successive instance to be arbitrary - it is a logical and has well defined/predictable result. Picking a random port number would be arbitrary.

Random is the other possible solution; just pick a random port, and bind to it, then display in a log what the port is so that it can be supplied as a "-p" option to cmdUtil - but I think that is a bad solution as it necessitates ALWAYS using the -p option where we do not need it in most cases now. Whereas using the CPU number as an offset is transparent in the typical use-case.

@jphickey
Copy link
Contributor Author

We could even change the "base port" to be something like 10000 instead of 1234. That would mean CPU1 gets UDP port 10001, CPU2 gets 10002, CPU3 gets 10003, and so on. Very logical and easy to remember.

@jphickey
Copy link
Contributor Author

And of course, that "base port" goes in a separate config file, where it is easily settable/overridable within the defs directory so missions can set the port however they like, and if you need to run more than one mission simultaneously, easy to choose a non-overlapping port range.

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

But the problem is you CANNOT bind to the default at startup. Only one will work, the second one will not. When the second one fails, it bails out of the initialization and the app fails to start. There is no loop to accept a command and re-bind.

Did you read my suggestion? Doesn't sound like it... Start instance one, command to new port, start instance 2. No conflict.

Even if it were feasible, it is a FAR more complex solution, introducing a number of "state" variables into CI_LAB that were not there before, and would almost constitute an entire rewrite of the app.

Simpler software doesn't mean simpler maintenance/debugging. Adding a command to switch ports is not an entire rewrite.

Feel free to suggest a different alternative. I'm not in favor of based on CPU, so lets move on.

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

Or pass in as a table that can be used at startup... change tables for each instance.

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

I don't like the table solution, but at least it's an example that can avoid the conflict if running two instances on the same CPU.

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

Or if bind fails increment to next port number... report on startup

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

Add a limit or range... bail if they all fail.

jphickey added a commit to jphickey/ci_lab that referenced this issue Feb 11, 2020
Avoid port conflicts when having more than one CPU within the same
mission load CI_LAB app.  Instead of binding to the port directly,
add the CPU number as an offset.  Subtract 1 so that CPU1 remains
at the same port (1234 by default) and cmdUtil is not affected.
@jphickey
Copy link
Contributor Author

The reality is that I have implemented the CPU offset port number conflict-avoidance long ago in the EDS branch. So it is already done, trivial, and solves the problem at hand in a developer-friendly way which I think is totally reasonable for a LAB app.

The suggestion to start one CPU, send a command to re-bind, then start the second CPU is way too complicated to do every time you want to restart the software (which happens hundreds of times as you are debugging apps throughout the day, which are the users of CI_LAB). The solution to this should just work without continuous intervention by the developer.

I pushed the existing, already-implemented, trivial change to my github fork. I'll leave the over-complicated solution to a future date as I don't really have the time budget to do some major rewrite of CI_LAB to deal with unbinding/rebinding ports right now.

Or if bind fails increment to next port number... report on startup

That's basically what commit 321b4a7 is, minus the randomness, by pre-incrementing by a consistent/predictable amount. This way if you kill the CFE core and restart it, it comes back at the same port. It won't randomly change to a different port on the next boot - which would be bad/annoying in a debug cycle.

@skliper
Copy link
Contributor

skliper commented Feb 11, 2020

The significant difference is it doesn't solve the conflict with two instances on one CPU. I'm not in favor such a behavior change without also addressing that issue (without requiring a completely different mission build).

@jphickey
Copy link
Contributor Author

When you say "two instances" please clarify the specific situation, as there are different ways of arriving at "two instance" of ci_lab:

  1. Two CPUs (or "core" instances) defined within one CFE project mission running the ci_lab app
  2. Two completely independent CFE missions individually running ci_lab on at least one of the CPUs within them

The first is what is specifically addressed here. The second is solvable my modifying the cfgCI_LAB_PORT value.

(Side note - Yes, it would be better to pull that config parameter out into the defs directory so it is easier to tune. But that is really a separate issue and there is already issue #2 which covers that - this is more of a convenience/visibility issue really, not a functional one)

I would also argue that the item 2 above is unlikely to really be a stumbling block for real users, as it requires running both missions at the same time, and generally most developers would shut the other instance of CFE down before they started a new one. Whereas if the two are intended to work together, they should probably be item 1 above, not item 2. Not arguing its not an issue, just a considerably less common one, and not one that justifies a substantial change to make it work.

I honestly didn't think this would be a point of contention, right now I'm working to resolve the differences between the framework and the EDS branch as we have some users waiting on it. Hence why I'll need to move forward with commit 321b4a7 in our builds for now. Hopefully there can eventually be some solution in the framework for this.

@jphickey
Copy link
Contributor Author

I will also point out that the item 2 in my previous comment is not limited to CFE. For instance when I run lighttpd locally on my debug host I have to tweak the configuration for each instance so the listening ports do not overlap, but the "default" config from the vendor will try to bind to port 80 and fail to start, and this is fully expected.

So basically, any developer who has a complex-enough setup to be theoretically running multiple separate instances of the software at the same time and on the same host should also be wholly capable of changing the port configuration on sufficiently to make it run. I don't think this is an issue worthy of a major app change, aside from perhaps making that config option a little more visible.

@skliper
Copy link
Contributor

skliper commented Feb 12, 2020

Use case - ~parallel/batch run of CTF tests locally where the only conflict is the port. It'd be trivial to automate/waterfall the starts and script a command to switch ports, or extract from the startup information the port CI selects to automate the runs, or automate a table swap (really any of the suggestions I provided). Requiring build for a separate CPU for each instance just to avoid port conflict is not a scalable, general solution to port conflict and does not work for this use case. There's no need to force users to choose between build for separate CPUs or fail due to port conflicts.

I'm not sure what you mean by "our build", or "we". I'm not part of whatever group you are referring to (I don't have a customer providing resources for integration of your EDS implementation). I do have resources related to CTF and the associated testing, which is why I bring it up as a consideration.

@jphickey
Copy link
Contributor Author

But my point is that is a separate problem - it is not the same problem as what this is addressing, even though the symptom is similar (port conflict).

The "parallel run" you are describing is pretty easy to solve by running each of your parallel tests in a separate docker container (which is generally a good idea anyway) and/or by fixing #2 and using that mechanism to configure a different port on one or both of the configurations under test so they don't overlap.

This addresses a problem that even #2 will not fix, which where a single mission (and single config) is being used, but that contains multiple targets (TGT1, TGT2, TGT3, etc) and more than one of those need to run ci_lab in a debug environment where containerization isn't practical.

Again, although the result/symptom is similar (port conflict) the fundamental issue is a little different and they are covered by different issues. I think #2 is what you need for the parallel build problem.

@skliper
Copy link
Contributor

skliper commented Feb 12, 2020

Agreed that separate containers is a better approach for isolation, but requires more infrastructure. It's trivial to batch/parallel run now with no additional overhead, if there was a mechanism to at run time (or startup) resolve port conflicts. I also agree we are attempting to solve different problems.

#2 does not solve my problem since it requires separate builds of ci_lab, your solution requires separate targets (and building for each target). Why not consider a solution that solves both problems, rather than add complexity that just solves one of them?

@jphickey
Copy link
Contributor Author

I was assuming that this parallel scenario would already be a separate configuration that was being CI-tested in parallel, and as such would already be a separate build. It sounds like you are referring to two test cases for the exact same build which are starting concurrent instances of the SAME binary?

If so, that is potentially problematic in other ways, as I think they'll both end up sharing the same CDS and other PSP memory segments that are based on SysV IPC (i.e. stuff created via ftok and shmget), and none of that code is set up for concurrent access. Containerization would be strongly recommended to ensure these don't step on each other.

Nonetheless, if that is the sticking point, then this actually does work anyway, because you CAN override the runtime CPU number on the command line (at least in pc-Linux) via the --cpuid option.
So just start one instance as CPU1 and the other instance as CPU2 using the command line switch and it will result in CI_LAB binding to different ports.

(but again, I would strongly recommend against running the exact same build of CFE concurrently due to the shm segment issue and maybe others I haven't thought of).

@jphickey
Copy link
Contributor Author

Confirmed that 321b4a7 actually does allow this. In a build that only has one target/cfe core instance, I can still run it twice, concurrently:

The first one:

$ ./core-cpu1 --cpuid 1
CFE_PSP: CPU ID: 1
--snipped---
1980-012-14:03:20.25328 ES Startup: Loading file: /cf/ci_lab.so, APP: CI
1980-012-14:03:20.25338 ES Startup: CI loaded and created
1980-012-14:03:20.25350 CI_LAB listening on UDP port: 1235
EVS Port1 42/1/CI 3: CI Lab Initialized.  Version 2.3.1.0

And the second one:

$ ./core-cpu1 --cpuid 2           
CFE_PSP: CPU ID: 2
--snipped--
1980-012-14:03:20.25305 ES Startup: Loading file: /cf/ci_lab.so, APP: CI
1980-012-14:03:20.25316 ES Startup: CI loaded and created
1980-012-14:03:20.25326 CI_LAB listening on UDP port: 1236
EVS Port1 42/2/CI 3: CI Lab Initialized.  Version 2.3.1.0

And again, the big caveat is that if you actually want to do this you need to make sure your PSP memories don't end up being shared between the two.

@skliper
Copy link
Contributor

skliper commented Feb 12, 2020

I agree there are likely other issues with parallel tests of the same binary (or even different binaries with the current code base since they all use the same port), but port collision doesn't have to be one of them. It definitely wouldn't be for production or official verification tests (or CI, etc).

The --cpuid parameter along with your change addresses both use cases (of port collision). So now I'm on board as long as we sufficiently communicate the change. I'd still consider it an arbitrary association, but given the command line option it seems fine as long as other's don't have an issue, and it's quick/easy.

@jphickey
Copy link
Contributor Author

Great! In that case I will submit the pull request once #35 is reviewed.

I'd still consider it an arbitrary association,

The CPU ID, in a very generic sense, is just a number that is intended as means to identify different concurrently-running instances of the CFE core. This value is used in other places to allocate resources that will not conflict with other instances, so I would contend that using it to choose a different UDP port is a totally appropriate and logical application of this number.

jphickey added a commit to jphickey/ci_lab that referenced this issue Mar 12, 2020
Avoid port conflicts when having more than one CPU within the same
mission load CI_LAB app.  Instead of binding to the port directly,
add the CPU number as an offset.  Subtract 1 so that CPU1 remains
at the same port (1234 by default) and cmdUtil is not affected.

NOTE: the "cpu number" can be overridden on the command line when
using the pc-linux PSP, thereby also permitting the listen point
to be tuned for each launch config, if necessary.
jphickey added a commit to jphickey/ci_lab that referenced this issue Mar 12, 2020
Avoid port conflicts when having more than one CPU within the same
mission load CI_LAB app.  Instead of binding to the port directly,
add the CPU number as an offset.  Subtract 1 so that CPU1 remains
at the same port (1234 by default) and cmdUtil is not affected.

NOTE: the "cpu number" can be overridden on the command line when
using the pc-linux PSP, thereby also permitting the listen point
to be tuned for each launch config, if necessary.
@astrogeco astrogeco linked a pull request Mar 18, 2020 that will close this issue
astrogeco added a commit that referenced this issue Mar 23, 2020
Fix #37, Offset the UDP base port per-CPU
@astrogeco astrogeco added the bug Something isn't working label Oct 1, 2020
@astrogeco astrogeco added this to the 2.4.0 milestone Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants