Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threads assume they will record leave events in LIFO order (can be violated for tasks) #12

Open
adamtuft opened this issue Jul 21, 2021 · 0 comments
Labels
limitation This issue is a consequence of a deliberate design choice

Comments

@adamtuft
Copy link
Owner

Limitation

Threads assume that they will always record leave events for the regions they visit in LIFO order, due to the fact that each thread maintains a stack of OTF2 region definitions for the regions it visits. Any callback that corresponds to entering or leaving a region invokes trace_event_enter or trace_event_leave.

Signatures:

void trace_event_enter(trace_location_def_t *self, trace_region_def_t *region);
void trace_event_leave(trace_location_def_t *self);

In trace_event_enter:

/* Push region onto location's region stack */
stack_push(self->rgn_stack, (data_item_t) {.ptr = region});

In trace_event_leave:

/* For the region-end event, the region was previously pushed onto the 
   location's region stack so should now be at the top (as long as regions
   are correctly nested) */
trace_region_def_t *region = NULL;
stack_pop(self->rgn_stack, (data_item_t*) &region);

Problem

This presents a problem because threads can switch between partially-complete tasks. For example, consider thread x executing the untied task p which enters a task-scheduling region, records a region-enter event, pushes the region definition onto its stack and suspends the task. If thread y then resumes and completes p, it would record a leave event against the task-scheduling region which x previously entered - the region-leave event will not be recorded by the thread that recorded the region-enter event, or against the correct region definition, and both threads will appear to have entered a different number of regions than they left.

A similar error is possible with tied tasks, in which region-leave and region-enter events could become unmatched in the trace. A thread will eventually record region-leave events for all region-enter events (since it must eventually complete all the tasks it started) but the task scheduling means the order of these events is not fixed. I suspect a workaround is possible for tied tasks during post-processing by breaking up event sequences at task-switch events and then stitching each event back together with its sub-sequences in the correct order.

Possible Fixes

As this limitation is due to a low-lying design decision I think it will need a fairly significant re-write of Otter. Ideas include:

  • Have tasks maintain a stack of the regions encountered instead of threads. Should be possible as there is always a task being executed (implicit if not an explicit task) so threads can just query the task's stack to record events against the correct region definition.
  • Have all regions represented by singleton definitions except for those which can be given persistent definitions (parallel & task regions only AFAIK) - I don't like this idea as it might look strange in the trace if a program only appears to have exactly 1 instance of each region...
@adamtuft adamtuft added the limitation This issue is a consequence of a deliberate design choice label Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
limitation This issue is a consequence of a deliberate design choice
Projects
None yet
Development

No branches or pull requests

1 participant