Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add data segments to binary format #301

Merged
merged 3 commits into from
Aug 18, 2015
Merged

Add data segments to binary format #301

merged 3 commits into from
Aug 18, 2015

Conversation

titzer
Copy link

@titzer titzer commented Aug 17, 2015

Add a description of data segments, which are a way that the binary module can load initialized data into memory, similar to a .data section.

Add a description of data segments, which are a way that the binary module can load initialized data into memory, similar to a .data section.
@lukewagner
Copy link
Member

lgtm. It'd be nice to link to and from Modules.md#initial-state-of-linear-memory.

@jfbastien
Copy link
Member

IIUC this means no addition to AST semantics, since the toolchain provides the address (or addresses) that the data segment is loaded at. Code then loads directly from that address, without any indirection / relocation / symbol. Correct?

@lukewagner
Copy link
Member

In the MVP, that makes sense. With dynamic linking, though, I think we'll need to have global variables that are immutable, load-time initialized pointers into the heap where data sections are loaded (I explained this more in #154). This could be achieved by specifying that, in a shared module (in the sense of -shared), data sections don't get to name an absolute address but, rather, each segment declares a new global variable that is initialized with the address of that section. With a patching implementation, this should be equivalent in performance to non-shared global data; without patching, it'd be equivalent to -fPIC.

@titzer
Copy link
Author

titzer commented Aug 17, 2015

Yes, these data segments would basically a way to initialize an area of
memory before the program starts.

We could explore program control of loading of data segments as a further
step; e.g. after the program has been linked, it then issues "load data
segment" commands to blast bytes into memory at particular addresses. Those
"load data segment" commands could also be otherwise useful, IMO.

On Mon, Aug 17, 2015 at 10:14 PM, JF Bastien notifications@github.com
wrote:

IIUC this means no addition to AST semantics, since the toolchain provides
the address (or addresses) that the data segment is loaded at. Code then
loads directly from that address, without any indirection / relocation /
symbol. Correct?


Reply to this email directly or view it on GitHub
#301 (comment).

@lukewagner
Copy link
Member

@titzer If we have the ability to efficiently copy into linear memory from outside memory and from files (map_file), what is the remaining use case for a dynamic "load data segment"? The main difference I see is that a dynamic "load data segment" would allow you to bundle some binary data in your .wasm file, but:

  • bundling/packaging to minimize the number of resources fetches is a general Web problem that is being attacked in a number of ways, so it seems like we might be attacking at the wrong level here
  • a naive wasm engine will keep the wasm binary in memory (negating any benefits from just loading the data eagerly); it'll take non-trivial work to keep this dynamically-loadable data segment out of memory and so it seems better to leverage existing support for this (File API).

@jfbastien
Copy link
Member

@lukewagner are you proposing that main modules be able to have a data section, but not dynamically loaded modules? I'm not sure I'm clear.

I agree that this interacts tightly with dynamic linking, and it would be good to have a nicely unified solution.

@lukewagner
Copy link
Member

@jfbastien Nope: both would be able to have data sections: the difference is that main modules would be able to absolutely position their data sections in linear memory while dynamically-loaded modules would need to rely on const-global-ptrs that were declared by the data section.

@jfbastien
Copy link
Member

In that case it kind of seems like doing addressof on a global symbol is easier and more consistent, regardless of whether the symbol comes from the main module or a dso.

@lukewagner
Copy link
Member

We could force main modules to do the same thing as shared modules, but that would effectively be strictly taking away useful functionality from main modules:

  • being able to place data sections anywhere in the address space
  • having the address be an a priori constant that can be transitively folded.

It does make sense that, for symmetry, we could allow main modules to use symbolic globals to refer to data sections, but until we have dynamic linking, that will be a superfluous feature.

@jfbastien
Copy link
Member

You may be right. I would however like us to try to avoid designing two features when we know up front we could design one that'll serve both purposes. Could we let dynamically-linked modules decide where their data section is loaded? That would address your first point.

Constant folding: relocations and/or patching could take care of this?

@lukewagner
Copy link
Member

I think there's just fundamental asymmetry between main and shared modules. A main module knows it has the whole [0, memory_size) range to itself and can put anything anywhere infallibly (an invariant that could be leveraged for interesting optimizations). For a shared module, since wasm semantics don't say what memory in [0, memory_size) is already in use, I've been assuming that we'd want to specify an allocate_global_data_section(length) callback that is specified to be called by the engine when loading a shared module and allows the app to decide exactly how it wants to lay out global data. FWIW, the same issue comes up with aliased thread-local state (which needs to go in the heap... but where?) and could have the same callback solution. (There's a lot of symmetry between thread-local state and dynamically linked global state.)

For constant folding: I'm thinking compound expression trees that include global addresses at the leaves that could otherwise be folded at compile-to-wasm time.

@titzer
Copy link
Author

titzer commented Aug 18, 2015

The main use case I see is that a module has a complete and efficient
specification of its initial state of memory. Maybe that only makes sense
for "main" modules, but it nevertheless has the nice property that such a
module has no dependencies on an outside linking process.

On Mon, Aug 17, 2015 at 11:23 PM, Luke Wagner notifications@github.com
wrote:

@titzer https://github.com/titzer If we have the ability to efficiently
copy into linear memory from outside memory and from files (map_file
https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#finer-grained-control-over-memory),
what is the remaining use case for a dynamic "load data segment"? The main
difference I see is that a dynamic "load data segment" would allow you to
bundle some binary data in your .wasm file, but:

  • bundling/packaging to minimize the number of resources fetches is a
    general Web problem that is being attacked in a number of ways, so it seems
    like we might be attacking at the wrong level here
  • a naive wasm engine will keep the wasm binary in memory (negating
    any benefits from just loading the data eagerly); it'll take non-trivial
    work to keep this dynamically-loadable data segment out of memory and so it
    seems better to leverage existing support for this (File API).


Reply to this email directly or view it on GitHub
#301 (comment).

@lukewagner
Copy link
Member

@titzer Totally agreed on that use case; maybe I misunderstood what you were asking. To be clear, I think dynamically-linked modules should have their own data segments that are copied into memory when the module is dynamically linked (see discussion with @jfbastien above). I thought you were asking for some sort of API to load data segments at arbitrary times (not just dynamic link time).

@jfbastien
Copy link
Member

@titzer could you clarify what you mean by "outside linking"? Main and dynamic modules are inherently relying on a loader.

A few thoughts:

Say a user wants some basic ASLR for their in-app data, and only have a single module (no dso). How would they achieve this? IIUC the current proposal is that they'd manually copy the automatically loaded data, and then use it as regular heap memory?

How does user code implement the basic allocator for heap space? The allocator has to figure out where data starts / ends, and stay clear of that? That means that the generic allocator we auto-link into user code has to know this. This is resolvable, but I want to make sure we design this knowingly.

@titzer
Copy link
Author

titzer commented Aug 18, 2015

On Tue, Aug 18, 2015 at 5:33 PM, Luke Wagner notifications@github.com
wrote:

@titzer https://github.com/titzer Totally agreed on that use case;
maybe I misunderstood what you were asking. To be clear, I think
dynamically-linked modules should have their own data segments that are
copied into memory when the module is dynamically linked (see discussion
with @jfbastien https://github.com/jfbastien above). I thought you were
asking for some sort of API to load data segments at arbitrary times (not
just dynamic link time).

I agree that dynamic linking will need to deal with initialized data in
some way. I don't want to propose a general API for loading data segments
in this PR, just initialized data segments for the initial contents of
memory.


Reply to this email directly or view it on GitHub
#301 (comment).

@titzer
Copy link
Author

titzer commented Aug 18, 2015

On Tue, Aug 18, 2015 at 6:01 PM, JF Bastien notifications@github.com
wrote:

@titzer https://github.com/titzer could you clarify what you mean by
"outside linking"? Main and dynamic modules are inherently relying on a
loader.

A few thoughts:

Say a user wants some basic ASLR for their in-app data, and only have a
single module (no dso). How would they achieve this? IIUC the current
proposal is that they'd manually copy the automatically loaded data, and
then use it as regular heap memory?

I'm not sure how the wasm engine could help for ASLR, since pointers are
just offsets into the linear memory, so I'd hazard a guess that yes, they
should manually copy the automatically loaded data segments.

How does user code implement the basic allocator for heap space? The
allocator has to figure out where data starts / ends, and stay clear of
that? That means that the generic allocator we auto-link into user code has
to know this. This is resolvable, but I want to make sure we design this
knowingly.

I'm assuming that until we solve dynamic linking, the allocator, if any,
would be compiled into the single (main) module and would inherently know
where the initialized data segments lie.


Reply to this email directly or view it on GitHub
#301 (comment).

@jfbastien
Copy link
Member

I think we're getting into more design than what a PR should contain. Would
you mind committing this, and moving the discussion to an issue instead?

On Tue, Aug 18, 2015 at 9:17 AM titzer notifications@github.com wrote:

On Tue, Aug 18, 2015 at 6:01 PM, JF Bastien notifications@github.com

Say a user wants some basic ASLR for their in-app data, and only have a
single module (no dso). How would they achieve this? IIUC the current
proposal is that they'd manually copy the automatically loaded data, and
then use it as regular heap memory?

I'm not sure how the wasm engine could help for ASLR, since pointers are
just offsets into the linear memory, so I'd hazard a guess that yes, they
should manually copy the automatically loaded data segments.

I wasn't suggesting the wasm engine help for ASLR as much as avoid getting
in the way. I think this is all stuff we should offer on the toolchain
side, but it would be nice if the basic mechanism wasm exposes were
designed with the current state of the art in mind.

How does user code implement the basic allocator for heap space? The

allocator has to figure out where data starts / ends, and stay clear of
that? That means that the generic allocator we auto-link into user code
has
to know this. This is resolvable, but I want to make sure we design this
knowingly.

I'm assuming that until we solve dynamic linking, the allocator, if any,
would be compiled into the single (main) module and would inherently know
where the initialized data segments lie.

That's indeed what my question leads to. So how does it figure this out? :-)

@titzer
Copy link
Author

titzer commented Aug 18, 2015

Merging based on above LGTM from @lukewagner

titzer pushed a commit that referenced this pull request Aug 18, 2015
Add data segments to binary format
@titzer titzer merged commit 80378e1 into master Aug 18, 2015
@jfbastien jfbastien deleted the add_data_se branch August 18, 2015 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants