[ORT 1.17.0 Release] Cherry-pick Final Round #19327

YUNQIUGUO · 2024-01-30T07:32:55Z

Description

Cherry-pick Final Round

Motivation and Context

### Description Adds the ability to specify general session configuration entries via the `-C` command-line option. Example: `-C "session.disable_cpu_ep_fallback|1 ep.context_enable|1"` Some session config entries can already be set via dedicated command-line options. If the user uses multiple command-line options to set the same session config entry, we'll print a warning. Note that the dedicated command-line options will take precedence. ### Motivation and Context Allows setting session configurations when testing EPs. QNN EP, for example, uses the `session.disable_cpu_ep_fallback` and `ep.context_*` options.

…lines (#19293) To fix a pipeline issue.

Given that InferenceSession::Run() is guaranteed to be thread-safe meaning multiple threads can call this function concurrently, TRT EP needs to carefully take care of concurrency here, if not, following concurrent issue might happen: - It's suggested that to perform inference concurrently in multiple streams, use one trt execution context per stream. In the design of TRT EP (Not apply per-thread context implementation) and if multiple threads are calling InferenceSession::Run() concurrently, the trt execution context instance is shared by all the threads and each thread aquires different stream from ORT. So TRT EP will end up having one trt execution context using multiple streams which is not suggested. But, since the whole compute_func() is protected by the lock and if cudaStreamSynchronize() is enforced here, one trt execution context per stream is guaranteed. Therefore, TRT EP needs to call cudaStreamSynchronize() at compute_func() which means to wait until stream has completed all operations to prevent the concurrent github isse: #19275

…9311) ### Description  Updates to only include ios archs framework in artifacts included in Nuget Package. ### Motivation and Context  Related issue: #19295 (comment) --------- Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

YUNQIUGUO · 2024-01-30T16:54:53Z

Waiting for two final prs:

#19322 <- no further issues, waiting for the required CI to pass.

#19332 <- necessary for the new neural-speed dependency.

### Description This PR updates the Whisper export with beam search by adding the following. - Fixes a bug when running `DecoderMaskedMultiHeadAttention` in the Whisper with beam search model - Sets the default PyTorch attention implementation to `eager` to allow existing attention fusions to continue working - Re-uses the cache directory when loading the PyTorch model to reduce memory used on disk - Adds `--disable_auto_mixed_precision` to the example FP16 export command ### Motivation and Context - [This PR](#19112) added the `is_unidirectional` parameter to `CheckInputs`, but it was not provided when checking the inputs in `DecoderMaskedMultiHeadAttention`. - [This PR](#19200) explains the reasoning behind why `eager` is used to load the `WhisperAttention` class. - By re-using the cache directory for loading the PyTorch model, only one copy of the PyTorch model is saved on disk instead of two copies. - By providing this flag, there will be less Cast nodes in the Whisper with beam search model to switch between FP16 and FP32 precision.

Add Intel neural-speed to ThirdPartyNotices.txt because it will be shipped in the default build in most of our packages.

tianleiwu · 2024-01-30T21:10:13Z

This one is missed in cherry-pick: #18906

YUNQIUGUO · 2024-01-30T21:14:07Z

This one is missed in cherry-pick: #18906

ok. looks like the label was just added last Friday.

but to confirm, it seems like a large change. Would that impact the RC/ any risk for breaks/revalidations,etc?

### Description These changes add rotary embedding and packed qkv input to gqa. As of now, the changes are only supported with Flash-Attention (SM >= 80) but should soon be supported with Memory Efficient Attention as well. ### Motivation and Context With the fusion of rotary embedding into this Attention op, we hope to observe some perf gain. The packed QKV should also provide some perf gain in the context of certain models, like Llama2, that would benefit from running ops on the fused QKV matrix, rather than the separate Q, K, and V. --------- Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

### Description  Cherry-pick Final Round ### Motivation and Context  --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: aciddelgado <139922440+aciddelgado@users.noreply.github.com> Co-authored-by: Yufeng Li <liyufeng1987@gmail.com>

adrianlizarraga and others added 4 commits January 29, 2024 23:30

Update OneBranch.Nuget-WindowsAI-Pipeline.Official.yml for Azure Pipe…

107e679

…lines (#19293) To fix a pipeline issue.

YUNQIUGUO marked this pull request as ready for review January 30, 2024 16:52

YUNQIUGUO requested a review from a team as a code owner January 30, 2024 16:52

YUNQIUGUO requested review from adrianlizarraga and chilo-ms January 30, 2024 16:53

adrianlizarraga previously approved these changes Jan 30, 2024

View reviewed changes

chilo-ms previously approved these changes Jan 30, 2024

View reviewed changes

snnn previously approved these changes Jan 30, 2024

View reviewed changes

kunal-vaishnavi and others added 2 commits January 30, 2024 12:42

Update ThirdPartyNotices.txt: Add Intel neural-speed (#19332)

d101450

Add Intel neural-speed to ThirdPartyNotices.txt because it will be shipped in the default build in most of our packages.

YUNQIUGUO dismissed stale reviews from snnn, chilo-ms, and adrianlizarraga via d101450 January 30, 2024 20:43

YUNQIUGUO requested review from kunal-vaishnavi and snnn January 30, 2024 20:43

snnn previously approved these changes Jan 30, 2024

View reviewed changes

YUNQIUGUO dismissed snnn’s stale review via 9343dac January 30, 2024 22:15

YUNQIUGUO requested a review from aciddelgado January 30, 2024 22:33

snnn approved these changes Jan 30, 2024

View reviewed changes

YUNQIUGUO requested a review from tianleiwu January 30, 2024 23:01

kunal-vaishnavi approved these changes Jan 30, 2024

View reviewed changes

tianleiwu approved these changes Jan 31, 2024

View reviewed changes

YUNQIUGUO merged commit 5f0b62c into rel-1.17.0 Jan 31, 2024
105 of 110 checks passed

YUNQIUGUO deleted the yguo/cherry-pick-final-round branch January 31, 2024 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ORT 1.17.0 Release] Cherry-pick Final Round #19327

[ORT 1.17.0 Release] Cherry-pick Final Round #19327

YUNQIUGUO commented Jan 30, 2024 •

edited

Loading

YUNQIUGUO commented Jan 30, 2024

tianleiwu commented Jan 30, 2024

YUNQIUGUO commented Jan 30, 2024 •

edited

Loading

[ORT 1.17.0 Release] Cherry-pick Final Round #19327

[ORT 1.17.0 Release] Cherry-pick Final Round #19327

Conversation

YUNQIUGUO commented Jan 30, 2024 • edited Loading

Description

Motivation and Context

YUNQIUGUO commented Jan 30, 2024

tianleiwu commented Jan 30, 2024

YUNQIUGUO commented Jan 30, 2024 • edited Loading

YUNQIUGUO commented Jan 30, 2024 •

edited

Loading

YUNQIUGUO commented Jan 30, 2024 •

edited

Loading