Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-qnn: fix build issue and update latest code from upstream and prepare for refine ggml-qnn backend #211

Merged
merged 3 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions README-qnn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
# llama.cpp for QNN

- [Background](#background)
- [News](#news)
- [OS](#os)
- [Hardware](#hardware)
- [Android](#android)
- [Windows over ARM](#windows)
- [Q&A](#qa)
- [TODO](#todo)

## Background

Android maintained its position as the leading mobile operating system worldwide in the fourth quarter of 2023 with <b><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009/">a market share of 70.1 percent </a></b> . Qualcomm is No.1 mobile SoC semiconductor company in our planet currently.


**QNN**(Qualcomm Neural Network, aka Qualcomm AI Engine Direct) SDK is verified to work with the following versions of the ML frameworks:

<ul>
<li>TensorFlow: tf-1.15.0, or tf-2.10.1 </li>
<li>TFLite: tflite-2.3.0 </li>
<li> PyTorch: torch-1.13.1</li>
<li> ONNX: onnx-1.11.0 </li>
</ul>


The Qualcomm® AI Engine Direct architecture is designed to be modular and allows for clean separation in the software for different hardware cores/accelerators such as the CPU, GPU and DSP that are designated as backends. Learn more about Qualcomm® AI Engine Direct backends here.

![Screenshot from 2024-04-14 11-42-14](https://github.com/zhouwg/kantv/assets/6889919/5d8de93a-7b02-4d6b-8b7f-19d2f829dd4d)

The Qualcomm® AI Engine Direct backends for different hardware cores/accelerators are compiled into individual core-specific libraries that come packaged with the SDK.


One of the key highlights of Qualcomm® AI Engine Direct is that it provides a unified API to delegate operations such as graph creation and execution across all hardware accelerator backends. This allows users to treat Qualcomm® AI Engine Direct as a hardware abstraction API and port applications easily to different cores.


The Qualcomm® AI Engine Direct API is designed to support an efficient execution model with capabilities such as graph optimizations to be taken care of internally. At the same time however, it leaves out broader functionality such as model parsing and network partitioning to higher level frameworks.

Qualcomm® AI Engine Direct API and the associated software stack provides all the constructs required by an application to construct, optimize and execute network models on the desired hardware accelerator core. Key constructs are illustrated by the Qualcomm AI Engine Direct Components - High Level View diagram.


![qnn-arch](https://github.com/zhouwg/kantv/assets/6889919/4f4881a6-9a91-4477-aeb2-193591375d75)



### Llama.cpp + QNN

The llama.cpp QNN backend is intented to support **Qualcomm mobile SoC** firstly.


## News

- 2024.5.28
- re-launch activity of <a href="https://github.com/ggerganov/llama.cpp/pull/6869">PR in upstream ggml community</a>

- 2024.4.26
- refine PR according to coding stye and pricinples of upstream ggml community
- add command line test using <a href="https://github.com/ggerganov/llama.cpp/blob/master/tests/test-backend-ops.cpp">test-backend-ops.cpp</a>
- refine PR according comments from reviewer
- 2024.4.24
- a very beginning <a href="https://github.com/ggerganov/llama.cpp/pull/6869">PR to upstream ggml community</a>
- data path works fine as expected with whisper.cpp and llama.cpp using QNN backend and verified on both low-end and high-end Android phones based on Qualcomm mobile SoC
- Support OPs
- GGML_OP_ADD
- GGML_OP_MUL
- GGML_OP_MUL_MAT

- 2024.3.29
- launch <a href="https://github.com/zhouwg/kantv/issues/121">PoC:add QNN backend for Qualcomm mobile SoC</a>

## OS

| OS | Status | Verified |
|-------------------|---------|------------------------------------|
| Android | Support | Android 10, Android 14 |
| Windows over ARM | TBD | TBD |


## Hardware

### Qualcomm mobile SoC based Android phone

**Verified devices**

| Qualcom mobile SoC | Status | Verified Vendor |
|-----------------------------------------|---------|---------------------------------------|
| Qualcomm SM8650-AB Snapdragon 8 Gen 3 | Support | Xiaomi 14 |
| Qualcomm low-end mobile SoC Series | Support | Vivo |

### Qualcomm SoC based Windows

TBD

## Android

### 1. Setup Environment

Any **mainstream** Android phone based on Qualcomm's mobile SoC should be supported by llama.cpp + QNN. Qualcomm SM8650-AB Snapdragon 8 Gen 3 based Android phone is preferred.

### 2. Run QNN backend in command line mode on Android phone

- for QNN backend developers, download and install QNN SDK from Qualcomm offcial website

```
https://qpm.qualcomm.com/#/main/tools/details/qualcomm_ai_engine_direct

https://developer.qualcomm.com/software/hexagon-dsp-sdk/tools

```

the default installation path is /opt/qcom/aistack/qnn/2.20.0.240223/


- for programmers, using test-backend-ops.cpp to verify QNN backend on Qualcomm mobile SoC based Android phone

```
cd core/ggml/llamacpp/tests/ggml-qnn/

./build-ggml-qnn.sh

./run-ggml-qnn.sh

```

### 3. Run QNN backend in Android APK on Android phone

pls refer to <a href="./README.md">README.md</a>


![504893116](https://github.com/zhouwg/kantv/assets/6889919/51f0b277-eca4-4938-86f5-415dbf5897e7)


## Windows

TBD

## Q&A

TBD

### **GitHub contribution**:
Please add the **[ggml-qnn]** prefix/tag in issues/PRs titles to help the community check/address them without delay.

## TODO

- only support FP32 / FP16 and the input and output tensors must be of the <b>same data type</b>

- lack of [implementation of other GGML-OPs using QNN API](./core/ggml/llamacpp/ggml-qnn.cpp#L3560)

- multithreading not working with QNN GPU&HTP (aka DSP) backend

- QNN's RPC feature(which useful for QNN HTP(aka DSP) backend) not used

- multi QNN backend(CPU/GPU/DSP) simultaneously not support
70 changes: 64 additions & 6 deletions core/ggml/jni/ggml-jni-impl-external.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -605,6 +605,64 @@ bool ggml_jni_is_valid_utf8(const char * string) {
}


//similar with original llama_print_timings and dedicated for project kantv, for merge/update latest source code of llama.cpp more easily and quickly
void ggml_jni_llama_print_timings(struct llama_context * ctx) {
const llama_timings timings = llama_get_timings(ctx);
std::ostringstream timing;
timing << "llama-timings:\t";

LOGGV("\n");
LOGGV("%s: load time = %10.2f ms\n", __func__, timings.t_load_ms);
LOGGV("%s: sample time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)\n",
__func__, timings.t_sample_ms, timings.n_sample, timings.t_sample_ms / timings.n_sample,
1e3 / timings.t_sample_ms * timings.n_sample);
LOGGV("%s: prompt eval time = %10.2f ms / %5d tokens (%8.2f ms per token, %8.2f tokens per second)\n",
__func__, timings.t_p_eval_ms, timings.n_p_eval, timings.t_p_eval_ms / timings.n_p_eval,
1e3 / timings.t_p_eval_ms * timings.n_p_eval);
LOGGV("%s: eval time = %10.2f ms / %5d runs (%8.2f ms per token, %8.2f tokens per second)\n",
__func__, timings.t_eval_ms, timings.n_eval, timings.t_eval_ms / timings.n_eval,
1e3 / timings.t_eval_ms * timings.n_eval);
LOGGV("%s: total time = %10.2f ms / %5d tokens\n", __func__,
(timings.t_end_ms - timings.t_start_ms), (timings.n_p_eval + timings.n_eval));


timing << " load time = " << std::setw(10) << std::fixed << std::setprecision(2)
<< (timings.t_load_ms) << " ms";

timing << "\n";
timing << " sample time = " << std::setw(10) << std::fixed << std::setprecision(2)
<< (timings.t_sample_ms) << " ms / "
<< timings.n_sample << " runs ("
<< (timings.t_sample_ms / timings.n_sample) << " ms per token, "

<< (1e3 / timings.t_sample_ms * timings.n_sample) << " tokens per second)";
timing << "\n";

timing << "prompt eval time = " << std::setw(10) << std::fixed << std::setprecision(2)
<< timings.t_p_eval_ms << " ms / "
<< timings.n_p_eval << " tokens ("
<< (timings.t_p_eval_ms / timings.n_p_eval) << " ms per token, "
<< (1e3 / timings.t_p_eval_ms * timings.n_p_eval)
<< " tokens per second)";
timing << "\n";

timing << "eval time " << std::setw(10) << std::fixed << std::setprecision(2)
<< timings.t_eval_ms << " ms / "
<< timings.n_eval << " runs"
<< "(" << timings.t_eval_ms / timings.n_eval << " ms per token, "
<< (1e3 / timings.t_eval_ms * timings.n_eval) << " tokens per second)";
timing << "\n";

timing << " total time = " << std::setw(10) << std::fixed << std::setprecision(2)
<< ((timings.t_end_ms - timings.t_start_ms)) << " ms / "
<< (timings.n_p_eval + timings.n_eval)
<< " tokens\n";

std::string result = timing.str();
kantv_asr_notify_benchmark(result);
}


/**
*
* @param sz_model_path
Expand Down Expand Up @@ -1253,17 +1311,17 @@ int llama_inference(const char * model_path, const char * prompt, int bench_typ
}
}

llama_print_timings(ctx);
ggml_jni_llama_print_timings(ctx);
if (ctx_guidance) {
LOGGD("here");
llama_free(ctx_guidance);
}
LOGGD("here");
//TODO:crash here on Xiaomi 14
//llama_free(ctx); //TODO: memory leak after comment it
//TODO:crash here on Xiaomi 14 and memory leak after comment it
//llama_free(ctx); //TODO:
LOGGD("here");
//TODO:crash here on Xiaomi 14
//llama_free_model(model);//TODO: memory leak after comment it
//TODO:crash here on Xiaomi 14 and memory leak after comment it
//llama_free_model(model);
LOGGD("here");

llama_sampling_free(ctx_sampling);
Expand Down Expand Up @@ -8826,8 +8884,8 @@ int qnn_ggml_op_automation_ut(const char *model_path, int num_threads, int n_bac
return 0;
}

extern int llama_inference_main(int argc, char *argv[], int backend);

extern int llama_inference_main(int argc, char *argv[], int backend);
int llama_inference_ng(const char * sz_model_path, const char * sz_user_data, int bench_type, int n_threads, int n_backend_type) {
int ret = 0;
LOGGD("model path:%s\n", sz_model_path);
Expand Down
3 changes: 3 additions & 0 deletions core/ggml/jni/ggml-jni.h
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,9 @@ enum ggml_jni_backend_type {

bool ggml_jni_is_valid_utf8(const char * string);

//similar with original llama_print_timings and dedicated for project kantv, for merge/update latest source code of llama.cpp more easily and quickly
void ggml_jni_llama_print_timings(struct llama_context * ctx);


// =================================================================================================
// PoC#64:Add/implement realtime AI subtitle for online English TV using whisper.cpp from 03-05-2024 to 03-16-2024
Expand Down
4 changes: 2 additions & 2 deletions core/ggml/jni/llm-inference.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ static void sigint_handler(int signo) {
} else {
console::cleanup();
printf("\n");
llama_print_timings(*g_ctx);
//llama_print_timings(*g_ctx);
write_logfile(*g_ctx, *g_params, *g_model, *g_input_tokens, g_output_ss->str(), *g_output_tokens);
_exit(130);
}
Expand Down Expand Up @@ -985,7 +985,7 @@ int llama_inference_main(int argc, char ** argv, int backend) {
llama_state_save_file(ctx, path_session.c_str(), session_tokens.data(), session_tokens.size());
}

llama_print_timings(ctx);
ggml_jni_llama_print_timings(ctx);
write_logfile(ctx, params, model, input_tokens, output_ss.str(), output_tokens);

if (ctx_guidance) { llama_free(ctx_guidance); }
Expand Down
5 changes: 3 additions & 2 deletions core/ggml/jni/minicpmv-cli.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -175,10 +175,11 @@ int main(int argc, char ** argv) {
#ifdef ANDROID
kantv_asr_notify_benchmark_c("\n[end of text]\n");
#endif
llama_print_timings(ctx_llava->ctx_llama);
ggml_jni_llama_print_timings(ctx_llava->ctx_llama);

ctx_llava->model = NULL;
llava_free(ctx_llava);
//TODO:crash here on Xiaomi 14 and memory leak after comment it
//llava_free(ctx_llava);
}

return 0;
Expand Down
2 changes: 1 addition & 1 deletion core/ggml/llamacpp/diff-with-upstream-llamacpp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ echo -e "upstream llamacpp path: ${UPSTREAM_LLAMACPP_PATH}\n"
echo -e "local llamacpp path: ${LOCAL_LLAMACPP_PATH}\n"

#the following method borrow from bench-all.sh in GGML's project whisper.cpp
LLAMACPP_SRCS=(ggml-alloc.c ggml-alloc.h ggml-backend.c ggml-backend.h ggml-backend-impl.h ggml.c ggml.h ggml-impl.h ggml-sycl.cpp ggml-sycl.h ggml-metal.h ggml-metal.m ggml-metal.metal ggml-quants.c ggml-quants.h llama.cpp llama.h unicode.h unicode.cpp unicode-data.h unicode-data.cpp ggml-common.h sgemm.h sgemm.cpp examples/main/main.cpp ggml-vulkan.h ggml-vulkan.cpp ggml-vulkan-shaders.hpp)
LLAMACPP_SRCS=(ggml-alloc.c ggml-alloc.h ggml-backend.c ggml-backend.h ggml-backend-impl.h ggml.c ggml.h ggml-impl.h ggml-sycl.cpp ggml-sycl.h ggml-metal.h ggml-metal.m ggml-metal.metal ggml-quants.c ggml-quants.h llama.cpp llama.h unicode.h unicode.cpp unicode-data.h unicode-data.cpp ggml-common.h sgemm.h sgemm.cpp examples/main/main.cpp tests/test-backend-ops.cpp ggml-vulkan.h ggml-vulkan.cpp ggml-vulkan-shaders.hpp)
for file in "${LLAMACPP_SRCS[@]}"; do
echo "diff $file ${UPSTREAM_LLAMACPP_PATH}/$file"
diff ${LOCAL_LLAMACPP_PATH}/$file ${UPSTREAM_LLAMACPP_PATH}/$file
Expand Down
Loading