Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C Code detected as LESS Code #3184

Closed
elaine-jackson opened this issue May 11, 2021 · 5 comments
Closed

C Code detected as LESS Code #3184

elaine-jackson opened this issue May 11, 2021 · 5 comments
Labels
bug help welcome Could use help from community language

Comments

@elaine-jackson
Copy link

Describe the issue
I have an application which uses the JavaScript fetch() API to fetch some data and render it on a page. After the data is fetched and decrypted, highlight.js init is called and the code is highlighted as LESS instead of C. I've narrowed this down to a single line comment at the top of a file. It's usually then erroneously detected as CSS unless the comment is a link.

Example Page: https://paste.is/p/v/264ce2ae-c9c0-40f9-9863-61f1b0c3fd1b#NXZNeWVpM3lEV2xpNXZENldOOE5qaENUYlE5eHpoajJYTUJUblViQlZXWHY2dTJTYTFrWmVKeXZON0NHZ2haUQ==

Which language seems to have the issue?
The issue is with C.

Are you using highlight or highlightAuto?
I believe I am using highlightAuto
...

Sample Code to Reproduce

The code using highlightjs is

<html>
<body>
<!--StartFragment-->

Line wrap
--
  | <!DOCTYPE html>
  |  
  | <html>
  |  
  | <head>
  | <title>Paste.is</title>
  | <meta name="viewport" content="width=device-width, initial-scale=1.0">
  |  
  | <link rel="stylesheet" href="/global.css" crossorigin="anonymous">
  | <link rel="stylesheet" href="/lib/bootstrap.min.css" crossorigin="anonymous">
  | <link rel="stylesheet" href="/lib/fontawesome-free-5.15.3-web/css/all.css" crossorigin="anonymous">
  |  
  | <script src="/lib/jquery-3.3.1.slim.min.js" integrity="sha384-q8i/X+965DzO0rT7abK41JStQIAqVgRVzpbzo5smXKp4YfRvH+8abtTE1Pi6jizo" crossorigin="anonymous"></script>
  | <script src="/lib/bootstrap.min.js" integrity="sha384-B4gt1jrGC7Jh4AgTPSdUtOBvfO8shuf57BaghqFfPlYxofvL8/KUEfYiJOMMV+rV" crossorigin="anonymous"></script>
  | <script src="/lib/bootstrap.bundle.min.js" integrity="sha384-LtrjvnR4Twt/qOuYxE721u19sVFLVSA4hf/rRt6PrZTmiPltdZcI7q7PXQBYTKyf" crossorigin="anonymous"></script>
  | <script src="/lib/crypto-js.min.js" integrity="sha384-0DrKBsfUuJe/vqjia1HviapRn4mR1BYfCpQ9gT7qjSKu8TrzTe2tlbK3cI9i9EwV" crossorigin="anonymous"></script>
  | <script src="/lib/highlight.min.js"></script>
  | <script src="/lib/highlightjs-line-numbers.min.js"></script>
  |  
  | <script src="/paste.js?v=1620658068"></script>
  | </head>
  |  
  | <body>
  |  
  | <nav class="navbar navbar-expand-lg navbar-dark bg-dark">
  | <a class="navbar-brand" href="/">Paste.is</a>
  | <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarText" aria-controls="navbarText" aria-expanded="true" aria-label="Toggle navigation">
  | <span class="navbar-toggler-icon"></span>
  | </button>
  | <div class="navbar-collapse" id="navbarText">
  | <ul class="navbar-nav mr-auto">
  | <li class="nav-item active">
  | <a class="nav-link" href="/"><span class="fas fa-home"></span> Home <span class="sr-only">(current)</span></a>
  | </li>
  | <li class="nav-item active">
  | <a class="nav-link" href="mailto:abuse@paste.is"><span class="fas fa-exclamation-triangle"></span> Report Abuse <span class="sr-only">(current)</span></a>
  | </li>
  | </ul>
  | </div>
  | </nav>
  | <div class="alert alert-primary">
  | <strong>Introducing Encrypted Pastes:</strong> Encrypted Pastes are now in public beta there may still be bugs but we hope you enjoy the feature 😎.
  | </div>
  |  
  | <div class="card">
  | <div class="card-body">
  | <h5 class="card-title">View Paste</h5>
  | <pre><code id="pasteContent"></code></pre>
  | <hr />
  | <h5 class="card-title">Additional Information</h5>
  | <ul>
  | <li><strong>UUID:</strong> <span id="uuid"></span></li>
  | <li><strong>Published On:</strong> <span id="pub"></span></li>
  | <li><strong>Expires On:</strong> <span id="exp"></span></li>
  | <li><strong>Will Self Destruct:</strong> <span id="will_self_destruct"></span></li>
  | <li><strong>Raw:</strong> <span id="raw_uuid_url"></span></li>
  | </ul>
  | </div>
  | </div>
  |  
  | <div style="text-align: center">
  | <hr /><i>Copyright 2021 \| Developed with 💜 by <a href="https://hacked.is/">Hacked LLC</a></i>
  | </div>
  |  
  | </body>
  |  
  | </html>

<!--EndFragment-->
</body>
</html>
// Set page title
document.getElementsByTagName("title")
    .item(0)
    .innerText = `${document.getElementsByTagName("title")
    .item(0)
    .innerText} | View Paste`

// Get UUID
let uuid = window.location.href.split('/p/v/')[1];

// Get Key if exists
let isEncrypted = false;
if (window.location.href.split('#')[1]) {
    isEncrypted = true;
}

// Get Paste JSON
    fetch('/api/v1/paste?dataType=json&uuid='+uuid)
        .then(resp => resp.text())
        .then((json) => {
            document.getElementById("will_self_destruct").innerText = JSON.parse(json)['will_self_destruct'];
            document.getElementById("pub").innerText = JSON.parse(json)['published_on'];
            document.getElementById("exp").innerText = JSON.parse(json)['expired_on'];
            document.getElementById("raw_uuid_url").innerHTML = "<a href=\"" + window.location.href.split('/p/v/')[0] + "/api/v1/paste?dataType=text&uuid=" + uuid + "\">" + window.location.href.split('/p/v/')[0] + "/api/v1/paste?dataType=text&uuid=" + uuid + "</a>";
            document.getElementById("uuid").innerText = uuid;
        })
        .then(() => {
            // Get Paste Text
            fetch('/api/v1/paste?dataType=text&uuid='+uuid)
                .then(resp => resp.text())
                .then((text) => {
                    if (isEncrypted) {
                        let bytes  = CryptoJS.AES.decrypt(atob(text), atob(window.location.href.split('#')[1]));
                        let originalText = bytes.toString(CryptoJS.enc.Utf8);
                        document.getElementById("pasteContent").innerText = originalText;
                        document.getElementById("raw_uuid_url").innerHTML = "<a href=\"" + window.location.href.split('/p/v/')[0] + "/api/v1/paste?dataType=text&uuid=" + uuid.split("#")[0] + "\">" + window.location.href.split('/p/v/')[0] + "/api/v1/paste?dataType=text&uuid=" + uuid.split("#")[0] + "</a>";
                        document.getElementById("uuid").innerText = uuid.split("#")[0];
                        return originalText;
                    } else {
                        document.getElementById("pasteContent").innerText = text;
                        return text;
                    }
                })
                .then((text) => {
                    hljs.highlightAll();
                    hljs.initLineNumbersOnLoad({
                        singleLine: true,
                    });
                })
        })

The code I need to highlight is:

/*
 * Copyright (c) 2001, Oracle and/or its affiliates. All rights reserved.
 * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
 *
 * This code is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 only, as
 * published by the Free Software Foundation.  Oracle designates this
 * particular file as subject to the "Classpath" exception as provided
 * by Oracle in the LICENSE file that accompanied this code.
 *
 * This code is distributed in the hope that it will be useful, but WITHOUT
 * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
 * version 2 for more details (a copy is included in the LICENSE file that
 * accompanied this code).
 *
 * You should have received a copy of the GNU General Public License version
 * 2 along with this work; if not, write to the Free Software Foundation,
 * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
 *
 * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
 * or visit www.oracle.com if you need additional information or have any
 * questions.
 */

#include "jni_util.h"


JNIEXPORT jlong JNICALL
Java_jdk_internal_misc_VM_getuid(JNIEnv *env, jclass thisclass) {

    /* -1 means function not available. */
    return -1;
}

JNIEXPORT jlong JNICALL
Java_jdk_internal_misc_VM_geteuid(JNIEnv *env, jclass thisclass) {

    /* -1 means function not available. */
    return -1;
}

JNIEXPORT jlong JNICALL
Java_jdk_internal_misc_VM_getgid(JNIEnv *env, jclass thisclass) {

    /* -1 means function not available. */
    return -1;
}

JNIEXPORT jlong JNICALL
Java_jdk_internal_misc_VM_getegid(JNIEnv *env, jclass thisclass) {

    /* -1 means function not available. */
    return -1;
}

Now I did some research on this issue. First I tried turning off the paste encryption just to make sure the fetch and decrypt wasn't causing some weird issue. Then I found something interesting, putting the comment // https://raw.githubusercontent.com/openjdk/jdk/master/src/java.base/windows/native/libjava/VM_md.c is what causes the code to be detected as LESS. If I remove said comment instead of the code being detected as C it is detected as JavaScript also wrong but hey more accurate than LESS. See: https://paste.is/p/v/8ad68734-1054-4f4e-b787-22d9e5158c6f

Expected behavior
C Code with a // comment at the top should be detected as C and not as LESS.

Additional context

  1. When I have my element setup does highlightjs add those
    tags or is it an HTML feature on Element.innerText = property?
const myElement = document.getElementById("myElement");
fetch('/my/api')
    .then((resp) => resp.text())
    .then((text) => {
      // text = "This \n is \n a \n string \n with \n linebreaks\n"
  myElement.innerText = text;
});
  1. To confirm I was able to verify that based on my usage of JavaScript promises and by just turning off the feature that the fetch and decrypt is not the cause of the improper syntax highlighting.
  2. While its not simple HTML, many web applications use ReactJS with Axios (also based on the fetch() API), highlightjs should (if it does not already) support dynamically fetched content when init after a promise.
@joshgoebel
Copy link
Member

joshgoebel commented May 11, 2021

Worth noting the latest version (11 beta) does correctly identify this as C++ code, but it's close. There is simply not enough actual code here (signal to noise). Our language detection is merely our best effort (based on counting keywords, etc), not best in class. #1213

Often auto-detect does a good job, but not always. When auto-detect is confused tiny changes to the code can affect what language is identified, trying to read too much into what those changes are typically isn't super helpful.

To correctly highlight as C++ always you should always manually specify the language with class=language-cpp, not rely on auto-detection.

@joshgoebel
Copy link
Member

joshgoebel commented May 11, 2021

2b9ca9f might also help slightly, but core problem again is auto-detect is simply not perfect and shouldn't be relied on if accuracy is required.

@elaine-jackson
Copy link
Author

Thanks for the commit. So accuracy might just not be possible in my case. As in my specific case all user content is end to end encrypted. Because of the length of these strings putting another field in the database for language could compromise user privacy even when encrypted the byte length would be telling as to what language a paste is in or at a minimum would narrow it down to a 4 / 8 character language. It's additional data I don't want to ask for or store.

@joshgoebel
Copy link
Member

joshgoebel commented May 11, 2021

don't want to ... store.

Sounds like a problem specific to your use case - if the code itself can be encrypted/decrypted then I'm unsure why the language can't be also... no one mandates that you store it in a second field - or that you encrypted it such that the length is easily guessable. Those would both seem to be issues of implementation rather than issues inherent in the problem itself.

don't want to ask for

Then indeed accuracy may not be possible. Or you can use some other entirely different heuristic for detecting the language and then ask us to highlight the language you detect via that heuristic...

@joshgoebel
Copy link
Member

Closing via 2b9ca9f and "auto detect isn't perfect, sadly".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug help welcome Could use help from community language
Projects
None yet
Development

No branches or pull requests

2 participants