Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STAL-2195] Initial implementation of intra-method taint analysis in Java #493

Merged
merged 19 commits into from
Sep 3, 2024

Conversation

jasonforal
Copy link
Collaborator

@jasonforal jasonforal commented Aug 15, 2024

What problem are you trying to solve?

Some significant security vulnerabilities (e.g. SQL injection) can be caught with taint analysis.

Currently, writing accurate rules to detect these vulnerabilities is difficult because we can currently only very crudely approximate the flow of variables through a program. To address this, we will build "native support" for (presently: intra-method) taint analysis for Java.

What is your solution?

Note

This is a large PR. All commits are intentionally organized and purely additive to each other. It's likely easier to review commits individually, in sequential order.

Preface

Our goal is to implement simple taint analysis in a way that builds on our current CST-based technique.

From a theoretical perspective, a CST-based analysis hits an accuracy ceiling pretty quickly because CST nodes only describe syntactic structure. More traditional approaches first construct an AST, lower that AST to an intermediate form like SSA, construct a control flow graph CFG from basic blocks, and use that to simulate abstract program states (essential for symbol/name resolution, type resolution, constant propagation, among other techniques).

Doing this would require re-architecting and re-implementing some core assumptions of the analyzer, and so that's an out-of-scope consideration.

Methodology

Note

Terminology:
Source: A place in a source code where data originates (e.g. the headers from an HTTP request).
Sink: A place in a source code where that data can end up (and particularly that this data could have harmful side effects -- e.g. a SQL statement).

  1. Rule author defines tree-sitter query only for a "sink". Sink captures are sent to the JavaScript runtime and a standard analysis begins.
  2. The sink node's containing method is located.
  3. The MethodFlow class recursively traverses that containing method, iteratively constructing a directed graph that represents a backwards flow analysis. This graph has CST nodes for vertices, and assignment/dependence relationships for edges.
  4. The digraph is then traversed and all paths from the sink to any "source" (i.e a vertex with an out-degree of 0) are collected.
  5. We expose these paths to the rule author as a TaintFlow, and so the rule can perform its own logic. This unit test shows how these APIs are used and what they expose.

Limitations

Some limitations are artificial and were chosen for simplicity. These can be addressed in future iterations:

  • Mutually exclusive control-flow branches are treated as if they were sequential. Addressing this requires a refactor in how we construct the initial graph.
  • Variable scoping is not supported (and thus shadowing will be confused for outer scope usage/definition).
  • Some Java language constructs have not been implemented, as they are dependent on other limitations we've chosen (e.g. field_access)
  • We hardcoded some heuristics to propagate taint that aren't always true (one, two). We should eventually implement hooks either we or the user can specify custom "propagator" patterns that are aware of, e.g. specific standard library methods.

Out of Scope

Features/abstractions this PR does not implement:

  • Support for multiple physical locations (both in a Rust RuleResult and in the SARIF output.
  • Taint sanitizers
  • Fine-grained definition of taint propagators (detailed in "Limitations")
  • Inter-method graph building

Alternatives considered

What the reviewer should know

  • Because our methodology uses a lot of known simplifications that we eventually might want to revisit, these are organized into separate modules so the behavior is both a) easily seen as intentional and b) easy to see what needs to be fixed.
  • The digraph created is a mostly unfiltered interpretation of the relationship between CST nodes -- it will likely contain superfluous syntax nodes that rules will almost always never care about (e.g. parenthesized_expression).
  • Renaming from TreeSitterNode.type to TreeSitterNode.cstType was overdue and so was squeezed in here.
  • Despite the expected bad performance of using string comparisons for cstType, we still do it for implementation simplicity.
    • We should eventually refactor this to use the numeric "kind id" that the tree-sitter grammar assigns (which would replace a map lookup and string comparison with just an integer comparison).
    • This was omitted because, to do robustly, the grammar ids used in JavaScript need to be guaranteed to be in sync with the compiled grammar. (This is logic our TsLanguageContext implements, but the abstraction is not 1:1 for this use case, so it would require a small refactor -- deemed out of scope for this PR).

@jasonforal jasonforal requested a review from juli1 August 15, 2024 00:43
@jasonforal jasonforal requested a review from a team as a code owner August 15, 2024 00:43
@jasonforal jasonforal force-pushed the jf/STAL-2635 branch 2 times, most recently from 1e9bcb9 to c8023ba Compare August 27, 2024 14:47
@DataDog DataDog deleted a comment from datadog-datadog-prod-us1 bot Aug 27, 2024
@jasonforal jasonforal requested review from juli1 and removed request for juli1 August 27, 2024 17:21
Base automatically changed from jf/STAL-2635 to main September 3, 2024 15:39
@DataDog DataDog deleted a comment from datadog-datadog-prod-us1 bot Sep 3, 2024
@jasonforal jasonforal merged commit beac9be into main Sep 3, 2024
64 checks passed
@jasonforal jasonforal deleted the jf/STAL-2195 branch September 3, 2024 16:04
* @param {TreeSitterNode} parent An ancestor of {@link nodes} which should receive any tainted return values.
* @param {Array<TreeSitterNode>} nodes
*/
_visitExprStmtList(parent, nodes) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Quality Violation

Suggested change
_visitExprStmtList(parent, nodes) {
Visitexprstmtlist(parent, nodes) {
Use 'camelCase' and not 'snake_case' (...read more)

Ensure that method names use camelCase and not snake_case or PascalCase.

View in Datadog  Leave us feedback  Documentation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants