Skip to content
This repository has been archived by the owner on Jul 18, 2022. It is now read-only.
/ SecurityWhale Public archive

Predicting Security Vulnerabilities using Code Metrics and Machine Learning

License

Notifications You must be signed in to change notification settings

Bonfire/SecurityWhale

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SecurityWhale

Predicting Security Vulnerabilities using Code Metrics and Machine Learning. This project was created for the University of Central Florida's Computer Science Senior Design class for Spring 2019 to Fall 2019.

Documents detailing the project including our Design Document, Conference Paper, and Final Committee Presentation can be viewed under the "Docs" folder and are available for download.

Project Summary

Code is inherently insecure, with both security vulnerabilities and other faults in production systems decreasing the reliability and performance of code that runs the modern world. Our project connects a world-class front-end Application, driven by a machine learning backend that has been trained on publicly available metadata about the development cycle, in order to allow developers to make accurate predictions about where to find potential code faults within a software package. This will allow developers, both small and large, to be informed on where problematic areas in their software projects are likely to occur. Resolving issues before they are released using this tool will save time and money.

To wit, we utilize cutting-edge machine learning technology for our statistical model, expanding on existing work in the field by a contributor, Dr. Elaine Weyuker. Key to the project is comprehensive data acquisition methods and custom utilities to ensure our model can operate at maximum efficiency. Our product requires a robust front-end to interface with the end-user and communicate results in a meaningful way about how to improve the development cycle. Finally, a flexible backend infrastructure is necessary to reliably deliver the data required at each step, in addition to accommodating shifting requirements at an early stage in the project.

Project Objectives

  • Predict security vulnerabilities within code using code metrics from public GitHub repos and the project’s file structure
  • Provide users with informative security analytics regarding their code
  • Improve upon existing work in the field by utilizing:
    • Machine Learning statistical modeling
    • Data acquisition using publicly available APIs and file metadata
  • Determine feasibility of using metrics from development cycle to predict vulnerabilities and faults

Technologies Used

Data Acquisition:

  • Data Source:
    • GitHub
    • CVE Database
  • Programming Language:
    • Python (v3.7)
  • Libraries Used:
    • PyGitHub
    • GitPython
    • OS
    • MySQL.Connector
  • Integrated Development Environment (IDE):
    • PyCharm

Backend:

  • Ubuntu VM hosted by Digital Ocean:
    • MySQL Database to connect Data, Modeling roles
      • Local database connections through mysql.connector
    • Bootstrap website running on Apache, SSL cert through Let’s Encrypt
    • Anaconda - Python Sandboxing

Machine Learning:

  • Programming Languages:
    • Python
  • Libraries:
    • Numpy
    • Keras
    • Tensorflow
    • Scikit Learn

Application:

  • Programming Languages:
    • C#
  • Libraries:
    • LibGit2Sharp
    • OctoKit
    • LiveCharts
  • CI/CD:
    • Travis CI
  • Integrated Development Environment (IDE):
    • Visual Studio 2019
  • Testing Framework:
    • Microsoft Unit Testing Framework for Managed Code
    • Arrange, Act, Assert testing paradigm
    • Six unit tests
    • Two integration tests

Existing Work

  • Predicting the Location and Number of Faults in Large Software Systems:
    • Thomas J. Ostrand, Elaine J. Weyuker, and Robert M. Bell
  • Negative binomial regression model that predicts the number of faults in a file
  • Predictions based on code faults and modification history of previous releases
  • Applied to 2 large industrial systems at AT&T:
    • The top 20% of predicted problematic files contained between 71% and 92% of the faults that were actually detected, with the overall average being 83%
  • Showed it is possible to predict faults with intensive efforts to map faults with metadata

Authors

Baran Barut - Front-End Application, Testing Plan

Michael Harris - Project Manager, Backend

Curtis Helsel - Machine Learning

Kyle Reid - Data Acquisition

Thomas Serrano - Data Acquisition, Application-Backend Interface

Other Contributors

Dr. Paul Gazzillo - Project Sponsor and Advisor

Dr. Elaine Weyuker - Previous Work and Advising

Dr. Mark Heinrich - Senior Design II Professor and Advising

About

Predicting Security Vulnerabilities using Code Metrics and Machine Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •