Skip to content

isgursoy/rolling-hash

Repository files navigation

Rolling Hash Weekend

This repository is playing with eqlabs recruitment exercise of Rolling Hash Algorithm.

Fork of palucki/yardiff for boilerplate base. Simplified and refactored with these improvements:

  • Generalized IO stream to accept also string stream.
  • Ability to cleanup stream buffer to reuse IO object.
  • Replaced Qt based MD4 strong checksum with independent header only MD5 implementation.
  • Rolling checksum collision warning.
  • Skipping unchanged segments to have hunk view like print.
  • Reporting block coordinates and compared chunks.
  • Alignment of matched reference and modified segments.
  • Reporting edit operations of difference as addition/removal/substitution details.
  • Producing metadata alongside terminal print.
  • Builtin test.

Documentation

You are there.

Environment

C++20 C++23

GNU: GCC 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)

Dependencies

Project Where to get Notes
RapidFuzz maxbachmann/rapidfuzz-cpp Included amalgamated header. Needed for levenshtein edit ops to report add/remove/substitution details.
Hash Chocobo1/Hash Included md5 header. Needed for fast blockwise decision before rolling sequence.

Building

QMake pro file and generic Makefile are provided for your taste.

Here typical qmake based build-install procedure:

cd builDir
qmake path/to/root/pro/file.pro CONFIG+=release CONFIG+=THIS_IS_A_SWITCH "ARGUMENT=THIS_IS_A_VALUE"
make
make install

Remember that you should clear builDir before recompiling program and additionally YOU MUST PASS "*." filename pattern to "rm" to be able to delete .qmake.super file which contains old qmake flags.

Structure

├── main.cpp | Everything is here.
├── Makefile
├── md5.h
├── rapidfuzz_amalgamated.hpp
├── README.MD
├── rolling-hash.pro
└── test_material.hpp | Reference and modified test input strings. Just inputs.

0 directories, 7 files

App Flow

  1. Have two file handles of A and B in main().
  2. Block by block walk over B and store rolling and strong checksums in SignatureCalculator::calculate().
  3. Shift block sized window with 1 hop size on A and compare stored strong checksums first and rolling hashes of B in case of mismatch in DeltaCalculator::calculate().
  4. Keep bytes of A between matched segments as the modification range. We are still in DeltaCalculator::calculate().
  5. Locate and align corresponding raw data chunk of B by looking at the segment of A in DeltaCalculator::align_cross_diff().
  6. Have Levenshtein based modification report in terms of add/remove/replace in DeltaCalculator::get_levenshtein_distance_ops_report().
  7. Return hunk metadata and commit information to the print message. Finally leaving the DeltaCalculator::calculate().

Test

Just run as is. Builtin test will always run, even if a/b paths omitted.

Here what to expect:

Reference Output Modified

Rolling Hash Algorithm
_Spec v4 (2021-03-09)_

Make a rolling hash based file diffing algorithm. When comparing original and an updated version of an input, it should return a description ("delta") which can be used to upgrade an original version of the file into the new file. The description contains the chunks which:
- Can be reused from the original file
- have been added or modified and thus would need to be synchronized

The real-world use case for this type of construct could be a distributed file storage system. This reduces the need for bandwidth and storage. If many people have the same file stored on Dropbox, for example, there's no need to upload it again.

A library that does a similar thing is [rdiff](https://linux.die.net/man/1/rdiff). You don't need to fulfill the patch part of the API, only signature and delta.

## Requirements
- Hashing function gets the data as a parameter. Separate possible filesystem operations.
- Chunk size can be fixed or dynamic, but must be split to at least two chunks on any sufficiently sized data.
- Should be able to recognize changes between chunks. Only the exact differing locations should be added to the delta.
- Well-written unit tests function well in describing the operation, no UI necessary.

## Checklist
1. Input/output operations are separated from the calculations
2. detects chunk changes and/or additions
3. detects chunk removals
4. detects additions between chunks with shifted original chunk
Signature: 48 blocks
### Delta ###
____________________________
| |
| Begin Diff |
|__________________________|
File block:0 (0-31)
||||||||||||||||||||||||||||
Rolling Hash Algorithm
_Spe
||||||||| +1 -0 ~0 |||||||||
aRolling Hash Algorithm
_Spe
||||||||||||||||||||||||||||
____________________________
| |
| Same So Far... |
|__________________________|
File block:29 (862-928)
||||||||||||||||||||||||||||

## Requirements
- Hashing function gets the data as a pa
||||||||| +0 -1 ~1 |||||||||

## Requirements
- Bashing function gets the dat as a pa
||||||||||||||||||||||||||||
____________________________
| |
| Same So Far... |
|__________________________|
File block:39 (1247-1279)
||||||||||||||||||||||||||||
nction well in describing the o
||||||||| +1 -0 ~0 |||||||||
nction welll in describing the o
||||||||||||||||||||||||||||
____________________________
| |
| Bye |
|__________________________|
-----------------------------------------------
Test passed
-----------------------------------------------

aRolling Hash Algorithm
_Spec v4 (2021-03-09)_

Make a rolling hash based file diffing algorithm. When comparing original and an updated version of an input, it should return a description ("delta") which can be used to upgrade an original version of the file into the new file. The description contains the chunks which:
- Can be reused from the original file
- have been added or modified and thus would need to be synchronized

The real-world use case for this type of construct could be a distributed file storage system. This reduces the need for bandwidth and storage. If many people have the same file stored on Dropbox, for example, there's no need to upload it again.

A library that does a similar thing is [rdiff](https://linux.die.net/man/1/rdiff). You don't need to fulfill the patch part of the API, only signature and delta.

## Requirements
- Bashing function gets the dat as a parameter. Separate possible filesystem operations.
- Chunk size can be fixed or dynamic, but must be split to at least two chunks on any sufficiently sized data.
- Should be able to recognize changes between chunks. Only the exact differing locations should be added to the delta.
- Well-written unit tests function welll in describing the operation, no UI necessary.

## Checklist
1. Input/output operations are separated from the calculations
2. detects chunk changes and/or additions
3. detects chunk removals
4. detects additions between chunks with shifted original chunk

License

As free as dependencies.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published