Merge pull request #122 from Alexhuszagh/safety

Initial Littany of Safety Enhancements. This removes a lot of unsafe code, documents the cases where removing it would have significant performance impacts but the safety invariants can be easily guaranteed, and likewise makes other enhancements to remove potentially unsafe behavior. This also redoes some architecture to make more code wrapped into safe variants, where rather than say if x.get(0) == b'0'. then do an unchecked index, instead it just has a peek and step in a single function, where applicable.. This also simplifies the code base a lot. Part of many commits to address #100.
Alexhuszagh · Sep 11, 2024 · 19bf353 · 19bf353
2 parents 5611efb + 13194a5
commit 19bf353
Show file tree

Hide file tree

Showing 44 changed files with 968 additions and 1,825 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -30,6 +30,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ### Removed
 
 - Support for mips (MIPS), mipsel (MIPS LE), mips64 (MIPS64 BE), and mips64el (MIPS64 LE) on Linux.
+- All `_unchecked` API methods, since the performance benefits are dubious and it makes safety invariant checking much harder.
 
 ## [0.8.5] 2022-06-06
 

diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md
@@ -21,29 +21,29 @@ In the interest of fostering an open and welcoming environment, we as contributo
 
 Examples of behavior that contributes to creating a positive environment include:
 
-  * Using welcoming and inclusive language.
-  * Being respectful of differing viewpoints and experiences.
-  * Gracefully accepting constructive feedback.
-  * Focusing on what is best for the community.
-  * Showing empathy and kindness towards other community members.
-  * Encouraging and raising up your peers in the project so you can all bask in hacks and glory.
+- Using welcoming and inclusive language.
+- Being respectful of differing viewpoints and experiences.
+- Gracefully accepting constructive feedback.
+- Focusing on what is best for the community.
+- Showing empathy and kindness towards other community members.
+- Encouraging and raising up your peers in the project so you can all bask in hacks and glory.
 
 Examples of unacceptable behavior by participants include:
 
-  * The use of sexualized language or imagery and unwelcome sexual attention or advances, including when simulated online. The only exception to sexual topics is channels/spaces specifically for topics of sexual identity.
-  * Casual mention of slavery or indentured servitude and/or false comparisons of one's occupation or situation to slavery. Please consider using or asking about alternate terminology when referring to such metaphors in technology.
-  * Making light of/making mocking comments about trigger warnings and content warnings.
-  * Trolling, insulting/derogatory comments, and personal or political attacks.
-  * Public or private harassment, deliberate intimidation, or threats.
-  * Publishing others' private information, such as a physical or electronic address, without explicit permission. This includes any sort of "outing" of any aspect of someone's identity without their consent.
-  * Publishing private screenshots or quotes of interactions in the context of this project without all quoted users' *explicit* consent.
-  * Publishing of private communication that doesn't have to do with reporting harrassment.
-  * Any of the above even when [presented as "ironic" or "joking"](https://en.wikipedia.org/wiki/Hipster_racism).
-  * Any attempt to present "reverse-ism" versions of the above as violations. Examples of reverse-isms are "reverse racism", "reverse sexism", "heterophobia", and "cisphobia".
-  * Unsolicited explanations under the assumption that someone doesn't already know it. Ask before you teach! Don't assume what people's knowledge gaps are.
-  * [Feigning or exaggerating surprise](https://www.recurse.com/manual#no-feigned-surprise) when someone admits to not knowing something.
-  * "[Well-actuallies](https://www.recurse.com/manual#no-well-actuallys)"
-  * Other conduct which could reasonably be considered inappropriate in a professional or community setting.
+- The use of sexualized language or imagery and unwelcome sexual attention or advances, including when simulated online. The only exception to sexual topics is channels/spaces specifically for topics of sexual identity.
+- Casual mention of slavery or indentured servitude and/or false comparisons of one's occupation or situation to slavery. Please consider using or asking about alternate terminology when referring to such metaphors in technology.
+- Making light of/making mocking comments about trigger warnings and content warnings.
+- Trolling, insulting/derogatory comments, and personal or political attacks.
+- Public or private harassment, deliberate intimidation, or threats.
+- Publishing others' private information, such as a physical or electronic address, without explicit permission. This includes any sort of "outing" of any aspect of someone's identity without their consent.
+- Publishing private screenshots or quotes of interactions in the context of this project without all quoted users' *explicit* consent.
+- Publishing of private communication that doesn't have to do with reporting harrassment.
+- Any of the above even when [presented as "ironic" or "joking"](https://en.wikipedia.org/wiki/Hipster_racism).
+- Any attempt to present "reverse-ism" versions of the above as violations. Examples of reverse-isms are "reverse racism", "reverse sexism", "heterophobia", and "cisphobia".
+- Unsolicited explanations under the assumption that someone doesn't already know it. Ask before you teach! Don't assume what people's knowledge gaps are.
+- [Feigning or exaggerating surprise](https://www.recurse.com/manual#no-feigned-surprise) when someone admits to not knowing something.
+- "[Well-actuallies](https://www.recurse.com/manual#no-well-actuallys)"
+- Other conduct which could reasonably be considered inappropriate in a professional or community setting.
 
 ## Scope
 
@@ -70,12 +70,12 @@ You may get in touch with the maintainer team through any of the following metho
 
 ### Further Enforcement
 
-If you've already followed the [initial enforcement steps](#enforcement), these are the steps maintainers will take for further enforcement, as needed:
+If you've already followed the [initial enforcement steps](#maintainer-enforcement-process), these are the steps maintainers will take for further enforcement, as needed:
 
-  1. Repeat the request to stop.
-  2. If the person doubles down, they will have offending messages removed or edited by a maintainers given an official warning. The PR or Issue may be locked.
-  3. If the behavior continues or is repeated later, the person will be blocked from participating for 24 hours.
-  4. If the behavior continues or is repeated after the temporary block, a long-term (6-12mo) ban will be used.
+1. Repeat the request to stop.
+2. If the person doubles down, they will have offending messages removed or edited by a maintainers given an official warning. The PR or Issue may be locked.
+3. If the behavior continues or is repeated later, the person will be blocked from participating for 24 hours.
+4. If the behavior continues or is repeated after the temporary block, a long-term (6-12mo) ban will be used.
 
 On top of this, maintainers may remove any offending messages, images, contributions, etc, as they deem necessary.
 

diff --git a/README.md b/README.md
@@ -1,5 +1,4 @@
-lexical
-=======
+# lexical
 
 High-performance numeric conversion routines for use in a `no_std` environment. This does not depend on any standard library features, nor a system allocator.
 
@@ -26,7 +25,7 @@ If you want a minimal, stable, and compile-time friendly version of lexical's fl
 - [License](#license)
 - [Contributing](#contributing)
 
-# Getting Started
+## Getting Started
 
 Add lexical to your `Cargo.toml`:
 
@@ -67,7 +66,7 @@ where
 }
 ```
 
-# Partial/Complete Parsers
+## Partial/Complete Parsers
 
 Lexical has both partial and complete parsers: the complete parsers ensure the entire buffer is used while parsing, without ignoring trailing characters, while the partial parsers parse as many characters as possible, returning both the parsed value and the number of parsed digits. Upon encountering an error, lexical will return an error indicating both the error type and the index at which the error occurred inside the buffer.
 
@@ -88,7 +87,7 @@ let x: i32 = lexical_core::parse(b"123 456")?;
 let (x, count): (i32, usize) = lexical_core::parse_partial(b"123 456")?;
 ```
 
-# no_std
+## no_std
 
 `lexical-core` does not depend on a standard library, nor a system allocator. To use `lexical-core` in a `no_std` environment, add the following to `Cargo.toml`:
 
@@ -120,7 +119,7 @@ let d: f64 = lexical_core::parse(b"3.5")?;    // Ok(3.5), error checking parse.
 let d: f64 = lexical_core::parse(b"3a")?;     // Err(Error(_)), failed to parse.
 ```
 
-# Features
+## Features
 
 Lexical feature-gates each numeric conversion routine, resulting in faster compile times if certain numeric conversions. These features can be enabled/disabled for both `lexical-core` (which does not require a system allocator) and `lexical`. By default, all conversions are enabled.
 
@@ -149,7 +148,7 @@ To ensure the safety when bounds checking is disabled, we extensively fuzz the a
 
 Lexical also places a heavy focus on code bloat: with algorithms both optimized for performance and size. By default, this focuses on performance, however, using the `compact` feature, you can also opt-in to reduced code size at the cost of performance. The compact algorithms minimize the use of pre-computed tables and other optimizations at the cost of performance.
 
-# Customization
+## Customization
 
 > ⚠ **WARNING:** If changing the number of significant digits written, disabling the use of exponent notation, or changing exponent notation thresholds, `BUFFER_SIZE` may be insufficient to hold the resulting output. `WriteOptions::buffer_size` will provide a correct upper bound on the number of bytes written. If a buffer of insufficient length is provided, lexical-core will panic.
 
@@ -176,7 +175,7 @@ Due the high variability in the syntax of numbers in different programming and d
 
 A limited subset of functionality is documented in examples below, however, the complete specification can be found in the API reference documentation.
 
-## Number Format API
+### Number Format API
 
 The number format class provides numerous flags to specify number syntax when parsing or writing. When the `power-of-two` feature is enabled, additional flags are added:
 
@@ -213,7 +212,7 @@ const FORMAT: u128 = lexical_core::NumberFormatBuilder::new()
 debug_assert!(lexical_core::format_is_valid::<FORMAT>());
 ```
 
-## Options API
+### Options API
 
 The options API allows customizing number parsing and writing at run-time, such as specifying the maximum number of significant digits, exponent characters, and more.
 
@@ -239,7 +238,7 @@ let options = lexical_core::WriteFloatOptions::builder()
     .unwrap();
 ```
 
-# Documentation
+## Documentation
 
 Lexical's API reference can be found on [docs.rs](https://docs.rs/lexical), as can [lexical-core's](lexical-core). Detailed descriptions of the algorithms used can be found here:
 
@@ -250,7 +249,7 @@ Lexical's API reference can be found on [docs.rs](https://docs.rs/lexical), as c
 
 In addition, descriptions of how lexical handles [digit separators](https://github.com/Alexhuszagh/rust-lexical/blob/main/docs/DigitSeparators.md) and implements [big-integer arithmetic](https://github.com/Alexhuszagh/rust-lexical/blob/main/lexical-parse-float/docs/BigInteger.md) are also documented.
 
-# Validation
+## Validation
 
 **Float-Parsing**
 
@@ -264,7 +263,7 @@ Float parsing is difficult to do correctly, and major bugs have been found in im
 
 Although lexical may contain bugs leading to rounding error, it is tested against a comprehensive suite of random-data and near-halfway representations, and should be fast and correct for the vast majority of use-cases.
 
-# Metrics
+## Metrics
 
 Various benchmarks, binary sizes, and compile times are shown here:
 
@@ -305,13 +304,13 @@ A benchmark on writing floats generated via a random-number generator and parsed
 
 ![Random Data](https://raw.githubusercontent.com/Alexhuszagh/rust-lexical/main/lexical-write-float/assets/json.svg)
 
-# Safety
+## Safety
 
 Due to the use of memory unsafe code in the integer and float writers, we extensively fuzz our float writers and parsers. The fuzz harnesses may be found under [fuzz](https://github.com/Alexhuszagh/rust-lexical/tree/main/fuzz), and are run continuously. So far, we've parsed and written over 72 billion floats.
 
 Due to the simple logic of the integer writers, and the lack of memory safety in the integer parsers, we minimally fuzz both, and test it with edge-cases, which has shown no memory safety issues to date.
 
-# Platform Support
+## Platform Support
 
 lexical-core is tested on a wide variety of platforms, including big and small-endian systems, to ensure portable code. Supported architectures include:
 - x86_64 Linux, Windows, macOS, Android, iOS, FreeBSD, and NetBSD.
@@ -326,7 +325,7 @@ lexical-core is tested on a wide variety of platforms, including big and small-e
 
 lexical-core should also work on a wide variety of other architectures and ISAs. If you have any issue compiling lexical-core on any architecture, please file a bug report.
 
-# Versioning and Version Support
+## Versioning and Version Support
 
 **Version Support**
 
@@ -349,15 +348,15 @@ Please report any errors compiling a supported lexical-core version on a compati
 
 lexical uses [semantic versioning](https://semver.org/). Removing support for Rustc versions newer than the latest stable Debian or Ubuntu version is considered an incompatible API change, requiring a major version change.
 
-# Changelog
+## Changelog
 
 All changes are documented in [CHANGELOG](https://github.com/Alexhuszagh/rust-lexical/blob/main/CHANGELOG).
 
-# License
+## License
 
 Lexical is dual licensed under the Apache 2.0 license as well as the MIT license. See the [LICENSE.md](LICENSE.md) file for full license details.
 
-# Contributing
+## Contributing
 
 Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in lexical by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions. Contributing to the repository means abiding by the [code of conduct](https://github.com/Alexhuszagh/rust-lexical/blob/main/CODE_OF_CONDUCT.md).
 

diff --git a/docs/BinarySize.md b/docs/BinarySize.md
@@ -6,7 +6,7 @@ Each binary is generated using all optimization levels, and includes the result
 
 All these binaries sizes are *relative* to the size of an empty Rust binary: that is, the size of the empty executable is subtracted from the total binary's size. For some cases, this leads to results of 0 bytes, which isn't real, but in practice leads to no additional size in the resulting executable.
 
-# Default
+## Default
 
 **Optimization Level "0"**
 
@@ -50,7 +50,7 @@ All these binaries sizes are *relative* to the size of an empty Rust binary: tha
 ![Parse Stripped - Optimization Level "z"](https://raw.githubusercontent.com/Alexhuszagh/rust-lexical/main/assets/size_parse_stripped_optz_posix.svg)
 ![Write Stripped - Optimization Level "z"](https://raw.githubusercontent.com/Alexhuszagh/rust-lexical/main/assets/size_write_stripped_optz_posix.svg)
 
-# Compact
+## Compact
 
 **Optimization Level "0"**
 

diff --git a/docs/Development.md b/docs/Development.md
@@ -7,7 +7,7 @@ cargo +nightly build
 cargo +nightly test
 ```
 
-# Code Structure
+## Code Structure
 
 Lexical is broken up into compact, relatively isolated workspaces to separate functionality based on the numeric conversion, minimizing compile times and simplifying testing feature-dependent code. The workspaces are:
 
@@ -26,7 +26,7 @@ Furthermore, any unsafe code uses the following conventions:
 1. Each unsafe function must contain a `# Safety` section.
 2. Unsafe operations/calls in unsafe functions must be marked as unsafe, with their safety guarantees clearly documented via a `// SAFETY:` section.
 
-# Dependencies
+## Dependencies
 
 In order to fully test and develop lexical, a recent, nightly compiler along with following Rust dependencies is required:
 
@@ -57,7 +57,7 @@ In addition, the following non-Rust dependencies must be installed:
 - python-magic (python-magic-win64 on Windows)
 - Valgrind
 
-# Development Process
+## Development Process
 
 The [scripts](https://github.com/Alexhuszagh/rust-lexical/tree/main/scripts) directory contains numerous scripts for testing, fuzzing, analyzing, and formatting code. Since many development features are nightly-only, this ensures the proper compiler features are used. This requires a recent version of a nightly compiler (1.65.0+) installed via Rustup, which can be invoked as `cargo +nightly`.
 
@@ -87,7 +87,7 @@ scripts/check.sh
 SKIP_MIRI=1 scripts/test.sh
 ```
 
-# Safety
+## Safety
 
 In order to ensure memory safety even when using unsafe features, we have the following requirements.
 
@@ -106,6 +106,6 @@ RUSTFLAGS="--deny warnings" cargo +nightly build --features=lint
 cargo +nightly clippy --all-features -- --deny warnings
 ```
 
-# Algorithm Changes
+## Algorithm Changes
 
 Each workspace has a "docs" directory containing detailed descriptions of algorithms and benchmarks. If you make any substantial changes to an algorithm, you should both update the algorithm description and the provided benchmarks.
diff --git a/docs/DigitSeparators.md b/docs/DigitSeparators.md
@@ -1,5 +1,4 @@
-Digit Separators
-================
+# Digit Separators
 
 Supporting performant parsers using digit separators in a no-allocator context is difficult to support correctly with adequate performance. One of the major issues is that the syntax of numbers that accept digit separators varies between implementations.
 
@@ -25,11 +24,11 @@ double x = 1._0;        // invalid
 
 This means any parser must be context-aware, and also understand control characters: a digit separator followed by a decimal point is a trailing digit separator, while one followed by a digit is an internal one.
 
-# Defining Grammar
+## Defining Grammar
 
 Due to the context-aware nature, it's important to define the grammar on how digit separators work:
 
-1. Leading digit separators come before any other input, or after control characters. Any digit separators after a leading digit separator are considered leading, even if consecutive digit separators are not allowed.
+- Leading digit separators come before any other input, or after control characters. Any digit separators after a leading digit separator are considered leading, even if consecutive digit separators are not allowed.
 
 Examples therefore include:
 
@@ -44,7 +43,7 @@ __1.0
 1.0e__5
 ```
 
-2. Trailing digit separators come after any other input, or before control characters. Any digit separators before another trailing digit separator are considered trailing, even if consecutive digit separators are not allowed.
+- Trailing digit separators come after any other input, or before control characters. Any digit separators before another trailing digit separator are considered trailing, even if consecutive digit separators are not allowed.
 
 Examples therefore include:
 
@@ -59,7 +58,7 @@ Examples therefore include:
 1.0e5__
 ```
 
-3. Internal digit separators therefore are any digit separators that cannot be classified as leading or trailing. Likewise, any digit separators that are adjacent to another internal digit separator are considered internal, even if consecutive digit separators are not allowed.
+- Internal digit separators therefore are any digit separators that cannot be classified as leading or trailing. Likewise, any digit separators that are adjacent to another internal digit separator are considered internal, even if consecutive digit separators are not allowed.
 
 Examples therefore include:
 
@@ -78,7 +77,7 @@ Examples therefore include:
 
 This opens up a lot of possibilities: what is a valid control character? In practice, it's much easier to define control characters as every character that's not a valid digit, and therefore to handle parsing we just need to check against valid digits and the digit separator.
 
-# Iterator Design
+## Iterator Design
 
 The iterator is therefore a generic based on the format specification: this allows the iterator to resolve all unnecessary branching at compile time.
 

diff --git a/fuzz/README.md b/fuzz/README.md
@@ -1,4 +1,3 @@
-lexical-fuzz
-============
+# lexical-fuzz
 
 Fuzzing routines to minimize the risk of any memory unsafety. See [scripts/fuzz.sh](/scripts/fuzz.sh) for use.
diff --git a/lexical-asm/README.md b/lexical-asm/README.md
@@ -1,5 +1,4 @@
-lexical-asm
-===========
+# lexical-asm
 
 Utilities to carefully monitor the assembly generation of lexical's numeric conversion routines. See [scripts/asm.sh](/scripts/asm.sh) for use.
 

diff --git a/lexical-benchmark/README.md b/lexical-benchmark/README.md
@@ -1,9 +1,8 @@
-lexical-benchmark
-=================
+# lexical-benchmark
 
 Benchmarks comparing lexical to other numeric conversion routines.
 
-# Running the Benchmark
+## Running the Benchmark
 
 The benchmark requires the following: