[ucd/category] Provisionally add unic-ucd-category #31

CAD97 · 2017-07-14T18:55:34Z

My use cases mainly draw from DerivedCoreProperties, one of which the main builders for is General_Category. And the best way to learn is to dive right in, so:

This updates gen_ucd_tables.py and adds the crate with enum for ucd-category. The design was mostly ad-hoc based on other crates in UNIC, but I expect that something can be improved. At the very least, my placeholder version 0.0.1 will probably be changed, and unic-ucd and unic bumped as well.

I skipped over name in the list of planned ucd crates because there are more interesting compression opportunites available there, such as a prefix trie or such, and I'm not quite prepared to dive into how other crates have done it quite yet.

CAD97 · 2017-07-14T19:01:18Z

components/ucd/category/src/lib.rs

+
+use self::GeneralCategory::*;
+
+// TODO: potentially change these to match pattern matching.


I'd love to see a comparison here. There's a large possibility that the two could be optimized to the same thing or that the pattern match is more performant. I feel that this form is much clearer; for comparison:

fn is_cased_letter(&self) -> bool { [UppercaseLetter, LowercaseLetter, TitlecaseLetter].contains(self) }

versus

fn is_cased_letter(&self) -> bool { match *self { UppercaseLetter | LowercaseLetter | TitlecaseLetter => true, _ => false, } }

There's an even bigger difference when more symbols are used, such as in is_punctuation.

The later is kind of guaranteed to be the fastest in Rust. The former may optimize to the same performance, but that's taking a chance.

If you don't like the second form, then maybe this is a good time that we start using matches crate. It gives you the match feature in a macro call. It's a very stable little package, so is fine to depend on.

matches has a very nice representation. I've pulled it in for this sub-crate.

CAD97 · 2017-07-14T19:03:20Z

components/ucd/category/src/lib.rs

+impl GeneralCategory {
+    /// Lu | Ll | Lt
+    ///
+    /// Abbreviated: LC


These comments are probably close to useless, but they're how tr44 describes them.

Not useless at all! Very good for documentation! And maybe one day we find a way to let people use the shorter names alongside the longer ones.

use UppercaseLetter as Lu? In any case it's probably best to "force" use of the descriptive names.

CAD97 · 2017-07-14T19:05:50Z

components/ucd/category/src/lib.rs

+    }
+}
+
+fn bsearch_range_value_table(


This fn is copied almost verbaitim from ucd/normal/decomposition_type.rs. Is it a good candidate for extracting into a helper? I suspect something like a ucd-util that contains the most commonly used impls for searching the commonly used table formats would be useful.

Well, that's the plan. If you want to put them in the utils crates, please do so in a different diff to make it easier to review.

That said, we're thinking of replacing the python code with https://github.com/BurntSushi/rucd, so we can use other Rust packages during table generation, like https://github.com/behnam/rust-phf. That's why I haven't spent much time on the cleaning up the current table code helpers.

I'm actually experimenting with porting the python scripts into Rust (to use in cargo buildscripts), partially because I'm just interested, and partially because it could potentially work. And I'm enjoying it (lets me exercise understanding Python code and get more practice writing Rust, so it's not wasted effort. And it's a problem space I wanted to dip my toes into.

Assume rucd works for the generation, it'd definitely be best to use that, once its ready.

Note there's one really big small problem with using PHF macros: rust-lang/cargo#1359

PHF is really slow without release optimizations for large maps.

Other than that, it'd definitely be a plus to be able to use a PHF map where it would be appropriate.

CAD97 · 2017-07-14T19:06:33Z

components/ucd/category/tests/unicode_version_tests.rs

+extern crate unic_ucd_category;
+
+
+//#[test]


Whoops, need to re-enable this (had it disabled before the python build script was edited).

CAD97 · 2017-07-14T19:07:47Z

tools/gen_ucd_tables.py

        # default to ET
-        (0x20A0, 0x20CF, "ET"),     # Currency Symbols
+        (0x20A0, 0x20CF, "ET"),  # Currency Symbols


These format changes must have been done automatically by my ide (bad IDE! only touch VCS changed text unless told otherwise!); I can change them back, and probably should.

Yeah, would be great to have them in one column. Don't worry if the IDE is bothering too much.

CAD97 · 2017-07-14T19:08:51Z

tools/gen_ucd_tables.py

+    general_category_mapping = get_general_category_mapping()
+
+    with open(join(dir, 'general_category.rsv'), "w") as values_file:
+        rustout.emit_table(


This variation of rustout.emit_table feels like it will be used a lot for the simple properties; would it be worth making a version with this default print_fun?

Fine either way. (See my comment above about dropping the Python code completely.)

behnam

This is great, @CAD97! Thanks for working on it!

Generally, everything's in good order. There are few nits and organization matters that I've commented inline. We can land it soon, afterwards.

About the API, the design looks great. https://github.com/swgillespie/unicode-categories/blob/master/src/lib.rs is taking a different approach, which is not that performant.

behnam · 2017-07-16T02:15:34Z

components/ucd/category/Cargo.toml

+repository = "https://github.com/behnam/rust-unic/"
+license = "MIT/Apache-2.0"
+keywords = ["text", "unicode"]
+description = "UNIC - Unicode Character Database"


Don't forget to update the description field.

behnam · 2017-07-16T02:16:16Z

components/ucd/category/Cargo.toml

+license = "MIT/Apache-2.0"
+keywords = ["text", "unicode"]
+description = "UNIC - Unicode Character Database"
+readme = "README.md"


Drop this if we don't have a README.md on this crate. (The small crates don't have one yet.)

behnam · 2017-07-16T02:21:49Z

components/ucd/category/src/lib.rs

@@ -0,0 +1,313 @@
+// Copyright 2012-2015 The Rust Project Developers.


This line is only needed ifyou are copying content from the Rust project, in which case should have the same years as the original file. Is this the case here?

Whoops, I don't actually know how that sneaked in.

behnam · 2017-07-16T02:22:25Z

components/ucd/category/src/lib.rs

+//!
+//! Unicode General Category.
+//!
+//!


This is the first thing people see on the package page on crates.io, and it's a very basic and useful character property. Would be great to have a bit more text here about it.

And don't forget to link to the definition of the property on unicode.org. (See links on other packages for formatting.)

behnam · 2017-07-16T02:26:32Z

components/ucd/category/src/lib.rs

+///
+/// * <http://unicode.org/reports/tr44/#General_Category_Values>
+#[derive(Clone, Copy, Debug, PartialEq, Eq)]
+pub enum GeneralCategory {


I recently started putting all type-level abstraction in their own module. So, Age enum and its various implementations and tests are in age.rs. And, char/str helper traits go into traits.rs, and only a couple of needed things get pub used here in lib.rs from different modules.

It has shown to be a good way to keep unrelated matters separate. Would be great to do the same here.

PS. I'll start writing these in CONTRIBUTING.md. (#36)

Yeah, I think I've seen that guideline in a few places (and mostly agree with it): lib.rs should really just be a list of [pub] mod and pub use.

behnam · 2017-07-16T02:51:13Z

components/ucd/category/tests/unicode_version_tests.rs

+extern crate unic_ucd_category;
+
+
+//#[test]


behnam · 2017-07-16T02:52:29Z

tools/gen_ucd_tables.py

@@ -28,12 +28,12 @@
 from common import path, memoize
 from unicode_utils import is_surrogate

-


nit: Python's PEP-8 recommends two empty lines here, so would be great to keep it.

You can just run the .cargo/format.sh script to get it fixed if you have the CLI tool installed.

Didn't have autopep8 installed previously (don't write too much python and PyCharm/IDEA usually do a decent job of formatting along pep8). Now I do.

Strangely, it's not changing this. I went ahead and changed it back manually.

behnam · 2017-07-16T02:53:40Z

tools/gen_ucd_tables.py

        # default to ET
-        (0x20A0, 0x20CF, "ET"),     # Currency Symbols
+        (0x20A0, 0x20CF, "ET"),  # Currency Symbols


Yeah, would be great to have them in one column. Don't worry if the IDE is bothering too much.

behnam · 2017-07-16T02:55:02Z

tools/gen_ucd_tables.py

+    general_category_mapping = get_general_category_mapping()
+
+    with open(join(dir, 'general_category.rsv'), "w") as values_file:
+        rustout.emit_table(


Fine either way. (See my comment above about dropping the Python code completely.)

behnam · 2017-07-16T02:57:49Z

components/ucd/Cargo.toml

@@ -18,3 +18,4 @@ unic-ucd-bidi = { path = "bidi/", version = "0.4.0" }
 unic-ucd-core = { path = "core/", version = "0.4.0" }
 unic-ucd-normal = { path = "normal/", version = "0.4.0" }
 unic-ucd-utils = { path = "utils/", version = "0.4.0" }
+unic-ucd-category = { path = "category/", version = "0.0.1" }


Right. I think you can start with 0.4.0. There won't be a release of it, but it allows the code work and we will update it to 0.5.0 for the next general release.

... such that it appears first in the docs

CAD97 · 2017-07-16T07:02:08Z

👌 we should be good now!

CAD97 · 2017-07-17T00:15:09Z

Wait, I need to add myself to update_ucd_data.py don't I?
EDIT: Wait no, I'm using UnicodeData.txt which is already on the list

(And I'm pretty sure we should be fetching from http://www.unicode.org/Public/10.0.0/ucd/ and not http://www.unicode.org/Public/UCD/latest/ucd/ )

behnam · 2017-07-17T00:23:08Z

(And I'm pretty sure we should be fetching from http://www.unicode.org/Public/10.0.0/ucd/ and not http://www.unicode.org/Public/UCD/latest/ucd/ )

This is another issue we need to address sometime soon. I want the update script to take the version of unicode to download the files for, instead of always getting the latest. The same for the IDNA update script.

behnam

Awesome! All looks good! Let's get it in!

CAD97 added 3 commits July 14, 2017 05:07

Initial General_Category enum draft

54e287e

Add Category to the python table gen

049add2

GeneralCategory::of

9a9c0ec

CAD97 commented Jul 14, 2017

View reviewed changes

behnam requested changes Jul 16, 2017

View reviewed changes

CAD97 added 10 commits July 16, 2017 02:02

Review pass 1

764e1de

Copy in the description of this property

fc69536

Move the heavy lifting out of the root module

38a7d04

Move GeneralCategory::of before other fn ...

63a9a58

... such that it appears first in the docs

Format pass

55ed6ba

Include license header

a698e73

Tests for some corners

412d4e1

format python

710d8f7

Bump version and expose

fcde22f

Link General_Category from crate root

30acb72

Prefer extracting reference in pattern

120601a

behnam approved these changes Jul 17, 2017

View reviewed changes

behnam merged commit 8048bb4 into open-i18n:master Jul 17, 2017

CAD97 deleted the category branch July 17, 2017 02:30


		use self::GeneralCategory::*;

		// TODO: potentially change these to match pattern matching.

		@@ -0,0 +1,313 @@
		// Copyright 2012-2015 The Rust Project Developers.

		@@ -28,12 +28,12 @@
		from common import path, memoize
		from unicode_utils import is_surrogate

		extern crate unic_ucd_category;


		//#[test]

		extern crate unic_ucd_category;


		//#[test]

[ucd/category] Provisionally add unic-ucd-category #31

[ucd/category] Provisionally add unic-ucd-category #31

Conversation

CAD97 commented Jul 14, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAD97 Jul 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behnam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CAD97 commented Jul 16, 2017

CAD97 commented Jul 17, 2017 • edited Loading

behnam commented Jul 17, 2017

behnam left a comment

Choose a reason for hiding this comment

CAD97 commented Jul 14, 2017 •

edited

Loading

CAD97 Jul 16, 2017 •

edited

Loading

CAD97 commented Jul 17, 2017 •

edited

Loading