Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ucd/category] Provisionally add unic-ucd-category #31

Merged
merged 14 commits into from
Jul 17, 2017
Merged

[ucd/category] Provisionally add unic-ucd-category #31

merged 14 commits into from
Jul 17, 2017

Conversation

CAD97
Copy link
Collaborator

@CAD97 CAD97 commented Jul 14, 2017

My use cases mainly draw from DerivedCoreProperties, one of which the main builders for is General_Category. And the best way to learn is to dive right in, so:

This updates gen_ucd_tables.py and adds the crate with enum for ucd-category. The design was mostly ad-hoc based on other crates in UNIC, but I expect that something can be improved. At the very least, my placeholder version 0.0.1 will probably be changed, and unic-ucd and unic bumped as well.

I skipped over name in the list of planned ucd crates because there are more interesting compression opportunites available there, such as a prefix trie or such, and I'm not quite prepared to dive into how other crates have done it quite yet.


use self::GeneralCategory::*;

// TODO: potentially change these to match pattern matching.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see a comparison here. There's a large possibility that the two could be optimized to the same thing or that the pattern match is more performant. I feel that this form is much clearer; for comparison:

fn is_cased_letter(&self) -> bool {
    [UppercaseLetter, LowercaseLetter, TitlecaseLetter].contains(self)
}

versus

fn is_cased_letter(&self) -> bool {
    match *self {
        UppercaseLetter | LowercaseLetter | TitlecaseLetter => true,
        _ => false,
    }
}

There's an even bigger difference when more symbols are used, such as in is_punctuation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The later is kind of guaranteed to be the fastest in Rust. The former may optimize to the same performance, but that's taking a chance.

If you don't like the second form, then maybe this is a good time that we start using matches crate. It gives you the match feature in a macro call. It's a very stable little package, so is fine to depend on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches has a very nice representation. I've pulled it in for this sub-crate.

impl GeneralCategory {
/// Lu | Ll | Lt
///
/// Abbreviated: LC
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are probably close to useless, but they're how tr44 describes them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not useless at all! Very good for documentation! And maybe one day we find a way to let people use the shorter names alongside the longer ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use UppercaseLetter as Lu? In any case it's probably best to "force" use of the descriptive names.

}
}

fn bsearch_range_value_table(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fn is copied almost verbaitim from ucd/normal/decomposition_type.rs. Is it a good candidate for extracting into a helper? I suspect something like a ucd-util that contains the most commonly used impls for searching the commonly used table formats would be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's the plan. If you want to put them in the utils crates, please do so in a different diff to make it easier to review.

That said, we're thinking of replacing the python code with https://github.com/BurntSushi/rucd, so we can use other Rust packages during table generation, like https://github.com/behnam/rust-phf. That's why I haven't spent much time on the cleaning up the current table code helpers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually experimenting with porting the python scripts into Rust (to use in cargo buildscripts), partially because I'm just interested, and partially because it could potentially work. And I'm enjoying it (lets me exercise understanding Python code and get more practice writing Rust, so it's not wasted effort. And it's a problem space I wanted to dip my toes into.

Assume rucd works for the generation, it'd definitely be best to use that, once its ready.

Copy link
Collaborator Author

@CAD97 CAD97 Jul 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there's one really big small problem with using PHF macros: rust-lang/cargo#1359

PHF is really slow without release optimizations for large maps.

Other than that, it'd definitely be a plus to be able to use a PHF map where it would be appropriate.

extern crate unic_ucd_category;


//#[test]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, need to re-enable this (had it disabled before the python build script was edited).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# default to ET
(0x20A0, 0x20CF, "ET"), # Currency Symbols
(0x20A0, 0x20CF, "ET"), # Currency Symbols
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These format changes must have been done automatically by my ide (bad IDE! only touch VCS changed text unless told otherwise!); I can change them back, and probably should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would be great to have them in one column. Don't worry if the IDE is bothering too much.

general_category_mapping = get_general_category_mapping()

with open(join(dir, 'general_category.rsv'), "w") as values_file:
rustout.emit_table(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variation of rustout.emit_table feels like it will be used a lot for the simple properties; would it be worth making a version with this default print_fun?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine either way. (See my comment above about dropping the Python code completely.)

Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, @CAD97! Thanks for working on it!

Generally, everything's in good order. There are few nits and organization matters that I've commented inline. We can land it soon, afterwards.

About the API, the design looks great. https://github.com/swgillespie/unicode-categories/blob/master/src/lib.rs is taking a different approach, which is not that performant.

repository = "https://github.com/behnam/rust-unic/"
license = "MIT/Apache-2.0"
keywords = ["text", "unicode"]
description = "UNIC - Unicode Character Database"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update the description field.

license = "MIT/Apache-2.0"
keywords = ["text", "unicode"]
description = "UNIC - Unicode Character Database"
readme = "README.md"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop this if we don't have a README.md on this crate. (The small crates don't have one yet.)

@@ -0,0 +1,313 @@
// Copyright 2012-2015 The Rust Project Developers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is only needed ifyou are copying content from the Rust project, in which case should have the same years as the original file. Is this the case here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I don't actually know how that sneaked in.

//!
//! Unicode General Category.
//!
//!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first thing people see on the package page on crates.io, and it's a very basic and useful character property. Would be great to have a bit more text here about it.

And don't forget to link to the definition of the property on unicode.org. (See links on other packages for formatting.)

///
/// * <http://unicode.org/reports/tr44/#General_Category_Values>
#[derive(Clone, Copy, Debug, PartialEq, Eq)]
pub enum GeneralCategory {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently started putting all type-level abstraction in their own module. So, Age enum and its various implementations and tests are in age.rs. And, char/str helper traits go into traits.rs, and only a couple of needed things get pub used here in lib.rs from different modules.

It has shown to be a good way to keep unrelated matters separate. Would be great to do the same here.

PS. I'll start writing these in CONTRIBUTING.md. (#36)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think I've seen that guideline in a few places (and mostly agree with it): lib.rs should really just be a list of [pub] mod and pub use.

extern crate unic_ucd_category;


//#[test]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -28,12 +28,12 @@
from common import path, memoize
from unicode_utils import is_surrogate


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Python's PEP-8 recommends two empty lines here, so would be great to keep it.

You can just run the .cargo/format.sh script to get it fixed if you have the CLI tool installed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't have autopep8 installed previously (don't write too much python and PyCharm/IDEA usually do a decent job of formatting along pep8). Now I do.

Strangely, it's not changing this. I went ahead and changed it back manually.

# default to ET
(0x20A0, 0x20CF, "ET"), # Currency Symbols
(0x20A0, 0x20CF, "ET"), # Currency Symbols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, would be great to have them in one column. Don't worry if the IDE is bothering too much.

general_category_mapping = get_general_category_mapping()

with open(join(dir, 'general_category.rsv'), "w") as values_file:
rustout.emit_table(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine either way. (See my comment above about dropping the Python code completely.)

@@ -18,3 +18,4 @@ unic-ucd-bidi = { path = "bidi/", version = "0.4.0" }
unic-ucd-core = { path = "core/", version = "0.4.0" }
unic-ucd-normal = { path = "normal/", version = "0.4.0" }
unic-ucd-utils = { path = "utils/", version = "0.4.0" }
unic-ucd-category = { path = "category/", version = "0.0.1" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I think you can start with 0.4.0. There won't be a release of it, but it allows the code work and we will update it to 0.5.0 for the next general release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌

@CAD97
Copy link
Collaborator Author

CAD97 commented Jul 16, 2017

👌 we should be good now!

@CAD97
Copy link
Collaborator Author

CAD97 commented Jul 17, 2017

Wait, I need to add myself to update_ucd_data.py don't I?
EDIT: Wait no, I'm using UnicodeData.txt which is already on the list

(And I'm pretty sure we should be fetching from http://www.unicode.org/Public/10.0.0/ucd/ and not http://www.unicode.org/Public/UCD/latest/ucd/ )

@behnam
Copy link
Member

behnam commented Jul 17, 2017

(And I'm pretty sure we should be fetching from http://www.unicode.org/Public/10.0.0/ucd/ and not http://www.unicode.org/Public/UCD/latest/ucd/ )

This is another issue we need to address sometime soon. I want the update script to take the version of unicode to download the files for, instead of always getting the latest. The same for the IDNA update script.

Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! All looks good! Let's get it in!

@behnam behnam merged commit 8048bb4 into open-i18n:master Jul 17, 2017
@CAD97 CAD97 deleted the category branch July 17, 2017 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants