Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] string encoding #5

Closed
jedwards1211 opened this issue Sep 23, 2020 · 8 comments
Closed

[Discussion] string encoding #5

jedwards1211 opened this issue Sep 23, 2020 · 8 comments
Assignees

Comments

@jedwards1211
Copy link
Contributor

jedwards1211 commented Sep 23, 2020

Hi there, Ć looks awesome and I really want to use it but for some of my use cases (file parsing) dealing with string encoding would be hard...

I'm not experienced in working with unicode string encodings in C/C++ and I don't know if you are either, but have you had any thoughts about what it would take to make Ć strings unicode (or maybe a pragma to turn on unicode strings)?

It's something I might look into contributing if you're open to it and would like to give me tips on working with the codebase.
In a quest to make cross-language APIs I've determined that Haxe definitely won't work, and SWIG/Emscripten seem like they would be workable, but a huge hassle compared to if I could use Ć.

@pfusik
Copy link
Collaborator

pfusik commented Sep 23, 2020

Ć is already Unicode-capable. The actual string encoding varies between the target languages. Are you concerned with the C or C++ interface? The C and C++ strings are expected to be UTF-8-encoded, which is the default encoding on modern GNU/Linux and macOS.
UTF-8 is also widely used for text file encoding on Windows. However, Windows API historically uses UTF-16. You can convert between UTF-8 and UTF-16 of course.

@jedwards1211
Copy link
Contributor Author

jedwards1211 commented Sep 23, 2020

Oh I didn't realize that the C/C++ strings would be UTF-8, that's great news! I just assumed because somewhere in the docs you said if you're planning to do a bunch of string manipulation, use perl, and I know string encoding in C/C++ is kind of crazy.

In that case, I'll give Ć a try soon 😃

@jedwards1211
Copy link
Contributor Author

There's not currently a Ć-native regex that abstracts the differences between target languages is there? Feel free to let me know your thoughts on that.

@pfusik
Copy link
Collaborator

pfusik commented Sep 24, 2020

I started adding regular expressions today. So far it's just one method, see its test.
Next up: retrieving the position and contents of the match, then the captures.

@jedwards1211
Copy link
Contributor Author

Wow cool! I'm gonna start playing around with Ć this weekend.

@pfusik
Copy link
Collaborator

pfusik commented Sep 25, 2020

Match location and captures. It's implemented for C#, Java, JavaScript and Python.
Next up: documentation, Regex object with pre-processed expression for improved performance of repeated searches.

@pfusik pfusik self-assigned this Oct 1, 2020
@pfusik
Copy link
Collaborator

pfusik commented Oct 23, 2020

Unicode capabilities explained. Regexes implemented and documented. Can we close this?

@jedwards1211
Copy link
Contributor Author

Yup!

@pfusik pfusik closed this as completed Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants