Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request - fixed length ascii string data type #132

Closed
jswhit opened this issue Oct 7, 2015 · 14 comments
Closed

feature request - fixed length ascii string data type #132

jswhit opened this issue Oct 7, 2015 · 14 comments

Comments

@jswhit
Copy link

jswhit commented Oct 7, 2015

I know I've asked for this before, but it seems like every couple of months a python user requests this feature, and I have to tell them that the netcdf C library doesn't support it.

In python/numpy, there is a fixed-length ascii string data type (type S#, e.g. S10 for 10-character ascii strings). Fortran has this too, with character(len=10). In order to store these arrays in a netcdf file, they have to either be converted to arrays of characters, or variable length strings. VLEN strings don't map nicely on to numpy or fortran arrays. I'm pretty sure HDF5 has a fixed-length ascii string data type, which is used by h5py (http://docs.h5py.org/en/latest/strings.html). This would map directly onto numpy and fortran string arrays, and I think would be a very popular feature if added to netcdf-c.

@WardF
Copy link
Member

WardF commented Oct 14, 2015

I'm responding so you know this isn't being ignored; I'm going to need to discuss the history behind this with Russ and Dennis, to get some context regarding why we haven't implemented fixed-length strings yet. It could simply have been a resource issue, or there could have been something else. I guess I need to find out what I don't know. I will follow up once I have further info.

@WardF
Copy link
Member

WardF commented Dec 2, 2015

Commenting here to know it's still alive; we're still resource constrained, especially with AGU around the corner. But this has come up in the last week as an issue as well.

@DennisHeimbigner
Copy link
Collaborator

I guess on thing to do is to tightly define what would be proposed.

  1. Netcdf-4 only, I presume.
  2. I would propose that this be basically be a string version of the opaque type.
  3. It would be defined as a type named, say, fixedstring, and would
    be declared (in ncgen cdl) as fixedstring(n)
    where n is the fixed length.
  4. The api functions would be essentially identical to the corresponding
    opaque api functions e.g.
    nc_def_fixedstring
    ncv_inq_fixedstring

Comments?

@DennisHeimbigner
Copy link
Collaborator

Additional note
How does utf-8 fit into this since what appears to be single character
when printed can be up to 3 bytes long in utf-8. One solution would be
to (under the hood) convert 'fixedstring(n)' to fixedstring(3*n). But the
problem needs to be addressed.
How does python handle utf8 wrt to its fixed length strings?

@jswhit
Copy link
Author

jswhit commented Dec 2, 2015

for numpy fixed-length string arrays, the 'length' means bytes for ASCII strings and characters for unicode.

@jswhit
Copy link
Author

jswhit commented Dec 2, 2015

I think you can set the size of a string data type in HDF5 with H5Tset_size

e.g. http://stackoverflow.com/questions/29528674/how-to-write-fixed-length-strings-in-hdf5

This may only work for ASCII though. I think this is what h5py does for their fixed length string datatype (http://docs.h5py.org/en/latest/strings.html) - no fixed length unicode data type is supported.

@DennisHeimbigner
Copy link
Collaborator

Restricting to ascii is IMO out of the question because
netcdf-c has committed to utf-8 as its character set.

Also this seems odd:

for numpy fixed-length string arrays, the 'length' means ...
characters for unicode.
I do not know what that means. Does python use utf-32 internally
or is a utf8 fixed length string actually variable length?

@jswhit
Copy link
Author

jswhit commented Dec 3, 2015

The unicode encoding for python is configurable - its ASCII by default in python 2, and I believe it's UTF-8 in python 3.

@DennisHeimbigner
Copy link
Collaborator

So my speculation is that a utf-8 fixed string in python
is represented under the covers as a variable length string.
Assuming so and assuming utf-8 there is no obvious reason
to use a fixed length utf-8 string in python.,
In any case, what python does is a bit off-topic, sorry for the digression.

@DennisHeimbigner
Copy link
Collaborator

Current situation:
Assuming UTF-8 encoding, the notion of a fixed-length string type seems
to be far too complicated to include in netcdf-c. To reiterate, a utf-8 fixed length
string must be represented as a variable length string because utf-8 characters
are not single bytes in length.
So, I vote against this proposal unless someone can provide a detailed specification
of what a fixed length string in utf8 encoding would be represented and until a compelling
use case is presented.
At the point,we can reconsider this proposal.
I vote to

@jswhit
Copy link
Author

jswhit commented Dec 4, 2015

I guess a fixed-length unicode data type only really makes sense for UTF-32. I think numpy represents unicode data internally with UTF-32 (UCS4). I don't think HDF5 supports UTF-32 though (sigh).

I still think having an ASCII fixed-length string array datatype would be very useful though.

@DennisHeimbigner
Copy link
Collaborator

As I said, give us a detailed proposal including use case(s)
and we will re-open the issue.

@edhartnett
Copy link
Contributor

I agree with Denis that there should be no fixed length ascii type in netCDF.

Also it is not clear to me why the user cannot use a fixed array of NC_CHAR for this.

@jswhit
Copy link
Author

jswhit commented Nov 2, 2017

Of course you can, but it's a convenience thing. I think this ticket can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants