feature request - fixed length ascii string data type #132

jswhit · 2015-10-07T21:04:05Z

I know I've asked for this before, but it seems like every couple of months a python user requests this feature, and I have to tell them that the netcdf C library doesn't support it.

In python/numpy, there is a fixed-length ascii string data type (type S#, e.g. S10 for 10-character ascii strings). Fortran has this too, with character(len=10). In order to store these arrays in a netcdf file, they have to either be converted to arrays of characters, or variable length strings. VLEN strings don't map nicely on to numpy or fortran arrays. I'm pretty sure HDF5 has a fixed-length ascii string data type, which is used by h5py (http://docs.h5py.org/en/latest/strings.html). This would map directly onto numpy and fortran string arrays, and I think would be a very popular feature if added to netcdf-c.

The text was updated successfully, but these errors were encountered:

WardF · 2015-10-14T19:22:20Z

I'm responding so you know this isn't being ignored; I'm going to need to discuss the history behind this with Russ and Dennis, to get some context regarding why we haven't implemented fixed-length strings yet. It could simply have been a resource issue, or there could have been something else. I guess I need to find out what I don't know. I will follow up once I have further info.

WardF · 2015-12-02T18:20:23Z

Commenting here to know it's still alive; we're still resource constrained, especially with AGU around the corner. But this has come up in the last week as an issue as well.

DennisHeimbigner · 2015-12-02T20:04:48Z

I guess on thing to do is to tightly define what would be proposed.

Netcdf-4 only, I presume.
I would propose that this be basically be a string version of the opaque type.
It would be defined as a type named, say, fixedstring, and would
be declared (in ncgen cdl) as fixedstring(n)
where n is the fixed length.
The api functions would be essentially identical to the corresponding
opaque api functions e.g.
nc_def_fixedstring
ncv_inq_fixedstring

Comments?

DennisHeimbigner · 2015-12-02T22:11:51Z

Additional note
How does utf-8 fit into this since what appears to be single character
when printed can be up to 3 bytes long in utf-8. One solution would be
to (under the hood) convert 'fixedstring(n)' to fixedstring(3*n). But the
problem needs to be addressed.
How does python handle utf8 wrt to its fixed length strings?

jswhit · 2015-12-02T22:16:08Z

for numpy fixed-length string arrays, the 'length' means bytes for ASCII strings and characters for unicode.

jswhit · 2015-12-02T22:27:07Z

I think you can set the size of a string data type in HDF5 with H5Tset_size

e.g. http://stackoverflow.com/questions/29528674/how-to-write-fixed-length-strings-in-hdf5

This may only work for ASCII though. I think this is what h5py does for their fixed length string datatype (http://docs.h5py.org/en/latest/strings.html) - no fixed length unicode data type is supported.

DennisHeimbigner · 2015-12-02T22:43:54Z

Restricting to ascii is IMO out of the question because
netcdf-c has committed to utf-8 as its character set.

Also this seems odd:

for numpy fixed-length string arrays, the 'length' means ...
characters for unicode.
I do not know what that means. Does python use utf-32 internally
or is a utf8 fixed length string actually variable length?

jswhit · 2015-12-03T00:01:02Z

The unicode encoding for python is configurable - its ASCII by default in python 2, and I believe it's UTF-8 in python 3.

DennisHeimbigner · 2015-12-03T01:33:35Z

So my speculation is that a utf-8 fixed string in python
is represented under the covers as a variable length string.
Assuming so and assuming utf-8 there is no obvious reason
to use a fixed length utf-8 string in python.,
In any case, what python does is a bit off-topic, sorry for the digression.

DennisHeimbigner · 2015-12-03T23:16:51Z

Current situation:
Assuming UTF-8 encoding, the notion of a fixed-length string type seems
to be far too complicated to include in netcdf-c. To reiterate, a utf-8 fixed length
string must be represented as a variable length string because utf-8 characters
are not single bytes in length.
So, I vote against this proposal unless someone can provide a detailed specification
of what a fixed length string in utf8 encoding would be represented and until a compelling
use case is presented.
At the point,we can reconsider this proposal.
I vote to

jswhit · 2015-12-04T04:23:33Z

I guess a fixed-length unicode data type only really makes sense for UTF-32. I think numpy represents unicode data internally with UTF-32 (UCS4). I don't think HDF5 supports UTF-32 though (sigh).

I still think having an ASCII fixed-length string array datatype would be very useful though.

DennisHeimbigner · 2015-12-04T04:39:42Z

As I said, give us a detailed proposal including use case(s)
and we will re-open the issue.

edhartnett · 2017-11-02T11:35:34Z

I agree with Denis that there should be no fixed length ascii type in netCDF.

Also it is not clear to me why the user cannot use a fixed array of NC_CHAR for this.

jswhit · 2017-11-02T12:24:55Z

Of course you can, but it's a convenience thing. I think this ticket can be closed.

WardF added status/more information needed type/feature request labels Oct 14, 2015

jswhit mentioned this issue Dec 4, 2015

Fixed length string datatype Unidata/netcdf4-python#494

Open

WardF closed this as completed Nov 2, 2017

jswhit mentioned this issue Mar 7, 2018

Numpy string types in compound type Unidata/netcdf4-python#773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature request - fixed length ascii string data type #132

feature request - fixed length ascii string data type #132

jswhit commented Oct 7, 2015

WardF commented Oct 14, 2015

WardF commented Dec 2, 2015

DennisHeimbigner commented Dec 2, 2015

DennisHeimbigner commented Dec 2, 2015

jswhit commented Dec 2, 2015

jswhit commented Dec 2, 2015

DennisHeimbigner commented Dec 2, 2015

jswhit commented Dec 3, 2015

DennisHeimbigner commented Dec 3, 2015

DennisHeimbigner commented Dec 3, 2015

jswhit commented Dec 4, 2015

DennisHeimbigner commented Dec 4, 2015

edhartnett commented Nov 2, 2017

jswhit commented Nov 2, 2017