Numpy string types in compound type #773

jacklovell · 2018-03-07T09:00:42Z

I have the following Numpy dtype for an array:

dtype([('Object_type', 'S30'), ('ID', 'S30'), ('Version', '<i4'), 
('basis_1', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('basis_2', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('centre_point', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('width', '<f8'), ('height', '<f8'),
('slit_id', 'S30'), ('slit_no', '<i4')])

When I try to create a NetCDF compound type from this dtype, I get the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
netCDF4/_netCDF4.pyx in netCDF4._netCDF4._def_compound (netCDF4/_netCDF4.c:51191)()

KeyError: 'S30'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-247-c1ab3d42293b> in <module>()
     20     nc_coord = bolo_group.createCompoundType(cartesian_coord, "COORDINATE")
---> 21     nc_detector = bolo_group.createCompoundType(aperture_dtype_full, "DETECTOR")
     22     foil_no = bolo_group.createDimension("foil_no", foils.size)

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.createCompoundType (netCDF4/_netCDF4.c:16268)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.CompoundType.__init__ (netCDF4/_netCDF4.c:49971)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4._def_compound (netCDF4/_netCDF4.c:51245)()

ValueError: Unsupported compound type element

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-247-c1ab3d42293b> in <module>()
     27     nc_slits = sxd_group.createVariable("slits", nc_detector, "slit_no")
     28     nc_slits[:] = slits_full
---> 29     nc_slits.units = "mm"

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.__exit__ (netCDF4/_netCDF4.c:13090)()

netCDF4/_netCDF4.pyx in netCDF4._netCDF4.Dataset.close (netCDF4/_netCDF4.c:15045)()

RuntimeError: NetCDF: HDF error

I understand that it is not possible to create compound types which include variable-length arrays, but the Numpy '<S30' type should always be 30 bytes, not a variable length. This is actually why I'm using S30 and not U30, since the latter may have multi-byte characters.

Is it possible to add support for the S30 dtype in netcdf4-python? Otherwise, is there another way that I can include fixed length string types in a NetCDF compound type?

jswhit · 2018-03-07T20:25:58Z

Problem is there is no fixed length string datatype in netcdf. The only workaround I know of is to create a character array ('S1') with a length of 30, and then use the stringtoarr and chartostring utilities to convert back and forth from strings to arrays of characters. There's an example of this for compound types at https://github.com/Unidata/netcdf4-python/blob/master/examples/tutorial.py.

In your case, you would use

dtype([('Object_type', 'S1',30), ('ID', 'S1',30), ('Version', '<i4'), 
('basis_1', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]),
('basis_2', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('centre_point', [('x', '<f8'), ('y', '<f8'), ('z', '<f8')]), 
('width', '<f8'), ('height', '<f8'),
('slit_id', 'S1',30), ('slit_no', '<i4')])

I've asked for fixed length strings to be added to netcdf-c in the past, but the idea never gained any traction (Unidata/netcdf-c#132). If you really want this, please comment on that ticket.

jswhit · 2018-03-08T19:14:01Z

BTW - it is technically possible to support vlen strings inside compound data types. I just have devoted the time to implement it (yet).

jacklovell · 2018-03-09T13:09:52Z

Thanks for this. Using the strtoarr and chartostring utilities works well, although it does add some boilerplate to the code.

It would be nice if there was at least an option in netcdf4-python to perform this step internally. Xarray does this (http://xarray.pydata.org/en/stable/io.html#string-encoding) for its netcdf IO, although that doesn't seem to work for compound data types yet (pydata/xarray#1977). Enabling this at the netcdf4-python level would make it easier for other packages to support what would be a very useful feature.

jswhit · 2018-03-09T20:31:28Z

We already to this for character arrays if the _Encoding attribute is set (that is what xarray is using under the hood), so it may not be too hard to support for compound types. I'll look into it.

shoyer · 2018-03-10T01:10:54Z

One option is to use .view() to convert the data from repeated single characters to strings. Consider:

In [9]: data = [('a', 'bb'), ('ccc', 'dddd')]

In [10]: arr1 = np.array(data, dtype=[('f0', 'S1', 3), ('f1', 'S1', 4)])

In [11]: arr1
Out[11]:
array([([b'a', b'a', b'a'], [b'b', b'b', b'b', b'b']),
       ([b'c', b'c', b'c'], [b'd', b'd', b'd', b'd'])],
      dtype=[('f0', 'S1', (3,)), ('f1', 'S1', (4,))])

In [12]: arr2 = np.array(data, dtype='S3,S4')

In [13]: arr2
Out[13]:
array([(b'a', b'bb'), (b'ccc', b'dddd')],
      dtype=[('f0', 'S3'), ('f1', 'S4')])

In [14]: arr3 = arr1.view(arr2.dtype)

In [15]: arr3
Out[15]:
array([(b'aaa', b'bbbb'), (b'ccc', b'dddd')],
      dtype=[('f0', 'S3'), ('f1', 'S4')])

shoyer · 2018-03-10T01:11:52Z

OK, numpy is doing something horrible with arr1 -- those element should not be repeated like that.

But hopefully you get the idea that view() can be used to convert between data types.

jswhit · 2018-03-11T21:44:48Z

Yes, I think views are the simplest and safest way to do this. You just have to create two numpy datatypes, one with the string components represented as character arrays, and one with numpy string arrays. Just use the first (dtype1) to create the netcdf variable, and the second (dtype2) to create your data, and write the data to the variable using v[:] = data[:].view(dtype1). To read the data use data[:] = v[:].view(dtype2).

jswhit · 2018-03-11T21:54:27Z

An example

from netCDF4 import Dataset
import numpy as np
f = Dataset('compound_example.nc','w')
dtype1 = np.dtype([('observation', 'f4'),
                    ('station_name','S1',80)])
dtype2 = np.dtype([('observation', 'f4'),
                    ('station_name','S80')])
station_data_t = f.createCompoundType(dtype1,'station_data')
f.createDimension('station',None)
statdat = f.createVariable('station_obs', station_data_t, ('station',))
data = np.empty(2,dtype2)
data['observation'][:] = (123.,3.14)
data['station_name'][:] = ('Boulder','New York')
statdat[:] = data.view(dtype1)
print statdat[:]
print
print statdat[:].view(dtype2)
f.close()

[(123.  , ['B', 'o', 'u', 'l', 'd', 'e', 'r', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''])
 (  3.14, ['N', 'e', 'w', ' ', 'Y', 'o', 'r', 'k', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''])]

[(123.  , 'Boulder') (  3.14, 'New York')]

ncdump compound_example.nc

netcdf compound_example {
types:
  compound station_data {
    float observation ;
    char station_name(80) ;
  }; // station_data
dimensions:
	station = UNLIMITED ; // (2 currently)
variables:
	station_data station_obs(station) ;
data:

 station_obs = {123, {"Boulder"}}, {3.14, {"New York"}} ;

jswhit · 2018-03-12T19:11:07Z

Pull request #778 enables this automatically, so now this works

from netCDF4 import Dataset
import numpy as np
f = Dataset('compound_example.nc','w')
dtype = np.dtype([('observation', 'f4'),
                  ('station_name','S1',80)])
station_data_t = f.createCompoundType(dtype,'station_data')
f.createDimension('station',None)
statdat = f.createVariable('station_obs', station_data_t, ('station',))
statdat._Encoding = 'ascii'
data = np.empty(2,station_data_t.dtype_view)
print data.dtype
data['observation'][:] = (123.,3.14)
data['station_name'][:] = ('Boulder','New York')
statdat[:] = data
print statdat.dtype
print statdat[:].dtype
print statdat[:]
f.close()

{'names':['observation','station_name'], 'formats':['<f4','S80'], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
{'names':['observation','station_name'], 'formats':['<f4',('S1', (80,))], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
{'names':['observation','station_name'], 'formats':['<f4','S80'], 'offsets':[0,4], 'itemsize':84, 'aligned':True}
[(123.  , 'Boulder') (  3.14, 'New York')]
~

jswhit · 2018-03-12T20:43:53Z

I'm a little nervous about adding this extra magic. The pros are:

it's only done if the _Encoding variable attribute is set, so the behavior is similar to what happens with character array variables except that a view is returned instead of a copy.
simplifies user code a bit, since the user almost always wants numpy strings and not character arrays. If the user really wants the character arrays, they can just not set _Encoding.

Cons:

may be confusing to get numpy data back that is not the same type as the netcdf variable.
may break existing code (doubtful, since probably no one is setting _Encoding on compound types).

jswhit · 2018-03-13T18:39:19Z

@jacklovell and @shoyer, I really would like your feedback on this proposed change.

shoyer · 2018-03-13T20:28:30Z

Remind me -- does the _Encoding attribute get set automatically?

I do think this is probably a win for usability. Most users want NumPy strings, not arrays of characters.

jswhit · 2018-03-13T20:46:47Z

Pull request was updated so that if you specify 'S#' in a structured dtype when creating a netcdf compound type it automatically gets converted to ('S1',#). So now in the above example

dtype = np.dtype([('observation', 'f4'),
                  ('station_name','S1',80)])
station_data_t = f.createCompoundType(dtype,'station_data')

can be changed to

dtype = np.dtype([('observation', 'f4'),
                 ('station_name','S80')])
station_data_t = f.createCompoundType(dtype,'station_data')

_Encoding does not get set automatically to preserve backward compatibility. To get the new behavior, you have to explicitly set it.

jacklovell · 2018-03-14T09:14:57Z

@jswhit This looks good. Under-the-hood conversion is better in my opinion, since it removes the need for boilerplate in users' codes (particularly having to create 2 similar-but-not-quite-identical dtypes). Being able to pass in a Numpy dtype with strings and read back a Numpy dtype with strings, without having to manually convert 'S#' to ('S1', #), does make this more user friendly, I think.

jswhit · 2018-03-14T13:59:54Z

Should we require the use of the _Encoding attribute to trigger the conversion (as we do for netcdf character arrays), or just make it the default for compound types?

jacklovell · 2018-03-14T14:33:32Z

Well, if _Encoding is not present we will get ValueError: Unsupported compound type element, won't we? This is the problem I initially encountered when I created this issue. So I'd suggest making it the default for compound types, if there isn't any way of doing it at the moment without the conversion to character arrays.

jswhit · 2018-03-14T15:18:46Z

Right now the conversion is done if the _Encoding attribute is set - we could just remove that check and always do it. It can still be disabled using set_auto_chartostring(False).

jacklovell · 2018-03-14T16:24:46Z

I would suggest that the check should be removed. That way, the default usage (i.e. no _Encoding set by the user) would just work. It would be less user friendly to raise an Exception unless _Encoding was set, I think, as there doesn't seem to be any benefit to not setting _Encoding.

jswhit · 2018-03-14T16:33:59Z

The only problem would be that if there is user code out there that is expecting numpy structured arrays with character array subtypes to be returned from the netcdf file, they will all of a sudden get structured arrays with strings back (unless they set set_auto_chartostring(False)).

shoyer · 2018-03-14T16:39:58Z

The work-around here is easy enough that it's probably worth adding this to set_auto_chartostring().

Note that xarray does explicitly disable set_auto_chartostring(False) so we'll need a parallel fix there.

jswhit · 2018-03-14T16:42:49Z

Note quite clear on what you mean by 'adding this to set_auto_chartostring' - do you mean don't check for _Encoding at all for compound types and rely on the flag set by set_auto_chartostring to control the behavior? BTW - set_auto_charstring(True) is the library default...

shoyer · 2018-03-14T16:48:53Z

Note quite clear on what you mean by 'adding this to set_auto_chartostring' - do you mean don't check for _Encoding at all for compound types and rely on the flag set by set_auto_chartostring to control the behavior?

Yes, that's what I meant. If only start writing _Encoding for structured arrays now then I suspect the backwards incompatibility impact will be minimal.

jswhit · 2018-03-14T17:17:14Z

@dopplershift, if you have time to read over this issue I'd appreciate your input.

jswhit · 2018-03-14T22:31:21Z

I updated the pull request to remove the check for _Encoding for compound types, so the conversion is always done unless it's turned off with set_auto_chartostring. Also added a section to the docs on dealing with strings.

dopplershift · 2018-03-14T23:19:38Z

I think the changes seem reasonable--especially since you can turn it off with set_auto_chartostring. I have no idea how much use compound types actually see in the wild, though.

jswhit · 2018-03-15T22:01:49Z

OK, I'm going to go ahead an merge now.

return views with numpy strings in compound types (issue #773)

jacklovell · 2018-03-16T09:23:55Z

Thanks. Just tested with the up-to-date master branch, and it works very nicely. I'll close this issue now.

jswhit added a commit that referenced this issue Mar 12, 2018

return views with numpy strings in compound types (issue #773)

98be00c

jswhit added a commit that referenced this issue Mar 15, 2018

Merge pull request #778 from Unidata/issue773

a693a18

return views with numpy strings in compound types (issue #773)

jacklovell closed this as completed Mar 16, 2018

kmuehlbauer mentioned this issue Feb 2, 2024

add VLType and CompoundType, commit complex compound type to file h5netcdf/h5netcdf#227

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy string types in compound type #773

Numpy string types in compound type #773

jacklovell commented Mar 7, 2018

jswhit commented Mar 7, 2018 •

edited

Loading

jswhit commented Mar 8, 2018

jacklovell commented Mar 9, 2018

jswhit commented Mar 9, 2018

shoyer commented Mar 10, 2018

shoyer commented Mar 10, 2018

jswhit commented Mar 11, 2018 •

edited

Loading

jswhit commented Mar 11, 2018 •

edited

Loading

jswhit commented Mar 12, 2018 •

edited

Loading

jswhit commented Mar 12, 2018 •

edited

Loading

jswhit commented Mar 13, 2018

shoyer commented Mar 13, 2018

jswhit commented Mar 13, 2018

jacklovell commented Mar 14, 2018

jswhit commented Mar 14, 2018

jacklovell commented Mar 14, 2018

jswhit commented Mar 14, 2018

jacklovell commented Mar 14, 2018

jswhit commented Mar 14, 2018

shoyer commented Mar 14, 2018

jswhit commented Mar 14, 2018 •

edited

Loading

shoyer commented Mar 14, 2018

jswhit commented Mar 14, 2018

jswhit commented Mar 14, 2018 •

edited

Loading

dopplershift commented Mar 14, 2018

jswhit commented Mar 15, 2018

jacklovell commented Mar 16, 2018

Numpy string types in compound type #773

Numpy string types in compound type #773

Comments

jacklovell commented Mar 7, 2018

jswhit commented Mar 7, 2018 • edited Loading

jswhit commented Mar 8, 2018

jacklovell commented Mar 9, 2018

jswhit commented Mar 9, 2018

shoyer commented Mar 10, 2018

shoyer commented Mar 10, 2018

jswhit commented Mar 11, 2018 • edited Loading

jswhit commented Mar 11, 2018 • edited Loading

jswhit commented Mar 12, 2018 • edited Loading

jswhit commented Mar 12, 2018 • edited Loading

jswhit commented Mar 13, 2018

shoyer commented Mar 13, 2018

jswhit commented Mar 13, 2018

jacklovell commented Mar 14, 2018

jswhit commented Mar 14, 2018

jacklovell commented Mar 14, 2018

jswhit commented Mar 14, 2018

jacklovell commented Mar 14, 2018

jswhit commented Mar 14, 2018

shoyer commented Mar 14, 2018

jswhit commented Mar 14, 2018 • edited Loading

shoyer commented Mar 14, 2018

jswhit commented Mar 14, 2018

jswhit commented Mar 14, 2018 • edited Loading

dopplershift commented Mar 14, 2018

jswhit commented Mar 15, 2018

jacklovell commented Mar 16, 2018

jswhit commented Mar 7, 2018 •

edited

Loading

jswhit commented Mar 11, 2018 •

edited

Loading

jswhit commented Mar 11, 2018 •

edited

Loading

jswhit commented Mar 12, 2018 •

edited

Loading

jswhit commented Mar 12, 2018 •

edited

Loading

jswhit commented Mar 14, 2018 •

edited

Loading

jswhit commented Mar 14, 2018 •

edited

Loading