`read_csv` does not return columns in the order specified by `columns` parameter #13066

mcrumiller · 2023-12-15T19:48:35Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from io import StringIO

csv = (
    "a,b,c\n"
    "1,2,3\n"
    "1,2,3\n"
)

df = pl.read_csv(StringIO(csv), columns=["b", "a", "c"])
print(df)

shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 1   ┆ 2   ┆ 3   │
└─────┴─────┴─────┘

Issue description

When the columns parameter is specified in order to select out the columns, the order return corresponds to the original frame's order, not the order specified by columns.

Related Issues: #11186, #11535.

Expected behavior

Should return requested order.

Installed versions

--------Version info---------
Polars:               0.19.19
Index type:           UInt32 
Platform:             Windows-10-10.0.19045-SP0
Python:               3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  0.4.0
cloudpickle:          <not installed>
connectorx:           0.3.2
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
matplotlib:           3.7.1
numpy:                1.26.1
openpyxl:             3.1.2
pandas:               2.1.1
pyarrow:              11.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.7
xlsx2csv:             0.8.1
xlsxwriter:           3.0.9

The text was updated successfully, but these errors were encountered:

stinodego · 2023-12-16T05:41:07Z

Related to #13040

Happy to accept a fix for this one.

romanovacca · 2023-12-22T13:50:03Z

I want to pick this one up!
Do we want a simple solution on the python side, something like a

df.select(columns)

in the read_csv would solve it.

But I believe fixing it from the rust side would be better agree?

stinodego · 2023-12-22T23:30:53Z

But I believe fixing it from the rust side would be better agree?

Yes! If this code path leads to the Rust side, it would have to be fixed there.

stinodego · 2024-02-14T22:29:35Z

As I noted in the related PR, the offending code seems to be here:

polars/crates/polars-io/src/csv/read_impl/mod.rs

Lines 227 to 240 in 11c6f9b

    
           if let Some(cols) = columns { 
        
               let mut prj = Vec::with_capacity(cols.len()); 
        
               for col in cols { 
        
                   let i = schema.try_index_of(&col)?; 
        
                   prj.push(i); 
        
               } 
        
               // update null values with projection 
        
               if let Some(nv) = null_values.as_mut() { 
        
                   nv.apply_projection(&prj); 
        
               } 
        
               projection = Some(prj); 
        
           }

stinodego · 2024-06-08T09:31:56Z

This should ideally be fixed by making read_csv actually call scan_csv under the hood, and then simply doing .select(columns).collect().

cbrnr · 2024-08-09T15:15:11Z

Fixing this would actually make selecting and renaming columns during reading much easier, because with pl.scan_csv() I have to do:

cols = ["A", "F", "C", "E", "D", "B"]
cols_new = ["a", "f", "c", "e", "d", "b"]
df = (
    pl.scan_csv(logfile)
    .select(cols)
    .rename({k: v for k, v in zip(cols, cols_new)})
)

But this could be much simpler:

df = pl.read_csv(logfile, columns=cols, new_columns=cols_new)

Unless of course I'm missing something.

mcrumiller · 2024-08-09T15:19:36Z

@cbrnr you are not missing anything; that's the intent, it's just a bit messy right now.

Furthermore, the order shouldn't matter, i.e. you should be able to do:

cols = ["D", "B", "A"]  # notice out of order
cols_new = ["d", "b", "a"]
df = pl.read_csv(logfile, columns=cols, new_columns=cols_new)

...but this is currently broken.

cbrnr · 2024-08-09T15:45:14Z

Yes, I know that the column order is currently broken in pl.read_csv(), and it would be very nice if it worked! I just wasn't sure if the pl.scan_csv() way was maybe too verbose, because I have literally started using Polars yesterday (so I'm never sure if I'm not just missing the idiomatic way).

mcrumiller · 2024-08-09T15:50:14Z

@stinodego was saying this is how it will/should work "under the hood"--meaning you could continue to use read_csv, but the underlying logic would be more consistent/reliable.

cbrnr · 2024-08-09T16:03:35Z

Yes, this would be great indeed! Meanwhile, I've switched to scan_csv with the more verbose selecting and renaming, not a big deal, but I just wanted to mention another use case which would be enabled by fixing read_csv.

mcrumiller added bug Something isn't working python Related to Python Polars labels Dec 15, 2023

stinodego added the accepted Ready for implementation label Dec 16, 2023

romanovacca mentioned this issue Dec 24, 2023

fix: Fix read_csv to respect the order specified by the columns argument #13240

Closed

stinodego added P-medium Priority: medium and removed accepted Ready for implementation labels Jan 12, 2024

alexander-beedie added the A-io-csv Area: reading/writing CSV files label Jan 23, 2024

mcrumiller mentioned this issue Mar 13, 2024

when "columns =" is specified, "pl.read_csv()" doesn't import columns based on the specified order of "columns = " #15027

Closed

2 tasks

This was referenced Mar 27, 2024

fix(python, rust): read_csv column order did not follow the columns parameter #15317

Closed

Proposal: Re-design columns, new_columns, schema, dtypes in read_csv #15431

Closed

stinodego mentioned this issue Jun 8, 2024

CSV reader column projection does not respect order #10572

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_csv` does not return columns in the order specified by `columns` parameter #13066

`read_csv` does not return columns in the order specified by `columns` parameter #13066

mcrumiller commented Dec 15, 2023 •

edited

Loading

stinodego commented Dec 16, 2023

romanovacca commented Dec 22, 2023 •

edited

Loading

stinodego commented Dec 22, 2023

stinodego commented Feb 14, 2024

stinodego commented Jun 8, 2024

cbrnr commented Aug 9, 2024

mcrumiller commented Aug 9, 2024 •

edited

Loading

cbrnr commented Aug 9, 2024

mcrumiller commented Aug 9, 2024

cbrnr commented Aug 9, 2024

read_csv does not return columns in the order specified by columns parameter #13066

read_csv does not return columns in the order specified by columns parameter #13066

Comments

mcrumiller commented Dec 15, 2023 • edited Loading

Checks

Reproducible example

Issue description

Expected behavior

Installed versions

stinodego commented Dec 16, 2023

romanovacca commented Dec 22, 2023 • edited Loading

stinodego commented Dec 22, 2023

stinodego commented Feb 14, 2024

stinodego commented Jun 8, 2024

cbrnr commented Aug 9, 2024

mcrumiller commented Aug 9, 2024 • edited Loading

cbrnr commented Aug 9, 2024

mcrumiller commented Aug 9, 2024

cbrnr commented Aug 9, 2024

`read_csv` does not return columns in the order specified by `columns` parameter #13066

`read_csv` does not return columns in the order specified by `columns` parameter #13066

mcrumiller commented Dec 15, 2023 •

edited

Loading

romanovacca commented Dec 22, 2023 •

edited

Loading

mcrumiller commented Aug 9, 2024 •

edited

Loading