Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious pytest failure #6499

Closed
2 tasks done
ritchie46 opened this issue Jan 27, 2023 · 4 comments · Fixed by #6508 or #6532
Closed
2 tasks done

Spurious pytest failure #6499

ritchie46 opened this issue Jan 27, 2023 · 4 comments · Fixed by #6508 or #6532
Assignees
Labels
bug Something isn't working python Related to Python Polars

Comments

@ritchie46
Copy link
Member

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

I believe that the the parallelization lead to a race condition with a file.

@stinodego I think we must ensure that all file creators and consumers end up on the same workers.


    @pytest.mark.xfail(sys.platform == "win32", reason="Does not work on Windows")
    def test_parquet_struct_categorical() -> None:
        df = pl.DataFrame(
            [
                pl.Series("a", ["bob"], pl.Categorical),
                pl.Series("b", ["foo"], pl.Categorical),
            ]
        )
        df.write_parquet("/tmp/tmp.pq")
        with pl.StringCache():
>           out = pl.read_parquet("/tmp/tmp.pq").select(pl.col("b").value_counts())

tests/unit/io/test_lazy_parquet.py:207: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
polars/internals/dataframe/frame.py:5603: in select
    self.lazy().select(exprs).collect(no_optimization=True)._df
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <polars.LazyFrame object at 0x7F8EBD9BBDF0>

    def collect(
        self,
        *,
        type_coercion: bool = True,
        predicate_pushdown: bool = True,
        projection_pushdown: bool = True,
        simplify_expression: bool = True,
        no_optimization: bool = False,
        slice_pushdown: bool = True,
        common_subplan_elimination: bool = True,
        streaming: bool = False,
    ) -> pli.DataFrame:
        """
        Collect into a DataFrame.
    
        Note: use :func:`fetch` if you want to run your query on the first `n` rows
        only. This can be a huge time saver in debugging queries.
    
        Parameters
        ----------
        type_coercion
            Do type coercion optimization.
        predicate_pushdown
            Do predicate pushdown optimization.
        projection_pushdown
            Do projection pushdown optimization.
        simplify_expression
            Run simplify expressions optimization.
        no_optimization
            Turn off (certain) optimizations.
        slice_pushdown
            Slice pushdown optimization.
        common_subplan_elimination
            Will try to cache branching subplans that occur on self-joins or unions.
        streaming
            Run parts of the query in a streaming fashion (this is in an alpha state)
    
        Returns
        -------
        DataFrame
    
        Examples
        --------
        >>> df = pl.DataFrame(
        ...     {
        ...         "a": ["a", "b", "a", "b", "b", "c"],
        ...         "b": [1, 2, 3, 4, 5, 6],
        ...         "c": [6, 5, 4, 3, 2, 1],
        ...     }
        ... ).lazy()
        >>> df.groupby("a", maintain_order=True).agg(pl.all().sum()).collect()
        shape: (3, 3)
        ┌─────┬─────┬─────┐
        │ a   ┆ b   ┆ c   │
        │ --- ┆ --- ┆ --- │
        │ str ┆ i64 ┆ i64 │
        ╞═════╪═════╪═════╡
        │ a   ┆ 4   ┆ 10  │
        │ b   ┆ 11  ┆ 10  │
        │ c   ┆ 6   ┆ 1   │
        └─────┴─────┴─────┘
    
        """
        if no_optimization:
            predicate_pushdown = False
            projection_pushdown = False
            slice_pushdown = False
            common_subplan_elimination = False
    
        if streaming:
            common_subplan_elimination = False
    
        ldf = self._ldf.optimization_toggle(
            type_coercion,
            predicate_pushdown,
            projection_pushdown,
            simplify_expression,
            slice_pushdown,
            common_subplan_elimination,
            streaming,
        )
>       return pli.wrap_df(ldf.collect())
E       exceptions.NotFoundError: b
E       
E       > Error originated just after operation: '  DF ["name", "amount"]; PROJECT */2 COLUMNS; SELECTION: "None"
E       '
E       This operation could not be added to the plan.

polars/internals/lazyframe/frame.py:1147: NotFoundError

Reproducible example

None

Expected behavior

Run tests successfully.

Installed versions

~

@ritchie46 ritchie46 added bug Something isn't working python Related to Python Polars labels Jan 27, 2023
@stinodego
Copy link
Member

stinodego commented Jan 27, 2023

@stinodego I think we must ensure that all file creators and consumers end up on the same workers.

Hm, I'm not sure that's required. But this test is still writing to disk which is not intended - I thought I had gotten all tests to write to a TemporaryDirectory.

In this case, I think it's clashing with test_streaming_categorical, which writes and reads to the same directory on disk. If those tests run simultaneously, that will obviously lead to problems.

I'll make that change, and if we're still running into issues, I'll do some tuning of the test distribution over workers.

@stinodego stinodego self-assigned this Jan 27, 2023
@thomasfrederikhoeck
Copy link
Contributor

Using ’tempfile.TemporaryFile’ or ’tempfile.TemporaryDirector’ should also allow it to run on windows.

@stinodego
Copy link
Member

Exactly, I already noticed that we were getting some xpasses on Windows. I was just looking at that 😄

@stinodego
Copy link
Member

stinodego commented Jan 28, 2023

Still seeing some intermittent issues, for example:
https://github.com/pola-rs/polars/actions/runs/4031534151/jobs/6930879460

I'm looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
3 participants