You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice if we could support lazy quantifier and specified group index in regexp_extract function.
For example:
create a hive table and insert some rows:
hive> show create table datavalid3;
OK
CREATE EXTERNAL TABLE `datavalid3`(
`col1` int,
`col2` bigint,
`col3` tinyint,
`col4` string,
`col5` date,
`col6` map<string,string>,
`col7` map<string,array<string>>,
`col8` decimal(5,3))
PARTITIONED BY (
`col9` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
run below SQL:
spark.sql("select regexp_extract(col4,'\"xst\":\"(.*?)\"') from datavalid3").show
fallback info:
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced
@Expression <Alias> regexp_extract(col4#3, "xst":"(.*?)", 1) AS regexp_extract(col4, "xst":"(.*?)", 1)#21 could run on GPU
!Expression <RegExpExtract> regexp_extract(col4#3, "xst":"(.*?)", 1) cannot run on GPU because regex group count is 0, but the specified group index is 1; Lazy quantifier *? not supported near index 10
The text was updated successfully, but these errors were encountered:
The specified group count issue I think is a side effect of not being able to support the lazy quantifier. I think we stopped parsing the regexp and didn't see the capture group around the lazy quantifier.
It looks like CUDF does support this feature already.
>>> import cudf
>>> s = cudf.Series(['a:"hello"', 'a:"there"'])
>>> s
0 a:"hello"
1 a:"there"
dtype: object
>>> s.str.extract('a:"(.*?)"')
0
0 hello
1 there
So this is probably mostly a feature for us to figure out how to transpile .*? into something that will match exactly with what java/spark does with the regexp.
It would be nice if we could support lazy quantifier and specified group index in regexp_extract function.
For example:
fallback info:
The text was updated successfully, but these errors were encountered: