-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Column(s) .. not found in data
error on sample AWS CUR file
#161
Comments
Column(s) .. not found in data
error on sample AWS CUR dataColumn(s) .. not found in data
error on sample AWS CUR file
I ran into the same issue when testing the tool on a real-world dataset before the anonymized sample was uploaded. I think the reason is that while the columns in the dataset are flattened, they do not fit the values expected by the tool, ie.
On the anonymized sample, I still get an error:
Attached is the output. |
It really is a problem related to the AWS CUR input format. I managed to make it work by choosing some specific attributes when creating the CUR: The converter appears to be prepared to receive only parquet files from AWS, and with the "Include resource IDs" option turned on. Would it be interesting to add this to the documentation? Or even increase the variability of possible inputs? I could contribute if so. |
Hi @stoiev, thank you for finding the fix, we realized this is an issue with the cur part of the converter, which is why I added https://github.com/finopsfoundation/focus_converters/blob/dev/focus_converter_base/focus_converter/conversion_configs/aws/0_dimension_dtypes_S001.yaml. This plan has column names and types that the converter is expecting and if are not present can be added with NULL values but right data types so that the sql plans do not fail, would you see for your given example if you added an entry here if the pipeline works. Hi @davidschneider2W, this plan that I pasted above essentially does the dataframe column add that you had to do prior to running the container. Could you also give it a try on the data source you have access to. |
Thanks for the clarificarion, @varunmittal91, but since all columns are different from Parquet to CSV format, @davidschneider2W will need to change all mappings Since this seems a usual case (CSV seems more popular than parquet), would support to AWS CUR CSV file be within the scope of this project as a build-in feature? If yes, we need to incorporate other AWS columns names in main code (by writing down new mappings folder or by adapting @davidschneider2W columns conversion behind the scenes). If no, we could document on how to generate compatible AWS CUR files somewhere. |
Hi @stoiev, thank you for the feedback, CSV as you mentioned is popular and definitely part of the scope. To fix this I think we need more plans specific to data format(CSV in this case) that can reduce the source data into a common format. This way we can extend support for more formats and things that might come in the future like CUR 2.0. Also advantage of using polars here is that everything is lazy-eval so any such plan should not have a huge impact on the compute as well. Do you have some ideas on how this can be bootstrapped? |
Hi @varunmittal91, We are facing the same issue while converting AWS CUR. Problem statement:
Proposed Solution::
|
Thank you @spriharani, that sounds like a good idea. I was wondering for us to get started and make some progress on the new conversion rules, would it make sense So there could be a aws:cur-core, and then variations aws:cur2 and aws:cur. This way we will be eventually able to expand into --report-type. And the same would apply for CSV and parquet. I am curious to hear your feedback. |
I've changed actual AWS plans to process CSV cur files as a POC (changes can be seen here). I could convert the example AWS CUR file that was added to the project, and compare it to a "parquet" version of same data (just changing columns names using @davidschneider2W logic), and it seems to produce same results. Maybe it could help a future PR that allow multiple AWS CUR formats. Some workarounds that I've to implement:
Regarding the process: |
Thanks @varunmittal91 I appreciate your insights, and I agree with your thoughts. It's great to have flexibility considering the various variations like aws:cur-core, aws:cur2, and aws:cur, among others. To accommodate these variations, I've made some updates and introduced a new provider, aws:cur, along with its dedicated set of conversion rules. Changes can be found here With these enhancements, we can now process AWS CUR using the following command: |
Thanks for adding AWS CSV CUR support, @spriharani ! There is one more issue related to input date generation that affects either CSV and parquet: the optional Is there an easy way to let this field optional in original data file? I've tried some plans configurations without success. |
@stoiev Commited code is here |
Great! Does it worth a PR? I think that resolves this issue completelly. |
Good to know that it resolved the issue. I have opened a PR for this. |
I tried to convert the AWS example CUR file within the project, but I'm getting a
Column(s) .. not found in data
error.Is there something I forgot to make the conversion work?
Here is the full command/output:
The text was updated successfully, but these errors were encountered: