Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column(s) .. not found in data error on sample AWS CUR file #161

Closed
stoiev opened this issue Nov 27, 2023 · 13 comments
Closed

Column(s) .. not found in data error on sample AWS CUR file #161

stoiev opened this issue Nov 27, 2023 · 13 comments

Comments

@stoiev
Copy link
Collaborator

stoiev commented Nov 27, 2023

I tried to convert the AWS example CUR file within the project, but I'm getting a Column(s) .. not found in data error.
Is there something I forgot to make the conversion work?

Here is the full command/output:

$ poetry run python3 -m focus_converter.main convert --provider aws --data-format csv --data-path tests/provider_config_tests/aws/sample-anonymous-aws-export-dataset.csv --export-path test 
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/ivan/Projects/focus_converters/focus_converter_base/focus_converter/main.py:65 in main     │
│                                                                                                  │
│    62 │   │   export_include_source_columns=export_include_source_columns,                       │
│    63 │   )                                                                                      │
│    64 │   converter.prepare_horizontal_conversion_plan(provider=provider)                        │
│ ❱  65 │   converter.convert()                                                                    │
│    66 │                                                                                          │
│    67 │   if validate:                                                                           │
│    68 │   │   for segment_file_name in os.listdir(export_path):                                  │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │                 column_prefix = '(None,)'                                                    │ │
│ │       converted_column_prefix = '((None,),)'                                                 │ │
│ │                     converter = <focus_converter.converter.FocusConverter object at          │ │
│ │                                 0x7f08cd572050>                                              │ │
│ │                   data_format = <DataFormats.CSV: 'csv'>                                     │ │
│ │                     data_path = 'tests/provider_config_tests/aws/sample-anonymous-aws-expor… │ │
│ │ export_include_source_columns = True                                                         │ │
│ │                   export_path = 'test'                                                       │ │
│ │           parquet_data_format = None                                                         │ │
│ │                      provider = 'aws'                                                        │ │
│ │                      validate = False                                                        │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/ivan/Projects/focus_converters/focus_converter_base/focus_converter/converter.py:317 in    │
│ convert                                                                                          │
│                                                                                                  │
│   314 │   │   │   if self.data_exporter:                                                         │
│   315 │   │   │   │   self.data_exporter.close()                                                 │
│   316 │   │   │                                                                                  │
│ ❱ 317 │   │   │   raise error                                                                    │
│   318                                                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ error = ValueError("Column(s) 'bill_bill_type, bill_billing_period_end_date,                 │ │
│ │         bill_billing_period_start_date, bill_invoicing_entity, bill_payer_account_id,        │ │
│ │         line_item_availability_zone, line_item_currency_code,                                │ │
│ │         line_item_line_item_description, line_item_line_item_type, line_item_product_code,   │ │
│ │         line_item_resource_id, line_item_usage_amount, line_item_usage_end_date,             │ │
│ │         line_item_usage_start_date, pricing_public_on_demand_cost, pricing_rate_code,        │ │
│ │         pricing_rate_id, pricing_unit, product_region, product_sku,                          │ │
│ │         reservation_effective_cost, reservation_total_reserved_units,                        │ │
│ │         savings_plan_savings_plan_effective_cost' not found in data")                        │ │
│ │    lf = <LazyFrame [94 cols, {"identity/LineItemId": Utf8 …                                  │ │
│ │         "savingsPlan/RecurringCommitmentForBillingPeriod": Utf8}] at 0x7F08CD5BEA10>         │ │
│ │  self = <focus_converter.converter.FocusConverter object at 0x7f08cd572050>                  │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/ivan/Projects/focus_converters/focus_converter_base/focus_converter/converter.py:304 in    │
│ convert                                                                                          │
│                                                                                                  │
│   301 │   │                                                                                      │
│   302 │   │   for lf in self.data_loader.data_scanner():                                         │
│   303 │   │   │   try:                                                                           │
│ ❱ 304 │   │   │   │   lf = self.__process_lazy_frame__(lf=lf)                                    │
│   305 │   │   │   │   self.data_exporter.collect(                                                │
│   306 │   │   │   │   │   lf=lf, collected_columns=list(set(self.h_collected_columns))           │
│   307 │   │   │   │   )                                                                          │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ error = ValueError("Column(s) 'bill_bill_type, bill_billing_period_end_date,                 │ │
│ │         bill_billing_period_start_date, bill_invoicing_entity, bill_payer_account_id,        │ │
│ │         line_item_availability_zone, line_item_currency_code,                                │ │
│ │         line_item_line_item_description, line_item_line_item_type, line_item_product_code,   │ │
│ │         line_item_resource_id, line_item_usage_amount, line_item_usage_end_date,             │ │
│ │         line_item_usage_start_date, pricing_public_on_demand_cost, pricing_rate_code,        │ │
│ │         pricing_rate_id, pricing_unit, product_region, product_sku,                          │ │
│ │         reservation_effective_cost, reservation_total_reserved_units,                        │ │
│ │         savings_plan_savings_plan_effective_cost' not found in data")                        │ │
│ │    lf = <LazyFrame [94 cols, {"identity/LineItemId": Utf8 …                                  │ │
│ │         "savingsPlan/RecurringCommitmentForBillingPeriod": Utf8}] at 0x7F08CD5BEA10>         │ │
│ │  self = <focus_converter.converter.FocusConverter object at 0x7f08cd572050>                  │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/ivan/Projects/focus_converters/focus_converter_base/focus_converter/converter.py:295 in    │
│ __process_lazy_frame__                                                                           │
│                                                                                                  │
│   292 │   │   lf = self.__deferred_column_plans__.apply_dtype_plan(lf=lf)                        │
│   293 │   │                                                                                      │
│   294 │   │   # validate all source columns exist in the lazy frame                              │
│ ❱ 295 │   │   self.__column_validator__.validate_lazy_frame_columns(lf=lf)                       │
│   296 │   │                                                                                      │
│   297 │   │   return self.apply_plan(lf=lf)                                                      │
│   298                                                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │   lf = <LazyFrame [104 cols, {"identity/LineItemId": Utf8 … "pricing_public_on_demand_rate": │ │
│ │        Float64}] at 0x7F08CD64F3A0>                                                          │ │
│ │ self = <focus_converter.converter.FocusConverter object at 0x7f08cd572050>                   │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
│                                                                                                  │
│ /home/ivan/Projects/focus_converters/focus_converter_base/focus_converter/conversion_functions/v │
│ alidations.py:134 in validate_lazy_frame_columns                                                 │
│                                                                                                  │
│   131 │   │                                                                                      │
│   132 │   │   columns_missing = sorted(set(source_columns) - set(lf.columns))                    │
│   133 │   │   if columns_missing:                                                                │
│ ❱ 134 │   │   │   raise ValueError(                                                              │
│   135 │   │   │   │   f"Column(s) '{', '.join(columns_missing)}' not found in data"              │
│   136 │   │   │   )                                                                              │
│   137                                                                                            │
│                                                                                                  │
│ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │
│ │ columns_missing = [                                                                          │ │
│ │                   │   'bill_bill_type',                                                      │ │
│ │                   │   'bill_billing_period_end_date',                                        │ │
│ │                   │   'bill_billing_period_start_date',                                      │ │
│ │                   │   'bill_invoicing_entity',                                               │ │
│ │                   │   'bill_payer_account_id',                                               │ │
│ │                   │   'line_item_availability_zone',                                         │ │
│ │                   │   'line_item_currency_code',                                             │ │
│ │                   │   'line_item_line_item_description',                                     │ │
│ │                   │   'line_item_line_item_type',                                            │ │
│ │                   │   'line_item_product_code',                                              │ │
│ │                   │   ... +13                                                                │ │
│ │                   ]                                                                          │ │
│ │              lf = <LazyFrame [104 cols, {"identity/LineItemId": Utf8 …                       │ │
│ │                   "pricing_public_on_demand_rate": Float64}] at 0x7F08CD64F3A0>              │ │
│ │            self = <focus_converter.conversion_functions.validations.ColumnValidator object   │ │
│ │                   at 0x7f08cd572770>                                                         │ │
│ │  source_columns = [                                                                          │ │
│ │                   │   'line_item_unblended_cost',                                            │ │
│ │                   │   'savings_plan_used_commitment',                                        │ │
│ │                   │   'savings_plan_total_commitment_to_date',                               │ │
│ │                   │   'reservation_unused_amortized_upfront_fee_for_billing_period',         │ │
│ │                   │   'reservation_unused_recurring_fee',                                    │ │
│ │                   │   'pricing_public_on_demand_rate',                                       │ │
│ │                   │   'line_item_line_item_type',                                            │ │
│ │                   │   'line_item_net_unblended_cost',                                        │ │
│ │                   │   'reservation_reservation_a_r_n',                                       │ │
│ │                   │   'reservation_reservation_arn',                                         │ │
│ │                   │   ... +23                                                                │ │
│ │                   ]                                                                          │ │
│ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Column(s) 'bill_bill_type, bill_billing_period_end_date, bill_billing_period_start_date, bill_invoicing_entity, bill_payer_account_id, line_item_availability_zone,
line_item_currency_code, line_item_line_item_description, line_item_line_item_type, line_item_product_code, line_item_resource_id, line_item_usage_amount, 
line_item_usage_end_date, line_item_usage_start_date, pricing_public_on_demand_cost, pricing_rate_code, pricing_rate_id, pricing_unit, product_region, product_sku, 
reservation_effective_cost, reservation_total_reserved_units, savings_plan_savings_plan_effective_cost' not found in data
@stoiev stoiev changed the title Column(s) .. not found in data error on sample AWS CUR data Column(s) .. not found in data error on sample AWS CUR file Nov 27, 2023
@oll-davidschneider
Copy link
Contributor

I ran into the same issue when testing the tool on a real-world dataset before the anonymized sample was uploaded. I think the reason is that while the columns in the dataset are flattened, they do not fit the values expected by the tool, ie. bill/BillType should actually be bill_bill_type.
I was able to fix the error in my own dataset by renaming the columns using the following regex, then writing to a new CSV file and running the tool against that:

df = df.rename(columns=lambda x: re.sub('/', '_', x))  # replace separating slashes with uscores
df = df.rename(columns=lambda x: re.sub(r'(?<![\_A-Z])([A-Z])', r'_\1', x)  # uscores before capitals, not more than once
df = df.rename(columns=lambda x: x.lower())

On the anonymized sample, I still get an error: ValueError: Column(s) 'line_item_resource_id' not found in data. The test data does seem to actually be missing the line_item_resource_id column expected by focus_converter convert. By adding a dummy column to the dataset, I was able to silence the error:

df['line_item_resource_id'] = 'test'

Attached is the output.
db5db97d593b411ebe9d0777cb292d73-0.parquet.zip

@stoiev
Copy link
Collaborator Author

stoiev commented Nov 28, 2023

It really is a problem related to the AWS CUR input format.

I managed to make it work by choosing some specific attributes when creating the CUR:

image
image

The converter appears to be prepared to receive only parquet files from AWS, and with the "Include resource IDs" option turned on.

Would it be interesting to add this to the documentation? Or even increase the variability of possible inputs?

I could contribute if so.

@varunmittal91
Copy link
Collaborator

varunmittal91 commented Dec 6, 2023

Hi @stoiev, thank you for finding the fix, we realized this is an issue with the cur part of the converter, which is why I added https://github.com/finopsfoundation/focus_converters/blob/dev/focus_converter_base/focus_converter/conversion_configs/aws/0_dimension_dtypes_S001.yaml.

This plan has column names and types that the converter is expecting and if are not present can be added with NULL values but right data types so that the sql plans do not fail, would you see for your given example if you added an entry here if the pipeline works.

Hi @davidschneider2W, this plan that I pasted above essentially does the dataframe column add that you had to do prior to running the container. Could you also give it a try on the data source you have access to.

@stoiev
Copy link
Collaborator Author

stoiev commented Dec 10, 2023

Thanks for the clarificarion, @varunmittal91, but since all columns are different from Parquet to CSV format, @davidschneider2W will need to change all mappings column fields from category_column_name to category/ColumnName.

Since this seems a usual case (CSV seems more popular than parquet), would support to AWS CUR CSV file be within the scope of this project as a build-in feature?

If yes, we need to incorporate other AWS columns names in main code (by writing down new mappings folder or by adapting @davidschneider2W columns conversion behind the scenes).

If no, we could document on how to generate compatible AWS CUR files somewhere.

@varunmittal91
Copy link
Collaborator

Hi @stoiev, thank you for the feedback, CSV as you mentioned is popular and definitely part of the scope.

To fix this I think we need more plans specific to data format(CSV in this case) that can reduce the source data into a common format. This way we can extend support for more formats and things that might come in the future like CUR 2.0.

Also advantage of using polars here is that everything is lazy-eval so any such plan should not have a huge impact on the compute as well.

Do you have some ideas on how this can be bootstrapped?

@spriharani
Copy link
Collaborator

Hi @varunmittal91,

We are facing the same issue while converting AWS CUR.

Problem statement:

  • Most of AWS CUR (csv) have columns that have prefixes like identity/, bill/, and lineItem/, while AWS CUR (parquet) have columns like line_item_unblended_cost etc. But in AWS CUR 2.0, "/" been removed and it's in snake case. Ref: https://docs.aws.amazon.com/cur/latest/userguide/table-dictionary-cur2-bill.html.
  • Also AWS have its legacy report format - which is called DBR report (in csv only), where columes have different name from CUR type reports.

Proposed Solution::

  • Introduce an additional parameter, "--report-type," to the converter command line interface, allowing users to specify the type of report being processed (e.g., dbr, cur, cur2). This enhancement aims to provide flexibility in handling diverse report structures.
    python -m focus_converter.main convert --provider aws --report-type dbr/cur/cur2 --data-format parquet/csv
  • Develop distinct mapping files for each report type to accurately interpret and transform the columns based on the corresponding naming conventions.
  • Dynamically load the appropriate mapping file based on the specified "--report-type" parameter

@varunmittal91
Copy link
Collaborator

Thank you @spriharani, that sounds like a good idea.

I was wondering for us to get started and make some progress on the new conversion rules, would it make sense
for us to create a new provider for now for each of these variations and then we can find a way to reduce all formats
in a common scheme so that we can reuse some of the core rules.

So there could be a aws:cur-core, and then variations aws:cur2 and aws:cur. This way we will be eventually able to expand into --report-type. And the same would apply for CSV and parquet. I am curious to hear your feedback.

@stoiev
Copy link
Collaborator Author

stoiev commented Dec 12, 2023

I've changed actual AWS plans to process CSV cur files as a POC (changes can be seen here).

I could convert the example AWS CUR file that was added to the project, and compare it to a "parquet" version of same data (just changing columns names using @davidschneider2W logic), and it seems to produce same results.

Maybe it could help a future PR that allow multiple AWS CUR formats.

Some workarounds that I've to implement:

  • since table names with '/' character are not standard, I had to quote them on a non-uniform way to get it working (column names need be set without quotes, sql related plans variables need quotes.
  • AWS CUR CSV sample does not have lineItem/ResourceId field, and I could not let it "optional" in any way (tried to use apply_default_if_column_missing and sql with null default_value), but I always was blocked in validation step. The solution was simply remove this plan (not ideal).

Regarding the process:
First, I just used try-and-error to identify and convert plans to new column names. After first successful conversion, I noticed that some fields was keep with parquet convention, so I identified them and used this ref to choose new names. In sum: there are cases that my manual tests with sample AWS CUR CSV file could not emulate.

@spriharani
Copy link
Collaborator

spriharani commented Dec 12, 2023

Thanks @varunmittal91

I appreciate your insights, and I agree with your thoughts. It's great to have flexibility considering the various variations like aws:cur-core, aws:cur2, and aws:cur, among others.

To accommodate these variations, I've made some updates and introduced a new provider, aws:cur, along with its dedicated set of conversion rules. Changes can be found here

With these enhancements, we can now process AWS CUR using the following command:
python -m focus_converter.main convert --provider aws:cur ..
Let me know if you have any feedback or if there's anything else you'd like to discuss.

@stoiev
Copy link
Collaborator Author

stoiev commented Dec 13, 2023

Thanks for adding AWS CSV CUR support, @spriharani !

There is one more issue related to input date generation that affects either CSV and parquet: the optional ResourceId (as pointed on screenshots above).

Is there an easy way to let this field optional in original data file? I've tried some plans configurations without success.

@spriharani
Copy link
Collaborator

spriharani commented Dec 13, 2023

@stoiev
I tried to replicate your scenario and adding the column name lineItem/ResourceId in this file worked for me.
focus_converter_base/focus_converter/conversion_configs/aws-cur/0_dimension_dtypes_S001.yaml

Commited code is here
Please let me know if it helps.

@stoiev
Copy link
Collaborator Author

stoiev commented Dec 13, 2023

@stoiev I tried to replicate your scenario and adding the column name lineItem/ResourceId in this file worked for me. focus_converter_base/focus_converter/conversion_configs/aws-cur/0_dimension_dtypes_S001.yaml

Commited code is here Please let me know if it helps.

Great! Does it worth a PR? I think that resolves this issue completelly.

@spriharani
Copy link
Collaborator

@stoiev I tried to replicate your scenario and adding the column name lineItem/ResourceId in this file worked for me. focus_converter_base/focus_converter/conversion_configs/aws-cur/0_dimension_dtypes_S001.yaml
Commited code is here Please let me know if it helps.

Great! Does it worth a PR? I think that resolves this issue completelly.

Good to know that it resolved the issue. I have opened a PR for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants