Skip to content

Observability delivery

tobiasny edited this page Nov 22, 2023 · 3 revisions

Observability delivery (06.10.2023)

IMPORTANT NOTICE: The solution described here has been delivered to the FinOps team and developed by the Observability team. The FinOps team owns the solution and has the full responsibility on further developtmen and bug fixes. The FinOps solution is thus not a part of the Observability solution, but we think it's wise to include documentation on the solution design on our internal Wiki page anyways in case we are being asked about the solution we have helped making in the future.

The solution is to be found in s037 subscription in Azure.

Other relevant documentation on the solution:

  • End User Documentation has been sent to David Flood for use on Omnia Docs. It has also been stored to the Sharepoint site of the Observability team, in the FinOps folder.
  • The FinOps Team does not have a GitHub repo or other CI/CD tools as of september 2023, therefore we store a copy of the code in our Observability repo as a kind of backup. The copy of the Synapse code can be found here. IMPORTANT TO NOTE: This is just a copy and should not be considered as production code. For production code we refer to the Synapse solution in production.
  • Technical Documentation on the parts the Observability team has been working on is to be found on this Wiki page.
  • The architecture drawing of the FinOps solution including the compononents the Observability team has been working on can be found in our Observability repo, here.

Background

Why was the solution made?

The FinOps team was in need of a solution to visualize costs related to different WBS, subscriptions and other parameters relevant for following up on Azure costs. David Flood in the FinOps team was in need of help to develop the solution in Synapse and Power BI, and Joakim Torsvik and Jack Chiu from the Observability team helped out during the summer 2023.

Architecture drawing

Finops_architecture

LAST UPDATED 05.10.2023

As the drawing illustrates there is only one environment (production environment) as of now for the FinOps solution.

The architecture drawing displays all components that the Observability team has worked on during their work from June 2023 to September 2023. However, not all compoents have been created by the Observability team as the architecture was well in place before the team started working on the solution.

How to access and make changes to the solution?

How to access the FinOps environment in Azure:

  • In order to access the Azure environment of the FinOps subscription S037, users have to have recieved access to the subscription through its owner (David Flood).

  • When permissions have been granted, users can activate a temporary access to the subscription through Privileged Identity Management (PIM).

    • Click on My Roles on the left-hand menu, then click on Azure Resources. A Contributor access should be available for the S037-Cost-Management-reservations Subscription
  • After the activation of the contributor access, the user is able to go to the subscription. Observability has for the most part worked in Azure Synapse. This is located in a resource group within S037 named "synapse".

DevOps environment

At the time of writing there has not been implemented any environment for git-integration, version control, DevOps (dev, test, prod environment) etc. Only workspaces in PowerBI service contain both test and production environments.

All changes that need to be made are being done in-production.

Data

Data Source

  • Daily cost-data from Microsoft
  • Monthly currated cost-data from Microsoft
  • Monthly Subscription-data from Solum (It may be replaced by tags data and SAP data.)
  • Monthly Application-data from ServiceNow
  • Azure AD Graph-data regarding users

(To be implemented)

  • SAP data from SAP SQL Warehouse

Sink Storage (Data Lake)

Data Formats

Currently, the team only works with structured data. The best data formatting available for use in Azure Synapse is Apache Parquet formatted files. This is an open-sourced column-oriented data storage format in the Apache Hadoop eco-system, built to efficiently compress data and encode schemes with enhanced performance to handle large datasets.

Partitioning

On occation we apply partitioning when datasets become large to handle. When using the Apache Spark Pools we can use the Spark automated partitioning. It then automatically decides the number of partitions based on the file size and which columns to partition by based on the data distribution. Manually set partitioning should be explained in their respective python notebooks.

Slowly Changing Dimensions (SCD)

At the time of writing there is no plan implemented for SCD. All notebooks changes are either overwritten or applied in new versioned notebooks.

Backend solution

ADF orchestrators/pipelines in Synapse

Run Notebook on Previous Dates_v1: This pipeline triggers the python notebook with the same name. The pipeline does not have a scheduled trigger attached. This pipeline has no triggers attached to it.

Microsoft Graph API/Azure AD Users_v1: This pipeline first copies data from the Graph API into the ADLS, then runs in sequence the two transformation notebooks to get business levels to users. This pipeline is triggered to run once a month.

Management API (New)/Convert Monthly CSV Cost Files to Parquet: This pipeline gets the latest Actual Cost- and Amortized Cost files, transforms them from CSV to Parquet, then runs the "Monthly Extend" transformation notebook. This pipeline is triggered to run once a month, on the 5th day of the month.

Daily Cleanup: This pipeline deletes old daily data cost data that we recieve from Microsoft and imports the newest data available. After this, it runs the "Daily Extend AI column and WBS tags"-notebook. This pipeline is triggered to run every day at 4am.

Update PBI Datasets and Run AnomalyDetection_v1: This pipeline runs three notebooks. It runs the Prod_Extend_Amortized_Cost-Oneyear- and Prod_Extend_Amortized_Cost-Threeyears notebooks, simutaniously, then when the OneYear-notebook is complete, it runs the Anomaly Detection notebook. This pipeline is triggered to run every day at 5am.

Notebooks in Synapse

Anomaly_Detection_v1: This notebook runs an Anomaly Detection model on the 90 days to find unusually high costs attributed to subscriptions, resource groups, azure services and WBS'.

Run Notebook on Previous Dates_v1: Reruns the "Monthly Extend.._v2" notebook on all historical data. This is done to get all historical data into the new schema that the Observability Team implemented in the "Monthly Extend" notebook.

Monthly Extend AI columns and Tags_v2: v1 was written on Pandas Core, whereas v2 is written on PySpark. Additional functionalities have been added to the data. This notebook transforms cost data to extend two json-columns, Additional Info (AI) and Tags. It also adds WBS codes to all costs and specifies why the specific WBS was applied.

Daily Extend AI columns and Tags_v1: This notebook is the same as Monthly Extend, but with small changes from reading csv files rather than parquet.

Prod_Extend_Amortized_Cost-Oneyear: This notebook accumulates the outputs generated by Monthly Extend notebooks over a one-year timeframe and maintain all the columns.

Prod_Extend_Amortized_Cost-Threeyears: This notebook accumulates the outputs generated by Monthly Extend notebooks over a three-year timeframe and aggregates various valuable columns in relation to the cost field.

Prod_import_most_recent: Identifies the newest file in the file system, perfect for notebooks that need to fetch the latest raw file from a collection of similarly named files, each with a distinct timestamp.

Prod_ServiceNowApplication: This notebook retrieves application information from ServiceNow.

Prod_subscription_json: Converts the Subscription-file from json into Parquet.

Anomaly Detection Model

In order to identify when costs are unusually high, we need to create a model for identifying this increased cost. To do this we have implemented a Moving Average model which calculates the average cost for a given window-period.

$$ MA(n) = \frac{\sum_{i=1}^{n} c_i}{n} $$

The Moving Average (MA) is determined by the window size (n). In order to determine which costs are outliers / anomalies we need to set a threshold. This is determined by a moving standard deviation:

$$ Mstd(n) = \sqrt{\frac{\sum_{i=1}^n (c_i - MA(n))^2}{n}} $$

The Detection model uses a scaling function to lower the threshold as the costs increases. The function was determined by curve fitting three target scales based on different costs:

  • Scale of 3 when costs are NOK 1.000,-
  • Scale of 2 when costs are NOK 10.000,-
  • Scale of 1 when costs are NOK 100.000,-

NB! - Although these were target scales, the result are slightly different.

The parameters are set by the curve fitting algorithm in scipy.optimize. This finds the parameters (a and b) of an undefined function;

$$ f(x) = a * x^{-b} $$

where y is the scale and x is the cost.

$$ Scaler(c) = 13.6 * c^{-0.22} $$

This helps to set a high threshold when costs are low and set a low threshold when costs are high. E.g.: When costs are at NOK 100,- and has a standard deviation of 20%, the impact of an anomaly value of NOK 200,- change in costs won't be very damaging for the company. However, when the daily costs are NOK 100.000,- with the same percentage of the standard deviation, the potential daily loss of NOK 20.000,- has a much greater impact on the bottom line. The code is described more in detail inside the anomaly detection algorithm in Synapse.

$$ Threshold = MA(n) + (Mstd(n) * Scaler(c)) $$

Notification of Anomalies

After identifying unusually high costs in the anomaly detection algorithm, we need to set up a notification system for alerting relevant owners and stakeholders of the unusually high cost that has incurred. To do this we have set a number of conditions for which anomalies to alert, filtered out irrelevant data, then merged the anomaly data with relevant stakeholders from the Subscription.json file from Solum.

Anomaly Alerting Conditions

The conditions for alerting stakeholders of high costs look like this.

  • When an Anomaly has occured for three straight days (n, n-1 and n-2, where n is the current day). This is done in order to only alert on recent recurring high costs.
  • Daily costs of at least 10.000 NOK. In Azure, many services incur low costs (e.g. 1, 10 or 100 NOK per diem). These costs are very insignificant in terms of Equinor cost management and so we want to only alert on unusual increases in costs where the daily costs already are significant, thus we set a lower threshold of only alerting where costs are at least 10.000 NOK.
  • The anomaly in question incurred on yesterdays date at the latest. The daily data we recieve from Microsoft show the current months data up until yesterday. We don't want to send notifications on old anomalies because they are no longer relevant, so we say that the anomaly has to have occured at least on yesterdays date.

If all of these conditions are met, then we send an alert to the owner of this subscription. We are currently using subscription.json files from Solum to determine which owner should be attached to the subscription, however, this dataset will be replaced by a more reliable source in the future.

Applying Azure Logic Apps to Send E-mail Notifications

In order to contact the relevant stakeholders we apply Azure Logic Apps as a tool to read the data that has been processed in the anomaly detection algorithm, then send out e-mail for each service in each subscription where anomalies has occured.

Technologies

At the time of writing is only Azure Synapse, Power BI and Azure Storage Account being used.

Azure Synapse Analytics

Azure Synapse is a cloud-based analytics service by Microsoft that integrates big data and data warehousing capabilities. It allows organizations to analyze and process large volumes of data from various sources. Azure Synapse provides a unified platform for data storage, data preparation, data exploration, and advanced analytics.

In the current set-up, FinOps uses a Data Lake outside of the Synapse environment, but inside the same resource group, to store their data. It is possible to access the storage account from the Storage Explorer and Azure Portal, but not directly from Synapse. As seen in the architecture image, FinOps uses Synapse to transform data and store it in a SQL Database so that Power BI can ingest the data.

Azure Logic Apps

(still in testing)

Azure Logic Apps is an Azure service that enables users to design and automate workflows and integrate various applications and services seamlessly. It provides a visual designer for creating workflows, supports a wide range of connectors to interact with different systems, and offers robust monitoring and management capabilities, making it a powerful tool for streamlining business processes and improving productivity in the Azure ecosystem. Logic Apps can be used for tasks like data integration, notification automation, and event-driven workflows, making it a versatile solution for various enterprise scenarios.

The scenario where FinOps will use Logic Apps is related to Anomaly Detection, where it is used to automatically send out e-mails to stakeholders when costs increase exponentially.

Frontend solution (Power BI)

We have harnessed the PowerBI service to effectively visualize Azure Cost data. The primary advantages of using PowerBI for visualizing large datasets lie in its robust data processing capabilities and intuitive, interactive visualization tools, enabling us to gain valuable insights with ease and efficiency.

Workspaces and applications

Workspaces:

Only the FinOps workspace possesses a premium license that can support PowerBI reports with large datasets.

Application for end users:

  • There exists one application for end users, named "FinOps", displaying reports from the FinOps workspace. Link to the FinOps Application.

Access control:

  • Reports are tailored for four distinct persona groups. One access group for all of the 3 main reports i.e. WBS owner, subscription/Product Owner and Executive/management will be needed. The internal FinOps report will be separated from the 3 main reports. To enable this access group, we may need to register our FinOps PBI application in Accessit so that end users can apply the access to the applicaiton through accessit. The specifics will be established at a later time.

Reports and datasets

Dataset:

  • FinOps_DataFoundamental

    Fact tables:

    • ACMMonthlyAmortizedCost_Detailed_Oneyear: This is the detailed amortized cost data for a one-year period.
    • ACMMonthlyAmortizedCost_overview_Threeyear: This is the aggregated amortized cost data for a three years period. Dimension tables:
    • BusinessAreaLevel: This is the optimized AzureAD data used for displaying business areal level a WBS owner belongs to for instance.
    • Subscription: This subscription provides data on the WBS, WBS owner, product owner, and technical owner, sourced from the solum team.
    • ServiceNowApplication: This is the application data sourced from ServiceNow and collected through ServiceNow API.

Reports:

  • Report_For_Executive_Management_Prod
  • Report_For_FinOps_team
  • Report_For_SubscriptionOwnerAndProductOwner_Prod
  • Report_For_WBS_Owner_Prod

Methods Adopted for the Project:

  • The dataset we're working with comprises millions of rows and can extend up to 100 columns. To manage such a vast dataset while ensuring optimal performance in PowerBI reports, we'll be adopting the hybrid data model in PowerBI, which will be elaborated upon in subsequent sections. A notable limitation that shapes our reporting approach is the unavailability of direct dataset downloads from the PowerBI service. While you can download reports, it's restricted to those relying on a live online data connection (.pbix). This means that backend modifications in the PowerBI desktop for these reports are not feasible.The reports are displayed in the PowerBI service, all utilizing a common dataset (FinOps_DataFoundamental). To manage this situation effectively, please refer to the latest version of the PowerBI dataset (.pbix) located in the FinOps channel files on Teams. Should you wish to modify the PowerBI dataset, always use this version, implement your changes, and then publish it to the PowerBI service. After confirming your changes, ensure that the dataset in the FinOps channel is updated accordingly.

Reload of the dataset

  • Microsoft supplies cost data monthly, and other relevant data also require only a monthly update. Therefore, once the reports are finalized, we will update the entire dataset in the PowerBI service on a monthly basis. Typically, we receive the cost data on the fourth of each month and update the dataset on the fifth. This refresh has been scheduled and configuration detail can be viewed in PowerBI service. No manual refresh from PowerBI desktop (Data_Foundamental) is needed unless the dataset schema has been altered.

  • When updating the PowerBi dataset (.pbix) in PowerBI Desktop, ensure that the data relationships are correctly configured and that incremental refresh is enabled.

Special features being used

To optimize performance for the frontend reports, specific features have been implemented. Without these features activated, report opening times could exceed 5 minutes, and refresh rates might reach up to 4 hours.

Large dataset storage format:

  • This setting was enabled for the dataset to speed up the response time of visualizations in the Power BI reports. The setting is found when accessing Power BI Service --> Workspace --> Go to the settings for the dataset --> Enable Large dataset storage format.

Query Caching:

  • In order to help speed up the response time of the visualizations in Power BI service further we enabled the Query Caching setting. The setting is found when accessing Power BI Service --> Workspace --> Go to the settings for the dataset --> Enable Large dataset storage format.

Hybrid data model:

  • A hybrid data model with incremental refresh in PowerBI refers to a setup where you combine both DirectQuery and Import data connection modes. With DirectQuery, data remains in the source location and is queried in real-time, whereas with Import mode, data is imported into PowerBI's in-memory storage. Using a hybrid model allows you to leverage the advantages of both methods, such as real-time access to critical data and faster performance for frequently accessed data. This approach provides flexibility and efficiency, catering to specific needs and scenarios in report building and analysis.

    • To implement this feature in PowerBI:

      1. Design two date parameters to indicate the start and end dates in the Power query. (Prerequisites for hybrid data model)
      2. Implement an incremental refresh for your fact tables which involves which involves setting archived and incremental refresh range.
      3. Set all dimension tables to dual mode, while ensuring fact tables are in import mode. (Foundamental for hybrid data model)

      Refer to the second point: If you find that the PowerBI reports are already performing well, it's advised to select "Get the latest data in real time with DirectQuery (Premium only)" within the incremental refresh options. This helps you display visual objects with real time data in the reports.

      For detailed instructions, please refer to this link. You may need to adjust few things based on your PowerBI version and your PowerBI license.

How to make changes to the Power BI solution

Utilizing a hybrid data model coupled with an incremental refresh has introduced certain limitations to our current workflow. Specifically, once reports, inclusive of their datasets, are published to the PowerBI service, they cannot be directly downloaded as mentioned earlier. If you wish to modify the PowerBI solutions, here's a structured guide based on the nature of your changes:

Scenario 1: Modifying the Reports without Altering the Core Dataset

  • Navigate to FinOps workspace
  • Enter the report you want to alter and click on the "File"
  • Choose download this file and select "A copy of your report with a live connection to the data online". (There is only one option available for hybrid data model based reports)
  • Open the downloaded report on your local device and implement the necessary changes.Ensure the report's name remains consistent with the one on the PowerBI service for seamless integration.
  • Publish the report to FinOps workspace once you have completed your modify.

To be noticed:

  • Data model settings in FinOps workspace which allow workspace users edit data models in the Power Bi service are not supporting for incremental refresh based reports.
  • You can directly modify data within the PowerBI service. However, this method isn't always reliable, as there's a possibility that your changes might not save correctly.

Scenario 2: Updating the Core Dataset

This involves changes to measures, PowerQuery transformations, and refreshing the data in light of backend schema alterations.

  • Navigate to "Files" FinOps channel in Teams and download the file named "FinOps_DataFoundametnal.pbix".
  • Open the file in your local device and make the changes to the dataset either in PowerQuery or in data model.
  • Publish the file to FinOps workspace once you have completed your modify. Remember to refresh the dataset before publishing.
  • Upon publishing, the dataset in the PowerBI service will initiate its refresh process. Please note that this initial refresh may take several hours to complete.

Please choose the scenario that fits your requirements and proceed with the respective steps.

Few things to be considered:

  • We have FinOps workspace as production workspace and FinOps testing workspace as developing/testing workspace. Only Production workspace has premium license.
  • Datasets are sourced from the Synapse SQL database using SQL Server connections, captured as snapshots.

Other important aspects related to the solution

GDPR

Has as of now not been addressed by the Observability team in their work with the FinOps solution. GDPR will be addressed with the work on the architecture contract (David FLoos has initiated and is responsible for this process).

Secrets/keys

All secrets are being stored in Azure Key Vault "ACM-Toolkit-kv" in resource group "ACM-Toolkit" in subscription "S037". Pipelines which are created by Observability uses secrets from this Key Vault, though it is not Observabilitys responsibility to create and manage secrets.