Block or Report
Block or report mlincon
Contact GitHub support about this user’s behavior. Learn more about reporting abuse.
Report abusedataEngineering
Price Crawler - Tracking Price Inflation
Learn how to design, develop, deploy and iterate on production-grade ML applications.
Data Engineering pipeline hosted entirely in the AWS ecosystem utilizing DocumentDB as the database
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Columnar storage extension for Postgres built as a foreign data wrapper. Check out https://github.com/citusdata/citus for a modernized columnar storage implementation built as a table access method.
A cross tenant metadata driven processing framework for Azure Data Factory and Azure Synapse Analytics achieved by coupling orchestration pipelines with a SQL database and a set of Azure Functions.
Always know what to expect from your data.
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
A framework for moving data into a data warehouse.
Metadata Driven Development (m3d) is a cloud and platform agnostic framework for the automated creation, management and governance of data lakes.
Project files for the post: Running PySpark Applications on Amazon EMR using Apache Airflow: Using the new Amazon Managed Workflows for Apache Airflow (MWAA) on AWS.
Guides and docs to help you get up and running with Apache Airflow.
My Insight Data Engineering Fellowship project. I implemented a big data processing pipeline based on ​lambda architecture​, that aggregates Twitter and US stock market data for user sentiment anal…
Project files for the post: Running PySpark Applications on Amazon EMR: Methods for Interacting with PySpark on Amazon Elastic MapReduce.
Example repo to create end to end tests for data pipeline.