Data streaming is part of a data engineers job.In this project, I am going to build a very simple and highly scalable data streaming pipeline using Python. Data streaming is the process of transmitting a continuous flow of data.
One side of the pipeline would have some or at least one data producer who would be continuously producing data and on the other side, we would have some or at least one data consumer who would be consuming that data continuously.
I will be using Redis as data pipeline and using a very simple data scraping microservice using Scrapy independently as a data producer and a separate microservice as a data consumer.
Install the Redis and run it locally.
Clone the repository.
git clone git@github.com:chilubagh/scalable-streaming-data-pipline.git
Install the requirements.
pip install -r requirements.txt
You are good to go!
Quick start
Start the producer quotes_spider
:
cd producer
scrapy crawl quotes Start the consumer quotes_consumer:
cd consumer
python quotes_consumer.py
I am building a simple python project with scrapy in a virtual environment. I run the command given below to create an empty Scrapy project.
scrapy startproject producer
Our data producer side is ready now but we need to put that data into a data pipeline instead of in a file. Before putting data into a data pipeline we need to build a data pipeline before.
Redis was installed and made to run
Wrappers are created around Redis functions to make them more humane. Let’s start with creating a directory in root with a name pipeline and create a new file redis_client.py.This client implements FIFO for the pipeline.
With the pipeline created,we can start putting our data into that from the data producer side. For that, we need to create a pipeline in scrapy which adds every scraped item to Redis and we consume it later. we add the code to the pipelines.py the file of scrapy project. This would start sending the data to Redis and to verify we can check our pipeline with redis-cli and type LLEN 'DATA-PIPELINE-KEY’ to see the number of quotes in the data pipeline.
As we’ve built a pipeline and a producer which can keep putting data to the pipeline independent of data consumption we are more than halfway through all we need to get data from the data pipeline and consume it according to our needs to call it a project. we create a new directory in root , name it consumer and create a new file with the name quotes_consumer.py in it. After this step, we can run scrapy spider and consumer independently which helps us in streaming data at a very high speed as data production and consumption are independent of each other.