This documantation is also available in RU.
The sftp-bq-integration module lets you automatically upload data from the regularly updated file on an SFTP server to Google BigQuery using Google Cloud Functions.
An HTTP POST request invokes a Cloud function that gets the file from the SFTP server and uploads it to a BigQuery table. If the table already exists in the selected dataset, it will be rewritten.
- A Google Cloud Platform project with an activated billing account;
- Read access to the data source;
- The WRITER access to the dataset and Job User roles for the Cloud Functions service account in the BigQuery project to which you are going to upload the table (see the Access part of this doc);
- An HTTP client for POST requests invoking the Cloud function.
-
Go to Google Cloud Platform Console and authorize using a Google account, or sign up if you don’t have an account yet.
-
Go to the project with billing activated or create a new billing account for the project that hasn’t one.
-
Go to Cloud Functions and click CREATE FUNCTION. Important to note: using Cloud Functions is billed according to this pricing.
-
Fill in these fields:
Name: sftp-bq-integration or any other name you see fit;
Memory allocated: 2Gb or less depending on the file that is being processed;
Trigger: HTTP;
Source code: Inline editor;
Runtime: Python 3.X.
-
Copy the contents of the main.py file to the inline editor, the main.py tab.
-
Copy the contents of the requirements.txt file to the inline editor, the requirements.txt tab.
-
As a Function to execute, state sftp.
-
In the Advanced options set the Timeout to 540 seconds or less depending on the file that is being processed.
-
Complete creating the Cloud Function by clicking Create.
Acquire the username and password from the FTPS server where the read access file is located. Make sure the file can be acquired via the standard FTPS control connection port 22.
If the created Cloud Function and the BigQuery project are located in the same Google Cloud Platform Console project, then you don’t need to take any additional actions.
If they are located in different projects, then:
- Go to Cloud Functions and click on the function you created to open the Function details.
- On the General tab, find the Service account field and copy the email from there.
- In Google Cloud Platform, go to IAM & admin - IAM and select the project where you are going to upload the BigQuery table to.
- Click the +Add - button above to add a new member. Paste the service account email to the New members field and select the Job User. Click Save.
- Go to your BigQuery dataset and share one with the service account email. You need to grant WRITER access to the dataset.
The file you need to acquire from the SFTP server can have any appropriate extension: .json, .txt, .csv. However, it must be in the JSON (newline-delimited) or CSV (Comma-separated values) format.
The file schema is automatically defined in BigQuery.
For the DATE data type to be defined correctly, the values of the field must use the “-” delimiter and have the “YYYY-MM-DD” format.
For the TIMESTAMP data type to be defined correctly, the values of the field must use the “-” delimiter for the date and the “:” delimiter for time. The format must be “YYYY-MM-DD hh:mm:ss”. Here’s the list of possible timestamp formats.
The JSON file contents must look as follows:
{"column1": "my_product" , "column2": 40.0}
{"column1": , "column2": 54.0}
…
The CSV file contents must look as follows:
column1,column2
my_product,40.0
,54.0
…
For the table scheme to be defined correctly, the first line of the CSV file — the header — must contain only the STRING values. In the rest of the lines, at least one column must contain numerical values.
- Go to Cloud Functions and click on the function you’ve created to open the Function details.
- On the Trigger tab, copy the URL address.
- Using an HTTP client, send a POST request to this URL address. The request body must be in the JSON format:
{
"sftp":
{
"user": "sftp.user_name",
"psswd": "sftp.password",
"path_to_file": "sftp://server_host/path/to/file/"
},
"bq":
{
"project_id": "my_bq_project",
"dataset_id": "my_bq_dataset",
"table_id": "my_bq_table",
"delimiter": ",",
"source_format": "CSV",
"location": "US"
}
}
Property name | Object | Description |
---|---|---|
Required properties | ||
user | sftp | Name of the user on the SFTP server, who has the read access. |
psswd | sftp | User password on the SFTP server. |
path_to_file | sftp | Full path to the file on the SFTP server. It always must look like this: “sftp://host/path/to/file/” |
project_id | bq | Name of the BigQuery project where the table will be uploaded to. The project may be different from the one where the Cloud Function was created in. |
dataset_id | bq | Name of the BigQuery dataset where the table will be uploaded to. |
table_id | bq | Name of the BigQuery table where the file from the SFTP server will be uploaded to. |
delimiter | bq | Field delimiter in the CSV file. The property is required if you choose CSV as source_format. Supported delimiters are: “,”, “ |
source_format | bq | The format of the file uploaded to BigQuery. Supported formats: “NEWLINE_DELIMITED_JSON", “CSV”. |
Optional properties | ||
location | bq | Geographical location of the table. Default: “US”. |
Each Cloud Function invocation is being logged. You can view the logs in Google Cloud Platform:
- Go to Cloud Functions and click on the function you created to open the Function details.
- Click View logs in the top bar and see the latest logs at the levels Error and Critical.
Usually, these errors are caused by the issues with accessing the SFTP server, the BigQuery access, or errors in the imported file.
- The size of the file processed must be no greater than 2Gb.
- The Cloud Function invocation timeout must be no greater than 540 seconds.
Invoke the function via the Linux terminal:
curl -X POST https://REGION-PROJECT_ID.cloudfunctions.net/sftp/ -H "Content-Type:application/json" -d
'{
"sftp":
{
"user": "sftp.user_name",
"psswd": "sftp.password",
"path_to_file": "sftp://server_host/path/to/file/"
},
"bq":
{
"project_id": "my_bq_project",
"dataset_id": "my_bq_dataset",
"table_id": "my_bq_table",
"delimiter": ",",
"source_format": "CSV",
"location": "US"
}
}'
from httplib2 import Http
import json
try:
from urllib import urlencode
except ImportError:
from urllib.parse import urlencode
trigger_url = "https://REGION-PROJECT_ID.cloudfunctions.net/sftp/"
headers = {"Content-Type": "application/json; charset=UTF-8"}
payload = {
"sftp":
{
"user": "sftp.user_name",
"psswd": "sftp.password",
"path_to_file": "sftp://server_host/path/to/file/"
},
"bq":
{
"project_id": "my_bq_project",
"dataset_id": "my_bq_dataset",
"table_id": "my_bq_table",
"delimiter": ",",
"source_format": "CSV",
"location": "US"
}
}
Http().request(method = "POST", uri = trigger_url, body = json.dumps(payload), headers = headers)
Paste this code with your parameters and launch the function:
function runstp() {
trigger_url = "https://REGION-PROJECT_ID.cloudfunctions.net/sftp/"
var payload = {
"sftp":
{
"user": "sftp.user_name",
"psswd": "sftp.password",
"path_to_file": "sftp://server_host/path/to/file/"
},
"bq":
{
"project_id": "my_bq_project",
"dataset_id": "my_bq_dataset",
"table_id": "my_bq_table",
"delimiter": ",",
"source_format": "CSV",
"location": "US"
}
};
var options = {
"method": "post",
"contentType": "application/json",
"payload": JSON.stringify(payload)
};
var request = UrlFetchApp.fetch(trigger_url, options);
}