Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
README_RU.md	README_RU.md
main.py	main.py
requirements.txt	requirements.txt

Google BigQuery Integration with SFTP - servers

This documantation is also available in RU.

About this script

The sftp-bq-integration module lets you automatically upload data from the regularly updated file on an SFTP server to Google BigQuery using Google Cloud Functions.

How it works

An HTTP POST request invokes a Cloud function that gets the file from the SFTP server and uploads it to a BigQuery table. If the table already exists in the selected dataset, it will be rewritten.

Requirements

A Google Cloud Platform project with an activated billing account;
Read access to the data source;
The WRITER access to the dataset and Job User roles for the Cloud Functions service account in the BigQuery project to which you are going to upload the table (see the Access part of this doc);
An HTTP client for POST requests invoking the Cloud function.

Setup

Go to Google Cloud Platform Console and authorize using a Google account, or sign up if you don’t have an account yet.
Go to the project with billing activated or create a new billing account for the project that hasn’t one.
Go to Cloud Functions and click CREATE FUNCTION. Important to note: using Cloud Functions is billed according to this pricing.
Fill in these fields:

Name: sftp-bq-integration or any other name you see fit;

Memory allocated: 2Gb or less depending on the file that is being processed;

Trigger: HTTP;

Source code: Inline editor;

Runtime: Python 3.X.
Copy the contents of the main.py file to the inline editor, the main.py tab.
Copy the contents of the requirements.txt file to the inline editor, the requirements.txt tab.
As a Function to execute, state sftp.
In the Advanced options set the Timeout to 540 seconds or less depending on the file that is being processed.
Complete creating the Cloud Function by clicking Create.

Access

SFTP

Acquire the username and password from the FTPS server where the read access file is located. Make sure the file can be acquired via the standard FTPS control connection port 22.

BigQuery

If the created Cloud Function and the BigQuery project are located in the same Google Cloud Platform Console project, then you don’t need to take any additional actions.

If they are located in different projects, then:

Go to Cloud Functions and click on the function you created to open the Function details.
On the General tab, find the Service account field and copy the email from there.
In Google Cloud Platform, go to IAM & admin - IAM and select the project where you are going to upload the BigQuery table to.
Click the +Add - button above to add a new member. Paste the service account email to the New members field and select the Job User. Click Save.
Go to your BigQuery dataset and share one with the service account email. You need to grant WRITER access to the dataset.

File

The file you need to acquire from the SFTP server can have any appropriate extension: .json, .txt, .csv. However, it must be in the JSON (newline-delimited) or CSV (Comma-separated values) format.

The file schema is automatically defined in BigQuery.

For the DATE data type to be defined correctly, the values of the field must use the “-” delimiter and have the “YYYY-MM-DD” format.

For the TIMESTAMP data type to be defined correctly, the values of the field must use the “-” delimiter for the date and the “:” delimiter for time. The format must be “YYYY-MM-DD hh:mm:ss”. Here’s the list of possible timestamp formats.

JSON (newline-delimited)

The JSON file contents must look as follows:

{"column1": "my_product" , "column2": 40.0}
{"column1": , "column2": 54.0}
…

CSV

The CSV file contents must look as follows:

column1,column2
my_product,40.0
,54.0
…

For the table scheme to be defined correctly, the first line of the CSV file — the header — must contain only the STRING values. In the rest of the lines, at least one column must contain numerical values.

Usage

Go to Cloud Functions and click on the function you’ve created to open the Function details.
On the Trigger tab, copy the URL address.
Using an HTTP client, send a POST request to this URL address. The request body must be in the JSON format:

{
   "sftp": 
        {
          "user": "sftp.user_name",
          "psswd": "sftp.password",
          "path_to_file": "sftp://server_host/path/to/file/"
        },
   "bq":
        {
          "project_id": "my_bq_project",
          "dataset_id": "my_bq_dataset",
          "table_id": "my_bq_table",
          "delimiter": ",",
          "source_format": "CSV",
          "location": "US"
        }
}

Property name	Object	Description
Required properties
user	sftp	Name of the user on the SFTP server, who has the read access.
psswd	sftp	User password on the SFTP server.
path_to_file	sftp	Full path to the file on the SFTP server. It always must look like this: “sftp://host/path/to/file/”
project_id	bq	Name of the BigQuery project where the table will be uploaded to. The project may be different from the one where the Cloud Function was created in.
dataset_id	bq	Name of the BigQuery dataset where the table will be uploaded to.
table_id	bq	Name of the BigQuery table where the file from the SFTP server will be uploaded to.
delimiter	bq	Field delimiter in the CSV file. The property is required if you choose CSV as source_format. Supported delimiters are: “,”, “
source_format	bq	The format of the file uploaded to BigQuery. Supported formats: “NEWLINE_DELIMITED_JSON", “CSV”.
Optional properties
location	bq	Geographical location of the table. Default: “US”.

Troubleshooting

Each Cloud Function invocation is being logged. You can view the logs in Google Cloud Platform:

Go to Cloud Functions and click on the function you created to open the Function details.
Click View logs in the top bar and see the latest logs at the levels Error and Critical.

Usually, these errors are caused by the issues with accessing the SFTP server, the BigQuery access, or errors in the imported file.

Limitations

The size of the file processed must be no greater than 2Gb.
The Cloud Function invocation timeout must be no greater than 540 seconds.

Sample usage

Linux

Invoke the function via the Linux terminal:

curl -X POST https://REGION-PROJECT_ID.cloudfunctions.net/sftp/ -H "Content-Type:application/json"  -d 
    '{
       "sftp": 
            {
              "user": "sftp.user_name",
              "psswd": "sftp.password",
              "path_to_file": "sftp://server_host/path/to/file/"
            },
       "bq":
            {
              "project_id": "my_bq_project",
              "dataset_id": "my_bq_dataset",
              "table_id": "my_bq_table",
              "delimiter": ",",
              "source_format": "CSV",
              "location": "US"
            }
}'

Python

from httplib2 import Http
import json

try:
    from urllib import urlencode
except ImportError:
    from urllib.parse import urlencode

trigger_url = "https://REGION-PROJECT_ID.cloudfunctions.net/sftp/"

headers = {"Content-Type": "application/json; charset=UTF-8"}
payload = {
           "sftp": 
                {
                  "user": "sftp.user_name",
                  "psswd": "sftp.password",
                  "path_to_file": "sftp://server_host/path/to/file/"
                },
           "bq":
                {
                  "project_id": "my_bq_project",
                  "dataset_id": "my_bq_dataset",
                  "table_id": "my_bq_table",
                  "delimiter": ",",
                  "source_format": "CSV",
                  "location": "US"
                }
            }
Http().request(method = "POST", uri = trigger_url, body = json.dumps(payload), headers = headers)

Google Apps Script

Paste this code with your parameters and launch the function:

function runstp() {
  trigger_url = "https://REGION-PROJECT_ID.cloudfunctions.net/sftp/"
  var payload = {
                 "sftp": 
                       {
                          "user": "sftp.user_name",
                          "psswd": "sftp.password",
                          "path_to_file": "sftp://server_host/path/to/file/"
                        },
                 "bq":
                      {
                          "project_id": "my_bq_project",
                          "dataset_id": "my_bq_dataset",
                          "table_id": "my_bq_table",
                          "delimiter": ",",
                          "source_format": "CSV",
                          "location": "US"
                      }
  };

  var options = {
                "method": "post",
                "contentType": "application/json",
                "payload": JSON.stringify(payload)
                };

  var request = UrlFetchApp.fetch(trigger_url, options);
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sftp

sftp

README.md

Google BigQuery Integration with SFTP - servers

About this script

How it works

Requirements

Setup

Access

SFTP

BigQuery

File

JSON (newline-delimited)

CSV

Usage

Troubleshooting

Limitations

Sample usage

Linux

Python

Google Apps Script

Files

sftp

Directory actions

More options

Directory actions

More options

Latest commit

History

sftp

Folders and files

parent directory

README.md

Google BigQuery Integration with SFTP - servers

About this script

How it works

Requirements

Setup

Access

SFTP

BigQuery

File

JSON (newline-delimited)

CSV

Usage

Troubleshooting

Limitations

Sample usage

Linux

Python

Google Apps Script