Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parquet support #334

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .devcontainer/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
FROM adrienaury/go-devcontainer:v4.1
FROM adrienaury/go-devcontainer:v5.1

USER root

ADD cgi_ca_root.crt /usr/local/share/ca-certificates/cgi_ca_root.crt
RUN chmod 644 /usr/local/share/ca-certificates/cgi_ca_root.crt && update-ca-certificates
ADD misc-sni-google-com.crt /usr/local/share/ca-certificates/misc-sni-google-com.crt
RUN chmod 644 /usr/local/share/ca-certificates/cgi_ca_root.crt \
&& chmod 644 /usr/local/share/ca-certificates/misc-sni-google-com.crt \
&& update-ca-certificates

RUN apk add --update --progress --no-cache make gomplate yarn

Expand Down
1 change: 0 additions & 1 deletion .devcontainer/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ services:
no_proxy: ${no_proxy}
volumes:
- ../:/workspace
- ~/.ssh:/home/vscode/.ssh:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
environment:
- TZ=
Expand Down
28 changes: 28 additions & 0 deletions .devcontainer/misc-sni-google-com.crt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
-----BEGIN CERTIFICATE-----
MIIE0zCCA7ugAwIBAgIJANu+mC2Jt3uTMA0GCSqGSIb3DQEBCwUAMIGhMQswCQYD
VQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTERMA8GA1UEBxMIU2FuIEpvc2Ux
FTATBgNVBAoTDFpzY2FsZXIgSW5jLjEVMBMGA1UECxMMWnNjYWxlciBJbmMuMRgw
FgYDVQQDEw9ac2NhbGVyIFJvb3QgQ0ExIjAgBgkqhkiG9w0BCQEWE3N1cHBvcnRA
enNjYWxlci5jb20wHhcNMTQxMjE5MDAyNzU1WhcNNDIwNTA2MDAyNzU1WjCBoTEL
MAkGA1UEBhMCVVMxEzARBgNVBAgTCkNhbGlmb3JuaWExETAPBgNVBAcTCFNhbiBK
b3NlMRUwEwYDVQQKEwxac2NhbGVyIEluYy4xFTATBgNVBAsTDFpzY2FsZXIgSW5j
LjEYMBYGA1UEAxMPWnNjYWxlciBSb290IENBMSIwIAYJKoZIhvcNAQkBFhNzdXBw
b3J0QHpzY2FsZXIuY29tMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA
qT7STSxZRTgEFFf6doHajSc1vk5jmzmM6BWuOo044EsaTc9eVEV/HjH/1DWzZtcr
fTj+ni205apMTlKBW3UYR+lyLHQ9FoZiDXYXK8poKSV5+Tm0Vls/5Kb8mkhVVqv7
LgYEmvEY7HPY+i1nEGZCa46ZXCOohJ0mBEtB9JVlpDIO+nN0hUMAYYdZ1KZWCMNf
5J/aTZiShsorN2A38iSOhdd+mcRM4iNL3gsLu99XhKnRqKoHeH83lVdfu1XBeoQz
z5V6gA3kbRvhDwoIlTBeMa5l4yRdJAfdpkbFzqiwSgNdhbxTHnYYorDzKfr2rEFM
dsMU0DHdeAZf711+1CunuQIDAQABo4IBCjCCAQYwHQYDVR0OBBYEFLm33UrNww4M
hp1d3+wcBGnFTpjfMIHWBgNVHSMEgc4wgcuAFLm33UrNww4Mhp1d3+wcBGnFTpjf
oYGnpIGkMIGhMQswCQYDVQQGEwJVUzETMBEGA1UECBMKQ2FsaWZvcm5pYTERMA8G
A1UEBxMIU2FuIEpvc2UxFTATBgNVBAoTDFpzY2FsZXIgSW5jLjEVMBMGA1UECxMM
WnNjYWxlciBJbmMuMRgwFgYDVQQDEw9ac2NhbGVyIFJvb3QgQ0ExIjAgBgkqhkiG
9w0BCQEWE3N1cHBvcnRAenNjYWxlci5jb22CCQDbvpgtibd7kzAMBgNVHRMEBTAD
AQH/MA0GCSqGSIb3DQEBCwUAA4IBAQAw0NdJh8w3NsJu4KHuVZUrmZgIohnTm0j+
RTmYQ9IKA/pvxAcA6K1i/LO+Bt+tCX+C0yxqB8qzuo+4vAzoY5JEBhyhBhf1uK+P
/WVWFZN/+hTgpSbZgzUEnWQG2gOVd24msex+0Sr7hyr9vn6OueH+jj+vCMiAm5+u
kd7lLvJsBu3AO3jGWVLyPkS3i6Gf+rwAp1OsRrv3WnbkYcFf9xjuaf4z0hRCrLN2
xFNjavxrHmsH8jPHVvgc1VD0Opja0l/BRVauTrUaoW6tE+wFG5rEcPGS80jjHK4S
pB5iDj2mUZH1T8lzYtuZy0ZPirxmtsk3135+CKNa2OCAhhFjE0xd
-----END CERTIFICATE-----
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ Types of changes
- `Fixed` for any bug fixes.
- `Security` in case of vulnerabilities.

## [1.28.0]

- `Added` new action for parquet files

## [1.27.0]

- `Added` parameter `maxstrlen` to `sha3` and `hashInCSV` masks
Expand Down
69 changes: 69 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -1439,6 +1439,75 @@ After executing the command with the correct configuration, here is the expected

[Return to list of masks](#possible-masks)

### Parsing Parquet files

To mask data in a Parquet file using PIMO with the correct configuration option, follow this updated approach:

```bash
pimo parquet data.parquet maskedData.parquet --config masking.yml
```

### Example

Assume the Parquet file `data.parquet` has the following table structure:

| agency | agency_number | name | account_type | account_number | annual_income |
|--------------|---------------|--------|--------------|----------------|---------------|
| NewYork | 0032 | Doe | classic | 12345 | 50000 |
| SanFrancisco | 7894 | Smith | saving | 67890 | 60000 |

### Masking Configuration (`masking.yml`)

```yaml
version: "1"
seed: 42

masking:
- selector:
jsonpath: "agency_number" # mask agency_number column
mask:
template: '{{MaskRegex "[0-9]{4}$"}}'

- selector:
jsonpath: "name" # mask name column
mask:
randomChoiceInUri: "pimo://nameFR"

- selector:
jsonpath: "account_type" # mask account_type column
mask:
randomChoice:
- "classic"
- "saving"
- "securitie"

- selector:
jsonpath: "account_number" # mask account_number column
masks:
- incremental:
start: 1
increment: 1
- template: "{{.account_number}}"
```

### Resulting Masked Parquet File

After executing the command:

```bash
pimo parquet data.parquet maskedData.parquet --config masking.yml
```

The `maskedData.parquet` file will contain the following masked data:

| agency | agency_number | name | account_type | account_number | annual_income |
|--------------|---------------|----------|--------------|----------------|---------------|
| NewYork | 2308 | Rolande | saving | 1 | 50000 |
| SanFrancisco | 9724 | Matéo | securitie | 2 | 60000 |

This example demonstrates how to mask specific columns using PIMO, applying random choices, regular expressions, and incremental masking.

[Return to list of masks](#possible-masks)

## `pimo://` scheme

Expand Down
6 changes: 3 additions & 3 deletions build.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ properties:
"gocritic",
"whitespace",
]
lintersno: ["scopelint", "interfacer", "maligned", "forbidigo", "gci"] # List of linters to exclude on running the lint target
lintersno: [ "forbidigo", "gci"] # List of linters to exclude on running the lint target

# test suites to execute on test-int target
testsuites: "*.yml"
Expand Down Expand Up @@ -280,10 +280,10 @@ targets:
- $: 'killall golangci-lint || :'
- if: len(linters) == 0
then:
- $: golangci-lint run --enable-all ={replace(join(appendpath("--disable", lintersno), " "), "/", " ")}
- $: golangci-lint run --timeout 10m --enable-all ={replace(join(appendpath("--disable", lintersno), " "), "/", " ")}
:: true
else:
- $: golangci-lint run ={replace(join(appendpath("--enable", linters), " "), "/", " ")} ={replace(join(appendpath("--disable", lintersno), " "), "/", " ")}
- $: golangci-lint run --timeout 10m ={replace(join(appendpath("--enable", linters), " "), "/", " ")} ={replace(join(appendpath("--disable", lintersno), " "), "/", " ")}
:: true

test:
Expand Down
24 changes: 24 additions & 0 deletions cmd/pimo/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ var (
serve string
maxBufferCapacity int
profiling string
parquetInput string
parquetOutput string
)

func main() {
Expand Down Expand Up @@ -188,6 +190,26 @@ There is NO WARRANTY, to the extent permitted by law.`, version, commit, buildDa
xmlCmd.Flags().Int64VarP(&seedValue, "seed", "s", 0, "set seed")
rootCmd.AddCommand(xmlCmd)

// Add command for parquet transformer
parquetCmd := &cobra.Command{
Use: "parquet input_parquet_file output_parquet_file",
Short: "Parsing and masking a parquet file",
Args: cobra.ExactArgs(2),
Run: func(cmd *cobra.Command, args []string) {
initLog()
if len(catchErrors) > 0 {
skipLineOnError = true
skipLogFile = catchErrors
}
parquetInput = args[0]
parquetOutput = args[1]

run(cmd)
},
}
parquetCmd.Flags().Int64VarP(&seedValue, "seed", "s", 0, "set seed")
rootCmd.AddCommand(parquetCmd)

rootCmd.AddCommand(&cobra.Command{
Use: "flow",
Run: func(cmd *cobra.Command, args []string) {
Expand Down Expand Up @@ -253,6 +275,8 @@ func run(cmd *cobra.Command) {
CachesToDump: cachesToDump,
CachesToLoad: cachesToLoad,
XMLCallback: len(serve) > 0,
ParquetInput: parquetInput,
ParquetOutput: parquetOutput,
}

var pdef model.Definition
Expand Down
25 changes: 22 additions & 3 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ require (
github.com/CGI-FR/xixo v0.1.8
github.com/Masterminds/sprig/v3 v3.3.0
github.com/adrienaury/zeromdc v0.1.1
github.com/apache/arrow/go/v12 v12.0.1
github.com/capitalone/fpe v1.2.1
github.com/goccy/go-json v0.10.3
github.com/goccy/go-yaml v1.12.0
Expand All @@ -30,35 +31,53 @@ require (

require (
dario.cat/mergo v1.0.1 // indirect
github.com/JohnCGriffin/overflow v0.0.0-20211019200055-46fa312c352c // indirect
github.com/Masterminds/goutils v1.1.1 // indirect
github.com/Masterminds/semver/v3 v3.3.0 // indirect
github.com/andybalholm/brotli v1.1.0 // indirect
github.com/apache/thrift v0.16.0 // indirect
github.com/bahlo/generic-list-go v0.2.0 // indirect
github.com/buger/jsonparser v1.1.1 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/fatih/color v1.13.0 // indirect
github.com/felixge/fgprof v0.9.3 // indirect
github.com/golang-jwt/jwt v3.2.2+incompatible // indirect
github.com/golang/protobuf v1.5.2 // indirect
github.com/golang/snappy v0.0.4 // indirect
github.com/google/flatbuffers v2.0.8+incompatible // indirect
github.com/google/gxui v0.0.0-20151028112939-f85e0a97b3a4 // indirect
github.com/google/pprof v0.0.0-20211214055906-6f57359322fd // indirect
github.com/google/uuid v1.6.0 // indirect
github.com/huandu/xstrings v1.5.0 // indirect
github.com/inconshreveable/mousetrap v1.1.0 // indirect
github.com/klauspost/asmfmt v1.3.2 // indirect
github.com/klauspost/compress v1.17.9 // indirect
github.com/klauspost/cpuid/v2 v2.0.9 // indirect
github.com/labstack/gommon v0.4.2 // indirect
github.com/mailru/easyjson v0.7.7 // indirect
github.com/mattn/go-colorable v0.1.13 // indirect
github.com/minio/asm2plan9s v0.0.0-20200509001527-cdd76441f9d8 // indirect
github.com/minio/c2goasm v0.0.0-20190812172519-36a3d3bbc4f3 // indirect
github.com/mitchellh/copystructure v1.2.0 // indirect
github.com/mitchellh/reflectwalk v1.0.2 // indirect
github.com/pierrec/lz4/v4 v4.1.21 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/shopspring/decimal v1.4.0 // indirect
github.com/smartystreets/goconvey v1.6.4 // indirect
github.com/spf13/pflag v1.0.5 // indirect
github.com/valyala/bytebufferpool v1.0.0 // indirect
github.com/valyala/fasttemplate v1.2.2 // indirect
github.com/wk8/go-ordered-map/v2 v2.1.8 // indirect
golang.org/x/net v0.24.0 // indirect
github.com/zeebo/xxh3 v1.0.2 // indirect
golang.org/x/mod v0.19.0 // indirect
golang.org/x/net v0.27.0 // indirect
golang.org/x/sync v0.8.0 // indirect
golang.org/x/sys v0.25.0 // indirect
golang.org/x/time v0.5.0 // indirect
golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1 // indirect
gopkg.in/check.v1 v1.0.0-20180628173108-788fd7840127 // indirect
golang.org/x/tools v0.23.0 // indirect
golang.org/x/xerrors v0.0.0-20220609144429-65e65417b02f // indirect
google.golang.org/genproto v0.0.0-20200526211855-cb27e3aa2013 // indirect
google.golang.org/grpc v1.49.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
Loading
Loading