Skip to content

project-samples/go-sql-export

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-sql-export

Export 1 million records from SQL to CSV or Fixed-Length Format Files with high performance

Exporting large volumes of data efficiently from SQL into a CSV or fixed-length format file requires careful consideration of memory management, I/O performance, and database interactions. This article provides an approach to optimize the process in Golang.

Export data from SQL to CSV or fixed-length format files

Test Info

  • RAM: 12 GB
  • Disk: SSD KINGSTON SA400S37240G ATA Device
  • Exec File Size: 10M
  • Database: PosgreSQL 16
  • Total of rows: 1.018.584 rows
  • Total of columns: 76 columns
  • Power Usage: Very High
  • CPU: 15%
Type File Size Rows RAM Disk Duration Description
Fix Length 1.15 GB 1,018,584 15 M 10.1 M/s 1 min 45 sec Full scan the table
CSV 975 MB 1,018,584 15 M 10.1 M/s 1 min 12 sec Full scan the table
Fix Length 1.02 GB 905,408 15 M 10.1 M/s 1 min 33 sec Filter by index on 1 field
CSV 863 MB 905,408 15 M 10.1 M/s 1 min 3 sec Filter by index on 1 field
Fix Length 890 MB 792,232 14 M 9.9 M/s 1 min 23 sec Filter by index on 1 field
CSV 764 MB 792,232 14 M 9.9 M/s 55 sec Filter by index on 1 field
Fix Length 254 MB 226,352 14 M 9.9 M/s 24 sec Filter by index on 1 field
CSV 220 M 226,352 14 M 9.9 M/s 16 sec Filter by index on 1 field

Batch jobs

Differ from online processing:

  • Long time running, often at night, after working hours.
  • Non-interactive, often include logic for handling errors
  • Large volumes of data

Common Mistakes

  • Inefficient Writing to I/O: Large writing to I/O can slow down performance. Writing each record immediately without buffering is inefficient due to the high overhead of repeated I/O operations.
    • Solution: Use "bufio.Writer" for more efficient writing.
  • Loading All Data Into Memory: Fetching all records at once can consume a lot of memory, causing the program to slow down or crash. Use streaming with cursors instead.
    • Solution: Loop on each cursor. On each cursor, use bufio.Writer to write to database
  • Inefficient Query: Full scan the table. Do not filter on the index.
    • Solution: If you export the whole table, you can scan the full table. If not, you need to filter on the index.

Implementation

Data Reader for SQL

  1. Build Query: For efficient query, you need to filter on the index, avoid to scan the full table. In my sample, I created index on field createdDate. In my 6 use cases, I use 4 use cases to filter on indexing field: createdDate.

  2. Scan the GO row into an appropriate GO struct:

    We provide a function to map a row to a GO struct. We use gorm tag, so that this struct can be reused for gorm later, with these benefits:

    • Simplifies the process of converting database rows into Go objects.
    • Reduces repetitive code and potential errors in manual data mapping.
    • Enhances code readability and maintainability.
type User struct {
    Id          string     `gorm:"column:id;primary_key" format:"%011s" length:"11"`
    Username    string     `gorm:"column:username" length:"10"`
    Email       string     `gorm:"column:email" length:"31"`
    Phone       string     `gorm:"column:phone" length:"20"`
    Status      bool       `gorm:"column:status" true:"1" false:"0" format:"%5s" length:"5"`
    CreatedDate *time.Time `gorm:"column:createdDate" length:"10" format:"dateFormat:2006-01-02"`
}

Transformer

Transform a GO struct to a string (CSV or fixed-length format). We created 2 providers already:

  • CSV Transformer: read GO tags to transform CSV line.
  • Fixed Length Transformer: read GO tags to transform Fixed Length line.

To improve performance, we cache the struct of CSV or Fixed Length Format.

File Writer

  • It is a wrapper of "bufio.Writer" to buffer writes to the file. This reduces the number of I/O operations.

Key Aspects to improve performance:

  • Streaming: The code uses db.QueryContext to fetch records in a streaming manner. This prevents loading all records into memory at once.
  • Memory Management: Since rows are processed one by one, memory usage remains low, even when handling a large number of records.
  • Cache Scanning: to improve performance: based on gorm tag, cache column structure when scanning the GO row into an appropriate GO struct.
  • Cache Transforming: to improve performance, cache CSV or fixed-length format structure when transforming a GO struct into CSV format or fixed-length format.

Conclusion

In the sample, I tested with 1 million records, I see Postgres still used less than 14M RAM, and my program used about 15M RAM.

So, for this case, we don't need to use LIMIT/OFFSET , as long as we loop on cursor, at each of cursor, we write to file stream.

In the past, I also test with 4 million records, export 4GB, it still works.

Other Samples:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages