Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mangled file content when multipart-POSTing a file with a "text/*" content type #403

Open
mciasuen opened this issue May 7, 2022 · 2 comments · May be fixed by #405
Open

Mangled file content when multipart-POSTing a file with a "text/*" content type #403

mciasuen opened this issue May 7, 2022 · 2 comments · May be fixed by #405

Comments

@mciasuen
Copy link

mciasuen commented May 7, 2022

Reading a UTF-8 CSV and attempting to upload it with needle via a multipart POST can cause non-ASCII characters inside the CSV data to be replaced by other characters.

Steps to Reproduce:

# node --version
v14.18.1
'use strict';

const childproc = require('child_process');
const needle = require('needle');
const net = require('net');

async function postAndCapture(buffer, content_type) {
	const port = 49152 + Math.floor(Math.random() * 16384);
	const proxy = net.createServer();
	await new Promise((res, rej) => {
		proxy.on('error', rej);
		proxy.listen(port, res);
	});
	proxy.on('connection', sock => {
		const buffs = [];
		let pending;
		sock.on('data', x => {
			buffs.push(x);
			if (!pending)
				pending = setImmediate(() => sock.end('HTTP/1.1 200 OK\r\n\r\n'));
		});
		sock.on('close', () => {
			const proc = childproc.spawn('hexdump', ['-C'], {
				stdio: ['pipe', 'inherit', 'inherit']
			});
			proc.stdin.end(Buffer.concat(buffs));
		});
	});
	await needle('post', `http://localhost:${port}/`, {
		file: {
			buffer,
			content_type,
			filename: new Date().getTime() + '.csv'
		}
	}, {
		multipart: true,
	});
	proxy.close();
}

(async () => {
	// File content as a buffer, as it would be read directly from a file.
	const csvFile = Buffer.from('77u/Ikh5cGhlbiIsIkVtIERhc2giDQoiLSIsIuKAlCINCg==', 'base64');

	// Send with CSV type, as some web API might require.
	// -> needle heuristically detects a "text" type, tries to re-encode the
	// CSV payload, causing the UTF-8 em dash to be replaced by an ASCII
	// control character, and corrupting the payload.
	await postAndCapture(csvFile, 'text/csv');

	// Sending with application/octet-stream works around the issue, but prevents
	// us from sending the correct Content-Type, which might not work for all use-cases.
	await postAndCapture(csvFile, 'application/octet-stream');
})();

Expected

  • Changing the Content-Type of the uploaded file should not affect the file content.

Observed

With application/octet-stream...

...the relevant section of the hex dump shows:

00000110  54 50 43 4c 49 45 4e 54  0d 0a 43 6f 6e 74 65 6e  |TPCLIENT..Conten|
00000120  74 2d 44 69 73 70 6f 73  69 74 69 6f 6e 3a 20 66  |t-Disposition: f|
00000130  6f 72 6d 2d 64 61 74 61  3b 20 6e 61 6d 65 3d 22  |orm-data; name="|
00000140  66 69 6c 65 22 3b 20 66  69 6c 65 6e 61 6d 65 3d  |file"; filename=|
00000150  22 31 36 35 31 39 33 37  34 32 38 39 38 38 2e 63  |"1651937428988.c|
00000160  73 76 22 0d 0a 43 6f 6e  74 65 6e 74 2d 54 72 61  |sv"..Content-Tra|
00000170  6e 73 66 65 72 2d 45 6e  63 6f 64 69 6e 67 3a 20  |nsfer-Encoding: |
00000180  62 69 6e 61 72 79 0d 0a  43 6f 6e 74 65 6e 74 2d  |binary..Content-|
00000190  54 79 70 65 3a 20 61 70  70 6c 69 63 61 74 69 6f  |Type: applicatio|
000001a0  6e 2f 6f 63 74 65 74 2d  73 74 72 65 61 6d 0d 0a  |n/octet-stream..|
000001b0  0d 0a ef bb bf 22 48 79  70 68 65 6e 22 2c 22 45  |.."Hyphen","E|
000001c0  6d 20 44 61 73 68 22 0d  0a 22 2d 22 2c 22 e2 80  |m Dash".."-",".|
000001d0  94 22 0d 0a 0d 0a 2d 2d  2d 2d 2d 2d 2d 2d 2d 2d  |."....----------|

The em dash is encoded as 0x80 0x94, a valid UTF-8 code sequence.

With text/csv...

...the relevant section of the hex dump shows:

00000110  54 50 43 4c 49 45 4e 54  0d 0a 43 6f 6e 74 65 6e  |TPCLIENT..Conten|
00000120  74 2d 44 69 73 70 6f 73  69 74 69 6f 6e 3a 20 66  |t-Disposition: f|
00000130  6f 72 6d 2d 64 61 74 61  3b 20 6e 61 6d 65 3d 22  |orm-data; name="|
00000140  66 69 6c 65 22 3b 20 66  69 6c 65 6e 61 6d 65 3d  |file"; filename=|
00000150  22 31 36 35 31 39 33 37  34 32 38 39 36 36 2e 63  |"1651937428966.c|
00000160  73 76 22 0d 0a 43 6f 6e  74 65 6e 74 2d 54 79 70  |sv"..Content-Typ|
00000170  65 3a 20 74 65 78 74 2f  63 73 76 0d 0a 0d 0a ff  |e: text/csv....|
00000180  22 48 79 70 68 65 6e 22  2c 22 45 6d 20 44 61 73  |"Hyphen","Em Das|
00000190  68 22 0d 0a 22 2d 22 2c  22 14 22 0d 0a 0d 0a 2d  |h".."-","."....-|

The em dash is encoded as 0x14, an obscure ASCII control character, which apparently chokes some CSV parsers.


Uploading the CSV file as application/octet-stream may work as a workaround for some APIs but may not work in all cases, e.g. where a provider accepts multiple formats and uses the Content-Type header to actually differentiate which parser to use.

According to https://github.com/tomas/needle/blob/master/lib/multipart.js#L45, needle tries to heuristically determine if it should process and re-encode the payload data based on the content-type; there is apparently no way to instruct it to skip this re-encoding and send the data exactly as-is while still using a text content-type.

@tomas
Copy link
Owner

tomas commented May 7, 2022

I see. So what should we do in this case? Allow passing something like multipart: 'raw' to skip re-encoding, or include some other method of preventing these weird conversions to happen?

Thanks for the very detailed bug report, by the way!

@mciasuen
Copy link
Author

mciasuen commented May 7, 2022

multipart: 'raw' could be technically workable, but feels very unsatisfying.

If you receive a binary Buffer for that part (or a file, which readFile then turns into a Buffer), wouldn't it make sense to just always send that as binary without re-encoding? In that case, I might just do something like this:

(code example moved to PR #405)

That seemed to work for the test case in this issue, at least: I get the content-type I want, a binary transfer-encoding, and the correct UTF-8 bytes, in both cases.

mciasuen added a commit to mciasuen/tomas-needle that referenced this issue May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants