Improve support for PDF/A and PDF/UA #664

qligier · 2021-03-03T17:51:25Z

Hi,

This is a PR to improve support for PDF/A and PDF/UA with these modifications:

it merges the methods PdfBoxRenderer.addPdfUaXMPSchema() and PdfBoxRenderer.addPdfASchema() because both are setting the XMP metadata. allowing to generate a PDF that is both A and UA. It also levels out the behavior of the slow and fast modes;
it improves the PDF/A support by fixing the mandatory translation from the information dictionnary to the XMP metadata and by adding the mandatory pdfaExtension;
the call to PdfRendererBuilder.usePdfAConformance() now sets the PDF version because a conformance level is linked to a specific PDF version.

I've been unable to properly generate a valid XMP string (edit: because of a bad transformer), so I've used a quick hack:

XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(metadata, baos, true);
String xmp = baos.toString("UTF-8");
// Fix for bad XML generation by some transformers
xmp = xmp.replace(" lang=\"x-default\"", " xml:lang=\"x-default\"");
//xmp = xmp.replace("pdfaExtension:pdfuaid:part", "pdfuaid:part");
metadataStream.importXMPMetadata(xmp.getBytes(StandardCharsets.UTF_8));

The two issues are:

In the Dublin Core, the 'lang' attribute (as serialized by the library) is not accepted by the validators. Adding the xml: prefix fixes the issue.
~~In the PDF Extension, when adding the UA part (pdfuaid:part), using the qualified call doesn't prevent the global prefix (pdfaExtension) to be added.~~

I don't know if it comes from the XmpSerializer (XmpBox) or the models (PdfBox), but it might weel be an issue in their side.

The generated PDFs (PDF/A with or without PDF/UA) have successfuly been tested against several validators:

The following changes could also be brought to the wiki page 'PDF A Standards Compliance':

The project is also capable generating PDFs compliant with PDF/A3a, PDF/A3a and PDF/A3u.
In the example code, the call to builder.usePdfVersion(float) should be removed, as the PDF version is now set by the method call builder.usePdfAConformance(conform). (There also was a typo, it should have been 1.7f, not 1.5f).
The guidelines from the PDF/UA wiki page don't necessarily apply when generating a PDF/UA. It's always good to follow them but it's not required for the strict adherence to the PDF/A specifications.

Thanks,
Quentin

qligier · 2021-03-04T21:35:23Z

I've played a bit more with this and discovered that the first issue (the xml:lang attribute) is caused by a bad transformer: it happens when using the implementation org.apache.xalan.transformer.TransformerIdentityImpl but not when using com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl (JDK 11). It may be a bad implementation or a bad configuration.

I've finally been able to fix the second issue by using a pdfAExt.setPrefix("pdfuaid") call.

danfickle · 2021-03-08T12:11:50Z

Hi @qligier,

Firstly, huge thanks, you have obviously done a lot of research to get this right. The code looks excellent and I have no problems with merging, which I'll do now.

Feel free to modify the wiki page or leave a note here and I'll make the changes you suggest. The only one I'm not sure about is the last:

The guidelines from the PDF/UA wiki page don't necessarily apply when generating a PDF/UA. It's always good to follow them but it's not required for the strict adherence to the PDF/A specifications.

I was under the impression that the last "a" in PDF/A3a for example stood for accessible and required a tagged PDF. I could be wrong though as I haven't bought/read all the relevant standards.

Anyway, thanks again!

qligier · 2021-03-08T13:43:33Z

Thanks for merging!

The last "a" effectively stands for "accessible" but they don't have exactly the same requirements/guidelines (and I'm particularly unsure about which ones are hard requirements and which ones are guidelines in PDF/UA). I've been able to generate a valid PDF/A-1a file without following the accessibility guidelines, so they aren't hard requirements. That's why I proposed to clarify that it's a recommended SHOULD, not a SHALL.
I've edited the wiki as proposed, feel free to review/revert/clarify the changes if needed.

Thanks,
Quentin

@syjer

Fixes regression for PDF/UA introduced in #664. Thanks to @syjer for tracking down. Also add meta subject to PDF/UA samples which is now used as the Dublin Core description.

qligier added 2 commits March 3, 2021 18:16

Improve support for PDF/A and PDF/UA

1437eb8

Fix PDF/UA property prefix and improve the bad XML generation fix

19e9ec7

danfickle merged commit 138b5b9 into danfickle:open-dev-v1 Mar 8, 2021

danfickle mentioned this pull request Mar 19, 2021

Upload to maven central via bintray. #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for PDF/A and PDF/UA #664

Improve support for PDF/A and PDF/UA #664

qligier commented Mar 3, 2021 •

edited

Loading

qligier commented Mar 4, 2021 •

edited

Loading

danfickle commented Mar 8, 2021

qligier commented Mar 8, 2021

Improve support for PDF/A and PDF/UA #664

Improve support for PDF/A and PDF/UA #664

Conversation

qligier commented Mar 3, 2021 • edited Loading

qligier commented Mar 4, 2021 • edited Loading

danfickle commented Mar 8, 2021

qligier commented Mar 8, 2021

qligier commented Mar 3, 2021 •

edited

Loading

qligier commented Mar 4, 2021 •

edited

Loading