Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for PDF/A and PDF/UA #664

Merged
merged 2 commits into from
Mar 8, 2021
Merged

Improve support for PDF/A and PDF/UA #664

merged 2 commits into from
Mar 8, 2021

Conversation

qligier
Copy link
Contributor

@qligier qligier commented Mar 3, 2021

Hi,

This is a PR to improve support for PDF/A and PDF/UA with these modifications:

  • it merges the methods PdfBoxRenderer.addPdfUaXMPSchema() and PdfBoxRenderer.addPdfASchema() because both are setting the XMP metadata. allowing to generate a PDF that is both A and UA. It also levels out the behavior of the slow and fast modes;
  • it improves the PDF/A support by fixing the mandatory translation from the information dictionnary to the XMP metadata and by adding the mandatory pdfaExtension;
  • the call to PdfRendererBuilder.usePdfAConformance() now sets the PDF version because a conformance level is linked to a specific PDF version.

I've been unable to properly generate a valid XMP string (edit: because of a bad transformer), so I've used a quick hack:

XmpSerializer serializer = new XmpSerializer();
ByteArrayOutputStream baos = new ByteArrayOutputStream();
serializer.serialize(metadata, baos, true);
String xmp = baos.toString("UTF-8");
// Fix for bad XML generation by some transformers
xmp = xmp.replace(" lang=\"x-default\"", " xml:lang=\"x-default\"");
//xmp = xmp.replace("pdfaExtension:pdfuaid:part", "pdfuaid:part");
metadataStream.importXMPMetadata(xmp.getBytes(StandardCharsets.UTF_8));

The two issues are:

  • In the Dublin Core, the 'lang' attribute (as serialized by the library) is not accepted by the validators. Adding the xml: prefix fixes the issue.
  • In the PDF Extension, when adding the UA part (pdfuaid:part), using the qualified call doesn't prevent the global prefix (pdfaExtension) to be added.

I don't know if it comes from the XmpSerializer (XmpBox) or the models (PdfBox), but it might weel be an issue in their side.

The generated PDFs (PDF/A with or without PDF/UA) have successfuly been tested against several validators:

The following changes could also be brought to the wiki page 'PDF A Standards Compliance':

  • The project is also capable generating PDFs compliant with PDF/A3a, PDF/A3a and PDF/A3u.
  • In the example code, the call to builder.usePdfVersion(float) should be removed, as the PDF version is now set by the method call builder.usePdfAConformance(conform). (There also was a typo, it should have been 1.7f, not 1.5f).
  • The guidelines from the PDF/UA wiki page don't necessarily apply when generating a PDF/UA. It's always good to follow them but it's not required for the strict adherence to the PDF/A specifications.

Thanks,
Quentin

@qligier
Copy link
Contributor Author

qligier commented Mar 4, 2021

I've played a bit more with this and discovered that the first issue (the xml:lang attribute) is caused by a bad transformer: it happens when using the implementation org.apache.xalan.transformer.TransformerIdentityImpl but not when using com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl (JDK 11). It may be a bad implementation or a bad configuration.

I've finally been able to fix the second issue by using a pdfAExt.setPrefix("pdfuaid") call.

@danfickle
Copy link
Owner

Hi @qligier,

Firstly, huge thanks, you have obviously done a lot of research to get this right. The code looks excellent and I have no problems with merging, which I'll do now.

Feel free to modify the wiki page or leave a note here and I'll make the changes you suggest. The only one I'm not sure about is the last:

The guidelines from the PDF/UA wiki page don't necessarily apply when generating a PDF/UA. It's always good to follow them but it's not required for the strict adherence to the PDF/A specifications.

I was under the impression that the last "a" in PDF/A3a for example stood for accessible and required a tagged PDF. I could be wrong though as I haven't bought/read all the relevant standards.

Anyway, thanks again!

@danfickle danfickle merged commit 138b5b9 into danfickle:open-dev-v1 Mar 8, 2021
@qligier
Copy link
Contributor Author

qligier commented Mar 8, 2021

Thanks for merging!

The last "a" effectively stands for "accessible" but they don't have exactly the same requirements/guidelines (and I'm particularly unsure about which ones are hard requirements and which ones are guidelines in PDF/UA). I've been able to generate a valid PDF/A-1a file without following the accessibility guidelines, so they aren't hard requirements. That's why I proposed to clarify that it's a recommended SHOULD, not a SHALL.
I've edited the wiki as proposed, feel free to review/revert/clarify the changes if needed.

Thanks,
Quentin

danfickle added a commit that referenced this pull request Jul 12, 2021
Fixes regression for PDF/UA introduced in #664. Thanks to @syjer for tracking down.

Also add meta subject to PDF/UA samples which is now used as the Dublin Core description.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants