PDF disassembly 20160516 I need some tools for deconstructing PDF files. Usage examples: - Technical manuals as PDFs; breaking down and extracting parts. For eg the Tek manual with the Ver 3 Readout board info, that is 'semi-OCR' mess. This is the immediate need. - Pulling original max-resolution images out of PDFs. - Checking for and removing malware in PDFs. - Finding/embedding/extracting steganography-info in PDFs. - Dissecting PDFs suspected of holding 'hidden info' (eg O's BC.) - Building a list of shortcomings, flaws and omissions of PDF as a document standard. This: F:\Personal\__Projects\__Learning_stuff\Software_languages\PDF_dissection See also: \\N2200plus\NAS_Public\Archives\__Util\Nut_crak\PDF_deconstruction F:\Personal\__Web_my_sites\__staging_area\everist\__Hindsight_review\topics\obama\ Documents & Specs about PDF --------------------------- From Adobe ---------- http://www.adobe.com/devnet/acrobat.html Adobe's excellent PDF Reference (it's an 8MB PDF file) http://partners.adobe.com/asn/acrobat/sdk/public/docs/PDFReference15_v6.pdf gone --> http://wwwimages.adobe.com/www.adobe.com/content/dam/Adobe/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference15_v6.pdf (One of a series of versions, see: Library of Congress index to doc sources: http://www.digitalpreservation.gov/formats/fdd/browse_list.shtml Huge list of file formats, incl sequence of PDF. http://www.digitalpreservation.gov/formats/fdd/fdd000123.shtml PDF PDF, PDF (Portable Document Format) Family PDF_1_3, PDF, Versions 1.0-1.3 PDF_1_4, PDF, Version 1.4 PDF_1_5, PDF, Version 1.5 PDF_1_6, PDF, Version 1.6 PDF_1_7, PDF, Version 1.7, (ISO 32000-1) PDF_1_7_ext03, PDF, Version 1.7, ExtensionLevel 3 PDF, Geospatial Encoding , PDF Version 1.7, ExtensionLevel 3 PDF_1_7_ext05, PDF, Version 1.7, ExtensionLevel 5 From ISO (list at same URL as above) -------- Once PDF became an ISO standard, the ref docs from ISO are behind a paywall. (outrageous!) They cost a LOT, eg PDF/A-3 = 158 CHF(Swiss francs) = AU$222 http://www.digitalpreservation.gov/formats/fdd/fdd000123.shtml PDF/A, PDF for Long-term Preservation PDF/A-1, PDF for Long-term Preservation, Use of PDF 1.4 PDF/A-1a, PDF for Long-term Preservation, Use of PDF 1.4, Level A Conformance PDF/A-1b, PDF for Long-term Preservation, Use of PDF 1.4, Level B Conformance PDF/A-2, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7) PDF/A-2a, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7), Level A Conformance PDF/A-2u, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7), Level U Conformance PDF/A-2b, PDF for Long-term Preservation, Use of ISO 32000-1 (PDF 1.7), Level B Conformance PDF/A-3, PDF for Long-term Preservation, Use of ISO 32000-1 , With Embedded Files PDF/UA-1, PDF Enhancement for Accessibility, Use of ISO 32000-1 (PDF 1.7) PDF/X, PDF for Prepress Graphics File Interchange Also GeoPDF_2_2, GeoPDF (TerraGo) Encoding, Version 2.2 Articles about PDF and dissecting PDF files ------------------------------------------- http://securityxploded.com/pdf_vuln_exploits.php PDF - Vulnerabilities, Exploits and Malwares GOOD ARTICLE on PDF http://www.records.nsw.gov.au/digitalarchives/pathways/formats/47 Acrobat PDF/A - Portable Document Format Description: The Portable Document Format/Archive is a format designed for long term preservation by Adobe Systems. PDF/A is an ISO standardised version of PDF, with all of the features from PDF that would impede long term preservation removed. A major principle of PDF/A is that it is self contained and not reliant on externalities thus all font and colour information is encoded into the file. PDF/A files are larger than other types of PDF files due to the need for embedded information. PDF/A3 supports three levels of compliance: PDF/A-3a (Accessible), PDF/A-3b (Basic). and PDF/A-3u (Unicode). PDF/A-2 is based on ISO 32000-1 – PDF 1.7 and is defined by ISO 19005-3:2012, which has the formal name 'Document management - Electronic document file format for long-term preservation - Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3)'. The primary new feature in PDF/A-3 is the ability to embed any source format within a PDF/A file. Very concise PDF overview: https://blog.idrsolutions.com/2013/01/understanding-the-pdf-file-format-overview/ http://blog.idrsolutions.com/2013/01/understanding-the-pdf-file-format-bugs-gotchas-and-tips/ PDF Analysis Utilities ---------------------- Google: pdf file structure dissector ==== PDF Dissector ==== (Was a briefly available commercial product, no longer available, can't find any copy) What happened to PDF-Dissector: http://forum.exetools.com/showthread.php?s=ea0e722bf080559351039e62583fe571&t=16094 "March 01, 2011: Google acquires zynamics We're pleased to announce that zynamics has been acquired by Google! If you're an existing customer and do not receive our email announcement within the next 48 hours, please contact us at zynamics-info@google.com. All press inquiries should be sent to press@google.com." For more info, and the hunt for a copy, see Utils_orig\PDF_dissector folder & info file. Also Archives\__Util\Nut_crak\PDF_deconstruction\PDF_dissector ---- http://stackoverflow.com/questions/3549541/best-tool-tool-for-inspecting-pdf-files Best tool tool for inspecting PDF files? http://www.cheapimpostor.com/PDFInspector/ (for macs) PDF Document Inspector - A PDF Document structure browser. Adobe Acrobat has a very cool but rather well hidden mode allowing you to inspect PDF files. I wrote a blog article explaining it at http://www.jpedal.org/PDFblog/2009/04/viewing-pdf-objects/ Stackoverflow... The object viewer in Acrobat is good but Windjack Solution's PDF Canopener allows better inspection with an eyedropper for selecting objects on page. Also permits modifications to be made to PDF. http://www.windjack.com/products/pdfcanopener.html Useful PDF tools – PDF CanOpener […] to work okay with 8 and 9. I suspect the reason for not listing 8 and 9 may be that Acrobat added similar functionality in Acrobat. This may well have reduced the commercial market for PDF CanOpener as Acrobat 8/9 provides lots of […] ------------ I use iText RUPS (Reading and Updating PDF Syntax) in Linux. Since it's written in Java, it works on Windows, too. You can browse all the objects in PDF file in a tree structure. It can also decode Flate encoded streams on-the-fly to make inspecting easier. http://github.com/itext/rups/ How are you supposed to run this thing? Edit: Figured it out. You should not download the default file offered by SourceForge, you need to download the .jar which includes dependencies. - all iText related software, including RUPS, was already on GitHub for more than 6 months. There is also the official iText website, itextpdf.com – Amedee Van Gasse http://itextpdf.com/ ------------ via https://blog.idrsolutions.com/2010/09/useful-pdf-tools-pdfedit/ Useful PDF tools – pdfedit Pdfedit is a free tool for Unix and Linux systems (it can also run under Windows) which can be downloaded from sourceforge. http://sourceforge.net/projects/pdfedit/ ------------ Enfocus PitStop Pro, StatusCheck via https://blog.idrsolutions.com/2010/09/useful-pdf-tools-pdfedit/ Enfocus also has a free tool to examine PDFs at a low level. You can get it both for Mac and Windows as a plug-in into Acrobat or as a standalone tool. This should link to the page where you have more explanation and with a download link on the left: http://www.enfocus.com/product.php?id=4530 (gone? all products cost money) Unless it's StatusCheck or PitStopPro https://www.enfocus.com/en/products/statuscheck (free, downloaded) https://www.enfocus.com/en/products/pitstop-pro (very expensive, not dl) ------------ Moved here from F:\Personal\__Web_my_sites\__staging_area\everist\__Hindsight_review ... refs_deconstructing_pdf goog: deconstructing pdf https://answers.yahoo.com/question/index?qid=20090306020740AANzftK There are several "extractors" available. However, most will show the results of the embedded image and not the original. http://www.rlvision.com/pdfwiz/about.asp http://www.download.com/Some-PDF-Image-Extract/3000-2079_4-10836441.html http://www.shareup.com/PDF_Image_Extractor-download-31621.html and if you want to unpdf a file try pdf to word http://www.hellopdf.com/download.php The mileage of good images from a PDF may vary. It usually is best and gives better images when those used in the pdf creation have been resized in an image editor before they are saved/used in the pdf. (just like web images) ---------- http://ubuntuforums.org/showthread.php?t=1578373 How can I deconstruct a PDF to a point where I can determine the actions taken when the buttons in the PDF get clicked? Are there any native Linux tools to help me do this, FOSS or otherwise? I've tried pdfedit, but it's very confusing, and I can't figure out how to convince it to do what I need it to do. pdftk doesn't seem to have the functionality I need, either. --a month later-- I'll be closing out this ticket. I've discovered that the PDF in question has some crazy ColdFusion scripting embedded in it for processing, and there's someone else who has permissions to edit it. This isn't my problem anymore. Thanks, though. --------- google: pdf internal file structure --> lots! ================================================================================ Books on PDF structure ---------------------- http://shop.oreilly.com/product/0636920021483.do http://www.abebooks.com/servlet/SearchResults?an=John+Whitington&sts=t&tn=PDF+Explained US$14.52 free post PDF Explained - The ISO Standard for Document Exchange By John Whitington Publisher: O'Reilly Media Final Release Date: December 2011 Pages: 142 Print ISBN:978-1-4493-1002-8 | ISBN 10: 1-4493-1002-8 Ebook ISBN:978-1-4493-1001-1 | ISBN 10: 1-4493-1001-X At last, here’s an approachable introduction to the widely used Portable Document Format. PDFs are everywhere, both online and in printed form, but few people take advantage of the useful features or grasp the nuances of this format. This concise book provides a hands-on tour of the world’s leading page-description language for programmers, power users, and professionals in the search, electronic publishing, and printing industries. Illustrated with lots of examples, this book is the documentation you need to fully understand PDF. Build a simple PDF file from scratch in a text editor Learn the layout and content of a PDF file, as well as the syntax of its objects Examine the logical structure of PDF objects, and learn how pages and their resources are arranged into a document Create vector graphics and raster images in PDF, and deal with transparency, color spaces, and patterns Explore PDF operators for building and showing text strings Get up to speed on bookmarks, metadata, hyperlinks, annotations, and file attachments Learn how encryption and document permissions work in PDF Use the pdftk program to process PDF files from the command line http://shop.oreilly.com/product/0636920025269.do http://www.abebooks.com/servlet/SearchResults?an=Leonard+Rosenthol&sts=t&tn=Developing+with+PDF US$21.32 free post Developing with PDF - Dive Into the Portable Document Format By Leonard Rosenthol Publisher: O'Reilly Media Final Release Date: October 2013 Pages: 218 Print ISBN: 978-1-4493-2791-0 | ISBN 10: 1-4493-2791-5 Ebook ISBN: 978-1-4493-2786-6 | ISBN 10: 1-4493-2786-9 PDF is becoming the standard for digital documents worldwide, but it’s not easy to learn on your own. With capabilities that let you use a variety of images and text, embed audio and video, and provide links and navigation, there’s a lot to explore. This practical guide helps you understand how to work with PDF to construct your own documents, troubleshoot problems, and even build your own tools. You’ll also find best practices for producing, manipulating, and consuming PDF documents. In addition, this highly approachable reference will help you navigate the official (and complex) ISO documentation. Learn how to combine PDF objects into a cohesive whole Use PDF’s imaging model to create vector and raster graphics Integrate text, and become familiar with fonts and glyphs Provide navigation within and between documents Use annotations to overlay or incorporate additional content Build interactive forms with the Widget annotation Embed related files such as multimedia, 3D content, and XML files Use optional content to enable non-printing graphics Tag content with HTML-like structures, including paragraphs and tables