Thursday, May 15, 2014

PDF Manipulation

I also have a car I have been rebuilding for 20 years - when I got the car, it was in bad shape and was in a state of being dismantled.  I've used a hard-copy of factory assembly engineering/instruction sheets that were used to put it together the first time to figure out where things should go, and what parts I was missing.  Now, I've been lucky enough to have access to an office "printer" for a little bit.  I decided I'd scan in the factory documents into a PDF to use on the tablets.

The printer will scan things to a USB disk, and allow me to scan entire documents, front and back.  One problem, though, is that I didn't know how to use it.  I ended up with a document that had the first 20 pages correct in a portrait layout, and the rest were in a landscape layout with every other page rotated the wrong direction.  I also had it scanned in sections, and realized I had taken pages and put them into another, temporary notebook for actual use (nothing like trying to lug around a 500-page, paper-in-sheet-protectors, thick book).  When I scanned in the missing pages, I got them into the reverse order.  Yeah, obviously, I didn't know what I was doing.  But, it was correctable.  I needed to get things oriented the right direction, missing pages added in the right place, and then every other page of engineering drawings rotated 180, reverse page ordering, and fix the "tops".  I had 4 scans (called scan1.pdf, scan2.pdf, scan3.pdf, scan4.pdf), and two scans with the missing pages (pages089-095.pdf and pages123-126.pdf).

First, I needed to split the PDF into two different files (or logical "sections"), each with a different orientation.  The first would be for the "table of contents", that was portrait-oriented.  This was done with
    pdftk scan1.pdf cat 1-26 output table_of_contents.pdf
    pdftk scan1.pdf cat 27-end output book-1.pdf
    mv book-1.pdf scan1.pdf
    
This essentially left me with a new file called table_of_contents.pdf, and scan1.pdf didn't have that table in it any more.  The rest of the book was in landscape orientation.  Next, I had to break the first scan once again to make room for the missing pages.
    pdftk scan1.pdf cat 1-88 output pages001-088.pdf
    pdftk scan1.pdf cat 89-117 output pages096-122.pdf
    pdftk scan1.pdf cat 118-end output pages127-143.pdf
    
As I was checking, I realized that pages089-095.pdf were actually pages 95 - 89 (e.g., backwards).  I needed to reverse the order of pages in the PDF file.  I did this using the pdf2ps tool, the psselect tool, and the ps2pdf tool in the following way :
    pdf2ps pages089-095.pdf pages089-095.ps
    rm pages089-095.pdf
    psselect -r pages089-095.ps pages089-095a.ps
    ps2pdf pages089-095a.ps pages089-095.pdf
    rm pages089-095.ps pages089-095a.ps
    
Next, I had scanned in the next missing pages with the pages face-down instead of face up.  That left me with a PDF containing pages 124, 123, 126, and then 125.  I went back to the pdftk tool to modify PDF page order in a granular fashion (you can reverse one or two pages with this technique).
    pdftk pages123-126.pdf cat 2 1 4 3 pages123-126a.pdf
    rm pages123-126.pdf
    mv pages123-126a.pdf pages123-126.pdf
    
This corrected that one.  Now, I had to merge all of the PDF files together.  I used the ghostscript tool gs to do this one instead of the pdftk tool (though I know it will do this well).
    gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=factory_drawings.pdf pages001-088.pdf pages089-095.pdf pages096-122.pdf pages123-126.pdf pages127-143.pdf scan2.pdf scan3.pdf scan4.pdf
    
This created a file called factory_drawings.pdf that had all of the pages in the correct order.  I needed to clean up the extra files now :
    rm pages* scan*
    
I was now left with two files, factory_drawings.pdf and table_of_contents.pdf.  However, since I had scanned this in using a 2-sided booklet format,every other page of the factory_drawings.pdf file had the top to the left and the rest of them had the top to the right.  I needed to rotate every other page of the PDF file.  For this, I resorted to the pdftk tool again.  The first step of this task is to split out the odd pages and rotate them one way :
    pdftk factory_drawings.pdf cat 1-endoddeast output factory_drawings-odd.pdf
    
Next, split out the even pages and rotate them the other way :
    pdftk factory_drawings.pdf cat 1-endevenwest output factory_drawings-even.pdf
    
And finally, "shuffle" them back together :
    rm factory_drawings.pdf
    pdftk factory_drawings-odd.pdf factory_drawings-even.pdf shuffle output factory_drawings.pdf
    rm factory_drawings-odd.pdf factory_drawings-even.pdf
    
SUCCESS!  At this point, I still have two files, factory_drawings.pdf and table_of_contents.pdf, and their orientation is where it should be (with the tops at the top).  The last thing I needed to do is merge them into one PDF file again (PDF's can handle different orientations in one file).
    gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=1977_corvette-factory_assembly_drawings.pdf table_of_contents.pdf factory_drawings.pdf
    rm table_of_contents.pdf factory_drawings.pdf
    
I instantly had a perfect PDF file for use on the tablet as I sat next to the car trying to assemble it.

NOTE -

If you use a custom font in your PDF, you can fix the PDF and embed the fonts using :

    gs -o file-with-embedded-fonts.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=true -sFONTPATH="/path/to/ttf;/other/path/to/ttf" input-without-embedded-fonts.pdf
Another note: Convert Jpeg images to PDF using the jpeg2eps tool, found at :
jpeg2eps ../*.jpg
This will create EPS files of the same names as the Jpegs. Next, convert those to PDF :
for x in ../*.eps; do eps2eps $x $x.pdf; done
At this point, you can combine them into a book :
gs -q -sPAPERSIZE=letter -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=newbook.pdf *.eps.pdf
Edit as of December 2023. I was REALLY lazy and didn't want to go download jpeg2eps code, compile it, etc. So, I used a few more mainstream command line apps. First, each image needs to be converted to a PDF. I used ImageMagick's "convert" command. You might need to first edit the policy config for ImageMagick to allow it to write PDF files (at the bottom of the /etc/ImageMagick-6/policy.xml configuration file) :
convert dowel_jig-1.jpg dowel_jig-1.pdf
Then, I could compile them into a new PDF (install pdftk-java).
pdftk dowel_jig-1.pdf dowel_jig-2.pdf dowel_jig-3.pdf dowel_jig-4.pdf  cat output dowl_jig.pdf
And finally, I could run an OCR tool (install ocrmypdf).
ocrmypdf dowl_jig.pdf dowel_jig.pdf
I have my PDF ready to view!