How to convert images to PDF with paper and image size, without ImageMagick

ImageMagick’s convert tool is handy for converting a series of images into a PDF. Just for future reference, here is one method how you can achieve the same without convert. It is useful if you have 1-bit PBM images (e.g. scanned text) at hand:

This command concatenates all .pbm files, pipes the data to pnmtops to create an intermediary PostScript file on-the-fly, and pipes this PostScript file to ps2pdf to create the final pdf.

With the -width and -height parameter you specify the paper size of the generated PDF in inches, which should correspond to the paper size of the original book. With the -imagewidth parameter you specify the width that your scan/photo takes up on the page.

Digitize books: Searchable OCR PDF with text overlay from scanned or photographed books on Linux

Here is my method to digitize books. It is a tutorial about how to produce searchable, OCR (Optical Character Recognition) PDFs from a hardcopy book using free software tools on Linux distributions. You probably can find more convenient proprietary software, but that’s not the objective of this post.

Important: I should not need to mention that depending on the copyright attached to a particular work, you may not be allowed to digitize it. Please inform yourself before, so that you don’t break copyright laws!!!

Digitize books

To scan a book, you basically have 2 choices:

  1. Scan each double page with a flatbed scanner
  2. Take a good photo camera, mount it on a tripod, have it point vertically down on the book, and then take photos of each double page. Professional digitizers use this method due to less strain on the originals.

No matter which method, the accuracy of OCR increases with the resolution and contrast of the images. The resolution should be high enough so that each letter is at least 25 pixels tall.

Since taking a photo is almost instant, you can be much faster with the photographing method than using a flatbed scanner. This is especially true for voluminous books which are hard to repeatedly take on and off a scanner. However, getting sharp high-resolution images with a camera is more difficult than using a flatbed scanner. So it’s a tradeoff that depends on your situation, equitpment and your skills.

Using a flatbed scanner doesn’t need explanation, so I’ll only explain the photographic method next.

Photographing each page

Digitizing books (International Dunhuang Project, CC BY-SA 3.0)
Digitizing books
(International Dunhuang Project, CC BY-SA 3.0)

If you use a camera, and you don’t have some kind of remote trigger or interval-trigger at hand, you would need 2 people: someone who operates the camera, and another one who flips the pages. You can easily scan 1 double page every 2 seconds once you get more skilled in the process.

Here are the steps:

  • Set the camera on a tripod and have it point vertically down. The distance between camera and book should be at least 1 meter to approximate orthagonal projection (imitates a flatbed scanner). Too much perspective projection would skew the text lines.
  • Place the book directly under the camera – avoid pointing the camera at any non-90-degree angles that would cause perspective skewing of the contents. Later we will unskew the images, but the less skewing you get at this point, the better.
  • Set up uniform lighting, as bright as you are able. Optimize lighting directions to minimize possible shadows (especially in the book fold). Don’t place the lights near the camera or it will cause reflections on paper or ink.
  • Set the camera to manual mode. Use JPG format. Turn the camera flash off. All pictures need to have uniform exposure characteristics to make later digital processing easier.
  • Maximize zoom so that a margin of about 1 cm around the book is still visible. This way, aligning of the book will take less time. The margin will be cropped later.
  • Once zoom and camera position is finalized, mark the position of the book on the table with tape. After moving the book, place it back onto the original position with help of these marks.
  • Take test pictures. Inspect and optimize the results by finding a balance between the following camera parameters:
    • Minimize aperture size (high f/value) to get sharper images.
    • Maximize ISO value to minimize exposure time so that wiggling of the camera has less of an effect. Bright lighting helps lowering ISO which helps reducing noise.
    • Maximize resolution so that the letter size in the photos is at least 25 pixels tall. This will be important to increase the quality of the OCR step below, and you’ll need a good camera for this.
  • Take one picture of each double page.
One double page of a book that will be digitized. This is actually a scan, but you also can use a good photo camera. Note that the right page is slighty rotated.
One double page of a book that will be digitized. This is actually a scan, but you also can use a good photo camera. Make sure that letters are at least 25 pixels tall. Note that the right page is slighty rotated.

Image Preprocessing

Let’s remember our goal: We want a PDF …

  • which is searchable (the text should be selectable)
  • whose file size is minimized
  • has the same paper size as the original
  • is clearly legible

The following steps are the preprocessing steps to accomplish this. We will use ImageMagick command line tools (available for all platforms) and a couple of other software, all available for Linux distributions.

A note on image formats

Your input files can be JPG or TIFF, or whatever format your scanner or camera support. However, this format must also be supported by ImageMagick. We’ll convert these images to black-and-white PBM images to save space and speed up further processing. PBM is a very simple, uncompressed image format that only stores 1 bit per pixel (2 colors). This image format can be embedded into the PDF directly, and it will be losslessly compressed extremely well, resulting in the smallest possible PDF size.

Find processing parameters by using just a single image

Before we will process all the images as a batch, we’ll just pick one image and find the right processing parameters. Copy one photograph into a new empty folder and do the following steps.

Converting to black and white

Suppose we have chosen one image in.JPG. Run:

Inspect the generated 1blackwhite.pbm file. Optimize the parameters threshold (50% in above example), brightness (0 in above example), and contrast (10 in above example) for best legibiligy of the text.

Black-white conversion of the original image. Contrast and resolution is important.
Black-white conversion of the original image. Contrast and resolution is important.

Cropping away the margins

Next we will crop away the black borders so that the image will correspond to the paper size.

In this example, the cropped image will be a rectangle of 2400×2000 pixels, taken from the offset 760,250 of the input image. Inspect 2cropped.pbm until you get the parameters right, it will be some trial-and-error. The vertical book fold should be very close to the horizontal middle of the cropped image (important for next step).

The cropped image
The cropped image

Split double pages into single pages

This will generate 2 images. Inspect  split0001.pbm and split0002.pbm. You only can use 50% of horizontal cut, otherwise you’ll get more than 2 images.

Left split page
Left split page
Right split page
Right split page

Deskewing the image

Your text lines are probably not exactly horizontal (page angles, camera angles, perspective distortion, etc.). However, having exactly horizontal text lines is very important for accuracy of OCR software. We can deskew an image with the following command:

Inspect the output file 3deskewed.pbm for best results.

The deskewed left page
The deskewed left page
The deskewed right page. Notice that the text lines are now perfectly horizontal. However, deskewing can have its limits, so the original image should already be good!
The deskewed right page. Notice that the text lines are now perfectly horizontal. However, deskewing can have its limits, so the original image should already be good!

Process all the images

Now that you’ve found the paramters that work for you, it’s simple to convert all of your images as a batch, by passing all the paramters at the same time to convert. Run the following in the folder where you stored all the JPG images (not in the folder where you did the previous single-image tests):

Now, for each .JPG input file, we’ll have two  .pbm output files. Inspect all .pbm files and make manual corrections if needed.

Note: If you have black borders on the pages, consider using unpaper to remove them. I’ll save writing about using unpaper for a later time.

 

Producing OCR PDFs with text overlay

The tesseract OCR engine can generate PDFs with a selectable text layer directly from our PBM images. Since OCR is CPU intensive, we’ll make use of parallel processing on all of our CPU cores with the parallel tool. You can install both by running

 

For each PBM file, create one PDF file:

To merge all the PDF files into one, run pdfunite from the poppler-utils package:

Success! And this is our result:

The PDF has been OCR'd and a selectable text layer has been generated
The PDF has been OCR’d and a selectable text layer has been generated

 

Hashing passwords: SHA-512 can be stronger than bcrypt (by doing more rounds)

'Hashed' brown potatoes. Hashing is important on more than just one level (picture by Jamie Davids, CC-BY-2.0)
‘Hashed’ brown potatoes. Hashing is important on more than just one level (picture by Jamie Davids, CC-BY-2.0)

On a server, user passwords are usually stored in a cryptographically secure way, by running the plain passwords through a one-way hashing function and storing its output instead. A good hash function is irreversible. Leaving dictionary attacks aside and by using salts, the only way to find the original input/password which generated its hash, is to simply try all possible inputs and compare the outputs with the stored hash (if the hash was stolen). This is called bruteforcing.

Speed Hashing

With the advent of GPU, FPGA and ASIC computing, it is possible to make bruteforcing very fast – this is what Bitcoin mining is all about, and there’s even a computer hardware industry revolving around it. Therefore, it is time to ask yourself if your hashes are strong enough for our modern times.

bcrypt was designed with GPU computing in mind and due to its RAM access requirements doesn’t lend itself well to parallelized implementations. However, we have to assume that computer hardware will continue to become faster and more optimized, therefore we should not rely on the security margin that bcrypt’s more difficult implementation in hardware and GPUs affords us for the time being.

For our real security margin, we should therefore only look at the so-called “variable cost factor” that bcrypt and other hashing functions support. With this feature, hashing function can be made arbitrarily slow and therefore costly, which helps deter brute-force attacks upon the hash.

This article will investigate how you can find out if bcrypt is available to you, and if not, show how to increase the variable cost factor of the more widely available SHA-512 hashing algorithm.

 

Do you have bcrypt?

If you build your own application (web, desktop, etc.) you can easily find and use bcrypt libraries (e.g. Ruby, Python, C, etc.).

However, some important 3rd party system programs (Exim, Dovecot, PAM, etc.) directly use glibc’s crypt() function to authenticate a user’s password. And glibc’s crypt(), in most Linux distributions except BSD, does NOT implement bcrypt! The reason for this is explained here.

If you are not sure if your Linux distribution’s glibc supports bcrypt ( man crypt won’t tell you much), simply try it by running the following C program. The function  crypt() takes two arguments:

For testing, we’ll firstly generate a MD5 hash because that is most certainly available on your platform. Add the following to a file called  crypttest.c:

Suppose xyz is our very short input/password. The $1$ option in the salt argument means: generate a MD5 hash. Type man crypt for an explanation.

Compile the C program:

Run it:

The output:

Next, in the C program change $1$ to $6$ which means SHA-512 hash. Recompile, run, and it will output:

Next, change $6$ to $2a$ whch means bcrypt. Recompile, run, and the output on my distribution (Debian) is:

So, on my Debian system, I’m out of luck. Am I going to change my Linux distribution just for the sake of bcrypt? Not at all. I’ll simply use more rounds of SHA-512 to make hash bruteforcing arbitrarily costly. It turns out that SHA-512 (bcrypt too, for that matter) supports an arbitrary number of rounds.

More rounds of SHA-512

glibc’s crypt() allows specification of the number of rounds of the main loop in the algorithm.  This feature is strangely absent in the man crypt documentation, but is documented here. glibc’s default of rounds for a SHA-512 hash is 5000. You can specify the number of rounds as an option in the salt argument. We’ll start with 100000 rounds.

In the C program, pass in as argument salt the string $6$rounds=100000$salthere . Recompile and measure the execution time:

It took 48 milliseconds to generate this hash. Let’s increase the rounds by a factor of 10, to 1 million:

We see that generation of one hash took 10 times longer, i.e. about half a second. It seems to be a linear relationship.

Glibc allows setting the cost factor (rounds) from 1000 to 999,999,999.

The Proof lies in the Bruteforcing

Remember that above we have chosen the input/password xyz (length 3). If we only allow for the characters [a-z], we get 26³ possibilities, or 17576 different passwords to try. With 48 ms per hash (100000 rounds) we expect the bruteforcing to have an upper limit of about 17576 x 0.048 s = 843 seconds (approx. 14 minutes).

We will use a hash solver called hashcat to confirm the expected bruteforcing time. Hashcat can use GPU computing, but I’m missing proper OpenCL drivers, so my results come from slower CPU computing only. But that doesn’t matter. The crux of the matter is that the variable cost factor (rounds) directly determines the cracking time, and it will have to be adapted to supposed computer hardware that might do the cracking.

Now we’re going to solve above SHA-512 hash (password/input was xyz) which was calculated with 100000 rounds:

Note that it took about 14 minutes to find the plaintext password xyz from the hash, which confirms above estimation.

Now let’s try to crack a bcrypt hash of the same input xyz which I generated on this website:

Note that bcrypt’s cost factor in this example is 08 (see bcrypt’s documentation for more info on this). The bruteforce cracking time of the same password took only 3 minutes and 30 seconds. This is about 3 times faster than the SHA-512 example, even though bcrypt is frequently described as being slow. This variable cost factor is often overlooked by bloggers who too quickly argue for one or the other hash function.

 

Conclusion

bcrypt is suggested in many blog posts in favor of other hashing algorithms. I have shown that by specifying a variable cost factor (rounds) to the SHA-512 algorithm, it is possible to arbitrarily increase the cost of bruteforcing the hash. Both SHA-512 and bcrypt can therefore be not said to be faster or slower than the other.

The variable cost factor (rounds) should be chosen in such a way that even resourceful attackers will not be able to crack passwords in a reasonable time, and that the number of authenticating users to your server won’t consume too much CPU resources.

When a hash is stolen, the salt may be stolen too, because they are usually stored together. Therefore, a salt won’t protect against too short or dictionary-based passwords. The importance of choosing long and random passwords with lower, upper and special symbols can’t be emphasized enough. This is also true when no hashes are stolen, and attackers simply try to authenticate directly with your application by simply trying every single password. With random passwords, and assuming that the hash function implementations are indeed non-reversible, it is trivial to calculate the cost of brute-forcing their hashes.