Digitize books: Searchable OCR PDF with text overlay from scanned or photographed books on Linux

Here is my method to digitize books. It is a tutorial about how to produce searchable, OCR (Optical Character Recognition) PDFs from a hardcopy book using free software tools on Linux distributions. You probably can find more convenient proprietary software, but that’s not the objective of this post.

Important: I should not need to mention that depending on the copyright attached to a particular work, you may not be allowed to digitize it. Please inform yourself before, so that you don’t break copyright laws!!!

Digitize books

To scan a book, you basically have 2 choices:

  1. Scan each double page with a flatbed scanner
  2. Take a good photo camera, mount it on a tripod, have it point vertically down on the book, and then take photos of each double page. Professional digitizers use this method due to less strain on the originals.

No matter which method, the accuracy of OCR increases with the resolution and contrast of the images. The resolution should be high enough so that each letter is at least 25 pixels tall.

Since taking a photo is almost instant, you can be much faster with the photographing method than using a flatbed scanner. This is especially true for voluminous books which are hard to repeatedly take on and off a scanner. However, getting sharp high-resolution images with a camera is more difficult than using a flatbed scanner. So it’s a tradeoff that depends on your situation, equitpment and your skills.

Using a flatbed scanner doesn’t need explanation, so I’ll only explain the photographic method next.

Photographing each page

Digitizing books (International Dunhuang Project, CC BY-SA 3.0)
Digitizing books
(International Dunhuang Project, CC BY-SA 3.0)

If you use a camera, and you don’t have some kind of remote trigger or interval-trigger at hand, you would need 2 people: someone who operates the camera, and another one who flips the pages. You can easily scan 1 double page every 2 seconds once you get more skilled in the process.

Here are the steps:

  • Set the camera on a tripod and have it point vertically down. The distance between camera and book should be at least 1 meter to approximate orthagonal projection (imitates a flatbed scanner). Too much perspective projection would skew the text lines.
  • Place the book directly under the camera – avoid pointing the camera at any non-90-degree angles that would cause perspective skewing of the contents. Later we will unskew the images, but the less skewing you get at this point, the better.
  • Set up uniform lighting, as bright as you are able. Optimize lighting directions to minimize possible shadows (especially in the book fold). Don’t place the lights near the camera or it will cause reflections on paper or ink.
  • Set the camera to manual mode. Use JPG format. Turn the camera flash off. All pictures need to have uniform exposure characteristics to make later digital processing easier.
  • Maximize zoom so that a margin of about 1 cm around the book is still visible. This way, aligning of the book will take less time. The margin will be cropped later.
  • Once zoom and camera position is finalized, mark the position of the book on the table with tape. After moving the book, place it back onto the original position with help of these marks.
  • Take test pictures. Inspect and optimize the results by finding a balance between the following camera parameters:
    • Minimize aperture size (high f/value) to get sharper images.
    • Maximize ISO value to minimize exposure time so that wiggling of the camera has less of an effect. Bright lighting helps lowering ISO which helps reducing noise.
    • Maximize resolution so that the letter size in the photos is at least 25 pixels tall. This will be important to increase the quality of the OCR step below, and you’ll need a good camera for this.
  • Take one picture of each double page.
One double page of a book that will be digitized. This is actually a scan, but you also can use a good photo camera. Note that the right page is slighty rotated.
One double page of a book that will be digitized. This is actually a scan, but you also can use a good photo camera. Make sure that letters are at least 25 pixels tall. Note that the right page is slighty rotated.

Image Preprocessing

Let’s remember our goal: We want a PDF …

  • which is searchable (the text should be selectable)
  • whose file size is minimized
  • has the same paper size as the original
  • is clearly legible

The following steps are the preprocessing steps to accomplish this. We will use ImageMagick command line tools (available for all platforms) and a couple of other software, all available for Linux distributions.

A note on image formats

Your input files can be JPG or TIFF, or whatever format your scanner or camera support. However, this format must also be supported by ImageMagick. We’ll convert these images to black-and-white PBM images to save space and speed up further processing. PBM is a very simple, uncompressed image format that only stores 1 bit per pixel (2 colors). This image format can be embedded into the PDF directly, and it will be losslessly compressed extremely well, resulting in the smallest possible PDF size.

Find processing parameters by using just a single image

Before we will process all the images as a batch, we’ll just pick one image and find the right processing parameters. Copy one photograph into a new empty folder and do the following steps.

Converting to black and white

Suppose we have chosen one image in.JPG. Run:

Inspect the generated 1blackwhite.pbm file. Optimize the parameters threshold (50% in above example), brightness (0 in above example), and contrast (10 in above example) for best legibiligy of the text.

Black-white conversion of the original image. Contrast and resolution is important.
Black-white conversion of the original image. Contrast and resolution is important.

Cropping away the margins

Next we will crop away the black borders so that the image will correspond to the paper size.

In this example, the cropped image will be a rectangle of 2400×2000 pixels, taken from the offset 760,250 of the input image. Inspect 2cropped.pbm until you get the parameters right, it will be some trial-and-error. The vertical book fold should be very close to the horizontal middle of the cropped image (important for next step).

The cropped image
The cropped image

Split double pages into single pages

This will generate 2 images. Inspect  split0001.pbm and split0002.pbm. You only can use 50% of horizontal cut, otherwise you’ll get more than 2 images.

Left split page
Left split page
Right split page
Right split page

Deskewing the image

Your text lines are probably not exactly horizontal (page angles, camera angles, perspective distortion, etc.). However, having exactly horizontal text lines is very important for accuracy of OCR software. We can deskew an image with the following command:

Inspect the output file 3deskewed.pbm for best results.

The deskewed left page
The deskewed left page
The deskewed right page. Notice that the text lines are now perfectly horizontal. However, deskewing can have its limits, so the original image should already be good!
The deskewed right page. Notice that the text lines are now perfectly horizontal. However, deskewing can have its limits, so the original image should already be good!

Process all the images

Now that you’ve found the paramters that work for you, it’s simple to convert all of your images as a batch, by passing all the paramters at the same time to convert. Run the following in the folder where you stored all the JPG images (not in the folder where you did the previous single-image tests):

Now, for each .JPG input file, we’ll have two  .pbm output files. Inspect all .pbm files and make manual corrections if needed.

Note: If you have black borders on the pages, consider using unpaper to remove them. I’ll save writing about using unpaper for a later time.

 

Producing OCR PDFs with text overlay

The tesseract OCR engine can generate PDFs with a selectable text layer directly from our PBM images. Since OCR is CPU intensive, we’ll make use of parallel processing on all of our CPU cores with the parallel tool. You can install both by running

 

For each PBM file, create one PDF file:

To merge all the PDF files into one, run pdfunite from the poppler-utils package:

Success! And this is our result:

The PDF has been OCR'd and a selectable text layer has been generated
The PDF has been OCR’d and a selectable text layer has been generated

 

Hashing passwords: SHA-512 can be stronger than bcrypt (by doing more rounds)

'Hashed' brown potatoes. Hashing is important on more than just one level (picture by Jamie Davids, CC-BY-2.0)
‘Hashed’ brown potatoes. Hashing is important on more than just one level (picture by Jamie Davids, CC-BY-2.0)

On a server, user passwords are usually stored in a cryptographically secure way, by running the plain passwords through a one-way hashing function and storing its output instead. A good hash function is irreversible. Leaving dictionary attacks aside and by using salts, the only way to find the original input/password which generated its hash, is to simply try all possible inputs and compare the outputs with the stored hash (if the hash was stolen). This is called bruteforcing.

Speed Hashing

With the advent of GPU, FPGA and ASIC computing, it is possible to make bruteforcing very fast – this is what Bitcoin mining is all about, and there’s even a computer hardware industry revolving around it. Therefore, it is time to ask yourself if your hashes are strong enough for our modern times.

bcrypt was designed with GPU computing in mind and due to its RAM access requirements doesn’t lend itself well to parallelized implementations. However, we have to assume that computer hardware will continue to become faster and more optimized, therefore we should not rely on the security margin that bcrypt’s more difficult implementation in hardware and GPUs affords us for the time being.

For our real security margin, we should therefore only look at the so-called “variable cost factor” that bcrypt and other hashing functions support. With this feature, hashing function can be made arbitrarily slow and therefore costly, which helps deter brute-force attacks upon the hash.

This article will investigate how you can find out if bcrypt is available to you, and if not, show how to increase the variable cost factor of the more widely available SHA-512 hashing algorithm.

 

Do you have bcrypt?

If you build your own application (web, desktop, etc.) you can easily find and use bcrypt libraries (e.g. Ruby, Python, C, etc.).

However, some important 3rd party system programs (Exim, Dovecot, PAM, etc.) directly use glibc’s crypt() function to authenticate a user’s password. And glibc’s crypt(), in most Linux distributions except BSD, does NOT implement bcrypt! The reason for this is explained here.

If you are not sure if your Linux distribution’s glibc supports bcrypt ( man crypt won’t tell you much), simply try it by running the following C program. The function  crypt() takes two arguments:

For testing, we’ll firstly generate a MD5 hash because that is most certainly available on your platform. Add the following to a file called  crypttest.c:

Suppose xyz is our very short input/password. The $1$ option in the salt argument means: generate a MD5 hash. Type man crypt for an explanation.

Compile the C program:

Run it:

The output:

Next, in the C program change $1$ to $6$ which means SHA-512 hash. Recompile, run, and it will output:

Next, change $6$ to $2a$ whch means bcrypt. Recompile, run, and the output on my distribution (Debian) is:

So, on my Debian system, I’m out of luck. Am I going to change my Linux distribution just for the sake of bcrypt? Not at all. I’ll simply use more rounds of SHA-512 to make hash bruteforcing arbitrarily costly. It turns out that SHA-512 (bcrypt too, for that matter) supports an arbitrary number of rounds.

More rounds of SHA-512

glibc’s crypt() allows specification of the number of rounds of the main loop in the algorithm.  This feature is strangely absent in the man crypt documentation, but is documented here. glibc’s default of rounds for a SHA-512 hash is 5000. You can specify the number of rounds as an option in the salt argument. We’ll start with 100000 rounds.

In the C program, pass in as argument salt the string $6$rounds=100000$salthere . Recompile and measure the execution time:

It took 48 milliseconds to generate this hash. Let’s increase the rounds by a factor of 10, to 1 million:

We see that generation of one hash took 10 times longer, i.e. about half a second. It seems to be a linear relationship.

Glibc allows setting the cost factor (rounds) from 1000 to 999,999,999.

The Proof lies in the Bruteforcing

Remember that above we have chosen the input/password xyz (length 3). If we only allow for the characters [a-z], we get 26³ possibilities, or 17576 different passwords to try. With 48 ms per hash (100000 rounds) we expect the bruteforcing to have an upper limit of about 17576 x 0.048 s = 843 seconds (approx. 14 minutes).

We will use a hash solver called hashcat to confirm the expected bruteforcing time. Hashcat can use GPU computing, but I’m missing proper OpenCL drivers, so my results come from slower CPU computing only. But that doesn’t matter. The crux of the matter is that the variable cost factor (rounds) directly determines the cracking time, and it will have to be adapted to supposed computer hardware that might do the cracking.

Now we’re going to solve above SHA-512 hash (password/input was xyz) which was calculated with 100000 rounds:

Note that it took about 14 minutes to find the plaintext password xyz from the hash, which confirms above estimation.

Now let’s try to crack a bcrypt hash of the same input xyz which I generated on this website:

Note that bcrypt’s cost factor in this example is 08 (see bcrypt’s documentation for more info on this). The bruteforce cracking time of the same password took only 3 minutes and 30 seconds. This is about 3 times faster than the SHA-512 example, even though bcrypt is frequently described as being slow. This variable cost factor is often overlooked by bloggers who too quickly argue for one or the other hash function.

 

Conclusion

bcrypt is suggested in many blog posts in favor of other hashing algorithms. I have shown that by specifying a variable cost factor (rounds) to the SHA-512 algorithm, it is possible to arbitrarily increase the cost of bruteforcing the hash. Both SHA-512 and bcrypt can therefore be not said to be faster or slower than the other.

The variable cost factor (rounds) should be chosen in such a way that even resourceful attackers will not be able to crack passwords in a reasonable time, and that the number of authenticating users to your server won’t consume too much CPU resources.

When a hash is stolen, the salt may be stolen too, because they are usually stored together. Therefore, a salt won’t protect against too short or dictionary-based passwords. The importance of choosing long and random passwords with lower, upper and special symbols can’t be emphasized enough. This is also true when no hashes are stolen, and attackers simply try to authenticate directly with your application by simply trying every single password. With random passwords, and assuming that the hash function implementations are indeed non-reversible, it is trivial to calculate the cost of brute-forcing their hashes.

100% HTTPS in the internet? Non-Profit makes it possible!

Selection_008HTTPS on 100% of websites in the internet? This just has gotten a lot easier! Let’s Encrypt is a free, automated, and open certificate authority (CA), run for the public’s benefit. Let’s Encrypt is a service provided by the Internet Security Research Group (ISRG), a Section 501(c)(3) Non-Profit entity dedicated to reduce financial, technological, and education barriers to secure communication over the Internet.

Let’s Encrypt offers free-of-cost certificates that can be used for HTTPS websites, even when these websites are ran for commercial purposes. Unlike traditional CA’s they don’t require cumbersome registration, paperwork, set-up and payment. The certificates are fetched in an automated way through an API (the ACME Protocol — Automatic Certificate Management Environment), which includes steps to prove that you have control over a domain.

Dedicated to transparency, generated certificates are registered and submitted to Certificate Transparency logs. Here is the generous legal Subscriber Agreement.

Automated API? This sounds too complicated! It is actually not. There are a number of API libraries and clients available that do the work for you. One of them is Certbot. It is a regular command-line program written in Python and the source code is available on Github.

After downloading the certbot-auto script (see their documentation), fetching certificates consists of just one command line (in this example certs for 3 domains are fetched in one command with the -d switch):

With the -w  flag you tell the script where to put temporary static files (a sub-folder .well-known  will be created) that, during the API control flow, serve as proof to the CA’s server that you have control over the domain. This is identical to Google’s method of verifying a domain for Google Analytics or Google Webmaster Tools by hosting a static text file.

Eventually, the (already chained, which is nice!) certificate and private key are copied into /etc/letsencrypt/live/example.com/ :

Then it is only a matter of pointing your web server (Nginx, Apache, etc.) to these two files, and that’s trivial.

Let’s Encrypt certificates are valid for 90 days. The automatic renewal of ALL certificates that you have loaded to your machine is as easy as …

… which they suggest should be put into a Cron job, run twice daily. It will renew the certificates just in time. No longer do you have to set a reminder in your calendar to renew a certificate, and then copy-paste it manually!

A bit of a downside is that Let’s Encrypt unfortunately doesn’t support wildcard domain certificates. For these, you still have to pay money to some other CA’s who support them. But in above shown code example, you would generate only 1 certificate for the domains example.com and its two subdomains www.example.com and blah.example.com. The two subdomains are listed in the Subject Alternative Name field of the certificate, which is as close to wildcard subdomains as it gets. But except for SAAS providers and other specialized businesses, not having wildcard certificates should not be too big of an issue, especially when one can automate the certificate setup.

On the upside, they even made sure that their certificates work down to Windows XP!

Today, I set up 3 sites with Let’s Encrypt (one of them had several subdomains), and it was a matter of a few minutes. It literally took me longer to configure proper redirects in Nginx (no fault of Nginx, I just keep forgetting how it’s done properly) than to fetch all the certificates. And it even gave me time to write this blog post!

Honestly, I never agreed with the fact that for commercial certificate authorities, one has to pay 1000, 100 or even 30 bucks per certificate per year. Where’s the work invested into such a certificate that is worth so much? The generation of a certificate is automated, and is done in a fraction of a second on the CPU. Anyway, that now seems to be a thing of the past.

A big Thumbs-up and Thanks go to the Let’s Encrypt CA, the ISRG, and to Non-Profit enterprises in general! I believe that Non-Profits are the Magic Way of the Future!

Icon made by Freepik from www.flaticon.com 

DIY Piezo-Electric Touch Probe with high sensitivity

I built this mechanical piezo-electric touch probe recently as an extension to a CNC machine to sense the depth of surfaces. It works with the grbl CNC controller and probably others.

The goal was to flip a TTL (5V) signal whenever the probe was touching a surface mechanically. Unlike classical mechanical switches which have moving parts, this probe is ‘solid state’ and must emit a signal as early as possible when there is mechanical contact, even when the probe is touched ever so slightly, and even on non-conducting materials.

The idea of piezoceramic touch probes is not new. They are based on the bending of piezo ceramic, which is constructed behind the surface of the probe. The benefit of using a piezo touch switch is its ability to interact with surfaces of virtually any type of material.

The cheapest professional touch probes cost several hundred Euros, and upwards. That was out of the question for my purposes. However, with a bit of ingenuity, I managed to construct a probe from scrap metal and electronic parts that cost less than 10 euros.

The heart of the sensor is a piecoceramic loudspeaker/beeper. However, it also can act as a “microphone”. When the piezoceramic surface is deformed mechanically (either via sound or by direct pressure), voltage is generated (up to 2 or 3 Volt).

The tricky part is that upon pressing, charge is generated, which dissipates quickly via the piecoceramic material itself. This means that upon releasing the mechanical pressure, charge of the opposite sign is generated. This negative voltage is a problem: One cannot simply feed this signal into a TTL-level microcontroller (µC) without risk of damaging it. Also, slight touches would only generate Millivolts, below the threshold voltage of a µC pin. So, electronics have to be made that act as an analog-to-digital converter. Specifically, I implemented an inverting Schmitt-Trigger (see below).

Here is the piecoceramic loudspeaker/beeper that I used (less than 1 Euro):

Piezoceramic beeper

 

I simply removed the plastic back plate from the beeper to expose the piezoceramic element, and, in its center, glued a short wood screw on it with a bit of epoxy glue.Then I bent some metal and drilled some holes to create a stable holder for the beeper (this will be mounted directly on the CNC machine). The nice feature about this particular beeper is that the piezoceramic element is suspended by its plastic packaging, which means that I could squeeze it between the metal holder without deforming the piezoceramics:

Piezoceramic beeper as touch probe

 

Now that the mechanical part is done, on to the electronics!

Electronics

Goal: The electronics will receive the analog (positive or negative) voltage of the beeper, and output a well defined TTL level between 0 and 5V that can be directly and safely be consumed by a microcontroller. The values of the electrical components have to be chosen such that the slightest mechanical deformation of the piezo element will immediately flip the TTL signal.

As I have already finished and tested the probe while writing this, I have tried to apply such a gentle pressure so as to NOT make the TTL signal flip. It is hard to do, which means that it is a very sensitive sensor!

My implementation will flip the TTL output from high to low when the input surpasses 50mV. It will flip the TTL output from low to high when the input goes below 0V.

Below schematics show:

Terminal T1: 5V supply voltage for the OpAmp
Terminal T2: The TTL output (0V or 5V) that can be directly connected with a µC
Terminal T3: The analog input from the piezo element
Terminal T4: GND

Touch Probe analog to digital conversion circuit

 

 

It consists of the following components:

Inverting Schmitt trigger (U1, R1, R2)

I chose an easily available OpAmp LM358N. I chose the inverting configuration to completely isolate input from output. I chose R1 and R2 such that the input threshold voltage would be approx. 50 mV (a very light touch of the probe generates this easily). The threshold voltage is simply the voltage between the voltage divider R1 and R2. I chose R1 = 5k4 and R2 = 55.

V_threshold = Vcc * R2 / (R1 + R2) = 5 * 55 / (5k4 + 55) = 50 mV

Pulldown resistor (R3)

The negative input of the OpAmp and the piezo element on T3 are very high impedance and would float if not pulled down. In addition, the real OpAmp leaks a bit of charge into this terminal, bringing it up to 60mV which is even above the switching threshold. To bring the voltage much closer to 0V, I added a rather strong pull-down resistor R3 = 87k. The piezo has no problems with this load.

Low-pass filter (R4, C4)

In my wiring, T3 will be connected to a long line subject to interference. R4 together with C4 form a simple low-pass filter which a time constant t = 0.24 µs , which is shorter than what will be expected from the signal of the piezo element.

LED (R5, D1)

The LED and its limiting resistor R5 = 480 Ohms (for approx 10mA) is just there for immediate optical feedback.

 

Implementation on a development board

Above schematic can be transferred into the following equivalent simple development wiring layout (C4 is labelled C1):

 

Touch Probe implementation

 

The resistors and capacitors can be easily exchanged for trial-and-error fine-tuning:

Touch Probe electronics on a development board

 

Verification with Oscilloscope

The following probes were connected:

Yellow: The input of the piezo element before low-pass filter
Blue: The input of the piezo element after low-pass filter
Pink: The output of the circuit

Touch Probe test setup

 

 

 

Input voltage without pull down resistor

Input voltage without pull down resistor (yellow graph). The OpAmp is leaking some charge into the input terminal, causing the input voltage to raise to about 60mV. Connecting a pull down resistor will rectify this.

 

 

<img class="wp-image-304" src="http://michaelfranzl team project management tools.com/wp-content/uploads/2016/05/DS1Z_QuickPrint7.png” alt=”Input voltage with pulldown resistor” width=”598″ height=”337″ />

Input with pull-down resistor applied. Input voltage is now down to less than 5mV.

 

 

 

Triggering1

Touching the piezo probe increases voltage on the input. At 50mV, the output switches to LOW.

 

 

 

Triggering2

A faster, oscillating signal. When going over 50mV, the output switches to LOW. When going below 0V, the output switches back to HIGH.

 

Triggering3

An even faster oscillating signal (I hit the probe holder with a screwdriver). Behavior is as expected.

OpenGL programming in Python: pyglpainter

This was a recent hobby programming project of mine for use in a CNC application, using Python and OpenGL. The source code is available at https://github.com/michaelfranzl/pyglpainter .Simple OpenGL output using pyglpainter library

This Python module provides the class PainterWidget, extending PyQt5’s QGLWidget class with boilerplate code neccessary for applications which want to build a classical orthagnoal 3D world in which the user can interactively navigate with the mouse via the classical (and expected) Pan-Zoom-Rotate paradigm implemented via a virtual trackball (using quaternions for rotations).

This class is especially useful for technical visualizations in 3D space. It provides a simple Python API to draw raw OpenGL primitives (LINES, LINE_STRIP, TRIANGLES, etc.) as well as a number of useful composite primitives rendered by this class itself ( Grid, Star, CoordSystem, Text, etc., see files in classes/items). As a bonus, all objects/items can either be drawn as real 3D world entities which optionally support “billboard” mode (fully camera-aligned or arbitrary- axis aligned), or as a 2D overlay.

It uses the “modern”, shader-based, OpenGL API rather than the deprecated “fixed pipeline” and was developed for Python version 3 and Qt version 5.

Model, View and Projection matrices are calculated on the CPU, and then utilized in the GPU.

Qt has been chosen not only because it provides the GL environment but also vector, matrix and quaternion math. A port of this Python code into native Qt C++ is therefore trivial.

Look at example.py, part of this project, to see how this class can be used. If you need more functionality, consider subclassing.

Most of the time, calls to item_create() are enough to build a 3D world with interesting objects in it (the name for these objects here is “items”). Items can be rendered with different shaders.

This project was originally created for a CNC application, but then extracted from this application and made multi-purpose. The author believes it contains the simplest and shortest code to quickly utilize the basic and raw powers of OpenGL. To keep code simple and short, the project was optimized for technical, line- and triangle based primitives, not the realism that game engines strive for. The simple shaders included in this project will draw aliased lines and the output therefore will look more like computer graphics of the 80’s. But “modern” OpenGL offloads many things into shaders anyway.

This class can either be used for teaching purposes, experimentation, or as a visualization backend for production-class applications.

Mouse Navigation

Left Button drag left/right/up/down: Rotate camera left/right/up/down

Middle Button drag left/right/up/down: Move camera left/right/up/down

Wheel rotate up/down: Move camera ahead/back

Right Button drag up/down: Move camera ahead/back (same as wheel)

The FOV (Field of View) is held constant. “Zooming” is rather moving the camera forward alongs its look axis, which is more natural than changing the FOV of the camera. Even cameras in movies and TV series nowadays very, very rarely zoom.

 

Resume rsync transfers with the –partial switch

Recently I wanted to rsync a 16GB file to a remote server. The ETA was calculated as 23 hours and, as it usually happens, the file transfer aborted after about 20 hours due to a changing dynamic IP address of the modem. I thought: “No problem, I just re-run the rsync command and the file transfer will resume where it left off.” Wrong! Because by default, on the remote side, rsync creates a temporary file beginning with a dot, and once rsync is aborted during the file transfer, this temporary file is deleted without a trace! Which, in my case, meant that I wasted 20 hours of bandwidth.

It turns out that one can tell rsync to keep temporary files by passing the argument --partial . I tested it, and the temporary file indeed is kept around even after the file transfer aborts prematurely. Then, when simply re-running the same rsync command, the file transfer is resumed.

In my opinion, rsync should adopt this behavior by default. Simple thing to fix, but definitely an argument that should be passed every time!

Added: Simply use -P  ! This implies --partial  and you’ll also see a nice progress output for free!

Unprivileged Unix Users vs. Untrusted Unix Users. How to harden your server security by confining shell users into a minimal jail

As a server administrator, I recently discovered a severe oversight of mine, one that was so big that I didn’t consciously see it for years.

What can Unprivileged Unix Users do on your server?

Any so-called “unprivileged Unix users” who have SSH access to a server (be it simply for the purpose of rsync’ing files) is not really “unprivileged” as the word may suggest. Due to the world-readable permissions of many system directories, set by default in many Linux distributions, such an user can read a large percentage of directories and files existing on the server, many of which can and should be considered as a secret. For example, on my Debian system, the default permissions are:

/etc: world-readable including most configuration files, amongst them passwd which contains plain-text names of other users
/boot: world-readable including all files
/home: world-readable including all subdirectories
/mnt: world-readable
/src: world-readable
/srv: world-readable
/var: world-readable
etc.

Many questions are asked about how to lock a particular user into their home directory. User “zwets” on askubuntu.com explained that this is besides the point and even “silly”:

A user … is trusted: he has had to authenticate and runs without elevated privileges. Therefore file permissions suffice to keep him from changing files he does not own, and from reading things he must not see. World-readability is the default though, for good reason: users actually need most of the the stuff that’s on the file system. To keep users out of a directory, explicitly make it inaccessible. […]

Users need access to commands and applications. These are in directories like /usr/bin, so unless you copy all commands they need from there to their home directories, users will need access to /bin and /usr/bin. But that’s only the start. Applications need libraries from /usr/lib and /lib, which in turn need access to system resources, which are in /dev, and to configuration files in /etc. This was just the read-only part. They’ll also want /tmp and often /var to write into. So, if you want to constrain a user within his home directory, you are going to have to copy a lot into it. In fact, pretty much an entire base file system — which you already have, located at /.

I agree with this assessment. Traditionally, on shared machines, users needed to have at least read access to many things. Thus, once you give someone even just “unprivileged” shell access, this user — not only including his technical knowledge but also the security of the setup of his own machine which might be subject to exploits — is still explicitly and ultimately trusted to see and handle confidentially all world-readable information.

The problem: Sometimes, being “unprivileged” is not enough. The expansion towards “untrusted” users.

As a server administrator you sometimes have to give shell access to some user or machine who is not ultimately trusted. This happens in the very common case where you transfer files via rsync  (backups, anyone?) to/from machines that do not belong to you (e.g. a regular backup service for clients), or even those machines which do belong to you but which are not physically guarded 24/7 against untrusted access. rsync  however requires shell access, period. And if that shell access is granted, rsync  can read out all world-readable information, and for that matter, even when you have put into place rbash  (“restricted bash”) or rssh  (“restricted ssh”) as a login shell. So now, in our scenario, we are facing a situation where someone ultimately not trusted can rsync all world-readable information from the server to anywhere he wants to simply by doing:

One may suggest to simply harden the file permissions for those untrusted users, and I agree that this is a good practice in any case. But is it practical? Hardening the file permissions of dozens of configuration files in /etc  alone is not an easy task and is likely to break things. For one obvious example: Do I know, without investing a lot of research (including trial-and-error), which consequences chmod o-rwx /etc/passwd  will have? Which programs for which users will it break? Or worse, will I even be able to reboot the system?

And what if you have a lot of trusted users working on your server, all having created many personal files, all relying on the world-readable nature as a way to share those files, and all assuming that world-readable does not literally mean ‘World readable’? Grasping the full extent of the user collaboration and migrating towards group-readable instead of world-readable file permissions likely will be a lot of work, and again, may break things.

In my opinion, for existing server machines, this kind of work is too expensive to be justified by the benefits.

So, no matter from which angle you look at this problem, having ultimately non-trusted users on the system is a problem that can only be satisfactorily solved by jailing them into some kind of chroot directory, and allowing only those tasks that are absolutely neccessary for them (in our scenario, rsync  only). Notwithstanding that, and to repeat, users who are not jailed must be considered as ultimately trusted.

The solution: Low-level system utilities and a minimal jail

For above reasons regarding untrusted users, ‘hardening’ shell access via rbash  or even rssh  is just a cosmetic measure that still doesn’t prevent world-readable files to be literally readable by the World (you have to assume that untrusted users will share data liberally). rssh  has a built-in feature for chroot’ing, but it was originally written for RedHat and the documentation about it is vague, and it wouldn’t even accept a chroot environment created by debootstrap.

Luckily, there is a low-level solution, directly built into the Linux kernel and core packages. We will utilize the ability of PAM to ‘jailroot’ a SSH session on a per-user basis, and we will manually create a very minimal chroot jail for this purpose. We will jail two untrusted system users called “jailer” and “inmate” and re-use the same jail. Each user which will be able to rsync  files, but either will not be able to escape the jail, nor see the files of the other.

The following diagram shows the directory structure of the jail that we will create:

The following commands are based on Debian and have been tested in Debian Wheezy.

First, create user accounts for two ultimately untrusted users, called “jailer” and “inmate” (note that the users are members of the host operating system, not the jail):

Their home directories will be /home/jailer  and /home/inmate  respectively. They need home directories so that you can set up SSH keys (via ~/.ssh/authorized_keys ) for passwordless-login later.

Second, install the PAM module that allows chroot’ing an authenticated session:

The installed configuration file is /etc/security/chroot.conf . Into this configuration file, add

These two lines mean that after completing the SSH authentication, the users jailer and inmate will be jailed into the directory /home/minjail of the host system. Note that both users will share the same jail.

Third, we have to enable the “chroot” PAM module for SSH sessions. For this, edit /etc/pam.d/sshd  and add to the end

After saving the file, the changes are immediately valid for the next initiated SSH session — thankfully, there is no need to restart any service.

Making a minimal jail for chroot

All that is missing now is a minimal jail, to be made in /home/minjail . We will do this manually, but it would be easy to make a script that does it for you. In our jail, and for our described scenario, we only need to provide rsync  to the untrusted users. And just for experimentation, we will also add the ls  command. However, the policy is: Only add into the jail what is absolutely neccessary. The more binaries and libraries you make available, the higher the likelihood that bugs may be exploited to break out of the jail. We do the following as the root user:

Next, create the home directories for both users:

Next, for each binary you want to make available, repeat the following steps (we do it here for rsync , but repeat the steps for ls ):

1. Find out where a particular program lives:

2. Copy this program into the same location in the jail:

3. Find out which libraries are used by that binary

4. Copy these libraries into the corresponding locations inside of the jail (linux-gate.so.1 is a virtual file in the kernel and doesn’t have to be copied):

After these 4 steps have been repeated for each program, finish the minimal jail with proper permissions and ownership:

The permission 751 of the ./home  directory ( drwxr-x--x  root  root) will allow any user to enter the directory, but forbid do see which subdirectories it contains (information about other users is considered private). The permission 750 of the user directories ( drwxr-x--- ) makes sure that only the corresponding user will be able to enter.

We are all done!

Test the jail from another machine

As stated above, our scenario is to allow untrusted users to rsync  files (e.g. as a backup solution). Let’s try it, in both directions!

Testing file upload via rsync

Both users “jailer” and “inmate” can rsync a file into their respective home directory inside the jail. See:

To allow password-less transfers, set up a public key in /home/jailer/.ssh/authorized_keys  of the host operating system.

Testing file download via rsync

This is the real test. We will attempt do download as much as possible with rsync (we will try to get the root directory recursively):

Here you see that all world-readable files were transferred (the programs ls and rsync and their libraries), but nothing from the home directory inside of the jail.

However, rsync  succeeds to grab the user’s home directory. This is expected and desired behavior:

 

Testing shell access

We have seen that we cannot do damage or reveal sensitive information with rsync . But as stated above, rsync  cannot be had without shell access. So now, we’ll log in to a bash shell and see which damage we can do:

Put "/bin/bash -i"  as an argument to use the host system’s bash in interactive mode, otherwise you would have to set up special device nodes for the terminal inside of the jail, which makes it more vulnerable for exploits.

We are now dumped to a primitive shell:

At this point, you can explore the jail. Try to do some damage (Careful! Make sure you’re not in your live host system, prefer an experimental virtual machine instead!!!) or try to read other user’s files. However, you will likely not succeed, since everything you have available is Bash’s builtin commands plus rsync  and ls , all chroot’ed by a system call to the host’s kernel.

If any reader of this article should discover exploits of this method, please leave a comment.

Conclusion

I have argued that the term “unprivileged user” on a Unix-like operating system can be misunderstood, and that the term “untrusted user” should be introduced in certain use cases for clarity. I have presented an outline of an inexpensive method to accomodate untrusted users on a shared machine for various purposes with the help of the low-level Linux kernel system call chroot()  through a PAM module called pam_chroot.so  as well as a minimal, manually created jail. This method still is experimental and has not entirely been vetted by security specialists.

 

 

Obscure error: dd: failed to open: No medium found

I got this error message when I wanted to use dd  to copy a raw image to a MicroSD card. You will see this message if you have unmounted the card with a file manager like Nautilus. I guess it unmounts not fully, or in a special way.

The fix is easy:

Unmount the card via the terminal:

The dd  command will work now.

Another variant of this error can be seen when using bmaptool :

Exim and Spamassassin: Rewriting headers, adding SPAM and Score to Subject

This tutorial is a follow-up to my article Setting up Exim4 Mail Transfer Agent with Anti-Spam, Greylisting and Anti-Malware.

I finally got around solving this problem: If an email has a certain spam score, above a certain threshold, Exim should rewrite the Subject header to contain the string  *** SPAM (x.x points) *** {original subject}

Spamassassin has a configuration option to rewrite a subject header in its configuration file /etc/spamassassin/local.cf  …

… but this is misleading, because it is used only when Spamassassin is used stand-alone. If used in combination with a MTA (Mail Transfer Agent) like Exim, the MTA is ultimately responsible for modifying emails. So, the solution lies in the proper configuration of Exim. To modify an already accepted message, the Exim documentation suggests a System Filter. You can set it up like this:

Enable the system filter in your main Exim configuration file. Add to it:

Then create the file  /etc/exim4/system.filter , set proper ownership and permission, then insert:

This means: If the header  $header_X-Spam_score_int  is present (has been added by Exim in the acl_check_data  ACL section, see my previous tutorial), and is more than 50 (this is 5.0), rewrite the Subject header. The regular expression checks if the spam score is valid and not negative.

Note that in the acl_check_data section of the Exim config, you can deny a message above a certain spam score threshold. This means, in combination with this System Filter, you can do the following:

  • If spam score is above 10, reject/bounce email from the ACL.
  • If spam score is above 5, rewrite the Subject.

XeLaTeX: Unicode font fallback for unsupported characters

Traditionally I only used to use LaTeX to typeset documents, and it works perfectly when you have a single language script (e.g. only English or German). But as soon as you want to typeset Unicode text in multiple languages, you’re quickly out of luck. LaTeX is just not made for Unicode, and you need a lot of helper packages, documentation reading, and complicated configuration in your document to get it all right.

All I wanted was to typeset the following Unicode text. It contains regular latin characters, chinese characters, modern greek and polytonic (ancient) greek.

Latin text. Chinese text: 紫薇北斗星  Modern greek: Διαμ πριμα εσθ ατ, κυο πχιλωσοπηια Ancient greek: Μῆνιν ἄειδε, θεά, Πηληϊάδεω Ἀχιλῆος. And regular latin text.

I thought it was a simple task. I thought: let’s just use XeLaTeX, which has out-of-the-box Unicode support. In the end, it was a simple task, but only after struggling to solve a particular problem. To show you the problem, I ran the following straightforward code through XeLaTeX…

… and the following PDF was produced:

XeLaTeX rendering Computer Modern font with unsupported unicode characters
XeLaTeX rendering Computer Modern font with unsupported unicode characters

It turns out that the missing unicode characters are not XeLaTeX’s fault. The problem is that the used font (XeLaTeX by default uses a slightly more encompassing Computer Modern font) has not all unicode characters implemented. To implement all unicode characters in a single font (about 1.1 million characters) is a monumental task, and there are only a small handful of fonts whose maintainers aim to have full support of all characters (one of them is GNU FreeFont, which is already part of the Debian distribution, and therefore available to XeLaTeX).

So, I thought, let’s just use a font which is dedicated to unicode. I selected in my document the pretty Junicode font:

The result was:

XeLaTex and Junicode font with chinese and greek characters
XeLaTex and Junicode font with chinese and greek characters

Now, greek worked, but still no chinese characters. It turned out that even fonts which are dedicated to unicode do not yet have all possible characters implemented. Because it’s a lot of work to produce high-quality fonts with matching styles for millions of possible characters.

So, how do regular web browsers or office applications do it? They use a mechanism called font fallback. When a particular character is not implemented in the chosen main font, another font is silently used which does have this particular character implemented. XeLaTeX can do the same with a package called ucharclasses, and it gives you full control over the fallback font selection process. The ucharclasses documentation gives an example using the \fontspec  font selection. I decided to use the font IPAexMincho which supports chinese characters. So I added to my document:

… but when running XeLaTeX with this addition, ucharclasses somehow entered an endless loop with high CPU usage for the TexLive 2014 distribution (part of Debian). It hung at the line:

Endless googling didn’t bring up any useful hints. Something must have changed in the internals, and the ucharclasses documentation needs updating. In any event, it took me 4 hours to find a fix. The solution was to use a font selection other than  \fontspec{} — because it doesn’t seem to be compatible with ucharclasses any more. Instead, I used fontspec‘s suggested  \newfontfamily  mechanism. Here is the final working code:

Here is the result: Mixed latin, chinese, and greek scripts with two different fonts: Junicode and IPAexMincho:

XeLaTeX with unicode font fallbacks
XeLaTeX with unicode font fallbacks

Pretty!

XeLaTeX with unicode font fallbacks
XeLaTeX with unicode font fallbacks