TerraHertz — NobLog — On Scanning

The history of technology can be fascinating. Technological history is often ignored by so-called historians, who concentrate on political and social developments. Yet politics and social systems can be considered constructs of deception and often unfounded, even irrational beliefs. Whereas technological history is an evolution of practical, engineering-based truths. Or failures in some cases. Yet the underlying intentions were usually to produce something that worked as intended, based on sound reasoning and scientific theory, with a clearly and honestly defined purpose.

It can be argued that technology predominantly steers political history, both via the results of weaponising of technology, and the social changes enabled by technology. In that sense, is not Technological History therefore as or even more important than the resulting Political History? Or at least more fundamental. Personally I find it much more interesting, and so quite saddening to see details of technological history being lost, discarded without thought as 'obsolete' trivia.

It is said those who forget the past are doomed to relive it. Since the dawn of the Industrial Revolution we have been riding a continuous wave of technological development, and we are not widely aware of any precedent of humankind having forgotten such knowledge. So we assume regression can't happen and there is no need to carefully preserve old technological knowledge.

But there have been relatively high technology periods in human history that are now entirely forgotten. For instance, who produced the Antikythera Mechanism[1] and how? This highly sophisticated precision-machined mechanical computing device, built around 150 BC, simply does not fit in our understanding of history at all. Also the Baghdad Batteries and who used them, for what?[2] Another loss within only the last thousand years, was the method of creating Damascus Steel.[3]

All knowledge of such devices and probably other past technologies, has been lost. Leaving only mysterious ancient relics, sometimes with no social context whatsoever. How can we be sure that in another thousand years, people won't be mystified when finding a calcified smartphone, and wondering who produced it, how, and what it was for?

In modern times, appalling loss of technological knowledge continues. These days we don't (usually) burn libraries, they merely get progressively purged of irreplaceable old technical books, into dumpsters in the rear lane. The entire theoretical works of the genius Nikola Tesla[4] in the latter stages of his life are lost (or at least vanished from public view), Philo Farnsworth's work in the 1960s on nuclear Fusors[5] is mostly gone, and the majority of all books published through the middle years of the 20th Century are well on their inevitable path to illegible dust due to acid-paper disintegration[6].

More recently it's become fashionable to dispose of all technical data books and manuals from the early days of the electronics revolution, because "who needs that bulky old stuff? It's obsolete, and all we need is online now."

As a species, we don't seem to have much common sense, caution, or ability to take lessons from history. It's not impossible to imagine circumstances where the 'old' technical knowledge might become critically useful again. Archaeologists will tell you that it's the rule, not exception, for civilizations to eventually collapse. Also that social complexity is a primary causative factor in such collapses, and that the process is always unexpected and far from pleasant.[7] Unpredictable external events too can bring abrupt global disruption. For instance the meteorite that made the 12,000 year old, 30 kilometer wide Hiawatha impact crater in Greenland.[8]
Allowing our present knowledge to fade away would be very unwise.

As someone with an interest in several aspects of history, particularly the history of electronics, I often find myself seeking old books and manuals. Before the Internet and with limited personal means this was usually a futile exercise, but with resources like ebay and abebooks, web search, the many online archives of digitized works, and forums of others with similar interests, the situation has greatly changed.

These days one typically has a choice between obtaining an original paper copy of old manuals, or downloading (often for free) an electronic copy of the manual. Personally I find the physical copies vastly more readable, practical, convenient and satisfying than electronic copies, so I generally buy physical copies when available and affordable. But of course this won't be possible for ever, as these things age and become more rare. It may take a few centuries before there are no more paper copies, but ultimately the only available forms will be digital - or reproductions from the digital copies. Which means there had better be digital copies, or the knowledge will be lost like Damascus Steel, or the skills that made the Antikythera Mechanism.

Fortunately there are many who recognise the problem we have with technological data loss, and make efforts to rectify the situation. They scan old technical documents, and upload them to the many online archives.[9]

Some of these are private individuals, sometimes forming groups to work together preserving such history. Others are for-profit operations, demanding payment for access to their digital copies. There are a few good examples of corporations taking the trouble to preserve public archives of their own past technologies, but mostly businesses do not care about such things since the effort does not mesh with the corporate first commandment: to maximize profit above all else. Or they exhibit undesirable behaviors such as 'for legal reasons' mass culling all circuit schematics from their archive of digitized manuals, or deliberately ensuring the digitizing resolution is too poor to be usable. (Their rationale is a lie, it's actually about ensuring old equipment can't be maintained, thus slightly increasing sales of new equipment.)

As for governments, they seem to be too busy archiving and crimethink-analyzing our emails and cross-referencing a billion Twitter messages to give any thought to the preservation of humankind's technical and literary cultural heritage. The deliberate destruction of the engineering plans for the Saturn V moon rocket is just one example of many, that illustrate typical government thinking.

Judging by the large numbers of quite poor quality document scans online, there does seem to be a need for a 'how to' like this. This work is the result of my own learning experiences, involving %%% plenty of mistakes. Hopefully it will help you avoid repeating them. Critiques and suggestions for improvement of this work are welcome.

Things You'll Need

Useful software utilities

Capture: Scanner or Camera

If you can't produce raw capture images preserving every significant detail of the physical document (including blank paper areas being blank in the captured image, and photos retaining all original detail), then the capture device is not good enough. You may decide that some levels of detail can be dropped from the final online document copy, but that must be your aesthetic decision, not the result of hardware shortcomings. If you're being forced to accept quality loss due to equipment limitations, then it's better not to waste your time doing the scanning work at all. Not until you get better equipment.

Generally though, quality is not a problem even with low cost scanners today. The scanner does not have to be anything great, so long as it produces reasonable quality images — and most do. If it can achieve up to 600 dpi (dots per inch), has color and gray-scale capability, can capture a reasonably linear grayscale, and has facility in the software to optimize capture response to suit the actual document tonal quality, it will do the job for most books and technical documents. Note that this completely excludes all 'FAX mode' black and white scanners.

The maximum page size you'll need it to handle depends loosely on the kinds of documents you expect to scan. If you will only need to handle a limited number of sheets bigger than the scanner bed, then stitching sections together in post-processing won't kill you, so a small (cheap) scanner can be adequate. If you want to attempt manuals with hundreds of large fold-out schematics, then you're going to need something more up-market.

Ergonomic issues like whether there is auto sheet feeding, how many keystrokes or mouse clicks required per page scanned, scanning carriage speed at chosen resolution, etc, translate to how much time will be required, not the quality of the end result. Whether time is a primary issue to you, depends on your situation. For some it's not particularly important, for others it's crucial. That choice is yours. With bound volumes you're going to be manually handling each page scan anyway, so autofeed is irrelevant. Also the page setup takes long enough that a few more keystrokes aren't going to make much difference either.

Another factor is that page autofeed mechanisms bring some small but real risk of missfeeds that may damage the original document. With consequences dependent on the rarity and value of the document. Weighing this risk is your responsibility.

I currently use an old USB Canon Lide 20, that I literally found among some tossed out junk. Downloaded drivers from the net, and it worked fine. It has an A4 sized bed, so with foldout schematics I have to scan them in sections and stitch. The stitching is typically only a small portion of the post-processing work required. Most of that post-processing would still be required no matter how expensive a scanner I used.

If you have a high end digital camera, for some document types it can be used instead of a scanner. The criteria is whether the images must be exactly linear and consistently, acurately scaled across the page, or not. For some things (for instance, engineering drawings, and anything you are going to try stitching together) exact rectilinearity is indispensible, while other times (eg text pages) it generally doesn't matter. With cameras you'll get some perspective scaling variations and barrel distorion no matter what, while scanners produce a reasonably accurate rectangular scaling grid across the image.

You'd need a setup with a camera stand, diffuse side lighting and black shrouding, so a sheet of glass can be placed on the surface to hold it flat, without introducing reflections. I've so far only tried this with a large schematic foldout, but found my camera resolution was inadequate for that detail level. I'll be trying it again for the chore of scanning one particular large old book. It should solve the 'thick spine can't spread flat' problem, which is an intrisic obstacle with cheap flatbed scanners. There are scanners where the imaging bed extends right to one edge, but they are expensive. Google 'book edge scanner'.

Another thing I've yet to try is a vacuum hold-down system, for forcing documents with crinkles and folds to lay very flat. This should be workable for both scanners and camera setups.

The reason the scanner is not the most critical element in the process, is that no matter how good the scan quality, you are still going to have to do post-processing of the images. During post-processing many imperfections in the images can be corrected — it takes a very bad scanner to produce images so poor they can't be used. Also a large reduction in the final file sizes can be achieved by optimizing the images to remove 'noise' in the data, that contributes nothing to the visual quality or historical accuracy.

Of course, where the line lies between unwanted noise, and document blemishes of significance to its character and history, is up to you.

Overall, your post-processing skill and the document encapsulation stage will have the most significant effects on quality, utility and compactness of the final digital document.
Fundamentally the results depend on your skills, not the tools. That shouldn't be any surprise.

The choice of scanner vs camera is complex, and depends on the needs of your anticipated scanning work, and subtle issues of image quality — that will be discussed in depth later. I'm not going to recommend one or the other. The following table lists general pros and cons of each; some are simple facts, others are matters of opinion.

Comparison of Flatbed Scanners to Cameras
	Scanner	Camera
Resolution	Specified per inch/cm of the scanner face. Typically scanning is done at 300 to 600 dots per inch (dpi) but most scanners can go up to over a thousand dpi. At say 1200 dpi, an A4 sheet (8.5" x 11.75") has a scan image size of 10,200 x 14,100 pixels. That's about 144 megapixels. And actually while scanning it's usual to overscan, with an extra border around the page. So the total resolution will be higher.	Specified as X by Y pixels for the overall image. This has to include any border allowed around the page, so the page resolution will be a bit lower. Taking the Canon EOS range as typical, image resolutions range from 10 to 50 megapixels. For an EOS 5DS with 51 megapixels, max resolution is 8688 x 5792. (Aspect ratio 1.5:1) For an A4 page (aspect ratio 1.38:1) that gives 681 dpi max. With some border it will be a bit less. This is just enough to adequately capture printed images with tonal screening.
File size	Can be configured for low file size, but generally the files are large; 50MB is common for a page. Scanner files are usually intended only for post-processing work, with the final product being noise-reduced, scaled to a much lower resolution, then compressed in a non-lossy format, and probably bundled into a single document wrapper file.	Cameras are designed to produce manageable file sizes as a primary requirement, so compression is enabled by default. Unless the user selects an uncompressed format, files are generally a few MB or less. Even with uncompressed formats, because of the lower overall image resolution, file sizes will be smaller than typical scanner images. Camera image files are usually expected to be archived as-is, and distributed as-is or at reduced resolution. For document capture they will undergo similar post processing and bundling to scanner images.
Image coding format	Scanner utilities generally offer a wide range of image coding, both lossy and non-lossy. JPG and PNG being typically available. For document capture non-lossy PNG is highly prefered. Never use a lossy format (JPG) for original scan files.	Cameras tend to have fewer file type options. Some may lack any non-lossy format. JPG is always present, and with some cameras non-lossy forms such as RAW. I've never seen a camera with PNG as an option.
Scale linearity	Intrinsically linear across the entire scan. At least it should be. Some poor quality scanners or mechanical faults can introduce non-linearity.	Varies across the document due to perspective and sphericity. Can be minimzed via camera alignment, focal distance and lens selection.
Stitchability	Manual PS image stitching possible due to consistent scale, linearity and illumination.	Perspective and illumination variations require software tools to stitch images. Without custom utilities it's generally impossible to do adequately. Even with the best tools, final product linearity won't be quite true to the original.
Illumination	Intrinsically highly uniform.	Achieving uniformity requires a careful lighting setup, with significant extra cost.
Repeatability	Very high due to absense of external lighting variations, and saved configuration files.	Difficult to achieve without a complex physical setup and care with camera settings.
Precise quantization curve control	Yes, always, via config screen. Easily accessible since this is a primary requirement of scanners.	Depends on the camera. Typically not, or if present will require digging down through setup menus since this is not a feature commonly required for general camera use.
Config files	Yes, always.	Depends on the camera. Typically not.
Intrinsic defects	Linear streaks in the direction of carriage travel, due to poor calibration or dirt on the sensor. Smudges, dirt, hair, scratches on the glass repeat in each image. Focus and luminance defects wherever the paper was not flat against the glass.	Perspective and sphericity linearity defects. Luminance variation across the page.
Reflections	Never.	When a glass sheet is used to flatten the document, suppressing reflections of nearby environment requires active measures such as dark drapes, lighting frames, darkroom, etc.
Page back printing bleed-through	Can be a noticable problem due to high intrinsic capture sensitivity. Requires measures such as black felt backing to suppress.	Tends not to be noticable, partly due to lower overall capture sensitivity.
Print screening moire pattern control	Since scanner resolution can be set to near the screening size, moire patterns have to be deliberately avoided by selecting adequately higher resolution.	Not generally a problem, since camera image resolution when viewing full pages with printed screening is typically too low to generate moire effects. However this means the screening pattern is lost, and cannot be precisely dealt with in post processing.
Page setup effort	Depends on the document. Stiff-spined books can be impossible or difficult. Single sheets are easy. Sheets larger or much smaller than the scanner face can be a pain to align. Some scanners can auto-feed pages.	Depends on the setup. The primary attraction of camera-based systems is the feasibility of capturing thick-spined books, using frames to hold the book partially open, and/or capture two pages at once with minimal page turning effort and little manipulation of the whole book.
Capture speed	Not instant. Speed depends on the scanner and selected resolution. It can be quite slow: up to several minutes per scan for large pages at high resolution.	Instant. But there's still the issue of transfering images to the computer. This can be rapid too if the camera is directly connected by wire or wifi.
Document flatness	Must be flat against the glass. With some documents this can be difficult to achieve. Folds, book spine bends, and edge lift due to sheets larger than the scanner face bezel aperture are common problems.	Not critical due to the camera being distant from the document surface. Usually document flatness irregularities are too small to be significant within the focal depth of the camera.
Space requirements	With scanners now being so small, setup space requirements are minimal. A small scanner can be easily stored away when not in use. Larger ones can require dedicated floorspace.	The setup required to achieve good results with cameras can be quite bulky, and tends to require permanent allocation of space or even a room.

Scanning Software

It's necesary to use the hardware drivers, but in my experience most of the extra bundled software tends to be rubbish and is best ignored. If you want to save a lot of time required to verify that they are rubbish, don't even install them. Photo album 'organizers' will try to hide the underlying filesystem (it's a Microsoft/Windows corporate paradigm that users can't manage/organize their own files, so the OS & Apps should 'relieve them of that responsibility'.) OCR-ing the document would be nice, if there was a file format for containing both the original imagery and the OCR text in an itegrated manner. But there isn't. As for PDF converters — well I'll discuss PDF later.

Most importantly though, bundled Apps tend to be cheap/crippleware versions; often wanting you to pay to upgrade to full versions, still not as capable as easily available better utilities, and in any case requiring a fair amount of learning curve effort to use at all. Then if you change your scanner you may find those Apps don't work any more, and you have to learn the ones bundled with the new scanner.
It's far better to find a set of good independent (preferably Freeware and Portable) utilities that can do the things you need, learn how to use them well, and stick with them.

Regarding the scanner driver software that you will need to use, here's a list of a few esssential features:

At left there's a lo-res scan preview, with dashed outline for the area to scan at selected resolution. The dashed outline can be moved and resized with the mouse. At right the response curve can be adjusted via mouse drags or numeric entry. The graph shows statistical weights of luminance values in the scanned image. The preview image changes on screen interactively with adjustment of the curve.

Note that example shows the tool as it initially appears, unadjusted. The fine straight diagonal line and the numbers describing its shape and position are not how they would be after user adjustment for that preview image. We'll get to that later. For now the important point is that once adjusted, those values should normally remain the same for all pages of a document being scanned. If these values change between pages the appearance of the scanned pages will be inconsistent.

That's why facilities to save and restore the scanner profile are here — because this is the most important element of the scanner settings. More important even than resolution. Settings for colour mode and resolution are in other tabs, but they are saved together with these values.

Here the save/load action is represented by the small folder icons, and it will be different in every scanner's software. But it's crucial that the ability exists. Also that the saved profiles can be saved with user-specified names, and in folder locations specified by the user. You must be able to easily save the profiles in specific folder locations because these files must remain associated with each scanning project.

That's pretty much all that's needed from the scanning software. You'll see many extras like noise reduction, despeckle, and so on, but they aren't essential and you'll have to experimentally verify whether they are useful or a hinderance in practice. Such 'automations' the scanner and its software may offer to do for you, will mostly be limiting your range of choices regarding the final product of your efforts. Once you are familiar with the nuances of post processing you'll generally leave all the scanner frills turned off. That will remain true at least until we have software AI with aesthetic capabilities at human level — and that probably won't end well for anyone.

Yes, I prefer a simple legacy Windows style, as I find flashy visual bling adds nothing but distraction, and can detract from accurate assement of the image on which I'm working.

Portable Utilities

MS Windows is particularly prone to this syndrome, with the Microsoft-recommended program structure involving things like the Windows Registry, splitting of software components across different locations like the System folders, Programs folder tree, user-specific folders, and so on.
However Apple and Linux are also guilty of poor choices in software install structures.

The result of such fragmentation into parts entangled throughout the operating system, is utilities that cannot simply be cloned as a unit (complete with all user configurations) to another machine. When one has very many software tools, each of which took time to learn to use, and may not get used very often, this non-clonability becomes a serious problem when transfering from one machine to another, or keeping multiple machines with identical tool setups. It is simply not feasible to be having to repeatedly go through the install process for long lists of utilities every time one transitions to new computing hardware.

In recent years there has been a developing movement to overcome this revolting, stupid problem, by creating 'portable utilities'. The idea is very simple — the entire concept of 'installing' is discarded, and software is restructured to reside and operate entirely within a single folder. All executables, defaults and user configuration files are kept in that one place, without any external dependencies.

Which is how it should be, and always should have been. It's a fair question to ask why it wasn't. The answer to that one comes down to the usual 'do we assume it was just incompetence? Or was it intentional, for reasons to do with insidious corporate intentions deriving more from ideology than profit motive?' It's your choice what you believe.

A highly desirable consequence of choosing 'portable utilities' is that you can create a folder tree containing all those you use. Also one folder (in that tree) of shortcuts to all the utilities. Then the entire folder tree can be copied to other machines as-is, resulting in all your customary tools being instantly available on the other machine(s). Also already configured just the way you like, since all the config information was present in the tree too.

When selecting software to add to your toolset, it's advisable to first check to see if there is a portable version available. And if not, hassle the authors to make their software portable.

Scanning Process

If you find yourself worrying about disk space used for hundreds of very large image files, then just buy an external HD and dedicate that to scan images. It will simplify your backup options anyway.
And for heaven's sake don't even think of using patch compression encodings such as JBIG2.

The original scans should capture even the finest detail on the pages. This includes fully resolving the fine dots of offset screened images. Anything less and you'll never get rid of the resulting moire patterns and other sampling alias effects, no matter what you try in post-processing.

A common conceptual error is the assumption that if a page is black ink on white paper, that means the scan only needs to be half-tone, ie each saved pixel need only be one bit, encoding full black or full white. The result is actually extremely lossy, since all detail of edge curves finer than the pixel matrix is lost. In the worst case there will also be massive levels of noise introduced by scanner software attempting to portray fine shading as dithered patterns of black pixel dots. In reality, so called 'black and white' pages actually require accurate gray-scale scanning at minimum, and perhaps even full colour scanning if there are any kind of historic artefacts on the page worth preserving. For example watermarks and hand written notes.

Yes, the raw scan file sizes will be pretty big; easily 10 to 50 MB per page. Tough. Just remember that the final document file size you achieve will have little relation to these large intermediate file sizes, but the quality will be vastly improved by not discarding resolution until near the end of the process. At that stage any compromise (file size vs document accuracy) can be the result of careful experiments and deliberate aesthetic choice.

All intermediate processing and saves subsequent to scanning must also be in a non-lossy format. Keep the working images in PNG or the native lossless format of your image editing utility. Under no circumstances should you do multiple load-edit-save cycles on one image using the JPG file format, since every time you save an image as JPG more image resolution is lost.
Only the final output may be in a lossy format, and preferably not even then.
Ideally, the only image resolution reduction ever should occur when you deliberately scale the images down to the lower resolution and encoding which you've chosen for the final document.

It may be an obvious thing to say, but do clean the scanner glass carefully before beginning, and regularly through the process. Smudges, spots, hairs, even small specks of dust, all mar the images and can make a lot of extra work for you in post-processing.

For documents with a large number of pages it's helpful to work out a page setup proceedure that is as simple as possible. If you're working with loose sheets, have the in and out stacks oriented to minimize page turns, rotates, etc. The out stack should end up in the correct sort order, not backwards. Avoid having to lean over the scanner face, or you will find hairs in the scans. Ha ha... at least I do. Your body hair shedding rate may vary.

It's helpful to have some aids for achieving uniform alignment and avoiding skew of the pages. But don't stress about this, as you'll find that page geometries will vary enough that you'll probably want to unify the formats during post processing anyway. Also surprisingly often the printing will be slightly skew on the paper, so your efforts to get the paper straight were kind of wasted, and de-skewing in post processing is necessary after all.

Often you will find the document has physical characteristics that make pages resist achieving flatness against the scanner glass. Any areas of the paper that are slightly away from the glass will be out of focus in the scanned image, or cause tonal changes across the page. Folds in the paper, areas at the edge of the scanner glass adjacent to the bezel, and so on, tend to degrade glass contact.

Sometimes there's simply nothing that can be done about it, for instance with thick bound books. In other cases (for eg folds or crinkling of the paper) the answer is more force, in the right places. The flimsy plastic lid and foam-backed sheet of typical scanners tends to be too weak to apply real force to something that needs 'persuading' to go flat. For my setup I have an assortment of metal weights (up to one that is hard to lift one-handed), various sizes and thickness of cardboard rectangles to use as packing shims where needed, sheets of more resilient foam, and some pieces of rigid board to take the place of the scanner lid when it isn't enough.

Ultimately the limit of how much force you can apply is determined by the scanner body and glass. The optical sensor carriage will be very close to the glass, and warping the glass significantly will jam the carriage, or cause focus problems. Obviously dropping a big weight on the glass is to be avoided too. If you are repeating the same motions over and over, perhaps late at night, that's a real risk. (No, I haven't had this accident. Yet.)

If you are working with a scanner with auto-feed, lucky you. As long as it does actually manage to feed and align all the pages correctly. If your system claims to automate the entire scanning and document encapsulation (to PDF) process, then I suggest you take a close look at the quality of the output, for all pages. If you consider it is OK, then you don't need this 'how to'.

Workflow

First Phase: Preparation, Experiment

This stage can take as little as a few minutes, up to days. It depends on the scale of effort you know you're going to be putting into the main scanning stage later, and how complex that work will be. The harder and longer the real work, the more worthwhile it is to expend effort on preparation, if that may ease the process.
If you've never done any of this before, don't agonize over trying to perfectly optimize everything. You're going to make mistakes and have to redo work. An unavoidable learning experience, as with any complex job done for the first time. Hopefully this text will help you avoid some of the more painful common goofs.

Organise. Create a project folder. Create subfolders like: Experiments, Original_scans, Processing, Reduced, Final.
Like me after you get used to the workflow you'll probably use short folder names, like raw, crop, descreen, etc. Also the post processing will often have enough independent steps that to keep these working folders in clear sequence of flow, it's worth forcing an intrinsic display order. By naming them like 01_exp, 02_raw, 03_rot, 04_align, 05_crop, 06_clean, 07_descreen, 08_scale, 09_encode, 10_wrap, etc. (Yes, that is a typical sequence. There may be even more steps.)
In the project root, a text file to keep recipee and progress notes. You may find yourself looking at your notes again many years later, so it's a good idea to start the text with a heading stating what exactly you were doing here, and when. Remember you won't remember!
Don't rely on file date attributes - these can get altered during copy operations. Put the date in the file. Personally I put it in many file and folder names as well, like YYYYMMDD_description. That way they sort sequentially.
Survey. Examine every page in the document you're going to work on. Are there multiple kinds of distinctly unique content, that might have to be treated differently? For instance if most pages are B&W text and simple line drawings, but some pages have very fine detail drawings, others have offset-screen toned photos (or color), there are a few large foldout sheets, some errata notes printed on yellow paper, and a faded roneod page someone stapled inside the front cover, then you have six different page types, and will need six separate workflows. Even things like different quality paper may require separate methodologies.
Give each page type a name. These become headings in the recipee file, where you keep notes on the settings that work (and those which don't.)
If there are many pages of some simple kind like B&W text, but a few containing screened color images, it's worth scanning them at different resolutions & encodings, just for the time saved in faster scans of the text pages at lower-res and grayscale. But to check the results can be manipulated to look the same in the final product you'll need to do some trial scanning and processing experiments.
And yes, this means your Original_scans folder will probably have sub-folders; one for each identified page type that has different scan parameters. At some stage in post processing they'll merge to a common image format, but it may not be for several steps.
Trial Scans. In general you should do some trial scans of the various page types in a document, experimenting with scanner modes and profiles to get the best possible results. Pick a representative page for each kind. For each one, experiment with the scanner to get the clearest possible scan. Keep notes of the settings! And of course save the optimal scanner profile for each one, making a note in the recipee file of their names.
Zoom in down to pixel level in the images. Are you capturing all the finest detail of the printed page? In post-processing you may choose to discard some resolution, but if you lost detail in the original scans you no longer have full freedom of choice.
Also never forget to view the overall page image, to check that scan quality is consistent across the entire area. No spots where the paper wasn't flat against the glass, bluring or shading the detail?
How about bleed-through of items on the other side of pages? Sometimes this may only occur where there are particularly high contrast blocky elements on the other side, and you don't notice those few failures till much later. Have a look through the document for heavy black-white contrasts, do trial scans of the other side of such pages to check for bleed through.
If necessary, add 'black backing required' to your scan recipee. Note that if you must do it for some, you have to do it for all otherwise the page tones will be inconsistent. The variation in total reflectance of the pages with and without black backing, will shift the tonal histogram of the images — that you configured the scanner to optimise.
If a page type has many pages to scan, then also make sure you can see how to get them all positioned uniformly on the scanner. Are you going to have to treat odd and even pages differently? For consistent framing it's best to do all of one kind, then the other, and keep the files separate until after you have normalized the text position within the page image area. In which case 'odd' and 'even' pages are actually different page types, and should be treated as such in your recipee.

Because this stage is preliminary, don't hope to use these scans as final copies, to avoid the effort of doing them again in the main scan-marathon of all the pages. For one thing, you won't know what the actual sequence number of these pages will be, until you are doing them all. It's very unlikely to be the literal page numbers! Also when doing the scan-marathon, it's more trouble than it's worth to be constantly checking to see whether each page is one you've done before, and adjusting scanner sequence numbers, etc. So keep these first trial scans in 'experiments' and leave them there.
Trial Post-Processing. Once you have a minimal set of scan images that seem good quality representatives for each page type, it's worth going through the whole post-processing sequence with copies of them, before you put in the considerable effort of scanning hundreds of pages. This verifies that your chosen scan parameters produce files that are adequate to achieve your intended end result.
This is where you develop a sequence of quantified processing steps to achieve the desired quality of resolution in the final document. You must have all that completely worked out before you begin the grunt work, because if part way through the job you decide you don't like something, and change the process, there will probably be a noticeable difference in the end result between pages already done, and subsequent ones.
For documents with many pages, the 'scan them all' stage can require a very large investment of effort. You want to be completely sure you're not going to have to do it over again due to some flaw you didn't notice with the samples.

Make sure you are capturing and can reproduce all the parameters. For instance if you shut down your PC, go away for several days and forget what the settings were, can you resume where you left off and get an identical result? This is why you have a recipee notes file and saved scanner config profiles.
One decision you'll need to make at an early stage based on your initial experiments, is a page resolution for the final product. This is iportant to know, because it also gives you an aspect ratio for the pages. Especially if you are doing some pages in different scan resolutions, you have to be certain you can crop and scale all pages to the same eventual size and aspect ratio & mdash; exactly. Don't just assume you can; actually try it with the utilities you will be using. Some tools don't provide any facility to specify crop sizes exactly in pixels while also positioning the crop frame on the page by eye, so it can be hard to crop a large image exactly so that when scaled down in a later step, it ends up exactly the right X,Y size to match pages from different workflows.
Once you are absolutely sure you're happy with the results, are sure they are repeatable, and have a detailed recipee that can be either automated or at least followed reliably and repeatably by manual process, then it's safe to begin the real work. Just one last thing:
Clear the decks. All the trial processing work you did should be moved into your '01_experiments' folder. You don't want that stuff getting mixed in with...

Second Phase: The Grind

Now you're going to be doing a lot of very repetive tasks, possibly for a long time. Turn on some nice music, have your clearly written recipee notes in sight, and sink into the routine. Take breaks, don't pull all-nighters or become a hermit. It's going to be finished when it's finished. Frequent file backups never hurt either.

The list is a typical sequence of steps, from scanning to finished document. It's a loose guideline only. Your own proceedure may vary, depending on equipment capabilities, software tools and your objectives. Some of the stage orderings are just preferences, and you may encounter small details and special cases omitted from this walkthrough.

Summary in one line:
Scan, Backup, Mode, De-Skew, Align, Crop, Cleanup, De-Screen, Final Scale, Encode, Wrap, Online. And Relax.

Scanning. Nothing but scanning, all the pages, with no distractions from the minimal repetetive scan actions. Saving to folder 02_raw (Original_Scans) or its subfolders. All using your determined recipee and scan profile, checking that the file sequence numbering is working right, save in non-lossy format, etc. Different page types can be kept in different subfolders, or given distinctly different names. Sometimes I keep them all in one folder and unified numeric sequence, but with name-tails showing page type.
Once you have these original files, never alter them! Make them read only. If you stuff something up during processing, you can go back to these. Having these files means you can put the scanner and the original document safely away, and probably not need to get them out again.
If there are any file naming and sequencing issues you want to fix, you could do it now. If you are 100% sure you won't goof and irretrievably mess things up, thus having to rescan. But why not more safely do it after...
Backup. Make a working copy set of the scan originals, in a new folder for your first stage of post-processing. Yes, you just made a duplicate copy of many, many megabytes. Maybe even gigs. You'll be making several more copies too. I did warn you this would chew disk space.
Now, in this working copy, would be a good time to fix those file naming issues.
RGB/8 Mode. Ensure the work copies are all in RGB/8 format, before doing anything more. Convert any that are not. All processing stages should be in RGB/8 format, regardless of whether the images are gray-scale or color. This mode is 24 bits/pixel, but file size is not a concern at this time.
Why: Many processing operations don't work well or at all in indexed color. In the worst case they can appear to work, but introduce subtle but non-recoverable image defects.
CS6: Image ► Mode: RGB Color, 8 Bits/channel. Displays as (RGB/8) in image window title bar.
You ought to be able to do an automated batch conversion of all the files. But be very sure the utility you use does this in a non-lossy form, and output is in a non-lossy form such as PNG or TIFF.
Because it's an automated step with no adjustments, applied to all files, there's no point creating another processing stage folder. Just overwrite these files.
De-skew. (Preferably a new fileset in a new folder) With each page, put an alignment ruler in the image close to some linear element that is supposed to be horizontal/vertical, and rotate the image so it appears exactly straight to the eye. Save it. (There may be automatic text de-skew facility in some scanner Apps, but if you are using Photoshop CS6 there isn't.)
CS6: View ► Rulers, drag a guide line from the ruler onto the image. Eyedropper the paper color, put in background color, Select ► All, Edit ► Transform ► Rotate.
It can be difficult to use the manual rotate facility to get very fine adjustments. An alternative is to use the numeric entry rotate tool. After a little practice you can guess the required decimal rotation number to be right first time.
Do not iteratively perform multiple sequential rotations to approach a correct result. Each rotation transform introduces a small amount of image loss. Instead, try an amount, judge the result, then undo (or reload) and try an adjusted value. The end result should be equivalent to a single cycle load, rotate-once, save.
Note: CS6's Edit ► Transform ► Skew provides a freeform trapiezoid adjustment suited for photographic perspective correction. It's not suitable for correcting scanner skew errors.
CS6 also has File ► Automate ► Crop and Straighten Photos but this only works for images with an identifyable border. White page on white scanner background doesn't work so well. Plus the printed text may be slightly skew on the paper page anyway.
Incidentally, small-angle rotation is one of the operations that subtly but hopelessly screws up with indexed colour files. Try it and see. Zoom in on what were originally straight lines, but become stepped after the rotate. An effect due to the index table not containing ideal colour values to achieve adequate edge shading. It's intrinsic, unavoidable, and unrecoverable.
If you're working on large pages that had to be scanned in parts, this is where you'll be rotating them as required and stitching them together in photoshop. Leave other processing steps till later if there are many to do.
Align. (Definitely a new fileset in a new folder!) Printed pages generally have a common content border, with blank margins to the paper edge. When you turn pages in a book, the eye can't judge if the next page content borders are exactly positioned in the same place. And in fact in printed books, with variations in the page binding and knife cropping of the page sheaves, there is some randomness in content placement. Doesn't matter since readers don't notice it.
But on a screen, with potentially instant (blink) page switching, you certainly will notice any slight misalignment of page contents relative to each other. The effect can be quite annoying, and usually considered a failure to reproduce the true nature of the original work. The authors didn't expect you to see 'jiggling pages.'
There's only one way to fix this. The pages all have to be 'content aligned' to the limit of human visual perception with blink page changes. That means very, very well aligned.
- Pick a representative page that has elements at important peripheral locations. eg page number, title, header, footer, text block with good edeges. This will be your reference page.
- Load the reference page into photoshop as a semi-transparent foreground layer. Lock it in position.
- Sequentially load every other page (one at a time) as a layer behind the see-through reference page.
  Manually shift each page to closely align with the position of the reference page. You'll get some 'no content' outer edges formed as you move the page. No matter. You can fill them now, or have another background layer visible to avoid it, or just leave them to be cropped off later.
  You can test the alignment by clicking the visibility of your reference page on and off repeatedly. Do your eyes tell you there's a 'sideways shift' of the content elements happening? Yes? Then keep adjusting till that effect is gone. Though sometimes you'll find works that are just poorly typeset, and some elements do actually move around.
  Another method is to use guide lines in Photoshop, and position page content relative to them. For repeatability you'd have to save a .PSD file to include the guide positions while working.
  Save the now aligned page over itself. In CS6 you'd do a 'save for web' with the reference page invisible, in PNG format.
Crop. (New folder!) With the original printed work, there were blank margins to the paper edge around the page. But on screen, the content is what matters, and there is no 'paper edge.' Commonly the user will want to display the content to fill the screen, with only small margins. Maybe you decide the digital document should include original margins, maybe not.
Are the margins important to the nature of the text?
Are there any 'notes in the margin' you want to preserve? (If so, you'd best include margins of all pages. You did make a note of this when doing your initial survey, right?)
If readability and minimized file size is a higher priority than perfect reproduction of the original page layout (blank paper included), then at some point you'll be deciding on a suitable croping of the page image files. You can crop quite close to the content. Margins on screen can be done programmatically, or just omitted.
Bear in mind that with run-length encodings like PNG, blank page areas (with perfectly uniform colour, eg full #ffffff white everywhere) compress to almost nothing. So the file size reductions from removing blank margins may not be as much as you'd expect. It's mostly about best use of screen pixels for human readability.
Obviously, to preserve the content position, this crop has to be identical for all pages.
There are various ways to do this. One is to load all pages into photoshop as layers, crop the entire layer stack, then write the individual layers out as pages again.
Another way is via a batch mode operation in Irfanview, applying a fixed crop to all images.
Most image editing tools can automate this one way or another. But this is definitely a case where you want to be sure you have a backup of the prior stage files before committing.
Cleanup. (New folder!) The original pages will have a variety of imperfections such as surface texture, blemishes, inking flaws, creases, discoloured paper, and so on. Predominantly by area these are visual details occuring in what would otherwise be 'blank background' to the actual content. Some of these will be desirable to remove, while others may form part of the character of the document and be preferable to retain in some form. This is where judgement of aesthetic quality and historical accuracy comes into play.
All good image compression schemes (including non-lossy ones like PNG) are able to reduce truly featureless areas of an an image to relatively few bytes. Background areas of a page are usualy the great majority of page area. Hence the nature of the page 'background content' very greatly influences final file size. If the backgrund is actually a uniform colour, file size will be mostly taken up by the actual page content. If the background has texture, that is 'not blank' and requires a lot of bytes to encode. Typically far more than the smaller area of actual page content consumes.
The cleanup stage of post processing involves several quite different tasks, that just happen to be carried out in the same way, and require similar kinds of aesthetic judgement. All of them requiring close, zoomed-in, visual inspection of all areas of the image. This is a part of the process that can take up a great deal of time and effort. Or be quite brief, if the scans were of good quality.
Ideally, during scanning you were able to adjust parameters to get all (or most) of the page background to be white (or some relatively clean colour) and most of the inked areas to be solid black (or whatever ink colours were used.) Remember, with ink-on-paper printing, there are no 'variable ink shades.' There's only ink or no ink. Tonal photos are achieved by 'screening' the ink(s) in regular arrays of variable sized fine dots. At this stage of post-processing those dot arrays should still be intact. We're still not dealing with those yet, other than perhaps to fix any obvious ink smudges.
In the worst case the scans may include a prominent paper texture of some kind, but usually there will be just a few blemishes, spots, speckles, hairs, missing ink, ink blots, etc. This is where you decide what to do about all of them. Were any of them intended by the document author and printer? If not, you should remove them if possible, with the additional incentive that their removal will reduce final file size. Your primary rule should be 'how did the author want this page to appear?'
Remember this? "For all you know, your electronic representation of a rare book or manual you are scanning today may in time become the sole surviving copy of that work. The only known record, for the rest of History."
You're going to be sitting there late at night, grinding through pages in photoshop, and will think "Screw this. I don't care about these specks or that dirt smudge. Who would ever care?"
Well, do you? What if this copy you are constructing ends up on a starship a million lightyears from Earth, the only copy in existence in the entire universe, after the planet and humankind are long gone. Was it important how the author wanted their work to appear?
%%% Processing at this stage will greatly influence the compressibility of the image files, hence the final document size. Fundamentally it is a tradeoff between retained detail vs file size. Is a type of detail significant, or is it unwanted noise? All retained detail costs bytes. If it isn't significant, remove it. For instance simply removing blank paper texture (that may not even be visible to the eye unless the image saturation is turned way up) and replacing it with a perfectly flat white or tone, can massively reduce total final file size.
See %%%
DeScreen. (New folder!) Remove print screening. For pages that contain elements of offset printed screening, you must convert the regular dot mesh of the screening to smooth tone gradations before doing pretty much any other processing for noise removal or scaling. Virtually any kind of image processing applied to a page image that still contains screening, is guaranteed to produce a horrible moire-pattern mess, and/or lose a great deal of image detail.
Depending on the format of screening in the document this can be a major pain in the arse, and in some cases I still don't know of a workable method — it's an unsolved problem.
See Offset Printed Screening below for a detailed discussion.
Final Scale.
See %%% notes in to_add.txt: Uniform page content placement (part of 'Crop and Scale')
Encode.
Wrap. Construct Navigation Framework. Hotlinks in index, return to index, front cover with identifier as RAR-book, who scanned it and when, etc.
See Encapsulation, PDF, and why it sucks.
See %%%
OnLine. There wasn't any point to all that work unless you share the result. Upload it to Bitsavers and whatever other archives you can.
Relax. Take a break, go for a walk somewhere nice. Talk to people. Look at clouds, listen to music. You're not going to do that scanning madness again for at least... until you see the next old document that needs preserving.

Keep a Recipee

So there's a need to keep detailed notes in a 'recipee' text file of the exact workflow. This should include both the scan parameters, and the post-processing details. All tool settings, profiles, resolutions, scale factors, crop sizes, etc. If you saved profiles for the scanner, Photoshop or other tools, list the filenames and what they are each for.

Be meticulous with keeping these recipee notes. It sometimes happens that much later you discover flaws in your final document, and you need to redo some stages, possibly even re-scanning original page(s). If you don't have the full recipee, you may not be able to get the exact same page characteristics again.

Generally once you have done sample scans and have a workable scan profile, then scan the entire document in one go. Even if it takes days, don't alter any tool settings.

Even if you line pages up perfectly you'll find that once the image is on screen any slight skew in the original printed page really stands out. So it's best to scan a little oversize, to simplify later de-skew and re-crop processing stages.

Once you have all the pages scanned, in the best quality you can achieve, ensure you have a complete, sequentially numbered set of files, one for each page, in a single folder. These are the raw scan files, with NO post-processing yet.

Make these files read only. The folder name should be something obvious like 'original_scans'. These are your source material, and you won't ever modify them. It's a good idea to keep them permanently, in case you later find errors in your post processing that need fixing. That folder should also contain a text file with the recipee.

File numbering

Typically your scanner software will be able to automatically number sequences of scans, with you just telling it the number of digits and starting number. Usually you can also include fixed strings in the name too.

But of course, the file sequence numbers won't match the page numbers in the document. Not ever, because what printed work ever starts numbering with 'page 1, front cover'? Also page numbers tend to be a mix of different formats.

Yet while working with a large set of page scan files it's important to be able to find page '173', 'xii', '4.15' or 'C-9' when you need it. While you're working on them the file names should contain BOTH the image sequence number, and the physical page number. They should also contain enough of the document title that they are unique to that project, and you can't accidentally overwrite scan images from some other project if you goof with drag & drop or something.

I typically use filenames like docname_nnn_ppp.png, where:
   docname is an abbreviation of the document title. In the final html/zip I may omit this from the image files.
   nnn is a sequence number, of however many digits are required for the highest number. They run 001 to whatever. These are in constant format so the images sort correctly in directory listings.
   ppp is an exact quote of the actual page number in the document. May be absent, or a description, roman numerals, page numbers, etc.
   .png is the filetype suffix, and PNG is a non-lossy format.

Once you've finished processing the page images and are constructing the final document, if the wrapper is html you might want to avoid wasting bytes due to repetive long file names, and reduce the names back to just the plain sequential numerics. Since the names will all occur multiple times in the html, and any more bytes than absolutely needed are a waste.
But do retain the same basic numbering as the original scan and work files, in case you need to fix errors in a few pages.
A useful freeware utility for bulk file renaming is http://www.1-4a.com/rename/

OCR - Conversion to Text

There are several advantages to having a character-based copy of a document, rather than images of it.

Since this article is about preserving rare and historical technical documents, the answer is obvious. One must choose the path that does not result in data loss. This means images must be used, since they capture both the soul and detail of the original work.

When a better document representational data format is developed in future (combining benefits of both searchable text and the accuracy of images), documents saved now as images can be retrofitted to the better format with no further data loss. Retaining the original images plus added searchable encoded text.
On the other hand a document OCR'd and saved as text now, can never be restored to its original appearance. That was permanently lost in the OCR process.

Encapsulation

Furthermore it's not just the final image X,Y size and encoding method that determines the file size. One of the most important factors is the nature of the image content itself, and how it has been optimized during post processing. Here the art lies in removing superfluous 'noise' from the hi-res image, even before it is scaled to the final size and compression format. For example a large area of pure white, in which every pixel has the value 0xFFFFFF (or any uniform value), will compress to almost nothing. On the other hand that same area of the image, if if consists of noisy dither near white, even though it may appear on-screen to the eye to be 'pure white', will not compress very well and may require many thousand additional bytes in the final file.

When bundling the whole thing into a single archive file for distribution, be aware that JPG/PNG files are already very close to the maximum theoretical data compression ratio possible, and the archiving compression can't improve significantly on that. In fact the result will likely be larger than the sum of original filesizes, since there are overheads due to inclusion of file indexes, etc.

Potential encapsualtion formats

RARbooks

PDF, and why it sucks

Post-Processing

Cleaning up tonal flaws in scans

%%% describe method of creating a selection of all the 'wanted' tones, shrink, expand (to remove isolated specks), inverting (now the selection should include only all 'blank' areas), fill with white, to clean background white.

Common Traps in Post-Processing

Don't de-skew scans in Indexed Color Mode

It turns out that while Photoshop does a great job of rotating images in RGB/8 mode, images in Indexed color mode will be slightly but irrevocably stuffed up by rotation. It appears to be related to the colour index table not containing values required to achieve a smoothly shaded line edge, resulting in that nasty 'stepped' effect. Which seriously destroys the visual linearity of line edges and the perceived overall line positioning. As seen in the center example. Notice that the font is corrupted as well.

It's non-recoverable, and carries through scaling operations. So any work you did on the images after de-skewing them is wasted.
The lesson is, always ensure scan images are converted to RGB/8 mode before you do anything else with them.

On Scanning

Introduction