Use Google OCR and resubmit all pages through it.
Using Google OCR gives an almost flawless image to text conversion. I think you are using the hopeless Tesseract. I suggest you collaborate with Google, and resubmit all the pages already done, the improvements will be enormous. As an example the Keith Waterhouse column in Daily Mirror 19780810 page 8, gives an utter garbled output. The first section that comes out of Google OCR without any manual corrections is.
My hit that missed..
THAT splendid actor Jack Hedley, in an interview about the making of the BBC serial "Who Pays the
Ferryman?" tells of the mysterious influence that the island of Crete had over him and other members of
"It had a profound effect on me," he reports. "I have changed considerably since I came back. One of
the fundamental changes is that I don't take The Times any more."
As well as being inscrutable this is a great shame, for Mr. Hedley is a thoughtful man, and The Times
has just commenced a series that will cause furrows on many a ruminative forehead.
These articles are about historical events which never took place, and the first one poses the
question: "What would have happened it Hitler had been assassinated in July, 1944 7" **
your version reads (partially)
THAT splendid actor Jack Hedley, in an interview . ' 959/ 5 -. WashingMachine7l42A Sale about the making of the BB C serial " Who Pays the 9 Programmes rpm spn speed ary wort capably. PRICE Ferryman?” tells of the mysterious influence that the ','• 17 5 as island of Crete had over him and other members of the cast. -- "It had a ofound effect on me,” he reports. " I have \ changed considerably since I came back. One of the la 1 f 35 TRADE-IN f ra u o n re
This is really not acceptable. As a result of this, searching itself never finds the articles it might find (say for search for a relative, or place).
I've just tried it on the article that made me look here. About 95% success, compared to less than 10%. The BNA OCR attempt was utter garbage, with a multitude of non-normal characters such as
"cultivator y® •®f honours. "
"continued to . ~ * • -ijii rr*, live "
"of the dung, and PJ trick : ft . r * v ett - was fc *• , . .. . , , , , Westminster School,"
Running a portion of the image as downloaded through Google Keep gave spelling errors, and wrong letters but not this waste of space.
The BNA also has utterly strange ways of selecting parts of the page to scan together, because in every article I have seen so far, the sections are a complete mix of different things. Mine is a farming report merged (BNA didn't spot the column divider!) with the Death of a Marquis, plus the Intelligence from Cambridge University. an Arrest Report intermingled with both bits.
How on earth dare the BNA suggest that their OCR is accurate?