Improve Optical Character Recognition
The existing OCR is poor. Many words are garbled, rendering search results meaningless
The font is no excuse.
I've just used Google keep and "extract text" using an image downloaded from the BNA site - the article that made me look here. About 99% success, compared to less than 10%. The BNA OCR attempt was utter garbage, with a multitude of non-normal characters such as
"cultivator y® •®f honours. "
"continued to . ~ * • -ijii rr*, live "
"of the dung, and PJ trick : ft . r * v ett - was fc *• , . .. . , , , , Westminster School,"
Google Keep gave spelling errors, and wrong letters but not this waste of space.
The BNA also has utterly strange ways of selecting parts of the page to scan together, because in every article I have seen so far, the sections are a complete mix of different things. Mine is a farming report merged (BNA didn't spot the column divider!) with the Death of a Marquis, plus the Intelligence from Cambridge University. an Arrest Report intermingled with both bits.
How on earth dare the BNA suggest that their OCR is accurate?
When performing OCR, it's quite remarkable how the quality of the output is driven by the fonts which the software used supports/understands. For example, I was asked to digitise a printed document for a family history society (which owned the copyright to the document). The first attempt at OCR-ing the document produced absolute rubbish. I then spent some time identifying the font (not long - at most 1 hour). The quality of the OCR output then rose to 99% ...
I have no detailed knowledge of the history of newspapers but I suspect that many used the same small set of fonts which are no longer standard today. I reckon that a newspaper historian would be able to point to documents which chart the history of newspaper fonts, and that this would allow The BNA to install the appropriate fonts into its OCR software. (Don't forget to include the bold and italic flavours as well.)
While I sympathise with the difficulties there must be with digitising many old publications, there are many times when I can see no reason why the OCR text is as bad as it is. I imagine that, as with some other software, a choice is being made between speed and accuracy; too often (on the Wiltshire Times, which I have used most recently) speed has been chosen, resulting in long sections of gobbledegook. There is so much OCR text that is so bad, I don't have time to correct it - if I want to use it, I have to transcribe it from scratch. I don't do that for BNA to benefit, while paying for it!
I use Trove from the NLA a lot and their OCR is vastly superior to the OCR results from the BNA.
This short example shows how bad the OCR can be!
b^*T. fad to ala t.o* unfit r *f b* f aaapar.aj b* nvglMt or d«la> Th* 1«44i u*4l •u'kjntb «iU rtrw >• •oiiiMbiog • hi* L will |>ut them ,n 4 batuw tb*y lit«id iu u*for* •*» I tii4»*a* •urkinn. *f» 4l(*Adj m Uua bat tar pu«itA<4 in <i v«rk lOtalitVo* <jf lhe wvrti"g Ulna 1* *bonar condition* .J labcvr 4r« oaora aatiafactory; mwii cannot t» dri»#i» 4iid harAaaad tliay •ar» Tf.at ha* a trad# »n.wn •bt«b abiatda a«aiiiat tnanr abtiaaa •OK-h fortoarlr lb a* anbnut Bur* •* Worth But tb*M improtovitoou or* port
I am referring to recently scanned pages. The Lincs Chronicle of 1919 is almost useless. I would love to post here the text as read by OCR to illustrate this.
Sometimes it seems that the newspapesr you are scanning are of poor quality. They do fade and get damaged, and their storage affects their survival. It must be difficult to get hold of good copies quickly, so some rescanning may have to take place in the future
If we could use wildcards it would get round some of these problems
John Woolman commented
There are some quite common misreads such as 'tbe' instead of 'the' would it be possible to do a find all and replace all search
I find that searching with 'EXACT' for a surname like IVIN picks up Giving and other such words this shouldn't be so with good OCR