Improve Optical Character Recognition

The existing OCR is poor. Many words are garbled, rendering search results meaningless

132 votes

Anonymous shared this idea · May 21, 2014 · Report… · Admin →

Open - Ongoing process · Oct 14, 2014

Show previous admin responses (2)

An error occurred while saving the comment

J commented · October 22, 2023 2:13 AM · Report

Dear ADMIN
If this was supposed to be improving since 2014, why is there no evidence? Is it because BNA are rushing into NEW digitisations, when there is such a huge amount of unsearchable information out there? It is pot luck whether the image scan and the OCR at the time the page was digitised are even remotely related to what we as humans can read.
Or even that the same image in Google can transcribe.
For goodness sake, they can now read information in dead languages on papyrus rolls that have been incinerated in the pyroclastic flows from the Vesuvius destruction of Pompeii and Herculaneum!

Submitting...
Magpie commented · March 27, 2023 1:43 PM · Report

It's seriously the most useless OCR I've ever seen. Anyone that is a British citizen should be furious that this is the only way they are kept digitally. Crazy that the British Library would even sponsor this.

Submitting...
Lue William commented · January 19, 2023 9:37 PM · Report

Artificial intelligence is a process that is used to create smart machines that are able to perform tasks and solve problems without being explicitly programmed this link use https://www.digitfeast.com/2021/03/how-to-use-artificial-intelligence-in-mobile-apps.html AI engines use data-driven algorithms to solve problems, making mobile app development easier and faster. becomes more effective.

Submitting...
J commented · May 23, 2022 7:42 AM · Report

not only that, but even the FREE google text recognition in google keep on my phone is better!

Submitting...
J commented · July 6, 2021 8:26 AM · Report

It is.
Can I suggest you add your votes to this post which has a lot of votes already - maybe if it grows enough some notice will be taken.
https://help-and-advice.britishnewspaperarchive.co.uk/forums/243685-search-improvements/suggestions/42821181-replace-the-ocr-engine

Submitting...
johntob commented · June 30, 2021 4:29 AM · Report

Extremely disappointed with the Optical Character Recognition OCR software. Makes me suspicious. Is the British Newspaper Archive just a front for some more sinister organisation set up to hide information by wasting the searchers time and making them give up out of sheer frustration. Is it just a front for Orwell's Ministry of Information?

Submitting...
Graham Wootten commented · April 26, 2021 2:02 PM · Report

Quite agree. It is absolutely useless!

Submitting...
Peter Johnson commented · March 3, 2021 4:51 AM · Report

It's well past time that the OCR engine was replaced with one fit for the 21st Century.
The present one omits thousands, if not millions, of random words for no obvious reason, and not because the image is flawed, and often ignores the column boundaries. On many pages the OCR result is missing completely.
A modern fit-for-purpose OCR engine would make a massive improvement.

Submitting...
Anonymous commented · January 26, 2021 2:18 PM · Report

I am doing research into the early years of The Vegan Society. If you search for the word "vegan", you get so many poor OCR-based matches that it becomes extremely arduous to get results actually about veganism.

I would wager this is the same for a great deal of other words.

Submitting...
J commented · September 25, 2020 4:59 PM · Report

The font is no excuse.

I've just used Google keep and "extract text" using an image downloaded from the BNA site - the article that made me look here. About 99% success, compared to less than 10%. The BNA OCR attempt was utter garbage, with a multitude of non-normal characters such as
"cultivator y® •®f honours. "
"continued to . ~ * • -ijii rr*, live "
"of the dung, and PJ trick : ft . r * v ett - was fc *• , . .. . , , , , Westminster School,"

Google Keep gave spelling errors, and wrong letters but not this waste of space.

The BNA also has utterly strange ways of selecting parts of the page to scan together, because in every article I have seen so far, the sections are a complete mix of different things. Mine is a farming report merged (BNA didn't spot the column divider!) with the Death of a Marquis, plus the Intelligence from Cambridge University. an Arrest Report intermingled with both bits.

How on earth dare the BNA suggest that their OCR is accurate?

Submitting...
J commented · September 25, 2020 4:41 PM · Report

This is already a top suggestion - vote for it!
https://help-and-advice.britishnewspaperarchive.co.uk/forums/243685-search-improvements/suggestions/5953136-improve-optical-character-recognition

Submitting...
J commented · September 25, 2020 4:19 PM · Report

Absolutely. Sadly I've now used my votes on similar suggestions.

Submitting...
J commented · September 25, 2020 4:17 PM · Report

I've just tried it on the article that made me look here. About 95% success, compared to less than 10%. The BNA OCR attempt was utter garbage, with a multitude of non-normal characters such as
"cultivator y® •®f honours. "
"continued to . ~ * • -ijii rr*, live "
"of the dung, and PJ trick : ft . r * v ett - was fc *• , . .. . , , , , Westminster School,"

Running a portion of the image as downloaded through Google Keep gave spelling errors, and wrong letters but not this waste of space.

The BNA also has utterly strange ways of selecting parts of the page to scan together, because in every article I have seen so far, the sections are a complete mix of different things. Mine is a farming report merged (BNA didn't spot the column divider!) with the Death of a Marquis, plus the Intelligence from Cambridge University. an Arrest Report intermingled with both bits.

How on earth dare the BNA suggest that their OCR is accurate?

Submitting...
Fortean commented · August 2, 2020 9:58 AM · Report

Replace your garbage OCR with something that actually works! A search for “Shower of Ants in Cambridgeshire” failed because the OCR rendered the text into this gibberish: "Pr.ov Lm CA A phenomenon lin jun occu rredat the • tars VM S orr A closet was &served to be peeling over, w anddeinis to bo eatoulatment of the villagers. It wee .eve to be auto and stattlar winged leant& Pe. pis and the grand snietheved with then, awl they swarmed in milli se. glory step taint .gushed hued ells of than."

Submitting...
CF commented · March 10, 2020 12:59 AM · Report

When performing OCR, it's quite remarkable how the quality of the output is driven by the fonts which the software used supports/understands. For example, I was asked to digitise a printed document for a family history society (which owned the copyright to the document). The first attempt at OCR-ing the document produced absolute rubbish. I then spent some time identifying the font (not long - at most 1 hour). The quality of the OCR output then rose to 99% ...

I have no detailed knowledge of the history of newspapers but I suspect that many used the same small set of fonts which are no longer standard today. I reckon that a newspaper historian would be able to point to documents which chart the history of newspaper fonts, and that this would allow The BNA to install the appropriate fonts into its OCR software. (Don't forget to include the bold and italic flavours as well.)

Submitting...
Anonymous commented · November 11, 2019 4:43 AM · Report

I agree with this idea. Additionally, this can be solved using "Regular Expression" (regex) technology, which is free.
I explain more in my suggestion on this site at https://help-and-advice.britishnewspaperarchive.co.uk/forums/243749-website-improvements/suggestions/38737225-allow-searches-to-be-specified-using-regular-expr

Submitting...
Geoff. commented · April 14, 2019 6:30 AM · Report

Using Google OCR gives an almost flawless image to text conversion. I think you are using the hopeless Tesseract. I suggest you collaborate with Google, and resubmit all the pages already done, the improvements will be enormous. As an example the Keith Waterhouse column in Daily Mirror 1978_08_10 page 8, gives an utter garbled output. The first section that comes out of Google OCR without any manual corrections is.
My hit that missed..
THAT splendid actor Jack Hedley, in an interview about the making of the BBC serial "Who Pays the

Ferryman?" tells of the mysterious influence that the island of Crete had over him and other members of

the cast.
"It had a profound effect on me," he reports. "I have changed considerably since I came back. One of

the fundamental changes is that I don't take The Times any more."
As well as being inscrutable this is a great shame, for Mr. Hedley is a thoughtful man, and The Times

has just commenced a series that will cause furrows on many a ruminative forehead.
These articles are about historical events which never took place, and the first one poses the

question: "What would have happened it Hitler had been assassinated in July, 1944 7" **
------------------------------------------------------------
your version reads (partially)
THAT splendid actor Jack Hedley, in an interview . ' 959/ 5 -. WashingMachine7l42A Sale about the making of the BB C serial " Who Pays the 9 Programmes rpm spn speed ary wort capably. PRICE Ferryman?” tells of the mysterious influence that the ','• 17 5 as island of Crete had over him and other members of the cast. -- "It had a ofound effect on me,” he reports. " I have \ changed considerably since I came back. One of the la 1 f 35 TRADE-IN f ra u o n re

-----
This is really not acceptable. As a result of this, searching itself never finds the articles it might find (say for search for a relative, or place).

Submitting...
Anonymous commented · March 18, 2019 5:44 PM · Report

In the older newspapers its difficult to find keywords that include the letter s.

Submitting...
VB commented · August 16, 2018 12:44 AM · Report

While I sympathise with the difficulties there must be with digitising many old publications, there are many times when I can see no reason why the OCR text is as bad as it is. I imagine that, as with some other software, a choice is being made between speed and accuracy; too often (on the Wiltshire Times, which I have used most recently) speed has been chosen, resulting in long sections of gobbledegook. There is so much OCR text that is so bad, I don't have time to correct it - if I want to use it, I have to transcribe it from scratch. I don't do that for BNA to benefit, while paying for it!

Submitting...
Anonymous commented · June 4, 2018 6:18 PM · Report

I use Trove from the NLA a lot and their OCR is vastly superior to the OCR results from the BNA.

Submitting...