Improve Optical Character Recognition
The existing OCR is poor. Many words are garbled, rendering search results meaningless
-
J commented
Dear ADMIN
If this was supposed to be improving since 2014, why is there no evidence? Is it because BNA are rushing into NEW digitisations, when there is such a huge amount of unsearchable information out there? It is pot luck whether the image scan and the OCR at the time the page was digitised are even remotely related to what we as humans can read.
Or even that the same image in Google can transcribe.
For goodness sake, they can now read information in dead languages on papyrus rolls that have been incinerated in the pyroclastic flows from the Vesuvius destruction of Pompeii and Herculaneum! -
Magpie commented
It's seriously the most useless OCR I've ever seen. Anyone that is a British citizen should be furious that this is the only way they are kept digitally. Crazy that the British Library would even sponsor this.
-
Lue William commented
Artificial intelligence is a process that is used to create smart machines that are able to perform tasks and solve problems without being explicitly programmed this link use https://www.digitfeast.com/2021/03/how-to-use-artificial-intelligence-in-mobile-apps.html AI engines use data-driven algorithms to solve problems, making mobile app development easier and faster. becomes more effective.
-
J commented
not only that, but even the FREE google text recognition in google keep on my phone is better!
-
J commented
It is.
Can I suggest you add your votes to this post which has a lot of votes already - maybe if it grows enough some notice will be taken.
https://help-and-advice.britishnewspaperarchive.co.uk/forums/243685-search-improvements/suggestions/42821181-replace-the-ocr-engine -
johntob commented
Extremely disappointed with the Optical Character Recognition OCR software. Makes me suspicious. Is the British Newspaper Archive just a front for some more sinister organisation set up to hide information by wasting the searchers time and making them give up out of sheer frustration. Is it just a front for Orwell's Ministry of Information?
-
Graham Wootten commented
Quite agree. It is absolutely useless!
-
Peter Johnson commented
It's well past time that the OCR engine was replaced with one fit for the 21st Century.
The present one omits thousands, if not millions, of random words for no obvious reason, and not because the image is flawed, and often ignores the column boundaries. On many pages the OCR result is missing completely.
A modern fit-for-purpose OCR engine would make a massive improvement. -
Anonymous commented
I am doing research into the early years of The Vegan Society. If you search for the word "vegan", you get so many poor OCR-based matches that it becomes extremely arduous to get results actually about veganism.
I would wager this is the same for a great deal of other words.
-
J commented
The font is no excuse.
I've just used Google keep and "extract text" using an image downloaded from the BNA site - the article that made me look here. About 99% success, compared to less than 10%. The BNA OCR attempt was utter garbage, with a multitude of non-normal characters such as
"cultivator y® •®f honours. "
"continued to . ~ * • -ijii rr*, live "
"of the dung, and PJ trick : ft . r * v ett - was fc *• , . .. . , , , , Westminster School,"Google Keep gave spelling errors, and wrong letters but not this waste of space.
The BNA also has utterly strange ways of selecting parts of the page to scan together, because in every article I have seen so far, the sections are a complete mix of different things. Mine is a farming report merged (BNA didn't spot the column divider!) with the Death of a Marquis, plus the Intelligence from Cambridge University. an Arrest Report intermingled with both bits.
How on earth dare the BNA suggest that their OCR is accurate?
-
J commented
This is already a top suggestion - vote for it!
https://help-and-advice.britishnewspaperarchive.co.uk/forums/243685-search-improvements/suggestions/5953136-improve-optical-character-recognition -
J commented
Absolutely. Sadly I've now used my votes on similar suggestions.
-
J commented
I've just tried it on the article that made me look here. About 95% success, compared to less than 10%. The BNA OCR attempt was utter garbage, with a multitude of non-normal characters such as
"cultivator y® •®f honours. "
"continued to . ~ * • -ijii rr*, live "
"of the dung, and PJ trick : ft . r * v ett - was fc *• , . .. . , , , , Westminster School,"Running a portion of the image as downloaded through Google Keep gave spelling errors, and wrong letters but not this waste of space.
The BNA also has utterly strange ways of selecting parts of the page to scan together, because in every article I have seen so far, the sections are a complete mix of different things. Mine is a farming report merged (BNA didn't spot the column divider!) with the Death of a Marquis, plus the Intelligence from Cambridge University. an Arrest Report intermingled with both bits.
How on earth dare the BNA suggest that their OCR is accurate?
-
Fortean commented
Replace your garbage OCR with something that actually works! A search for “Shower of Ants in Cambridgeshire” failed because the OCR rendered the text into this gibberish: "Pr.ov Lm CA A phenomenon lin jun occu rredat the • tars VM S orr A closet was &served to be peeling over, w anddeinis to bo eatoulatment of the villagers. It wee .eve to be auto and stattlar winged leant& Pe. pis and the grand snietheved with then, awl they swarmed in milli se. glory step taint .gushed hued ells of than."
-
CF commented
When performing OCR, it's quite remarkable how the quality of the output is driven by the fonts which the software used supports/understands. For example, I was asked to digitise a printed document for a family history society (which owned the copyright to the document). The first attempt at OCR-ing the document produced absolute rubbish. I then spent some time identifying the font (not long - at most 1 hour). The quality of the OCR output then rose to 99% ...
I have no detailed knowledge of the history of newspapers but I suspect that many used the same small set of fonts which are no longer standard today. I reckon that a newspaper historian would be able to point to documents which chart the history of newspaper fonts, and that this would allow The BNA to install the appropriate fonts into its OCR software. (Don't forget to include the bold and italic flavours as well.)
-
Anonymous commented
I agree with this idea. Additionally, this can be solved using "Regular Expression" (regex) technology, which is free.
I explain more in my suggestion on this site at https://help-and-advice.britishnewspaperarchive.co.uk/forums/243749-website-improvements/suggestions/38737225-allow-searches-to-be-specified-using-regular-expr -
Geoff. commented
Using Google OCR gives an almost flawless image to text conversion. I think you are using the hopeless Tesseract. I suggest you collaborate with Google, and resubmit all the pages already done, the improvements will be enormous. As an example the Keith Waterhouse column in Daily Mirror 1978_08_10 page 8, gives an utter garbled output. The first section that comes out of Google OCR without any manual corrections is.
My hit that missed..
THAT splendid actor Jack Hedley, in an interview about the making of the BBC serial "Who Pays theFerryman?" tells of the mysterious influence that the island of Crete had over him and other members of
the cast.
"It had a profound effect on me," he reports. "I have changed considerably since I came back. One ofthe fundamental changes is that I don't take The Times any more."
As well as being inscrutable this is a great shame, for Mr. Hedley is a thoughtful man, and The Timeshas just commenced a series that will cause furrows on many a ruminative forehead.
These articles are about historical events which never took place, and the first one poses thequestion: "What would have happened it Hitler had been assassinated in July, 1944 7" **
------------------------------------------------------------
your version reads (partially)
THAT splendid actor Jack Hedley, in an interview . ' 959/ 5 -. WashingMachine7l42A Sale about the making of the BB C serial " Who Pays the 9 Programmes rpm spn speed ary wort capably. PRICE Ferryman?” tells of the mysterious influence that the ','• 17 5 as island of Crete had over him and other members of the cast. -- "It had a ofound effect on me,” he reports. " I have \ changed considerably since I came back. One of the la 1 f 35 TRADE-IN f ra u o n re-----
This is really not acceptable. As a result of this, searching itself never finds the articles it might find (say for search for a relative, or place). -
Anonymous commented
In the older newspapers its difficult to find keywords that include the letter s.
-
VB commented
While I sympathise with the difficulties there must be with digitising many old publications, there are many times when I can see no reason why the OCR text is as bad as it is. I imagine that, as with some other software, a choice is being made between speed and accuracy; too often (on the Wiltshire Times, which I have used most recently) speed has been chosen, resulting in long sections of gobbledegook. There is so much OCR text that is so bad, I don't have time to correct it - if I want to use it, I have to transcribe it from scratch. I don't do that for BNA to benefit, while paying for it!
-
Anonymous commented
I use Trove from the NLA a lot and their OCR is vastly superior to the OCR results from the BNA.