What is Optical Character Recognition (OCR)?
Posted on 24/06/21
In its simplest terms, Optical Character Recognition (OCR) is the conversion of an image to text.
OCR utilises technology to identify text characters within a digital image of a physical document and convert them into a format that can be used for data processing. This allows a scanned document – usually considered an image by computers – to be edited and searched as though it were a standard word document.
OCR solutions use pattern recognition and feature detection are to identify text characters within an image and then convert them to code that can be read and manipulated by a computer.
Without OCR, scanned documents sit as images that offer no clarity or understanding into their content unless they are read and indexed by a human. Introducing OCR, however, brings the following benefits:
- Quick search: all instances of a single term can be found within thousands of pages in seconds, rather than the tens or hundreds of hours it would take an individual.
- No need to retype documents: rather than taking the time to manually re-type physical documents into an electronic format, OCR can do this for you – and more accurately without human error. From here data can be edited and searched.
- Centralise data & save on physical storage: historical physical records can be scanned and read by OCR so they can be stored electronically in a single location and you can finally get rid of bulky filing cabinets.
- Save money by saving time: ultimately OCR saves money by performing a laborious task in a fraction of the time it would take you or I to do the same. By slashing the time it would take to re-type or sift through large physical datasets or scanned documents, you’ll be saving valuable costs in work hours.
Why is OCR important in Subject Access Requests?
The subject of a request may have data contained within physical records as well as electronic ones. Historical medical records, for example, are often still stored physically and may not yet be digitised. Even if you know exactly where the information within this data is located, manually photocopying, redacting and then scanning the data can take many hours and is prone to human-error.
Not knowing where – or even whether – the subject’s information is contained within physical records is an even greater challenge. Identifying the relevant information within tens, hundreds or thousands of documents can take days or even weeks. In addition, this doesn’t account for any sensitive information that you are not actively looking out for and could leave you open to risk.
Here are just a few of the numerous uses for OCR:
- Electronically archiving historical records and files
- Help convert documents from text to speech for those that are visually-impaired
- Text recognition within photos – such as number plates
- Backing up physical records
OCR solves these challenges by providing digital copies of your documents that enable you to instantly search for information relevant to the request. This allows information to be identified and processed in a fraction of the time and with a greater degree of accuracy than traditional methods.
OCR also facilitates the centralisation of data. This could be used to:
- Gather a subject’s data from multiple sources – physical and digital – into a single location so it can be worked on more efficiently and disclosed more easily
- Digitise an entire data estate so that information can be quickly identified during future requests
Smartbox.ai and OCR
Smartbox.ai uses OCR to apply its powerful AI-search and bulk-redaction to scanned documents. By converting a scanned document into code, Smartbox is able to read the entire content. This not only enables you to identify information relating to the subject, but also non-subject information that should not be disclosed as part of the request. Once the data is indexed by our AI, bulk redaction can be run across the entire dataset in a process that will take minutes rather than hours or days that it would take a human to carry out the same task.