OCR Basics – Converting Images to Searchable Text
Optical Character Recognition, commonly known as OCR, is a term you hear frequently in the legal space when it comes to processing, reviewing, and producing documents. While on the surface, OCR can seem straightforward there are still some topics to take into mind to gain a better understanding of what it is and when it should be used.
As with any tool, OCR should be used strategically and not blindly applied to entire data populations. This is both because applying it to massive datasets is not only going to increase the time spent and cost of your litigation spend, but you can also run into ESI issues from overwriting or omitting native metadata.
So, what exactly is Optimal Character Recognition? In basic words, OCR is the process of converting images and flat non-searchable documents into searchable text documents. This could be when you need to pull text from an image such as an infographic or need to convert physically scanned documents into a searchable document so you can search for keywords.
When it comes to OCR technology, most industry standard eDiscovery processing tools and review platforms already have an OCR option built in. You can also find third-party OCR tools where you can then load the data into your preferred platform. It’s important to note however that OCR Text is not the same as Extracted Text.
Extracted Text should always be the preferred method as it comes directly from the native file metadata and is extracted from ESI processing tools. Extracted Text is 100% accurate while OCR Text is not always 100% accurate. You should strive to use OCR to compliment extracted text whenever possible. Work to maintain native files and original metadata whenever possible. Extracted Text should be used on data that is already searchable, this includes data sources such as email, interactive PDFs, web-based content, and other items you can search in native format.
When used strategically, OCR has a range of benefits including:
- Converting non-searchable documents into searchable text documents
- Quickly find relevant information: keywords, dates, phrases, and more
- Convert paper files into a searchable digital repository
- Edit and redact original documents
- Speeds up and increases efficiency of document review
- Paper documents are vulnerable to being lost or destroyed, OCR helps archive and store the information from those documents
- Some courts require text searchability