OCR Basics – Converting Images to Searchable Text

Optical Character Recognition, commonly known as OCR, is a term you hear frequently in the legal space when it comes to processing, reviewing, and producing documents. While on the surface, OCR can seem straightforward there are still some topics to take into mind to gain a better understanding of what it is and when it should be used.

As with any tool, OCR should be used strategically and not blindly applied to entire data populations. This is both because applying it to massive datasets is not only going to increase the time spent and cost of your litigation spend, but you can also run into ESI issues from overwriting or omitting native metadata.

So, what exactly is Optimal Character Recognition? In basic words, OCR is the process of converting images and flat non-searchable documents into searchable text documents. This could be when you need to pull text from an image such as an infographic or need to convert physically scanned documents into a searchable document so you can search for keywords.

When it comes to OCR technology, most industry standard eDiscovery processing tools and review platforms already have an OCR option built in. You can also find third-party OCR tools where you can then load the data into your preferred platform. It’s important to note however that OCR Text is not the same as Extracted Text.

Extracted Text should always be the preferred method as it comes directly from the native file metadata and is extracted from ESI processing tools. Extracted Text is 100% accurate while OCR Text is not always 100% accurate. You should strive to use OCR to compliment extracted text whenever possible. Work to maintain native files and original metadata whenever possible. Extracted Text should be used on data that is already searchable, this includes data sources such as email, interactive PDFs, web-based content, and other items you can search in native format.

When used strategically, OCR has a range of benefits including:

Converting non-searchable documents into searchable text documents
Quickly find relevant information: keywords, dates, phrases, and more
Convert paper files into a searchable digital repository
Edit and redact original documents
Speeds up and increases efficiency of document review
Paper documents are vulnerable to being lost or destroyed, OCR helps archive and store the information from those documents
Some courts require text searchability

Considerations for Building an Effective ESI Protocol »

« What to Expect During A Forensic Data Collection

Tags: Document ConversioneDiscoveryeDiscovery ProcessingesiExtracted TextocrOCR in eDiscoveryoptical character recognition

Josh Markarian:

5 Reasons Why Digitizing Oversized Documents is Essential for Modern Organizations
In today's fast-paced world, the ability to access and manage information quickly and efficiently is…
Navigating Enterprise Data: Tips for Corporate Data Mapping
For corporate legal teams, managing data effectively means understanding the organization's data environment and implementing…
Maximizing the Use of Social Media Data in eDiscovery
Social media data has become a crucial source of evidence in legal proceedings, this post…