Using Data Lakes to Structure Data

Companies are increasingly finding themselves with large quantities of aging and diverse unstructured data.

Companies are increasingly finding themselves with large quantities of aging and diverse unstructured data. Each time these companies seek to understand why in the heck they are keeping it all, they discover that it serves some nebulous purpose, depending on who they talk to. It’s time to consider how to make real use of all that stuff

Extract from an article by: Michael Lappin via NUIX Blog


We’ve seen an increase in the popularity of data lakes. According to TechTarget, data lakes are defined as “a storage repository that holds a vast amount of raw data in its native format until it is needed.”

Taking that a step further, a Nuix data lake is a large collection of unstructured (and some structured) data that is indexed using Nuix to answer multiple use cases fitting your specific business vision, understanding the cost-benefit of having it in place, and allowing you to manage the size and reach of that data lake.

It’s important to note that calling it a ‘Nuix data lake’ doesn’t imply housing the data with Nuix in any fashion. It’s solely a reflection of using Nuix to index and search the data collection within your stated use cases.

Sometimes very large databases full of structured data are used to form data lakes to be used for a variety of data analytics. In this case, we are concerned with centralizing large quantities of emails, archive data, file shares, Sharepoint data, and more into one indexed repository.

Elephant in server room
Images based on photos by Manuel Geissinger from Pexels and Stickophobe


After helping several large companies form Nuix data lakes, we’ve noticed the emergence of some trends and requirements that are important to understand if you opt to start down the path to your own data lake.

Fractured Infrastructure

Every customer we’ve worked with to create a data lake has held large quantities and variety of unstructured data in various, disconnected repositories. It’s a natural evolution as data storage requirements increase and new technologies become available; however, it can lead to increased costs and difficulty accessing the data at any meaningful scale.

Cost-Benefit Balance

Conducting a formal cost-benefit analysis is an absolute must, and it often helps identify multiple use cases where you can build efficiency and get greater value out of your data. The customers we’ve worked with mostly had a well-conceived vision and data federation plan, knowing they’d find a ton of duplication and near-duplication in a centralized repository.


Leave a Comment

You must be logged in to post a comment.