Discovering Dark Data

by Nick Patience.

21st November 2013

Dark data, otherwise known as unstructured, unmanaged, and categorized information is a major problem for many organisations (and many don’t even know it). Many organisations don’t have the will, systems or processes in place to automatically index and categorize their rapidly growing unstructured dark data and instead rely on employees to manually manage their own information. This reliance on employees is a no-win situation because employees have neither the incentive nor the time to actively manage their information so dark data continues to pile-up all over the organisation. This accumulation of dark data has several obvious problems associated with it:

Dark data consumes costly storage space and resources – Most medium to large organisations provide terabytes of file share storage space for employees and departments to utilize. Employees drag and drop all kinds of work related files (and personal files like personal photos, MP3 music files, and personal communications) as well as PSTs and work station backup files. The vast majority of these files are unmanaged and are never looked at again by the employee or anyone else.
Dark data consumes IT resources – Personnel are required to perform nightly backups, DR planning, and IT personnel to find or restore files employees could not find.
Dark Data masks security risks – File shares act as “catch-alls” for employees. Sensitive company information regularly finds its way to these repositories. These file shares are almost never secure so sensitive information like personally identifiable information (PII), protected health information (PHI, and intellectual property can be inadvertently leaked.
Dark data raises eDiscovery costs – Organisations find themselves trying to figure out what to do with huge amounts of dark data, particularly when they’re anticipating litigation. Almost everything is discoverable in litigation if it pertains to the case and reviewing GBs or TBs of dark data can push the cost of eDiscovery up substantially.

Dark Data…it’s a good thing?

Many organisations have begun to look at uncontrolled dark data growth and reason that, as Martha Stewart use to say….”it’s a good thing”. They believe they can run big data analytics on it and realize really interesting things that will help us market and sell better. This strategy misses the point of information governance, which is defined as;

a cross-departmental framework consisting of the policies, procedures and technologies designed to optimise the value of information while simultaneously managing the risks and controlling the associated costs, which requires the coordination of eDiscovery, records management and privacy/security disciplines.

Data has risks associated with it as well as cost beyond its daily cost of storage. Let’s consider the legal implications of dark data.

Dragging dark data out of the legal shadows

Almost everything is discoverable in litigation if it’s potentially relevant to the case. The fact that tens or hundreds of terabytes of unindexed and unmanaged content is sitting on file shares means that those terabytes of files might have relevant content so it may have to be reviewed to determine if they are relevant in a given legal case. That fact can add hundreds of thousands or millions of dollars of additional cost to a single eDiscovery request. For example, according to a CGOC survey in 2012, on the average 1% of data is subject to legal hold, 5% is subject to regulatory retention and 25% has some values to the business leaving 69% with no real legal, regulatory or business reason to be kept. So for a given 20 TB file share, on the average 1% or 200 GB is potentially relevant to a given eDiscovery request. 200 GB of content can conservatively hold 2 million pages that might have to be reviewed to determine relevancy to the case. These same 2 million pages of content would cost $1.5 million to review using standard manual review processes. The big question that has to be asked is how many of these 2 million pages were considered irrelevant to the business and should not have been kept? Considering the same 69% number from the survey mention above; 2 million docs * 69% = 1.38 million docs that should have been deleted and would never had to have been reviewed for the case.

Dark data equals higher discovery costs so make dark data visible so that you can find it, manage it, and act on it.

About the Author:

Nick is Recommind’s director of product marketing and product strategy. He leads a global team tasked with developing marketing strategy across Recommind’s products. Nick joined Recommind from The 451 Group, a technology industry analyst company he co-founded. He started and ran its information management practice and was known as a thought leader in areas such as e-Discovery, enterprise search, text analytics and unstructured data management. Nick has a BA in Philosophy from Middlesex University and an MSc in Computing Science from the University of London.

Source: Recommind