"Lies, damned lies, and statistics" is a phrase describing the persuasive power of numbers, particularly the use of statistics to bolster weak arguments, and the tendency of people to disparage statistics that do not support their positions. It is a phrase sometimes used colloquially to doubt statistics used to prove an opponent's point.
It was also the opening slide in an excellent webinar on “Statistical Sampling in E-Discovery” that I recently attended that was put on by Catalyst Repository Systems. It reminded me of a Statistics class I took in Graduate School that was based on the book by Darrell Huff, “How to Lie with Statistics”. The heart of the message in both the class and the book was that statistics are important, proper sampling technique must be learned and that common misunderstanding about the use of sample techniques can lead to dramatically incorrect and inaccurate statements.
Why is this relevant to eDiscovery?
Recent federal cases teach a hard lesson: Failure to use sampling during e-discovery review and production is shortsighted at best and perhaps grounds for severe sanctions at worst. When done correctly, sampling protects against inadvertent disclosure, strengthens defensibility and controls the high cost of eDiscovery. When practiced incorrectly or not al all, the consequences can be waiver of privilege, monetary sanctions, or even dismissal.
Sampling is that part of statistical practice concerned with the selection of a subset of individual documents from within a total population to yield knowledge about the whole population, especially for the purposes of making predictions based on statistical inference. The three main advantages of sampling are that it lowers the cost of eDiscovery, data collection is faster, and it provides defensible procedures about the accuracy of your production.
In United States v. O’Keefe, Magistrate Facciola posed the question about whether lawyers are qualified to search document populations since it involves the sciences of computer technology, statistics and linguistics, all fields that are generally beyond the purview of the average attorney. Judge Grimm, in Victor Stanley v. Creative Pipe went even further in his statement about the need for sampling, “ All keyword searches are not created equal…. The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling…”
The landmark electronic case, Zubulake v. UBS Warburg, used sampling to reduce discovery cost and establish whether additional discovery was necessary. Similarly, courts have accepted statistical sampling in a wide variety of other endeavors as a means of scientifically estimating results in a large population that could not practically be addressed by examining the entire population, given the current size of many discovery matters.
How does Statistical Sampling Work with eDiscovery?
The whole point of using a sample (vs. inspecting the entire population) is to save money, while not sacrificing the accuracy of your production. Therefore, the goal of integrating a statistical sampling approach is to have the smallest sample that truly represents the characteristics of the entire population. The larger the sample size, the higher the confidence level and therefore, the lower probability of error. Sound confusing? It can be, but in the hands of experts, it can be applied across any document database and defended in court. In reality, there are two major types of sampling that should be considered in eDiscovery, based on the stage you are in the process: Low Legal Consequences and High Legal Consequence. Examples of each situation follow and the factor that impacts each of these is described after the list:
Low Legal Consequence - you can usually apply a less stringent set of conditions
• Early Case Assessment – when trying to determine the conditions of the case, the potential impact of the litigation
• Testing Search Terms – During the iterative process of testing your search terms to see what documents turn up
• Managing Review Teams – During the review process when you are doing quality control procedures on the teams performance.
High legal Consequence – you should apply the most stringent conditions when tyring to determine:
• Privilege Waivers (to make sure that you protect all the documents you can, and have proof that if a document is produced inadvertently, it is truly an accident)
• Defensibility of Production (to protect against sanctions)
• To avoid Legal Malpractice (a goal we should always shoot for)
Sample size is determined by the following factors that the person conducting the sample gets to determine (at least initially):
1. Precision - How close do you want your estimate to be? The more precise you want the estimate, the higher the sample size will need to be. When applied to this situation, precision refers to the percentage of relevant documents that can be tolerably missed.
2. Confidence level – How confident do you want to be that the estimate is within the precision range described above? The higher the level of confidence, the higher the sample size will need to be. In scientific endeavors, confidence is routinely expressed as either 99% or 95%. For purposes of ensuring that relevant documents have produced, one would normally want this high level of confidence.
3. Margin of Error – Also called the Confidence Level. It is a range around a measurement that conveys how precise the estimate is that we are sampling. How many exceptions do we expect to occur in the population? In electronic discovery sampling, we anticipate having practically no exceptions (i.e., documents that should have been produced that were not identified). Of course, if our actual results do not confirm this, it is back to the drawing board; but, that is what the testing was endeavoring to determine. It is often an iterative process that is applied throughout the course of the production.
4. A one vs. a two tail test – Are we concerned with both understatements AND overstatements of our estimate? Or, is it acceptable for us to estimate merely overstatements or understatements, but not both? In the context of eDiscovery, our concern should be in both overstatements and understatements, so a two-tail test is a more defensible position and should increase the accuracy of your production, satisfying the most discriminately judge.
So what does all of this mean?
Fortunately, there are experts in this field who can consult with your team to quickly determine how to design a sampling protocol that will work for your litigation, reducing your costs and increasing the accuracy of what you produce. The point of this post is to make you aware of the issues and supply the working attorney with a basic vocabulary and understanding of the math that is required in today’s eDiscovery litigation. Search and analytic consulting along with sampling techniques are all required skills, math is here to stay. Fortunately, there are experts available to guide the attorneys through this often unfamiliar territory.


Comments