Open Source TAR . . . FOR FREE!

I originally titled this post “Open Source E-Discovery and Document Review”, but thanks to the advice of a very insightful commenter and e-discovery veteran, Andy Wilson, CEO of Logikull, I decided to more accurately entitle this post what it is today. Check out Andy’s comment following the post!

Note: Code available on GitHub.

Consumers of e-discovery and document management services and products are at an inherent disadvantage. Costs are unpredictable, information is asymmetric, and technology is constantly changing. At the same time, electronic data, both structured and unstructured, is growing at an exponential rate and so is the need for faster, more efficient, and intelligent information management and e-discovery solutions. Whether you are General Counsel with a large organization or a solo practitioner, an understanding of the identification, production, storage, and deletion of electronic information and the ability to analyze that information is crucial for assessing potential risk or litigating a case moving forward, i.e. being a lawyer.

But some lawyers are at a greater disadvantage than others.

For instance, a few weeks ago, a criminal defense attorney spoke at MSU Law and discussed how e-discovery influences his work. To me, his message and its implications for his clients were bleak. E-discovery not only impacts his work, but more importantly, can mean the difference between a “guilty” and “not guilty” verdict for his clients. In cases where electronic information is an issue, this defense attorney does his best to work with his resources, the client, and the court to obtain any relevant evidence that may help his client’s case. Yet, access to a lot of potential electronic evidence, a deleted text message for example, requires the expensive work of a computer forensics expert or some other burdensome method of retrieval.

Even more common and perhaps more taxing to this attorney, a solo practitioner, is receiving hundreds if not thousands of documents related to a case and manually reviewing them. Even something as simple as locating duplicate documents can take hours, if not days. The majority of clients this attorney represents cannot afford technology assisted review (“TAR”), leaving their case to the mercy of human error and time limitations. (For more on e-discovery and criminal law, check out E-Discovery in Criminal Cases: A Need for Specific Rules, by Daniel B. Garrie & Daniel K. Gelb).

I assume that some of the problems encountered by this criminal defense attorney are also common for small firm and solo-practitioners on the civil side, often buried in electronic information without the resources to contract a vendor or implement a TAR platform. In sum, the proliferation of electronic information is creating a divide for attorneys: those who have the tools and resources to handle and effectively analyze it, and those who do not.

This past year, I’ve had the opportunity to experiment with a few different document review platforms, including Concordance, Clustify, and Backstop through my e-discovery class at MSU. Although I do not have much to compare these programs to, I thought they all worked well. Clustify and Backstop were much more intuitive and powerful, but nonetheless, each allowed me to perform certain tasks that would otherwise have been time-consuming and expensive within minutes: de-duping, clustering, searches, predictive coding, etc. That said, I do not know the exact costs of these platforms (one of the joys/detriments of being a student using legal technology), but I would guess that each is prohibitively expensive for the average solo or small firm attorney.

So is there a way that lawyers without the resources to use TAR software or hire vendors can harness the power of technology to improve their efficiency and results for clients? In the long-run, the price of e-discovery/document management solutions will likely drop and products will improve and become more user-friendly. But, in the meantime, is there a solution?

I believe there is and it is in the form of open source software. Open source software is software whose source code is available for modification or enhancement by anyone. It is usually free and worked on by large communities of developers working collaboratively to constantly improve the product and offer support to the software’s users. There are a lot of examples of open source software out there (Firefox, Linux, OpenOffice, etc.), but the one that I think has a lot of potential for use in e-discovery is the R Project for Statistical Computing.

According to RevolutionAnalytics.com:

R is the world’s most powerful programming language for statistical computing, machine learning and graphics as well as a thriving global community of users, developers and contributors. R includes virtually every data manipulation, statistical model, and chart that the modern data scientist could ever need. As a thriving open-source project, R is supported by a community of more than 2 million users and thousands of developers worldwide. Whether you’re using R to optimize portfolios, analyze genomic sequences, or to predict component failure times, experts in every domain have made resources, applications and code available for free, online.

And it is applicable to e-discovery. How? E-discovery platforms and providers are essentially doing the same thing R is doing: Statistics. Whether locating ESI, using predictive-coding to search for relevant documents, or doing something as simple as de-duping a corpus, statistics is the driving-force behind e-discovery solutions . . . unless your solution is having your firm’s associate(s) read through documents or outsourcing the work, offshore or otherwise.

So let’s look at a very simple example of how R might be used to analyze and manage documents:

Document clustering is a popular method of TAR, especially when dealing with a large corpus. In particular, clustering allows for more efficient review by quickly revealing relationships and connections within a set that may not otherwise be apparent and by identifying duplicate documents within a set. Within the broader context of data science, clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

An example of R’s clustering analysis on data set involving mice.

R does this (shown above) and it can be applied to documents. Let’s take a look at a simple example using R (downloadable here) and RStudio (an IDE interface to make R more user-friendly – downloadable here). I am going to just get straight into the code, but if you are new to R, check out Professor Katz’s “R Boot Camp Slides” – that’s how I learned!

First, I’m going to create a simple test corpus consisting of nine short email excerpts from the original Enron corpus. The samples will pertain to three semantically diverse topics:

Topic 1

(1) “To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.”

(2) “while you netted well over a $100 million, many of Enron’s employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out”

(3) “you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it”

Topic 2

(4) “This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . . ”

(5) “Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.”

(6) “Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.”

Topic 3

(7) “We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.”

(8) “The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.”

(9) “In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.”

Now that we have our documents, we can create our corpus in R Studio.

Untitled

We will do so by first placing all of our sample documents into a vector, “text”, and then cleaning the text (remove stopwords, stemming, etc.) within the corpus. This is accomplished with the following code:

# Load requisite packages
library(tm)
library(ggplot2)
library(lsa)

# Place Enron email snippets into a single vector.
text <- c(
 "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
 "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
 "you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it",
 "This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . .",
 "Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.",
 "Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.",
 "We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.",
 "The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.",
 "In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
df <- data.frame(text, view, stringsAsFactors = FALSE)

# Prepare mini-Enron corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus # check corpus

# Mini-Enron corpus with 9 text documents

Now that we have our mini-Enron corpus, we can create we can a “term-document matrix” that contains and measures the occurrence of certain terms within each document. Then we can use a statistical technique known as multidimensional scaling, a means of visualizing the level of similarity of individual documents within a corpus, to view our “clusters”. We will create our graph using ggplot2, a data visualization package for R:

# Compute a term-document matrix that contains occurrance of terms in each email
# Compute distance between pairs of documents and scale the multidimentional semantic space (MDS) onto two dimensions
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat  # check distance matrix

# Compute distance between pairs of documents and scale the multidimentional semantic space onto two dimensions
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,
    color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))

Untitled

As you can see, the “documents” are positioned on the graph according to semantic similarity. In other words, they are “clustered”. Topic 1 documents are tightly grouped, while Topic 2 and 3 documents are more loosely grouped or clustered. It’s not perfect, but it’s a start.

Another R package, called LSA, might improve our accuracy. LSA, which stands for latent semantic analysis, is an algorithm applied to approximate the meaning of texts, thereby exposing semantic structure to computation. LSA is commonly used by e-discovery platforms and an effective way of grouping documents. LSA should produce superior results because we can “weight” the text and compute distance more effectively. Again, we can display these results graphically.

# MDS with LSA
td.mat.lsa <- lw_bintf(td.mat) * gw_idf(td.mat)  # weighting
lsaSpace <- lsa(td.mat.lsa)  # create LSA space
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))  # compute distance matrix
dist.mat.lsa  # check distance mantrix

# MDS
fit <- cmdscale(dist.mat.lsa, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,
    color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))

Untitled

This outcome is dramatically different from the first analysis. As the graph shows, the documents in Topic 3 were pulled down the y-axis quite substantially, while Topic 2 expanded in distance. Topic 3 is not really clustered, although it is obviously distinct from the other Topics. I suspect that my mini-Enron corpus did not have the kind of variety that is really required for the LSA to truly show its computing power and produce robust clusters.

If the results of this mini-Enron test might not persuade you that R is an viable alternative to e-discovery software or vendors, that is my fault. I have only been using R for a couple months now and am still learning how to effectively use it. But, if my example is not persuasive, there is a huge community of developers out there working with R to develop text-mining and document analysis solutions that can sell it better than I can. For example:

  • Here is an example of a person using R to mine the complete works of William Shakespeare;
  • Here is an example and video using R to predict the author of a document based on machine learning from previous documents; and
  • Here is a much more persuasive example of clustering using R with, again, the complete works of William Shakespeare.

These are just a few of the hundreds if not thousands of functions that can be performed on a document set in R and, because it’s open source, there are constantly new techniques in development.

At the end of the day, R comes with a serious learning curve and it simply is not as effective (at least in my hands) as a traditional TAR platform. But, as it continues to develop and lawyers and programmers alike realize its potential in the e-discovery and document management space, I bet that it will become an effective tool for lawyers, even outside of the context of information management. Oh, and one more thing:

It’s free.

So for attorneys out there with limited financial resources, R may be an excellent tool for overcoming the divide that electronic information is creating. Therefore, I look forward to continue learning R and its potential to make legal processes more affordable, more efficient, and more intelligent.

Advertisements

14 thoughts on “Open Source TAR . . . FOR FREE!

  1. Pingback: Three Benefits of Open Source Legal Docs | Patrick Ellis

  2. Andy Wilson (logikcull.com)

    “So is there a way that lawyers without the resources to use TAR software or hire vendors can harness the power of technology to improve their efficiency and results for clients? In the long-run, the price of e-discovery/document management solutions will likely drop and products will improve and become more user-friendly. But, in the meantime, is there a solution?”

    Yes. And the answer is not free. Pat, this is a great analysis of using something like R to better understand text in an eDiscovery case, but using R for eDiscovery is not something I’d advise. Let me explain…

    (DISCLAIMER: Why bother listening to me about this? Since 2002 I’ve done nothing but eDiscovery, real , complex eDiscovery. And from 2008 till today I’ve been intimately involved with building a next generation eDiscovery SaaS app: logikcull.com, so I have first hand knowledge of the perils of building such things. It is neither easy or free to do.)

    True, the cost of eDiscovery should come down. The old school technology of today requires expensive software, hardware, and support installations that can easily reach into the hundreds of thousands of dollars (and sometimes millions). Not for the faint of heart. But this too is starting to change with the rise of SaaS (Software as a Service) where no software or hardware is required. Because there is no capital expense needed, SaaS companies can deliver powerful software over the internet on a subscription or pay-as-you-go basis. Perfect for eDiscovery and orders of magnitudes cheaper, but better, than today’s tech.

    Back to free eDiscovery. eDiscovery is more than free text analytics using R. Much more. Although very attractive, free is not the solution to eDiscovery and I doubt never will be. Or at least it won’t be until we reach a Star Trek-like state of government when money no longer matters (so in 100 years?).

    Here are 3 reasons free eDiscovery is generally not a good solution:

    1. Lack of audit trail, which could wreak havoc on your defensible process of DIYing it with something like R. Also, take into consideration the amount of money at risk in the matter. Choosing free when hundreds of thousands or millions are at stake is incredibly risky. And what do you do when your eDiscovery process is called into question? Can you easily produce reports of all actions taken and justify them? Your client’s metadata, is that properly preserved throughout the process?

    2. Lack of security and redundancy. If you do your own eDiscovery on your own infrastructure, is your client’s sensitive data encrypted at rest? Are you monitoring for intrusions? Do you have a detailed access log to see who did what, when, and from where? Are you making daily and weekly backups in case of disaster? Are those backups also encrypted? Where are your encryption keys stored? Do you hire ethical hackers to PEN test your system on a yearly basis and expose potential security holes? Your servers, are they in a closet in your office or are they behind a secure datacenter guarded by armed guards? Etc. etc.

    3. Lack of support when shit hits the fan. This is probably the biggest reason doing your own eDiscovery on your own infrastructure using a free software approach is ill advised. Who helps you when your free software breaks? Who helps you when the developer that wrote the free software takes a job at Google and no longer supports updating the software? (I realize R is different here, but you see my point). Sure, there could be a company out there one day that provides paid support for your eDiscovery software, but even that isn’t free or cheap. Just ask any JBOSS or Red Hat customer (Red Hat, a publicly traded [$10B market cap] company that sells support for free, Linux-based open source software (OSS) and acquired JBoss, also a free OSS company, for $300M many years ago).

    We’re just scratching the surface here. eDiscovery is not text analytics using R or Clustify or some other text-analytics tool. On the data side…it’s data collection, data processing, file type verification, password protection handling, text extraction, OCR (using a GOOD OCR engine), virus scanning, deep embedded file extraction, attachment extraction, metadata extraction and analysis, text and metadata cleanup so it’s usable, formatting complex documents (like spreadsheets), rendering documents to PDF, accessing uncommon file types (i.e. Lotus Notes, Handysoft, etc.), rendering documents to TIFF (gasp!), and building a searchable index that can scale to millions of records, etc. On the review side…it’s presenting the data in easy to search/sort/tag view. It’s making tagging documents and their families easy, fast, and searchable. It’s making review from a mobile device possible without a bunch of crap in your UI. It’s making it easy to share documents with your teammates and your client. It’s so much more than analyzing text using OSS. And then you have all the insane requirements for producing your eDiscovery results (i.e. single-page TIFFs black and white group IV with natives named by original file name produced in a Concordance load file with 65 unique metadata fields). It adds up and doing it for free is near impossible.

    But, eDiscovery should most definitely be more affordable to the eDiscovery have-nots. Anyone should be able to do eDiscovery, from anywhere (shameless tagline plug).

    Reply
    1. Pat Ellis Post author

      Andy, thanks a lot for the comment and insight.

      There is certainly a lot of information that I had not considered and I think I may have ineptly titled this post when I associated it with eDiscovery, and not just document review. I still believe that R can effectively be used for text analysis and document management (in the hands of a skilled user), but certainly not full-blown eDiscovery.

      Regardless of open source software’s potential in the space, I believe that the more attorneys, developers, and law students that think about these issues, the better. Hopefully the proliferation and growth of data will not bury those who do not have the resources to spend tens or hundreds of thousands of dollars on eDiscovery solutions before a more affordable product or service becomes available.

      Reply
      1. Andy Wilson (logikcull.com)

        More attorneys, paralegals, etc. learning more efficient ways to handle discovery? Totally agree! Here’s an example why your post can be super helpful in achieving this:

        Let’s say a paralegal (or litigation support project manager) has a Concordance database of 50,000 records that she recently received from a prior law firm (that no longer exists). There’s a good amount of metadata in the Concordance DCB and 90% of the documents have text (also referred to as the OCR fields in most Concordance databases). But this paralegal has NO CLUE what this data really has. Sure, she could run some keyword searches in Concordance, but why in the hell would she do that when….she could use your R post!?

        So, the paralegal reads your post. Extracts all the text into document level text files with a unique file numbering/naming convention, likely the bates_number (i.e. DOC00000001.txt) for easy cross reference. Then loads the text files into R Studio. Runs your step by step analysis, and WHAMMY! (sorry, I just saw Anchorman 2, couldn’t resist). Now she’s got concepts, clusters, keywords, themes etc. She knows sooooo much more about her data set. Best of all, she can now make her crummy Concordance database an analytical powerhouse by exporting the results as an overlay file and then overlaying the categories, concepts, keywords etc. into her Concordance database.

        The paralegal is now the super star of her office. Everyone loves her. She gets a raise. And then…she realizes…”Hey, maybe I should start my own company doing this stuff for others in need?” Five years later she’s an Inc 500 company because of your article =)

        Keep up the posts, Pat. Good stuff!

  3. Bill Dimm

    Hi Pat,

    Thank you for mentioning Clustify. Regarding pricing, Clustify has a pay-per-use option where the client pays based on the volume of text analyzed, so small clients with small cases pay a small fee. It works out to about a third of a penny per full page of text (pricing may change when version 4 comes out) with a minimum of $45 in any month where new calculations are run (no charge for months where there is no new data, and no charge for running additional calculations on data you’ve already paid for). Compared to the cost of having a human reviewer read a page of text, it is really quite negligible. There are certainly tools that have a high minimum price because they don’t want to be bothered with small clients, but Clustify isn’t one of them. Of course, the price of the software isn’t the only cost — you do have to keep in mind the cost of the time spent learning the tool and the cost of any other tools that you may need to use with it (e.g. you will need to OCR any paper documents, and may need other tools to extract text from file formats not handled by Clustify) when deciding whether or not it makes sense for you.

    Reply
    1. Pat Ellis Post author

      Bill,

      First, I apologize that your comment had to go through moderation. I have fixed that and comments can now flow freely! Thank you for your comment and insight. Like I mentioned in the post, one of the benefits (and potential dangers) of being a law student is having access to all of this great technology and not having to pay for it! I am thinking more about Westlaw and Lexis, but I certainly enjoyed using Clustify (for free) in class. It was intuitive and clearly powerful – I can’t imagine doing some of those tasks manually. Maybe all of this technology is spoiling the lawyers of the future, but it certainly is better for clients! It sounds like Clustify is quite affordable, much more than I would have guessed. From my (limited) research, I get the sense that not all providers are quite so transparent, but I am sure that is due to the often complex nature of the tasks that are involved in this line of work.

      To your final point, I am confident that the “learning curve” associated with platforms like Clustify will decrease as attorneys and students become more accustomed to this kind of technology. Like I mentioned earlier, Clustify was very intuitive. It seems like UX will be one of the most important differentiators between products in the future. Thanks again Bill.

      Reply
  4. Alex

    Hi,

    I’m getting errors on replicating your code within RStudio upon entering ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y, color = df$view)) + geom_text(data = points, aes(x = x, y = y – 0.2, label = row.names(df))) + theme_bw()

    Getting Error: ggplot2 doesn’t know how to deal with data of class dist.

    Not used R since univeristy and didn’t use ggplot2 at uni, so not entirely sure how to fix this?

    Thanks

    Reply
      1. Alex

        Hi Pat,

        The error comes up as “ggplot2 doesn’t know how to deal with data of class dist.”

        Have you missed steps out in your code dumps above in using ggplot? I followed your code to the dist.mat matrix, getting the same elements outputted as above but after loading the ggplot2 library can’t get your next line to work. Should I have declared ‘points’ elsewhere at an intermittent step that you missed out?

        Thanks,
        Alex

  5. search

    Hello there I am so delighted I found your site, I
    really found you by mistake, while I was searching on Askjeeve for something else,
    Nonetheless I am here now and would just like to say cheers for a tremendous post and a all round entertaining
    blog (I also love the theme/design), I don’t have time to browse it all at the moment
    but I have saved it and also added your RSS feeds, so when I
    have time I will be back to read much more, Please do
    keep up the great jo.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s