Three Benefits of Open Source Legal Docs

In my last post, I discussed how open-source software could potentially be used as an alternative to expensive TAR platforms or, even worse, teams of associate attorneys to review and organize large bodies of documents. Open source principles are not, however, limited to software or computers. In fact, “the concept of free sharing of technological information existed long before computers. For example, cooking recipes have been shared and remixed since the beginning of human culture.” So it should come as no surprise that organizations are applying open source principles to disciplines beyond software and computers.

The legal industry, despite its often cited resistance to change, is no exception.

GitHub’s logo

Take Fenwick & West, the Am Law 200, Mountain View-based law firm, which specializes in providing legal services to hi-tech clients in emerging industries. More specifically, Fenwick & West provides free online early stage funding documents, which are posted to GitHub, a web-based platform and storage space for collaborative works (generally programming projects). Fenwick & West originally released Series Seed documents in 2010, which have since been used by several companies and startups. The documents can be accessed here.

Fenwick & West aren’t alone in this endeavor. Wilson Sonsini Goodrich & Rosati, a Palo Alto-based firm, also provides open source convertible equity documents, which can be downloaded and used for free (thanks to Joel Jacobson for the tip). So what does the emergence and provision of open source documents mean for the delivery of legal services?

In my mind, law firms, no matter what size, that take the initiative to create and provide open source documents reap three important and distinct benefits:

(1) Expertise

Open source is as much of a culture as it is an ideal. By posting open source documents on GitHub, which is one of the not-so secret clubhouses of software developers everywhere, a law firm is basically saying, “I get it.” In other words, a law firm like Fenick & West is directly participating in the communities that their clients and potential clients are a part of. Moreover, if a client is seeking assistance with say, an open source warranty, licensing agreement, or the acquisition of an open source organization, the firm that actually uses open source in its practice is likely a more attractive choice for representation in comparison to the firm that advertises open source expertise, but does not practice by it.

(2) Client Empowerment

In addition to conveying expertise, firms that provide free and open source documentation are empowering the client. Instead of paying an associate by the hour to draft up Series Seed documents that the firm already has in template form, a firm can simply make these documents available for free. Thus, the client or potential client can do this work on their own. I think the idea here is that if a potential client has questions about the document or needs counseling with more complex matters, who are they most likely to turn to? The firm that provided the *free* document, or the firm that did not? This not only empowers clients, but is a step in solving clients’ often cited “more-for-less” challenge. This is especially true with nascent startups, whose legal needs may not seem important enough to rationalize procuring a law firm for work that can otherwise be found online.

(3) Collaboration

The final major benefit, that I can see, with open source legal documents is collaboration. This one is probably a long-term benefit. In the future, companies (I’m thinking GCs) will expect legal service providers to work more collaboratively on issues, especially when it comes to standardized documentation. Open source platforms or projects will provide an excellent mode for collaboration between firms and providers, which again, gets at the more-for-less challenge. This is where firms who initiate open source projects and document sets would see the greatest reward because I imagine that the initiating firm would be the quarterback of the project, or at least get their name at the top of the list of contributors. Further, this should foster higher-quality legal documents, as two (or hundreds) of contributors are better than one!

These ideas are a bit rough and certainly need refinement, but I think the point remains: Firms that provide open source documents to clients and potential clients are not only conveying expertise in the space, but also getting at the heart of most clients’ biggest problem: How can we get more-for-less?

Open Source TAR . . . FOR FREE!

I originally titled this post “Open Source E-Discovery and Document Review”, but thanks to the advice of a very insightful commenter and e-discovery veteran, Andy Wilson, CEO of Logikull, I decided to more accurately entitle this post what it is today. Check out Andy’s comment following the post!

Note: Code available on GitHub.

Consumers of e-discovery and document management services and products are at an inherent disadvantage. Costs are unpredictable, information is asymmetric, and technology is constantly changing. At the same time, electronic data, both structured and unstructured, is growing at an exponential rate and so is the need for faster, more efficient, and intelligent information management and e-discovery solutions. Whether you are General Counsel with a large organization or a solo practitioner, an understanding of the identification, production, storage, and deletion of electronic information and the ability to analyze that information is crucial for assessing potential risk or litigating a case moving forward, i.e. being a lawyer.

But some lawyers are at a greater disadvantage than others.

For instance, a few weeks ago, a criminal defense attorney spoke at MSU Law and discussed how e-discovery influences his work. To me, his message and its implications for his clients were bleak. E-discovery not only impacts his work, but more importantly, can mean the difference between a “guilty” and “not guilty” verdict for his clients. In cases where electronic information is an issue, this defense attorney does his best to work with his resources, the client, and the court to obtain any relevant evidence that may help his client’s case. Yet, access to a lot of potential electronic evidence, a deleted text message for example, requires the expensive work of a computer forensics expert or some other burdensome method of retrieval.

Even more common and perhaps more taxing to this attorney, a solo practitioner, is receiving hundreds if not thousands of documents related to a case and manually reviewing them. Even something as simple as locating duplicate documents can take hours, if not days. The majority of clients this attorney represents cannot afford technology assisted review (“TAR”), leaving their case to the mercy of human error and time limitations. (For more on e-discovery and criminal law, check out E-Discovery in Criminal Cases: A Need for Specific Rules, by Daniel B. Garrie & Daniel K. Gelb).

I assume that some of the problems encountered by this criminal defense attorney are also common for small firm and solo-practitioners on the civil side, often buried in electronic information without the resources to contract a vendor or implement a TAR platform. In sum, the proliferation of electronic information is creating a divide for attorneys: those who have the tools and resources to handle and effectively analyze it, and those who do not.

This past year, I’ve had the opportunity to experiment with a few different document review platforms, including Concordance, Clustify, and Backstop through my e-discovery class at MSU. Although I do not have much to compare these programs to, I thought they all worked well. Clustify and Backstop were much more intuitive and powerful, but nonetheless, each allowed me to perform certain tasks that would otherwise have been time-consuming and expensive within minutes: de-duping, clustering, searches, predictive coding, etc. That said, I do not know the exact costs of these platforms (one of the joys/detriments of being a student using legal technology), but I would guess that each is prohibitively expensive for the average solo or small firm attorney.

So is there a way that lawyers without the resources to use TAR software or hire vendors can harness the power of technology to improve their efficiency and results for clients? In the long-run, the price of e-discovery/document management solutions will likely drop and products will improve and become more user-friendly. But, in the meantime, is there a solution?

I believe there is and it is in the form of open source software. Open source software is software whose source code is available for modification or enhancement by anyone. It is usually free and worked on by large communities of developers working collaboratively to constantly improve the product and offer support to the software’s users. There are a lot of examples of open source software out there (Firefox, Linux, OpenOffice, etc.), but the one that I think has a lot of potential for use in e-discovery is the R Project for Statistical Computing.

According to

R is the world’s most powerful programming language for statistical computing, machine learning and graphics as well as a thriving global community of users, developers and contributors. R includes virtually every data manipulation, statistical model, and chart that the modern data scientist could ever need. As a thriving open-source project, R is supported by a community of more than 2 million users and thousands of developers worldwide. Whether you’re using R to optimize portfolios, analyze genomic sequences, or to predict component failure times, experts in every domain have made resources, applications and code available for free, online.

And it is applicable to e-discovery. How? E-discovery platforms and providers are essentially doing the same thing R is doing: Statistics. Whether locating ESI, using predictive-coding to search for relevant documents, or doing something as simple as de-duping a corpus, statistics is the driving-force behind e-discovery solutions . . . unless your solution is having your firm’s associate(s) read through documents or outsourcing the work, offshore or otherwise.

So let’s look at a very simple example of how R might be used to analyze and manage documents:

Document clustering is a popular method of TAR, especially when dealing with a large corpus. In particular, clustering allows for more efficient review by quickly revealing relationships and connections within a set that may not otherwise be apparent and by identifying duplicate documents within a set. Within the broader context of data science, clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

An example of R’s clustering analysis on data set involving mice.

R does this (shown above) and it can be applied to documents. Let’s take a look at a simple example using R (downloadable here) and RStudio (an IDE interface to make R more user-friendly – downloadable here). I am going to just get straight into the code, but if you are new to R, check out Professor Katz’s “R Boot Camp Slides” – that’s how I learned!

First, I’m going to create a simple test corpus consisting of nine short email excerpts from the original Enron corpus. The samples will pertain to three semantically diverse topics:

Topic 1

(1) “To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.”

(2) “while you netted well over a $100 million, many of Enron’s employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out”

(3) “you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it”

Topic 2

(4) “This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . . ”

(5) “Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.”

(6) “Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.”

Topic 3

(7) “We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.”

(8) “The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.”

(9) “In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.”

Now that we have our documents, we can create our corpus in R Studio.


We will do so by first placing all of our sample documents into a vector, “text”, and then cleaning the text (remove stopwords, stemming, etc.) within the corpus. This is accomplished with the following code:

# Load requisite packages

# Place Enron email snippets into a single vector.
text <- c(
 "To Mr. Ken Lay, I’m writing to urge you to donate the millions of dollars you made from selling Enron stock before the company declared bankruptcy.",
 "while you netted well over a $100 million, many of Enron's employees were financially devastated when the company declared bankruptcy and their retirement plans were wiped out",
 "you sold $101 million worth of Enron stock while aggressively urging the company’s employees to keep buying it",
 "This is a reminder of Enron’s Email retention policy. The Email retention policy provides as follows . . .",
 "Furthermore, it is against policy to store Email outside of your Outlook Mailbox and/or your Public Folders. Please do not copy Email onto floppy disks, zip disks, CDs or the network.",
 "Based on our receipt of various subpoenas, we will be preserving your past and future email. Please be prudent in the circulation of email relating to your work and activities.",
 "We have recognized over $550 million of fair value gains on stocks via our swaps with Raptor.",
 "The Raptor accounting treatment looks questionable. a. Enron booked a $500 million gain from equity derivatives from a related party.",
 "In the third quarter we have a $250 million problem with Raptor 3 if we don’t “enhance” the capital structure of Raptor 3 to commit more ENE shares.")
view <- factor(rep(c("view 1", "view 2", "view 3"), each = 3))
df <- data.frame(text, view, stringsAsFactors = FALSE)

# Prepare mini-Enron corpus
corpus <- Corpus(VectorSource(df$text))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus # check corpus

# Mini-Enron corpus with 9 text documents

Now that we have our mini-Enron corpus, we can create we can a “term-document matrix” that contains and measures the occurrence of certain terms within each document. Then we can use a statistical technique known as multidimensional scaling, a means of visualizing the level of similarity of individual documents within a corpus, to view our “clusters”. We will create our graph using ggplot2, a data visualization package for R:

# Compute a term-document matrix that contains occurrance of terms in each email
# Compute distance between pairs of documents and scale the multidimentional semantic space (MDS) onto two dimensions
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat  # check distance matrix

# Compute distance between pairs of documents and scale the multidimentional semantic space onto two dimensions
fit <- cmdscale(dist.mat, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,
    color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))


As you can see, the “documents” are positioned on the graph according to semantic similarity. In other words, they are “clustered”. Topic 1 documents are tightly grouped, while Topic 2 and 3 documents are more loosely grouped or clustered. It’s not perfect, but it’s a start.

Another R package, called LSA, might improve our accuracy. LSA, which stands for latent semantic analysis, is an algorithm applied to approximate the meaning of texts, thereby exposing semantic structure to computation. LSA is commonly used by e-discovery platforms and an effective way of grouping documents. LSA should produce superior results because we can “weight” the text and compute distance more effectively. Again, we can display these results graphically.

# MDS with LSA
td.mat.lsa <- lw_bintf(td.mat) * gw_idf(td.mat)  # weighting
lsaSpace <- lsa(td.mat.lsa)  # create LSA space
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))  # compute distance matrix
dist.mat.lsa  # check distance mantrix

fit <- cmdscale(dist.mat.lsa, eig = TRUE, k = 2)
points <- data.frame(x = fit$points[, 1], y = fit$points[, 2])
ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,
    color = df$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(df)))


This outcome is dramatically different from the first analysis. As the graph shows, the documents in Topic 3 were pulled down the y-axis quite substantially, while Topic 2 expanded in distance. Topic 3 is not really clustered, although it is obviously distinct from the other Topics. I suspect that my mini-Enron corpus did not have the kind of variety that is really required for the LSA to truly show its computing power and produce robust clusters.

If the results of this mini-Enron test might not persuade you that R is an viable alternative to e-discovery software or vendors, that is my fault. I have only been using R for a couple months now and am still learning how to effectively use it. But, if my example is not persuasive, there is a huge community of developers out there working with R to develop text-mining and document analysis solutions that can sell it better than I can. For example:

  • Here is an example of a person using R to mine the complete works of William Shakespeare;
  • Here is an example and video using R to predict the author of a document based on machine learning from previous documents; and
  • Here is a much more persuasive example of clustering using R with, again, the complete works of William Shakespeare.

These are just a few of the hundreds if not thousands of functions that can be performed on a document set in R and, because it’s open source, there are constantly new techniques in development.

At the end of the day, R comes with a serious learning curve and it simply is not as effective (at least in my hands) as a traditional TAR platform. But, as it continues to develop and lawyers and programmers alike realize its potential in the e-discovery and document management space, I bet that it will become an effective tool for lawyers, even outside of the context of information management. Oh, and one more thing:

It’s free.

So for attorneys out there with limited financial resources, R may be an excellent tool for overcoming the divide that electronic information is creating. Therefore, I look forward to continue learning R and its potential to make legal processes more affordable, more efficient, and more intelligent.

Notes on LPO

At our ReInvent Law Lab meeting tonight, we had Jonathan Goldstein, formerly of Pangea3, call in and discuss the ethics of legal process outsourcing. Some takeaways that I found particularly interesting:

An LPO Framework

In his presentation, Mr. Goldstein discussed the framework for developing and implementing an LPO provider, which can be summed up in three ideals:

  1. Value: How will LPO add value to the client’s organization (efficiency, savings, innovation, etc.)?
  2. Peace of Mind: How is this safe, compliant with regulatory requirements, privacy concerns, etc.? 
  3. Quality: How can we deliver and ensure a quality product or service? Without these three drivers, your LPO operation is worthless. 

Fear of Change

Whether out of a fear of change, the unknown, or competition, large, U.S. law firms (Big Law) were the most resistant to LPO before the ABA gave its stamp of approval in 2008. I imagine they still are.

Why India?

Pangea3 outsources to India because, among other things, the Indian legal system is based on common law. Therefore, the Indian attorneys and employees approach legal issues and problems the same way U.S.- and British-trained lawyers do. The Indian code of legal ethics is also very similar, particularly with regards to confidentiality. In addition, India has approximately 1 million lawyers, with 100,000 in Mumbai (Pangea3’s Indian HQ). By outsourcing to India, Pangea3 is able to overcome scalability and quality problems because of the sheer number and quality of lawyers. According to Mr. Goldstein, some of India’s top law grads line-up to work for Pangea3. I also believe India’s relative political stability and strong tech infrastructure make India the ideal choice for an LPO.

Mumbai, India

Keeping Secrets

In addition to the benefits of the Indian view on client confidentiality, Pangea3 also protects its clients with non-disclosure agreements. If GE is the client, employees of Pangea3 would sign an NDA in a jurisdiction favorable to GE. In addition, Pangea3 builds a very strong culture of confidentiality. From U.S. management to employees in Mumbai, Pangea3 values confidentiality and inculcates this value throughout the organization.

The Future of LPO

Finally, I asked a question to Mr. Goldstein: What kind of work is best-suited for LPO? His response, in short: Doc review. No surprise there. Following up, I suggested that TAR  and intelligent discovery methods would be consuming a lot of this work, to which he replied that he agreed. But, he thought (and I am paraphrasing), that before software eats LPO, the combination of LPO and tech will eat up the traditional doc review methods and jobs here in the States. Maybe not the best news for a third year law student like me, but a reality that, if embraced, could provide great opportunity.

It seems to me that the role of people, technology, collaboration, and process are the keys to a strong LPO operation. Especially people. Having the best people working for you in this business, on this and the other-side of the pond (where ever that may be), is crucial to providing the value, peace of mind, and quality that these companies must be founded on. Mr. Goldstein finally noted that Pangea3 tries to infuse legal DNA at every level of the organization. A for lawyers, by lawyers mentality. I like that.

Thank you Mr. Goldstein for speaking to the Lab. More to come . . .

Visualizing SCOTUS


Today in Quantitative Methods for Lawyers, taught by Professor Dan Katz at MSU Law, we experimented with the Supreme Court Database and learned how to plot graphs using R, an open-source statistical computing software. The graph above depicts dissents filed since the 1950’s, color-coded by the filing justice. All of the credit for the good stuff in this graph goes to the Supreme Court Database, Professor Katz, and Mr. Michael Bommarito (who visited our class today). I tried to put a few personal touches on this graph, so any mistakes are my own.

For more on the software used to produce this graph and its uses in law, visit the Computational Legal Studies Blog.

Practice Notes on Predictive Coding and eDiscovery

In eDiscovery class today, Bruce Ellis Fein, the co-founder and legal director of Backstop LLP, a discovery software and services firm, visited our class and discussed the role of predictive coding in modern discovery.

A former associate at Sullivan & Cromwell, Mr. Fein observed that traditional doc review was not only a waste of paper, but also time and effort. Enter Backstop, which uses predictive coding methods to allow the user of the Backstop software to take information entered by human reviewers and generalize it to a larger group of documents, making the sorting process dramatically less taxing.

Among several interesting points in Mr. Fein’s presentation, Mr. Fein’s thoughts on recall and precision were especially interesting:

  • Recall: Recall is the fraction of relevant documents that are identified as relevant by a search or review effort. According to Mr. Fein, recall is the most critical measure of accuracy. If you err in the measure of recall, you can get hammered with sanctions. That said, most documents that are responsive to a subpoena end up not being relevant to the case. In fact, one person could have a different opinion of a document’s relevancy on two different days. Human recall has been measured at around 50%. So if we can beat that even marginally using predictive coding, we will be better off. Computers can do this. Practice tip: That said, if you’re going to use predictive coding to analyze a corpus, get a written agreement regarding the recall level signed by opposing counsel. This will prevent any issues arising later in the discovery process. Further, predictive software is able to more accurately measure recall and gives an advantage to those who correctly utilize it.
  • Precision: Precision is the fraction of documents identified as relevant by a search or review effort, that are in fact relevant. If a collection has low precision, this means there are a number of non-responsive documents categorized as responsive, demonstrating that the computer’s decisions are not very accurate yet.  Mr. Fein suggests that if you are the producing party, avoid any discussion of precision because it will require you to to do more work in the form of review. If you are on the receiving end of production, you should insist on a specific level of precision.

Thanks to Mr. Fein for coming in and speaking to our class. It was an informative presentation and it sounds like Backstop has an exciting product and service; one of the few that truly uses predictive coding to make the ediscovery process more efficient.

An Experiment in Legal Research


The Michigan State Law Review requires its members to produce a scholarly Note addressing a legal issue. After working on this topic for just over a year, I am proud to say that a working draft of my paper is now available on SSRN. That said, this project is by no means complete and I would greatly appreciate any comments or feedback. Here is the abstract:

In 1995, Robert Ambrogi, former columnist for Legal Technology News, wrote about the Internet’s potential to revolutionize the accessibility and delivery of legal information. Almost 20 years later, Ambrogi now describes his initial optimism as a “pipe dream.” Perhaps one of the greatest problems facing the legal industry today is the sheer inaccessibility of legal information. Not only does this inaccessibility prevent millions of Americans from obtaining reliable legal information, but it also prevents many attorneys from adequately providing legal services to their clients. Whether locked behind government paywalls or corporate cash registers, legal information is simply not efficiently and affordably attainable through traditional means.

There may, however, be an answer. Although the legal industry appears to just be warming up to social media for marketing purposes, social media platforms, like Twitter, may have the untapped potential to help solve the accessibility problem. This Note attempts to prove that assertion by showing an iteration of social media’s potential alternative use, as an effective and free information sharing mechanism for legal professionals and the communities and clients they serve.

Generally speaking, law review editors and other academicians demand that authors support every claim with a citation, or, at the very least, require extensive research to support claims or theses.  This Note seeks to fulfill this requirement, with a variation on conventional legal scholarship. Almost all of the sources in this Note were obtained via Twitter.  Thus, this somewhat experimental piece should demonstrate social media’s potential as an emerging and legitimate source of legal information. By perceiving and using social media as something more than a marketing tool, lawyers, law schools, and, most importantly, clients, may be able to tap into a more diverse and more accessible well of information. This redistribution of information accessibility may not only solve some of the problems facing the legal industry, but also has the capability to improve society at large.

Baker McKenzie: Twitter Outcast

Baker McKenzie is the Steven Glansberg of the the world’s top law firms . . . on Twitter.

I recently pulled the Twitter accounts of the ten highest grossing law firms in the world and imported them into Gephi, an open-source data visualization software, and yielded this:

BigLaw Top Ten

This is a graphical representation of Twitter relationships between the top ten law firms. Applying a modularity filter to the firms, the graph also displays “communities” between the firms. This is demonstrated by the green and yellow connectors, or edges, between the firms. (Note: I am not sure if this function worked all that well given the limited data points provided to the algorithm. But it looks nice.)

So the question is: Can we gain any insight from this graph? First, I am surprised at the unilateral relationships between the firms. There does not appear to be any follow-back reciprocation amongst the firms. Second, Clifford Chance isn’t afraid to follow its competitors! Maybe they are keeping an “eye” on the competitions’ Twitter activity! Finally, Baker McKenzie is a Twitter outcast. I’m picturing Heath Ledger in Ten Things I Hate About You . . . or Steven Glansberg.