Software forensics lay MS-DOS, CP/M controversy to rest
It’s often the case that success breeds controversy, particularly in the technology industry and especially with regards to software. Given the amount of code sharing and reuse that has occurred in since the beginnings of the tech industry itself, it’s not uncommon anymore to hear about legal actions being taken around copyright infringement or licensing issues.
One of the earliest and most high profile of these cases was a lawsuit filed by CP/M that was developed by DRI founder Gary Kidall in 1974. The private antitrust suit was settled out of court in 2000 with .. Caldera had acquired Digital Research Inc. (DRI) by way of Novell Inc., and with it the rights to
What’s the beef?
$275 million, even by 2000 standards, is a relative pittance for what became the backbone (or at least the jumping off point) of the Microsoft empire. MS-DOS would form the foundation for Windows 95 and 98 that helped cement Microsoft’s dominance in the enterprise and eventually datacenter, while CP/M (later marketed as DR-DOS) has been relegated to the annals of tech history with its last release coming some 33 years ago.
The point of contention is that, at the time of the alleged copying of source code, commands, and system calls, DRI and CP/M owned roughly 20 percent of the market, which potentially could have grown into the market share DOS and subsequent versions of Windows enjoyed after being licensed to IBM in 1981. As such, rumors have circulated about the true origins of MS-DOS, which some claim is the direct product of an unauthorized clone of CP/M, ever since. Some myths even go so far as to suggest that a secret command exists within MS-DOS that when executed prints something by way of: COPYRIGHT (C) GARY A. KILDALL JUNE, 1975 *.
Outside of adamant, impassioned, and opinionated advocates on both sides of the argument, these rumors have continued for more than 30 years because of limitations in the field of software forensics. Software forensics (or software plagiarism detection), is a discipline widely employed in cases such as the one between Microsoft and Caldera, deals with the analysis of source code and binaries to determine matters of intellectual property infringement. However, for most of its existence, including at the time Caldera filed its lawsuit against Microsoft, software forensics typically relied on academics and software consultants manually comparing huge amounts of code in different programs in search of evidence of copying, and then sharing their findings with a court as expert witnesses. Like the US legal system as a whole, the process was fallible, time consuming, and expensive.
Coincidentally, though, around the time of the Microsoft/Caldera litigation, Bob Zeidman, an electrical engineer by education and a software developer by trade, began developing a utility program to optimize those methods. Over the years, that simple utility program evolved into CodeSuite set of tools for comparing source code and executables that has become a mainstay in IP litigation cases.
Five rules of CodeSuite & DOS v. CP/M
The CodeSuite toolset supports about 40 languages, each of which contains a unique parser or definition file that is used when a set of code is being reviewed. The code being tested is then divided into segments that are analyzed separately so that alterations to certain code elements do not disguise plagiarism across an entire body of work. These segments are:
- Statements, comments, and strings
- Instruction sequences
Once these three segments are analyzed a normalized correlation score is produced, which is weighted heavily against direct copying. For example, if a copied algorithm makes up 10 percent of a code sample being analyzed, CodeSuite would return a score of “100” as opposed to other tools that might report a “10 percent” or “.1,” which can be misleading when core IP is at stake. The CodeSuite correlation score then directs analysts to areas of potential copying for further review.
However, it’s important to note that the correlation score is not a conclusive judgment on infringement or theft in and of itself. Rather, it is the starting point for analysts to review highly correlated code in more depth following a well-defined procedure to determine whether software was taken nefariously (Figure 1). This procedure considers
- Common algorithms – Many fundamental algorithms are taught in schools for basic operations, which could appear as copying
- Common identifiers – Common industry and programming terms (such as “index,” “count,” and “matrix”) as well as human-readable language (for comments) can appear to be copying
- Common author – If the same author wrote programs at two different companies, they will invariably use the same fundamental coding style
- Automatic code generation – If two programs use the same or a similar tool to automatically generate code, it will likely look the same
- Third-party code – Code, particularly open-source code for basic operations, will appear as plagiarism, a fact that is particularly relevant today
To start, Zeidman started his code comparison by cleaning CP/M, removing everything that was not source code, reformatting the code, running comments as instructions and instructions as comments, and searching for unusual identifiers, among other things. He then proceeded to perform global searches of both CP/M and MS-DOS for “copyright,” company names, author names and initials, and other terms of relevance.
Interestingly, in MS-DOS, the term CP/M was found. However, after some research Zeidman found that early DOS programs did in fact read CP/M files, which is not surprising given CP/M’s early market presence (Figure 2). Matching comments, strings, and identifiers were also discovered, but with minimal frequency and the terms were general enough to fall under the category of “Common Identifier Names,” as mentioned previously (Figure 3). With regards to the source code, the conclusion was that no copying had occurred.
Again, for the command line interfaces high correlations were found, but as shown in Figure 4, these too fall under the umbrella of “common identifiers” that span multiple operating systems.
The greatest possibility of copyright infringement came in the system calls, where, as seen in Figure 5, the numbering of the calls is almost identical across programs. For instance, in both MS-DOS and CP/M, 15 and 16 are the calls to open and close a file, respectively. Here we come to a gray area in that, while the system calls were implemented and used differently in each program, the similarities are undeniable. That being said, however, this is most likely a case of Microsoft not reinventing the wheel and sticking with the recognizable calls of the day. In copyright terms this is referred to as fair use. Did anyone buy MS-DOS as a result of the CP/M system calls? Probably not.
Oh, and Zeidman didn’t find any secret commands in MS-DOS that print a Kildall copyright.
So why did Microsoft settle?
The conclusions above reveal little if any significant evidence for copyright infringement, so why did Microsoft budge?
As mentioned, $275 million was and is a relative pittance for Microsoft, and after almost five years of litigation, more important than attorney fees or bruised egos was the company’s market and brand reputation. Rather than get stuck in the weeds over a controversy that could have gotten out of hand and raised public questions about Microsoft’s integrity, they kept their head in the clouds, which is exactly where they find themselves today.
As for the CodeSuite findings, Zeidman presented them earlier this month at the Vintage Computer Festival that took place at the Computer History Museum in Mountain View, CA. He’s confident enough in his findings that he has offered a $100,000 reward for anyone that can disprove his findings that Microsoft did not copy source code from CP/M, and an additional $100,000 for anyone that can find Kildall’s mysterious secret function.
Until someone does, I think it’s time to put these rumors to bed.