Principle of Immutability for Election Data
Citizens Oversight (2017-03-03) Ray Lutz
This Page: http://copswiki.org/Common/M1733
More Info: Election Integrity
, Open Ballot Initiative
Note: This article refers to the CVR - Cast Vote Record standard being developed under the umbrella of NIST, the National Institutes of Standards and Technology.
I have been working on defining a separate data set standard for ballot image data. In this process, I think the concept of IMMUTABILITY is key to making the right decisions about any standards for election data, so that it is compatible with robust audits and also less subject to cybersecurity threats. (Regarding this proposed ballot image data set standard, I am getting closer to having some sort of document that can be reviewed but initially I have to dance around for a while to let things settle into place.)
Understanding and applying the principle of IMMUTABILITY throughout elections standards is key to reducing vulnerabilities and encouraging robust auditing.
Immutability means that the data is not ever changed once it is produced. Appending is allowed, but changes to older entries is disallowed. Log files generally fit this model while a spreadsheet of incrementing vote totals does not. In accounting, the rule of not going back and changing entries (esp. in periods that are closed) is strictly forbidden. You can add a new correcting entry. When you do it that way, you preserve the history of what occurred earlier AND document the correction. "Accountants don't use erasers," as they say.
In elections data, we need to be mindful of this principle and develop data that is immutable as a rule.
Thus CVR data will look more like log files and less like a JSON data structure. In fact, JSON is probably a bad model to use because it matches internal data structures of programs which are generally based on mutable programming practices. I assert that a better model for thinking about CVR data is it should be a set of CSV tables with entries that are never changed once the are appended, where each is protected by a checksum (SHA digest for example) and they are grouped in to working units and also protected as a group. Once a table is done, it is never ever changed.
I can tell you I am one that would like to get away from the concept that we need to carefully review code because I know how hard that is. I've done a lot of programming and I learned long ago that I can't always sit down and understand all the the issues that may crop up. If we define the data produced so that it adheres to the principle of immutability and also breaks down all processing into a series of discrete steps, then each process can be reviewed and fully audited by observing input and output, i.e. redundantly checked by another separate set of processes. But in general, if we still have to deal with checking code, immutable data should be a goal of these reviews.
For example, in the CVR as currently defined, I see state variables that apparently can be reversed, such as, "Needs Adjudication." Presumably, in one CVR record, you can have this flag set and people will know that sometime in the future, the data for that record will be adjudicated. Let's say it is then reviewed. Do you unset the bit "needs adjudication" and produce a new CVR without it, but with the result of the vote changed (and the unadjudicated version lost forever)? The principle of immutability says no. You need to document maybe that adjudication was deemed necessary at time x because of reason y as determined by process-id z. Then later, a NEW entry is added that says what was adjudicated at time x because of reason y as determined by process-id z, and what was changed, but how? I admit that perhaps it is the case that a new CVR record could be produced which is linked from the old one somehow which documents the new state of the vote based on the adjudication, but the point is that the old CVR must still exist and be available for review.
Also, I see embedded into a single JSON wrapper a set of contests and then ballot selections. In actual processing, the data furthest inside these structures is developed first, detecting marks on the ballots, and then it is resolved into the contest vote. Also inside that wrapper is the ballot image. It seems the concept is that the CVR is produced by one all-in-one device that can do all the steps. Even if it can, it is bad practice to report them this way because if you ever do break it up into separate processes, then you have to modify the CVR, and right now, there is no concept of a reference to an old CVR which it replaces. The way the CVR structure is defined violates the principle of immutability, and that can only be recovered by keeping separate CVR records at each step, if stepwise processing is used. Alternatively, the CVR is cut up into separate structures, each of which is completely immutable. This is probably the better course.
Consider this set of steps as an example:
Step 1. Image Scan Ballots to create ballot images ==> immutable ballot image data set.
Step 2. Process immutable ballot image data + ballot mark location data ==> immutable table of marks as detected.
Step 3. Process immutable mark data + meaning of the marks ==> immutable ballot selection data.
Step 4. Process ballot selection data + race rules ==> immutable contest data.
Each one of these separate data sets can be then compared. After step 1, all steps can be redundantly processed in a 100% audit. It is very robust and resistant to cyber security threats or compromised employees because you can't easily change the immutable data produced at each step.
If you want to create those data outputs by one device, you can. But it should be possible to use different devices (probably better to use the term "service") at each step too. Either way, the same data is produced. Current CVR definition constrains the type of products that can be used. This approach does not (and it respects immutability).
We can even look at the concept of immutability as it applies to paper ballots. They are not fully immutable because you CAN add a mark to a ballot option and that may even invalidate the vote for that ballot. A robust chain of custody is required to make sure ballots are not changed, added or removed from the set. That is an attempt to enforce immutability.
Once ballot data is cast into a digital image file and protected by published message digests, it becomes immutable. Of course, we must inspect the image data to make sure it matches the original ballot data. But this scanning process can be isolated fairly easily. Document image scanning is very different from recognizing the content of the document, such as determining the vote. Really good document image scanners are even available as commercial off the shelf equipment as they have been used in the document management industry for years. Taking snapshots of ballots is a completely different processing activity than recognizing what is on them. "Eagles have great vision but still can't read" is a way to think about this because it is somewhat a law of information processing.
After this step is accurately completed and the images cast into an immutable form, the rest of the processing can occur anywhere. Vast arrays of computers now exist (Hadoop, Google) and can be harnessed to perform the rest of the processing in record time, in a redundant and competitive fashion. Or the ballots can be reviewed by teams of "hand counters" without ever touching the mutable original ballots.
I hope this principle can help guide our thinking on both cybersecurity as it relates to the actual processing of the election and the CVR record. I cross-posted this email because I do believe that in this case, security of the election data is vastly different from typical computer security paradigms, and can be appreciated as the principle of immutability.