Analyzing One Million Voter Records in Manhattan
A Unique Look into the Composition of Voters in Manhattan
What if you could instantly visualize the political affiliation of an entire city, down to every single apartment and human registered to vote? Somewhat surprisingly, the City of New York made this a reality in early 2019, when the NYC Board of Elections decided to release 4.6 million voter records online, as reported by the New York Times. These records included full name, home address, political affiliation, and whether you have registered in the past 2 years. The reason according to this article was:
Board officials said their print vendor was not able to produce enough copies of the voter enrollment book in time for candidates to begin gathering petition signatures in February, which is why they posted the information online.
so this became the ‘stopgap’ in the meantime. Although as NYC BOE spokeswoman Valerie Vazquez-Diaz said in the same article:
Whether you were aware of it or not, NY voter data is public record.
While the idea of a city making 4.6 million personal records of voters publicly available on a whim is deeply troubling, and I plan to discuss that in another article, the opportunity to visualize this data intrigued me. With most public datasets, the most granular you tend to have access to is maybe census tract level, but New York City just laid out one of the most unique datasets in the world! For the first time there now existed a dataset to visualize a vast array of hyper-granular voter data, including political affiliation concentration by area, whether there are concerted efforts in certain areas to register new voters, and how dominant parties are at a scale that people can understand.
In the end I decided to attempt and publish this for two reasons:
1. Public officials need to better consider the potential ramifications of posting a massive dataset like this online
In this case it took a single, semi-technical person a few weekends of their free time to download, transform, and search through millions of records. Not only did I do that, but I was able to geocode over a million records for about $30 after credits on Google. Privacy needs to be top of mind as opposed to an afterthought in this digital age.
2. At an aggregated level this data offers new and unique insights that are valuable to a broad spectrum of people
I do believe this dataset (visualized/analyzed at the building-level) can offer unique insights while maintaining the privacy of myself and other New York residents.
In aggregate, I found that there does indeed exist concentrated areas that are party-dominant, and that party dominant clusters occur across the entire city. I also found that new voter registration were highly concentrated in areas like Alphabet City and the Upper East Side, with some clusters having up to 5x higher registrants vs. others. The remainder of the article will cover the methodology of transforming ~600 PDFs into a usable dataset, three visualizations of unique comparisons observed, and a conclusion with suggestions on further research.
Now just because data is available doesn’t mean it’s accessible, which was demonstrated by the way the data released. Let’s take a redacted look at the raw PDF below:
From a data science perspective, this is an absolute nightmare. A few of the many reasons:
- It is displayed as three ‘tables’ per page in a PDF, so excel/other programs do not consume it as you would expect. It more or less explodes and generates random entries in random cells, so no-go for simple copy/paste
- Data overlaps (street name overlaps with names) so there no easy way to delimit columns
- Data points like street number are initially stated only once
- Columns/new pages can start with no data (ex/ no street number on the 2nd column) so there needs to be a way to remember ‘most recent’ elements
- It only has addresses, and you need latitude/longitude data in order to visualize on a map
Since I’m a Python person, I immediately began trying to use the various existing libraries like PyPDF2 to try and read the data, but it became clear that the formatting of the PDF would not allow for this. Troubled after sinking my first few hours into a dead-end, I began to re-think my strategy. If I couldn’t use the existing way the data was structured on the page, I needed to find a way to completely isolate and rebuild every word from the ground up on the page.
Enter Optical Character Recognition (‘OCR’). Using the Pytesseract library, I changed my strategy by telling my program each page was an image instead of a PDF page. In fact, all it takes is a single line of code:
text = pytesseract.image_to_data(page)
To give you coordinates for virtually every element on the page:
Once I was able to confirm this, it’s a matter of working out how to use those positions in order to construct a custom delimiter to loop through each page. Unfortunately, most of the first few columns wouldn’t work due to the way Pytesseract chose to group things, but the info for ‘left’, ‘top’, ‘width’, and ‘height’ were accurate from what I could tell. So for my program, I basically utilized the following logic:
- Read page as an image
- Isolate data table on left side
- Use left/top positions to determine row/column groupings
- Make into a DataFrame with separate columns for street number, name, zip, etc.
- Perform above steps on middle then right data table
- Move to next page
To be fair, there was a lot of nuance and edge case handling I had to deal with, and the results of OCR are not perfect (which I’ll discuss more in the conclusion). However, after a day or two of tinkering around, I was able to successfully iterate through my first 250 page PDF in about 30 minutes and transform it into a fully utilizable data table!
Now was the fun part, how do I translate this onto an actual map? I had addresses, but I wouldn’t be able to actually map them unless I could somehow find the lat/lon coordinates for each address. When I looked at the unique Manhattan addresses I successfully converted, it came to almost 100,000 unique addresses, so doing this part manually was out of the question. Luckily, not only does Google have a Reverse Geocode API, but I found a fantastic python library called Geopy that let me easily create a function to geocode a list of addresses in under 20 lines of code.
Since I had never used Google’s cloud platform, I started with a $300 free credit, which helped offset the total cost of ~$330. Emboldened by my free credits and whimsy, I let my program execute over the next ~12 hours to return a beautiful set of lat/lon coordinates for almost all the corresponding addresses. After joining that with the aggregated dataset and doing some data cleanup (see conclusion), it was time to determine how to visualize it.
Visualizing massive datasets is no small task, outside of finding a visualization library that can handle several hundred thousand rows, you also need to consider the technique for visualizing the dataset so that the end result is meaningful and truthful to the underlying data. I initially began using a fantastic new project in the Machine Learning community called Streamlit, but ultimately decided I didn’t want to create a dashboard since too much clutter detracted from the focus (mapping the data).
Ultimately I went with the graphing library made by the Uber Visualization team called Kepler.gl. Kepler uses Mapbox as the underlying map provider and does an amazing job not only at handling large datasets, but is extremely accessible even for people who don’t code. After experimenting with a few different techniques, filters, and layers designs I decided on the three most impactful below.
One of the first things I wanted to see was ‘what does each party concentration look like in isolation’? I wanted to do this for two reasons:
- To see if the data I pulled roughly approximates reported party dominance (Democrats have made up ~68% of the voting population)
- To clearly identify concentration areas without one party dominating the map due to skew
To do this I created a dual view of each party on a map, made the radius width based on the concentration of points in a given area (intensity of registered voters in area), then created a sequential quantized color scale based on the avg. percentage of buildings in the area the parties occupy.