10 million data requests: How a Times team tracked Covid
Times Insider explains who we are and what we do, and provides behind-the-scenes information on how our journalism comes together.
This morning, the programs written by the developers of the New York Times made more than 10 million requests for Covid-19 data from websites around the world. The data we collect are daily snapshots of the ebb and flow of the virus, including for every U.S. state and thousands of U.S. counties, cities and zip codes.
You may have seen slices of this data in the daily charts and graphs we publish at The Times. These combined pages, which involved over 100 journalists and engineers from across the organization, are the most viewed collection in the history of nytimes.com and are a key part of the award-winning Covid reporting package. The Times Pulitzer 2021. for the public service.
The Times Coronavirus Tracking Project was one of many efforts that have helped fill the void in public understanding of the pandemic left by the lack of a coordinated government response. The Coronavirus Resource Center at Johns Hopkins University has collected data on national and international cases. And the Covid Tracking Project at The Atlantic has mobilized an army of volunteers to collect data on U.S. states, in addition to testing, demographics, and health facility data.
At The Times, our work began with a single spreadsheet.
At the end of January 2020, Monica Davey, editor at the national office, asked Mitch Smith, a correspondent based in Chicago, to start collecting information on each American case of Covid-19. One line per case, meticulously flagged based on public announcements and hand-entered, with details such as age, location, gender and condition.
In mid-March, the explosive growth of the virus proved too important for our workflow. The spreadsheet grew so large it became unresponsive, and reporters didn’t have enough time to report and manually enter data from the ever-growing list of U.S. states and counties we had to track.
Around this time, many national health services began to deploy Covid-19 reporting efforts and websites to inform their constituents of the local spread. The federal government faced challenges early on in providing a single, reliable federal data set.
The available local data was all over the map, literally and figuratively. The formatting and methodology varied considerably from place to place.
Within The Times, a group of newsroom-based software developers were quickly tasked with creating tools to increase data acquisition work as much as possible. The two of us – Tiff is a newsroom developer and Josh is a graphics editor – would end up making this growing team.
On March 16, the main app was mostly working, but we needed help recovering a lot of other sources. To tackle this colossal project, we recruited developers from across the company, many of whom had no newsroom experience, to temporarily participate in the writing of scrapers.
At the end of April, we’re programmatically collecting numbers in all 50 states and nearly 200 counties. But the pandemic and our database both seemed to grow exponentially.
Additionally, a few notable sites have changed multiple times in just a few weeks, causing us to rewrite our code multiple times. Engineers in our newsroom adapted by streamlining our custom tools as they were used daily.
No less than 50 people beyond the scratch team have been actively involved in the day-to-day management and verification of the data we collect. Some data is always entered by hand and all is manually verified by journalists and researchers, a seven-day-a-week operation. Thorough reporting and subject-matter mastery were essential to all of our roles, from journalists to reviewers. data going through engineers.
In addition to posting data to the Times website, we made our dataset publicly available on GitHub in late March 2020 for everyone to use.
As vaccinations reduce the toll of the virus across the country – overall, 33.5 million cases have been reported – a number of health services and other sources are updating their data less often. Conversely, the federal Centers for Disease Control and Prevention expanded their reports to include full figures that were only partially available in 2020.
All of this means that some of our own personalized data collections may to be closed. Since April 2021, our number of programmatic sources has dropped by almost 44%.
Our goal is to reach around 100 active scrapers by late summer or early fall, primarily for tracking potential hot spots.
The dream, of course, is to conclude our efforts as the threat of the virus eases considerably.
A version of this article was originally published on NYT open, the New York Times blog about designing and making topical products.