An Approach to Interactive Model Development on Big Data

Authors: Benjamin Lewis, Harvard University, Devika Kakkar, Harvard University, Weihe Guan*, Harvard University, Ryan Enos, Harvard University, Jacob R Brown, Harvard University
Topics: Geographic Information Science and Systems, Cyberinfrastructure, Quantitative Methods
Keywords: ‘big data’, gis, geospatial, ‘social science’, election, voter, gpu, postgis, omnisci, cuda, nosql, analytics, analysis, parallel, geohash, ‘open source’, ‘apache arrow’
Session Type: Virtual Paper
Day: 4/9/2021
Start / End Time: 8:00 AM / 9:15 AM
Room: Virtual 22
Presentation File: No File Uploaded

Increasingly, social scientists must work on problems which involve big data, for example voter or social media data in raw form. Such data can contain millions of records and may, during modeling, generate billions. At this scale, analysis requires new tools and approaches, especially when geospatial visualization is required.

The goal of this work is to make it easier for social scientists to work with big datasets as they handle small data, interactively, reducing model iterations from days to minutes or even seconds, with billions of records. In addition to modeling speed ups, an important goal is to speed up creation of visualizations, allowing researchers to explore large raw datasets before, and during, model development.

For this project, researchers had individual voter data and wanted to explore partisan geographic sorting at a level of detail not previously studied. In support we combined two platforms. First, a custom PostGIS configuration was developed to support K-nearest neighbor clustering on a voter dataset of 180 million with K=1000, resulting in 180 billion calculations. To model and visualize the results we used OmniSci Immerse, a GPU-based analytics platform, and rewrote the original R model to the cuDF (CUDA GPU Data Frame) language.

A rate of 200,000 calculations/second was achieved with PostGIS. Partisan weighting in OmniSci proved to be 200 times faster than R. The researchers can now re-run models and visualize results in a manner that is close to interactive.

Abstract Information

This abstract is already part of a session. View the session here.

To access contact information login