Authors: Susan Burtner*, UC Santa Barbara
Topics: Quantitative Methods
Keywords: word embeddings, data harmonization, microdata
Session Type: Paper
Start / End Time: 1:20 PM / 3:00 PM
Room: Grand Ballroom A, Astor, 2nd Floor
Presentation File: No File Uploaded
It is often the case that questions within a demographic survey have an exhaustive list of the possible responses. Institutions that collect surveys from several different areas (regionally or globally) may then attempt to harmonize data by manually changing labels until a desired level of compatibility is reached. In the case of the IPUMS International dataset, both unharmonized and harmonized data is provided so that users may choose which form of the data best serves their needs. The harmonized data conforms to a comparable coding scheme designed by IPUMS researchers, while the unharmonized data provides the microdata records in English with very little or no restructuring involved. However, outside researchers are then tasked with harmonizing the data themselves, and this can prove difficult when dealing with multiple variables and data on a global scale. The goal of this project is to borrow from work in word embeddings and map values from several variables and samples into a vector space from which the strength between similar values or labels can be deduced. This can then lead to a global perspective of harmonized categories that is informed by similarities among attributes within categories as well as across categories of interest.