psvr
Abstract
Digital data from the political sphere is abundant, omnipresent, and more and more directly
accessible through the Internet. Project Vote Smart (PVS) is a prominent example of this
big public data and covers various aspects of U.S. politics in astonishing detail. Despite the
vast potential of PVS’ data for political science, economics, and sociology, it is hardly used
in empirical research. The systematic compilation of semi-structured data can be complicated and time consuming as the data format is not designed for conventional scientific
research. This paper presents a new tool that makes the data easily accessible to a broad
scientific community. We provide the software called pvsR as an add-on to the R programming environment for statistical computing. This open source interface (OSI) serves as a
direct link between a statistical analysis and the large PVS database. The free and open
code is expected to substantially reduce the cost of research with PVS’ new big public data
in a vast variety of possible applications. We discuss its advantages vis-à-vis traditional
methods of data generation as well as already existing interfaces. The validity of the library
is documented based on an illustration involving female representation in local politics. In
addition, pvsR facilitates the replication of research with PVS data at low costs, including
the pre-processing of data. Similar OSIs are recommended for other big public databases.
Introduction
In recent years, the dawn of the new discipline of ‘computational’ social science has been widely
discussed (see, e.g., [1, 2, 3, 4]). This has been brought about by increased computational power
and immensely rich sources of digital data covering every-day human activities such as consumer histories from online shopping outlets, records of social interactions based on web platforms such as Facebook and Twitter, and medical histories via the digitization of health
insurance processes, as well as tracks of geographic movements recorded by mobile applications on smart phones (these developments and the applications of related research methods
are also referred to as eScience and data science). While such data offer exciting potential for
quantitative research in diverse fields such as sociology, economics, political science and social
psychology, the associated privacy concerns are not to be underestimated (see, e.g., [5, 6, 7]).
There is, however, a comparatively smaller but nevertheless important domain of data sources
PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 1 / 21
a11111
OPEN ACCESS
Citation: Matter U, Stutzer A (2015) pvsR: An Open
Source Interface to Big Data on the American
Political Sphere. PLoS ONE 10(7): e0130501.
doi:10.1371/journal.pone.0130501
Academic Editor: José Javier Ramasco, Instituto de
Fisica Interdisciplinar y Sistemas Complejos IFISC
(CSIC-UIB), SPAIN
Received: August 14, 2014
Accepted: May 20, 2015
Published: July 1, 2015
Copyright: © 2015 Matter, Stutzer. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in any
medium, provided the original author and source are
credited.
Data Availability Statement: All relevant data are
within the paper and its Supporting Information files
or can be directly fetched from a web API as explicitly
explained in the code examples throughout the paper.
Funding: Ulrich Matter acknowledges financial
support from the WWZ Forum (FV-27, https://wwz.
unibas.ch/wwz-forum/foerderverein-des-wwz/.) The
funders had no role in study design, data collection
and analysis, decision to publish, or preparation of
the manuscript.
Competing Interests: The authors have declared
that no competing interests exist.
which has received limited attention from the emerging computational social science community so far; i.e., big data on the public and political sphere. These data pose only minor ethical
concerns, yet offer major advantages.
A prominent example are the data offered by Project Vote Smart (PVS). PVS provides the
public in the United States with detailed information on various political issues at all levels of
government via its web platform www.votesmart.org and thereby maintains a detailed online
data collection on multiple aspects of U.S. politics. The idea behind PVS’ web platform is to
increase transparency of the political process and thereby facilitate voters’ decisions. An important aspect for social scientists is that PVS not only serves as a stage for elected officials but also
for candidates running for a public office. Thereby, the central research subjects provide a large
amount of data about themselves, including details on their biographical background and political opinions. Separately, large records of these politicians’ voting behavior and other actions in
office are collected. Despite the vast potential of PVS’ data for diverse fields in the social sciences, they have hardly been used for scientific studies. The reason is probably that their use is
not straightforward. Technically, access is facilitated via an application programming interface
(hereafter API).
We understand the term (web) API as it is used in the web development context, i.e., a collection of defined HTTP requests and their respective response messages as documents in
Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format for the purpose of exchanging data over the Internet. More specifically, we refer here to APIs based on the
representational state transfer (REST) principle (see [8] for a general introduction to RESTful
web APIs). The term ‘web service’ is often used as synonym for an API in the web development
context and stresses the server-side implementation of APIs. Client-side applications, also
referred to as API client libraries, are programs written to either manually or automatically
interact with an API (i.e., in the form of sending requests to the API and handling the returned
web data). Typical API client libraries in the context of web development facilitate the embedding of web data provided via the API in dynamic websites. API client libraries are often written in the programming languages that are frequently used in web development such as PHP
or Java.
The PVS API is predominantly provided for the web and software developers who write
mobile applications and dynamic websites which embed PVS data. Due to the primary
intended use of the data, its format is not designed for conventional scientific research. The
data entries first need to be parsed and formatted in a (table-like) flat representation. The compilation of such semi-structured web data for scholarly analyses can thus be complicated and
requires a scientific understanding of the information in the data as well as a neat computational background.
In this article, we introduce a software package called pvsR that automates the systematic
compilation and transformation of data from PVS via its API for scientific analysis. The sofware is freely available as an R-package for academic/non-profit purposes. It is platform-independent and can be directly installed from the R command line with the command
install.packages(’pvsR’) or downloaded from http://CRAN.R-project.org/
package=pvsR. See [9] for details on the R-package. This free open source interface is expected
to substantially reduce the costs of research using new big public data provided by PVS.
We use the term Open Source Interface (hereafter OSI) to describe the API client libraries
that are specifically tailored for social science research. Such libraries work as well-documented
add-ons to statistical software packages such as R and offer well-guided high-level access to
data from a web service. An OSI substantially facilitates the compilation of data drawn from an
API that was not initially designed to provide data for statistical analysis, but was instead
designed for integration in dynamic websites and smart phone applications. OSIs thus
pvsR and Big Public Data
PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 2 / 21
preprocess, flatten, bind, and link the raw tree-structured web data into convenient table-like
formats for statistical analysis. We use the term big data, as defined by Michael Franklin (in
[10]: 4), for data that is “expensive to manage” and “hard to get value from”. While the size of
such data (in terms of bytes) is a challenge, it is by far not the only one. The data format, for
example, is often the biggest obstacle to analysis. In the case of big public data, the initial purpose the data is collected for, usually differs from the purpose of the analysis with this data
later on, often requiring a completely different data structure. This is particularly the case for
many recent studies in the social sciences that rely on big data collected from programmable
web sources such as the Twitter API. There are, of course, other definitions of the term big
data, depending on the academic field or the area of application (see, e.g., the expert survey
conducted by the Berkeley School of Information: http://datascience.berkeley.edu/what-is-bigdata/). In particular, complexity might be added as a substantial attribute to the definition of
big data. In the future, new public data sources from government offices might best be qualified
as big public data due to this attribute.
The application of pvsR in social science research on U.S. politics offers various advantages
over existing data-gathering methods and research practices. First, pvsR provides instant
access to highly granular data on all levels of government in the United States in a format that
can be directly integrated in a statistical analysis. From the U.S. President to a local council
member, data on officials is made easily accessible in the same format and includes detailed
biographical information such as gender, professional experience, or religious affiliation. Second, compiling and coding data via pvsR (as well as via any other API-client for PVS) can
replace the querying of candidates and officials which is often loaded with a specific survey
context. The provision of the raw data to PVS by the individual candidates and officials happens completely independently of any later ‘computational surveying’ by a researcher via
pvsR. The raw information is often not forced into predefined categorical answers and is less
likely to suffer from response bias than ordinary survey data. Third, the application of pvsR
automatically facilitates the replicability of the resulting research. This is a crucial aspect, as
there is currently a paradoxical development regarding the transparency of research in the context of social science with big data from the Internet. With the increase in publicly available big
data, there are also more studies which base their research on “unique” and “original” datasets.
While this development is welcome, the new studies are harder to replicate than empirical analyses based on traditional sources from official statistics. Even if a dataset is made available,
there are substantial up-front costs associated with data selection and data cleaning when a
dataset has to be reconstructed or extended. In the case of research with PVS data, pvsR substantially reduces the costs of generating datasets for reproduction and replication. As pvsR
starts at the data retrieval stage, the pre-processing of data is automatically included when
research is reproduced. Simply providing the code of how pvsR was applied in a study is a sufficient documentation of what raw data was used and how it can be accessed for a reproduction
or a replication of the study. We illustrate some of the advantages of our software approach
compared to traditional methods of data generation based on an application on the representation of women in local politics.
The developed arguments suggest the application of OSIs also in connection with other data
bases offering access via an API. It simplifies their use and allows studies to be replicated based
on this new data. A look at the emerging empirical literature that exploits publicly available big
data on various aspects of society outside the political sphere indicates the potential of OSIs. In
particular, popular social media sites such as Twitter, Facebook, and Weibo, as well as Wikipedia are ingcreasingly used as data sources for quantitative analyses covering many aspects of
society. Recent research has shown, for example, that stock market moves can be predicted by
sentiment analysis of micro-blog messages [11], the analysis of Wikipedia usage patterns [12],
pvsR and Big Public Data
PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 3 / 21
as well as the analysis of specific Google search volumes [13]. In [12], Wikipedia usage patterns
are analyzed as an approximation of investors’ information gathering to detect early signs of
stock market moves. The authors present evidence suggesting that the number of page views of
Wikipedia articles related to financial topics increases before stock market falls. In [14], new
approaches are presented to measure labor market flows by creating indexes of job loss, job
search, and job posting based on data from Twitter. Other studies based on newly available big
data aim, for example, at the quantification of an individual’s mood and happiness based on a
sentiment analysis of tweets ([15, 16] and [17]) as well as the study of collective human attention on natural disasters via Flickr-tags [18].
The remainder of this paper is organized as follows. In Section 2, we present an overview
regarding access to data via APIs. Moreover, we introduce the PVS API as well as the basic idea
behind pvsR. In Section 3, we show how to work with pvsR and present a working example.
This example documents how data gathering with pvsR supersedes existing data collection
practices in political science. In Section 4, we demonstrate pvsR’s advantages in terms of replicability of research by replicating a part of an existing study based on PVS data. Section 5 provides some technical background to pvsR and Section 6 compares its core features to other
API client libraries for PVS not particularly developed for social science research and written
in other programming languages. The concluding discussion in Section 7 summarizes the
advantages of pvsR and OSIs in general, based on a conceptual perspective, and points out
what could be done to foster the provision of OSIs to big public data.
Background
The costs of gathering and distributing information for the public’s political engagement have
significantly fallen with higher Internet penetration. Democratic movements now have numerous opportunities to pursue their political goals using the Internet and social media. As a consequence, entries in political blogs and tweets leave electronic traces which can be used to
generate, for example, a new kind of data on political opinion formation and behavior [19], on
political polarization [20], as well as on political discourse [21] (note that the author of [20]
also contributes an OSI-like R package to compile data from the Twitter Streaming API [22].).
These data are explicitly meant to be publicly accessible (and the dissemination is approved by
the individuals generating it). Moreover, the Internet has given rise to citizens’ groups and
non-governmental organizations whose aim it is to make the democratic process more transparent. These bodies gather data, for example, on political candidates, public officials, and campaign finances in order to inform voters.
This development offers researchers who want to investigate the public and political sphere
various opportunities to undertake descriptive analyses as well as hypothesis testing. New
descriptive insights into the structure of the political system are expected, as computationally
intense methods from fields such as network science are applied to vast data sets (see, e.g., [23]
who unveils the topology of legislators’ co-sponsership networks based on all 280,000 items of
legislation proposed in the U.S. Congress between 1973 and 2004). Regarding hypothesis testing, new light can be shed on existing theories, because more concepts and variables, such as
those relating to campaign contributions and political behavior, can be empirically captured
and measured.
Data generated by the processes described above are generally published on dynamic websites to inform the public on diverse political issues. The editing and presentation of the data is
optimized for individual users interested in a particular issue or person. Hence, the normal
user accesses the data through his or her web browser or a similar device with a graphical user
interface (e.g., getting information on the candidates in a specific electoral district via PVS’s
pvsR and Big Public Data
PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 4 / 21
website). Gathering web data in such a manual way is also still a wide spread practice in the
social sciences. Data is retrieved piecewise with a single query or repeated queries through a
web browser, after which the data can then be stored in a spreadsheet to build a unique data
set.
However, web data sources can also be accessed increasingly through an API, which enables
specific queries to be run programmatically on the data (instead of extracting the information
from hundreds of web pages). A prominent example is the initiative PVS (see http://www.
votesmart.org.). PVS is a non-profit organization engaged in informing citizens about U.S. politics. The organization is non-partisan and collects and distributes information ranging from
local to federal elections and voting on ballot measures. Of particular relevance is the background information that it collects on candidates and elected officials. This involves information about their previous occupations, education, family life, organizational memberships and,
importantly, also their voting behavior in key votes at the federal and state level. PVS also asks
politicians about their political position, based on a standardized so-called Political Courage
Test. Furthermore, information about campaign finances, interest group ratings of politicians
and ballot measure descriptions for states with direct legislation is provided. PVS offers its data
(subject to an annual registration fee) free of charge and provides a set of tools that allow users
to integrate specific items of information in their web pages (see, e.g. [24]). Access is granted
via an API that is developed on the REST architectural style. Such APIs usually return data in a
hierarchically (or tree-)structured XML or JSON format. Data in such a format can easily be
embeded in other websites (i.e., so-called mashups) or implemented in mobile phone applications (in the case of PVS, e.g., AT&T’s 2010 VoterHub Mobile App; see http://voterhub.us/).
This is also the main motivation for non-governmental organizations like PVS for opening
access to the public through an API. The dissemination of information is part of their mission
to increase transparency in the political process. Besides PVS, there are various other NGOs
offering access to public data on U.S. politics via an API. For example, the Center for Responsive Politics (CRP) which provides (through their database OpenSecrets.org) detailed information on federal campaign contributions and lobbying activities in the United States (http://
www.opensecrets.org/resources/create/apis.php) and MapLight whose Bill Positions API offers
access to data on organizations’ interests for legislative bills (http://maplight.org/apis/billpositions). A good example of a company that offers access to data on the public sphere is the
New York Times (NYT). The APIs offered by NYT provide data on various aspects of U.S. federal and state politics, as well as on news and readers’ comments (http://developer.nytimes.
com/docs/read/Home). Finally, as an example of an API provided by a public authority, the
Parliamentary Services of Switzerland offer access to data on diverse parliamentary activities
(http://www.parlament.ch/e/dokumentation/webservices-opendata/pages/default.aspx). Fig 1
illustrates the conventional use of APIs as an integrated component of a dynamic website.
The design of APIs facilitates web and software developers’ access to the data in order to
embed the data in their applications. However, its design is not per se suited to systematic scientific data compilation and analysis (an exeption is, of course, the type of APIs that are specifically designed by research institutes for the storage and exchange of data sets between
academics). In most cases, the tree-structured data returned from such an API cannot be analyzed directly using standard statistical procedures. Accordingly, another interface is needed to
make such data easily accessible to researchers; i.e., an interface that translates the tree-structured data provided by the API to a corresponding flat data representation. This is also the case
with the PVS API. To overcome this challange, we provide pvsR as an OSI to the PVS API.
Other OSI-like client libraries for other APIs are, e.g., the Sunlight Foundation’s Python library
to the Influence Explorer API [25] and the R package WDI [26].
pvsR and Big Public Data
PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 5 / 21
pvsR automates the data retrieval requests to the PVS API, handles HTTP and API errors
as well as malformed XML, reshapes the data format, and stores it or makes it available for
econometric analyses. In addition, several high-level functions are provided that combine different API methods and conveniently allow for fetching data on specific data entity. Another
interface is thus created which bridges data retrieval and analysis. Fig 2 illustrates this alternative use of APIs to web platforms such as PVS for social science research.
OSI programs, like pvsR, build the missing link between a non-scientific API (or web service) providing public data and the statistical analysis of this data. OSIs have a remarkable capacity to facilitate researchers’ access to big public data sources in the programmable web. Moreover, by simply referring to the archived OSI and supplying a short documentation of how it was used, the cost for replicating an entire empirical analysis can be reduced dramatically—not only for the statistical computations, but also for the retrieval, compilation and preparation of the data. In a broad sense, this implies a quantitative analysis architecture that allows for replicable research at low costs. The application of an OSI to public data in order to make research replicable takes the ideas proposed in [48] one step further. The published code implicitly provides the analysis and the data. OSIs might thus partly supersede the duplication of big data in journal archives in the cases where the raw data is compiled via an API. The current best practice of data availability policy is to store the final data sets centrally with a publisher or an association. Despite rapid growth in storage capacity, this practice is costly, particularly if the final data is itself big data. With OSIs, the datasets are built on demand, and it is sufficient that journals store the protocols to document the data generation process. The existing problem emphasized in [49] regarding the non-availability of important publications’ primary raw data could be alleviated. In sum, the proposal is thus for an empirical analysis architecture with OSIs as an integral component rather than individual data retrieval and compilation. These OSIs must be provided with some basic information so that they can be properly referenced. In particular, the following items of information should be included: author, year of publication, title of the pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 17 / 21 program, a short note on the type (i.e., R package) and the version of the program, and a link to the code repository (an example could be: Ulrich Matter (2014). pvsR: An R package to interact with the Project Vote Smart API for scientific research. R package version 0.3. http:// CRAN.R-project.org/package=pvsR). In this paper, we introduced an OSI for the big database of the organization PVS for the R statistical computing environment. Other interesting data sources offering APIs that could be more easily used if OSIs were available are LegiScan (http://legiscan.com/legiscan), ProPublica’s Free the Files API (https://projects.propublica.org/free-the-files/api), and Buhl & Rasmussen’s API for European Union legislation (http://api.epdb.eu/), just to mention a few. The web platform programmableweb (http://www.programmableweb.com) offers a large directory of web APIs from all over the world. Around 290 APIs listed in programmableweb’s directory concern the public and political sphere. For many of them no suitable OSI is yet available for scientific use. However, how will this “service-oriented science” [50] with OSIs for easier data access come about? This is an important question for the organization of scientific research, not least because OSIs might well increase the multiplier effect associated with publicly-funded research based on big public data. Research that uses this data is a challenge for funding agencies if data preparation consumes a large fraction of the budgeted funds. If programs for data retrieval and compilation tailored for social scientists are written in and for open source software, they can easily be used by other researchers to leverage their skills and capacity. Funding agencies might therefore welcome OSIs as a by-product of the funded research in order to create additional positive spillovers. Overall, we would like to offer three suggestions improving the conditions for the application of OSI packages. First, it is still necessary that researchers receive greater acknowledgement for freely providing research services that enable other researchers to replicate or extend scientific work. Public archives for service software like OSIs are an important step in that direction. Second, in the age of big public data from the programmable web, exact reproduction based on ‘unique’ novel data sets should be set into contrast to replication including raw data compilation based on OSIs. Finally, services have to be rewarded in the same currency as original research, i.e. with citations. Once these conditions are secured, the opportunities presented by new big public data from the programmable web may well spur social science research to new heights.
OSI programs, like pvsR, build the missing link between a non-scientific API (or web service) providing public data and the statistical analysis of this data. OSIs have a remarkable capacity to facilitate researchers’ access to big public data sources in the programmable web. Moreover, by simply referring to the archived OSI and supplying a short documentation of how it was used, the cost for replicating an entire empirical analysis can be reduced dramatically—not only for the statistical computations, but also for the retrieval, compilation and preparation of the data. In a broad sense, this implies a quantitative analysis architecture that allows for replicable research at low costs. The application of an OSI to public data in order to make research replicable takes the ideas proposed in [48] one step further. The published code implicitly provides the analysis and the data. OSIs might thus partly supersede the duplication of big data in journal archives in the cases where the raw data is compiled via an API. The current best practice of data availability policy is to store the final data sets centrally with a publisher or an association. Despite rapid growth in storage capacity, this practice is costly, particularly if the final data is itself big data. With OSIs, the datasets are built on demand, and it is sufficient that journals store the protocols to document the data generation process. The existing problem emphasized in [49] regarding the non-availability of important publications’ primary raw data could be alleviated. In sum, the proposal is thus for an empirical analysis architecture with OSIs as an integral component rather than individual data retrieval and compilation. These OSIs must be provided with some basic information so that they can be properly referenced. In particular, the following items of information should be included: author, year of publication, title of the pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 17 / 21 program, a short note on the type (i.e., R package) and the version of the program, and a link to the code repository (an example could be: Ulrich Matter (2014). pvsR: An R package to interact with the Project Vote Smart API for scientific research. R package version 0.3. http:// CRAN.R-project.org/package=pvsR). In this paper, we introduced an OSI for the big database of the organization PVS for the R statistical computing environment. Other interesting data sources offering APIs that could be more easily used if OSIs were available are LegiScan (http://legiscan.com/legiscan), ProPublica’s Free the Files API (https://projects.propublica.org/free-the-files/api), and Buhl & Rasmussen’s API for European Union legislation (http://api.epdb.eu/), just to mention a few. The web platform programmableweb (http://www.programmableweb.com) offers a large directory of web APIs from all over the world. Around 290 APIs listed in programmableweb’s directory concern the public and political sphere. For many of them no suitable OSI is yet available for scientific use. However, how will this “service-oriented science” [50] with OSIs for easier data access come about? This is an important question for the organization of scientific research, not least because OSIs might well increase the multiplier effect associated with publicly-funded research based on big public data. Research that uses this data is a challenge for funding agencies if data preparation consumes a large fraction of the budgeted funds. If programs for data retrieval and compilation tailored for social scientists are written in and for open source software, they can easily be used by other researchers to leverage their skills and capacity. Funding agencies might therefore welcome OSIs as a by-product of the funded research in order to create additional positive spillovers. Overall, we would like to offer three suggestions improving the conditions for the application of OSI packages. First, it is still necessary that researchers receive greater acknowledgement for freely providing research services that enable other researchers to replicate or extend scientific work. Public archives for service software like OSIs are an important step in that direction. Second, in the age of big public data from the programmable web, exact reproduction based on ‘unique’ novel data sets should be set into contrast to replication including raw data compilation based on OSIs. Finally, services have to be rewarded in the same currency as original research, i.e. with citations. Once these conditions are secured, the opportunities presented by new big public data from the programmable web may well spur social science research to new heights.
Comments
Post a Comment