psvr

Abstract Digital data from the political sphere is abundant, omnipresent, and more and more directly accessible through the Internet. Project Vote Smart (PVS) is a prominent example of this big public data and covers various aspects of U.S. politics in astonishing detail. Despite the vast potential of PVS’ data for political science, economics, and sociology, it is hardly used in empirical research. The systematic compilation of semi-structured data can be complicated and time consuming as the data format is not designed for conventional scientific research. This paper presents a new tool that makes the data easily accessible to a broad scientific community. We provide the software called pvsR as an add-on to the R programming environment for statistical computing. This open source interface (OSI) serves as a direct link between a statistical analysis and the large PVS database. The free and open code is expected to substantially reduce the cost of research with PVS’ new big public data in a vast variety of possible applications. We discuss its advantages vis-à-vis traditional methods of data generation as well as already existing interfaces. The validity of the library is documented based on an illustration involving female representation in local politics. In addition, pvsR facilitates the replication of research with PVS data at low costs, including the pre-processing of data. Similar OSIs are recommended for other big public databases. Introduction In recent years, the dawn of the new discipline of ‘computational’ social science has been widely discussed (see, e.g., [1, 2, 3, 4]). This has been brought about by increased computational power and immensely rich sources of digital data covering every-day human activities such as consumer histories from online shopping outlets, records of social interactions based on web platforms such as Facebook and Twitter, and medical histories via the digitization of health insurance processes, as well as tracks of geographic movements recorded by mobile applications on smart phones (these developments and the applications of related research methods are also referred to as eScience and data science). While such data offer exciting potential for quantitative research in diverse fields such as sociology, economics, political science and social psychology, the associated privacy concerns are not to be underestimated (see, e.g., [5, 6, 7]). There is, however, a comparatively smaller but nevertheless important domain of data sources PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 1 / 21 a11111 OPEN ACCESS Citation: Matter U, Stutzer A (2015) pvsR: An Open Source Interface to Big Data on the American Political Sphere. PLoS ONE 10(7): e0130501. doi:10.1371/journal.pone.0130501 Academic Editor: José Javier Ramasco, Instituto de Fisica Interdisciplinar y Sistemas Complejos IFISC (CSIC-UIB), SPAIN Received: August 14, 2014 Accepted: May 20, 2015 Published: July 1, 2015 Copyright: © 2015 Matter, Stutzer. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting Information files or can be directly fetched from a web API as explicitly explained in the code examples throughout the paper. Funding: Ulrich Matter acknowledges financial support from the WWZ Forum (FV-27, https://wwz. unibas.ch/wwz-forum/foerderverein-des-wwz/.) The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. which has received limited attention from the emerging computational social science community so far; i.e., big data on the public and political sphere. These data pose only minor ethical concerns, yet offer major advantages. A prominent example are the data offered by Project Vote Smart (PVS). PVS provides the public in the United States with detailed information on various political issues at all levels of government via its web platform www.votesmart.org and thereby maintains a detailed online data collection on multiple aspects of U.S. politics. The idea behind PVS’ web platform is to increase transparency of the political process and thereby facilitate voters’ decisions. An important aspect for social scientists is that PVS not only serves as a stage for elected officials but also for candidates running for a public office. Thereby, the central research subjects provide a large amount of data about themselves, including details on their biographical background and political opinions. Separately, large records of these politicians’ voting behavior and other actions in office are collected. Despite the vast potential of PVS’ data for diverse fields in the social sciences, they have hardly been used for scientific studies. The reason is probably that their use is not straightforward. Technically, access is facilitated via an application programming interface (hereafter API). We understand the term (web) API as it is used in the web development context, i.e., a collection of defined HTTP requests and their respective response messages as documents in Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format for the purpose of exchanging data over the Internet. More specifically, we refer here to APIs based on the representational state transfer (REST) principle (see [8] for a general introduction to RESTful web APIs). The term ‘web service’ is often used as synonym for an API in the web development context and stresses the server-side implementation of APIs. Client-side applications, also referred to as API client libraries, are programs written to either manually or automatically interact with an API (i.e., in the form of sending requests to the API and handling the returned web data). Typical API client libraries in the context of web development facilitate the embedding of web data provided via the API in dynamic websites. API client libraries are often written in the programming languages that are frequently used in web development such as PHP or Java. The PVS API is predominantly provided for the web and software developers who write mobile applications and dynamic websites which embed PVS data. Due to the primary intended use of the data, its format is not designed for conventional scientific research. The data entries first need to be parsed and formatted in a (table-like) flat representation. The compilation of such semi-structured web data for scholarly analyses can thus be complicated and requires a scientific understanding of the information in the data as well as a neat computational background. In this article, we introduce a software package called pvsR that automates the systematic compilation and transformation of data from PVS via its API for scientific analysis. The sofware is freely available as an R-package for academic/non-profit purposes. It is platform-independent and can be directly installed from the R command line with the command install.packages(’pvsR’) or downloaded from http://CRAN.R-project.org/ package=pvsR. See [9] for details on the R-package. This free open source interface is expected to substantially reduce the costs of research using new big public data provided by PVS. We use the term Open Source Interface (hereafter OSI) to describe the API client libraries that are specifically tailored for social science research. Such libraries work as well-documented add-ons to statistical software packages such as R and offer well-guided high-level access to data from a web service. An OSI substantially facilitates the compilation of data drawn from an API that was not initially designed to provide data for statistical analysis, but was instead designed for integration in dynamic websites and smart phone applications. OSIs thus pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 2 / 21 preprocess, flatten, bind, and link the raw tree-structured web data into convenient table-like formats for statistical analysis. We use the term big data, as defined by Michael Franklin (in [10]: 4), for data that is “expensive to manage” and “hard to get value from”. While the size of such data (in terms of bytes) is a challenge, it is by far not the only one. The data format, for example, is often the biggest obstacle to analysis. In the case of big public data, the initial purpose the data is collected for, usually differs from the purpose of the analysis with this data later on, often requiring a completely different data structure. This is particularly the case for many recent studies in the social sciences that rely on big data collected from programmable web sources such as the Twitter API. There are, of course, other definitions of the term big data, depending on the academic field or the area of application (see, e.g., the expert survey conducted by the Berkeley School of Information: http://datascience.berkeley.edu/what-is-bigdata/). In particular, complexity might be added as a substantial attribute to the definition of big data. In the future, new public data sources from government offices might best be qualified as big public data due to this attribute. The application of pvsR in social science research on U.S. politics offers various advantages over existing data-gathering methods and research practices. First, pvsR provides instant access to highly granular data on all levels of government in the United States in a format that can be directly integrated in a statistical analysis. From the U.S. President to a local council member, data on officials is made easily accessible in the same format and includes detailed biographical information such as gender, professional experience, or religious affiliation. Second, compiling and coding data via pvsR (as well as via any other API-client for PVS) can replace the querying of candidates and officials which is often loaded with a specific survey context. The provision of the raw data to PVS by the individual candidates and officials happens completely independently of any later ‘computational surveying’ by a researcher via pvsR. The raw information is often not forced into predefined categorical answers and is less likely to suffer from response bias than ordinary survey data. Third, the application of pvsR automatically facilitates the replicability of the resulting research. This is a crucial aspect, as there is currently a paradoxical development regarding the transparency of research in the context of social science with big data from the Internet. With the increase in publicly available big data, there are also more studies which base their research on “unique” and “original” datasets. While this development is welcome, the new studies are harder to replicate than empirical analyses based on traditional sources from official statistics. Even if a dataset is made available, there are substantial up-front costs associated with data selection and data cleaning when a dataset has to be reconstructed or extended. In the case of research with PVS data, pvsR substantially reduces the costs of generating datasets for reproduction and replication. As pvsR starts at the data retrieval stage, the pre-processing of data is automatically included when research is reproduced. Simply providing the code of how pvsR was applied in a study is a sufficient documentation of what raw data was used and how it can be accessed for a reproduction or a replication of the study. We illustrate some of the advantages of our software approach compared to traditional methods of data generation based on an application on the representation of women in local politics. The developed arguments suggest the application of OSIs also in connection with other data bases offering access via an API. It simplifies their use and allows studies to be replicated based on this new data. A look at the emerging empirical literature that exploits publicly available big data on various aspects of society outside the political sphere indicates the potential of OSIs. In particular, popular social media sites such as Twitter, Facebook, and Weibo, as well as Wikipedia are ingcreasingly used as data sources for quantitative analyses covering many aspects of society. Recent research has shown, for example, that stock market moves can be predicted by sentiment analysis of micro-blog messages [11], the analysis of Wikipedia usage patterns [12], pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 3 / 21 as well as the analysis of specific Google search volumes [13]. In [12], Wikipedia usage patterns are analyzed as an approximation of investors’ information gathering to detect early signs of stock market moves. The authors present evidence suggesting that the number of page views of Wikipedia articles related to financial topics increases before stock market falls. In [14], new approaches are presented to measure labor market flows by creating indexes of job loss, job search, and job posting based on data from Twitter. Other studies based on newly available big data aim, for example, at the quantification of an individual’s mood and happiness based on a sentiment analysis of tweets ([15, 16] and [17]) as well as the study of collective human attention on natural disasters via Flickr-tags [18]. The remainder of this paper is organized as follows. In Section 2, we present an overview regarding access to data via APIs. Moreover, we introduce the PVS API as well as the basic idea behind pvsR. In Section 3, we show how to work with pvsR and present a working example. This example documents how data gathering with pvsR supersedes existing data collection practices in political science. In Section 4, we demonstrate pvsR’s advantages in terms of replicability of research by replicating a part of an existing study based on PVS data. Section 5 provides some technical background to pvsR and Section 6 compares its core features to other API client libraries for PVS not particularly developed for social science research and written in other programming languages. The concluding discussion in Section 7 summarizes the advantages of pvsR and OSIs in general, based on a conceptual perspective, and points out what could be done to foster the provision of OSIs to big public data. Background The costs of gathering and distributing information for the public’s political engagement have significantly fallen with higher Internet penetration. Democratic movements now have numerous opportunities to pursue their political goals using the Internet and social media. As a consequence, entries in political blogs and tweets leave electronic traces which can be used to generate, for example, a new kind of data on political opinion formation and behavior [19], on political polarization [20], as well as on political discourse [21] (note that the author of [20] also contributes an OSI-like R package to compile data from the Twitter Streaming API [22].). These data are explicitly meant to be publicly accessible (and the dissemination is approved by the individuals generating it). Moreover, the Internet has given rise to citizens’ groups and non-governmental organizations whose aim it is to make the democratic process more transparent. These bodies gather data, for example, on political candidates, public officials, and campaign finances in order to inform voters. This development offers researchers who want to investigate the public and political sphere various opportunities to undertake descriptive analyses as well as hypothesis testing. New descriptive insights into the structure of the political system are expected, as computationally intense methods from fields such as network science are applied to vast data sets (see, e.g., [23] who unveils the topology of legislators’ co-sponsership networks based on all 280,000 items of legislation proposed in the U.S. Congress between 1973 and 2004). Regarding hypothesis testing, new light can be shed on existing theories, because more concepts and variables, such as those relating to campaign contributions and political behavior, can be empirically captured and measured. Data generated by the processes described above are generally published on dynamic websites to inform the public on diverse political issues. The editing and presentation of the data is optimized for individual users interested in a particular issue or person. Hence, the normal user accesses the data through his or her web browser or a similar device with a graphical user interface (e.g., getting information on the candidates in a specific electoral district via PVS’s pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 4 / 21 website). Gathering web data in such a manual way is also still a wide spread practice in the social sciences. Data is retrieved piecewise with a single query or repeated queries through a web browser, after which the data can then be stored in a spreadsheet to build a unique data set. However, web data sources can also be accessed increasingly through an API, which enables specific queries to be run programmatically on the data (instead of extracting the information from hundreds of web pages). A prominent example is the initiative PVS (see http://www. votesmart.org.). PVS is a non-profit organization engaged in informing citizens about U.S. politics. The organization is non-partisan and collects and distributes information ranging from local to federal elections and voting on ballot measures. Of particular relevance is the background information that it collects on candidates and elected officials. This involves information about their previous occupations, education, family life, organizational memberships and, importantly, also their voting behavior in key votes at the federal and state level. PVS also asks politicians about their political position, based on a standardized so-called Political Courage Test. Furthermore, information about campaign finances, interest group ratings of politicians and ballot measure descriptions for states with direct legislation is provided. PVS offers its data (subject to an annual registration fee) free of charge and provides a set of tools that allow users to integrate specific items of information in their web pages (see, e.g. [24]). Access is granted via an API that is developed on the REST architectural style. Such APIs usually return data in a hierarchically (or tree-)structured XML or JSON format. Data in such a format can easily be embeded in other websites (i.e., so-called mashups) or implemented in mobile phone applications (in the case of PVS, e.g., AT&T’s 2010 VoterHub Mobile App; see http://voterhub.us/). This is also the main motivation for non-governmental organizations like PVS for opening access to the public through an API. The dissemination of information is part of their mission to increase transparency in the political process. Besides PVS, there are various other NGOs offering access to public data on U.S. politics via an API. For example, the Center for Responsive Politics (CRP) which provides (through their database OpenSecrets.org) detailed information on federal campaign contributions and lobbying activities in the United States (http:// www.opensecrets.org/resources/create/apis.php) and MapLight whose Bill Positions API offers access to data on organizations’ interests for legislative bills (http://maplight.org/apis/billpositions). A good example of a company that offers access to data on the public sphere is the New York Times (NYT). The APIs offered by NYT provide data on various aspects of U.S. federal and state politics, as well as on news and readers’ comments (http://developer.nytimes. com/docs/read/Home). Finally, as an example of an API provided by a public authority, the Parliamentary Services of Switzerland offer access to data on diverse parliamentary activities (http://www.parlament.ch/e/dokumentation/webservices-opendata/pages/default.aspx). Fig 1 illustrates the conventional use of APIs as an integrated component of a dynamic website. The design of APIs facilitates web and software developers’ access to the data in order to embed the data in their applications. However, its design is not per se suited to systematic scientific data compilation and analysis (an exeption is, of course, the type of APIs that are specifically designed by research institutes for the storage and exchange of data sets between academics). In most cases, the tree-structured data returned from such an API cannot be analyzed directly using standard statistical procedures. Accordingly, another interface is needed to make such data easily accessible to researchers; i.e., an interface that translates the tree-structured data provided by the API to a corresponding flat data representation. This is also the case with the PVS API. To overcome this challange, we provide pvsR as an OSI to the PVS API. Other OSI-like client libraries for other APIs are, e.g., the Sunlight Foundation’s Python library to the Influence Explorer API [25] and the R package WDI [26]. pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 5 / 21 pvsR automates the data retrieval requests to the PVS API, handles HTTP and API errors as well as malformed XML, reshapes the data format, and stores it or makes it available for econometric analyses. In addition, several high-level functions are provided that combine different API methods and conveniently allow for fetching data on specific data entity. Another interface is thus created which bridges data retrieval and analysis. Fig 2 illustrates this alternative use of APIs to web platforms such as PVS for social science research.

OSI programs, like pvsR, build the missing link between a non-scientific API (or web service) providing public data and the statistical analysis of this data. OSIs have a remarkable capacity to facilitate researchers’ access to big public data sources in the programmable web. Moreover, by simply referring to the archived OSI and supplying a short documentation of how it was used, the cost for replicating an entire empirical analysis can be reduced dramatically—not only for the statistical computations, but also for the retrieval, compilation and preparation of the data. In a broad sense, this implies a quantitative analysis architecture that allows for replicable research at low costs. The application of an OSI to public data in order to make research replicable takes the ideas proposed in [48] one step further. The published code implicitly provides the analysis and the data. OSIs might thus partly supersede the duplication of big data in journal archives in the cases where the raw data is compiled via an API. The current best practice of data availability policy is to store the final data sets centrally with a publisher or an association. Despite rapid growth in storage capacity, this practice is costly, particularly if the final data is itself big data. With OSIs, the datasets are built on demand, and it is sufficient that journals store the protocols to document the data generation process. The existing problem emphasized in [49] regarding the non-availability of important publications’ primary raw data could be alleviated. In sum, the proposal is thus for an empirical analysis architecture with OSIs as an integral component rather than individual data retrieval and compilation. These OSIs must be provided with some basic information so that they can be properly referenced. In particular, the following items of information should be included: author, year of publication, title of the pvsR and Big Public Data PLOS ONE | DOI:10.1371/journal.pone.0130501 July 1, 2015 17 / 21 program, a short note on the type (i.e., R package) and the version of the program, and a link to the code repository (an example could be: Ulrich Matter (2014). pvsR: An R package to interact with the Project Vote Smart API for scientific research. R package version 0.3. http:// CRAN.R-project.org/package=pvsR). In this paper, we introduced an OSI for the big database of the organization PVS for the R statistical computing environment. Other interesting data sources offering APIs that could be more easily used if OSIs were available are LegiScan (http://legiscan.com/legiscan), ProPublica’s Free the Files API (https://projects.propublica.org/free-the-files/api), and Buhl & Rasmussen’s API for European Union legislation (http://api.epdb.eu/), just to mention a few. The web platform programmableweb (http://www.programmableweb.com) offers a large directory of web APIs from all over the world. Around 290 APIs listed in programmableweb’s directory concern the public and political sphere. For many of them no suitable OSI is yet available for scientific use. However, how will this “service-oriented science” [50] with OSIs for easier data access come about? This is an important question for the organization of scientific research, not least because OSIs might well increase the multiplier effect associated with publicly-funded research based on big public data. Research that uses this data is a challenge for funding agencies if data preparation consumes a large fraction of the budgeted funds. If programs for data retrieval and compilation tailored for social scientists are written in and for open source software, they can easily be used by other researchers to leverage their skills and capacity. Funding agencies might therefore welcome OSIs as a by-product of the funded research in order to create additional positive spillovers. Overall, we would like to offer three suggestions improving the conditions for the application of OSI packages. First, it is still necessary that researchers receive greater acknowledgement for freely providing research services that enable other researchers to replicate or extend scientific work. Public archives for service software like OSIs are an important step in that direction. Second, in the age of big public data from the programmable web, exact reproduction based on ‘unique’ novel data sets should be set into contrast to replication including raw data compilation based on OSIs. Finally, services have to be rewarded in the same currency as original research, i.e. with citations. Once these conditions are secured, the opportunities presented by new big public data from the programmable web may well spur social science research to new heights.

Comments

Popular posts from this blog

ft

gillian tett 1