Data Handling: Import, Cleaning and Visualisation

Course Content Short summary This course introduces students to the fundamental practices of Data Science in the context of economic research. The course covers basic theoretical concepts and practical skills in gathering, preparing/cleaning, visualizing, storing, and analyzing digital data for research purposes. Description The increasing abundance of digital data covering every-day human activities offers opportunities and poses challenges for empirical research in economics and more broadly in the social sciences at large. Data used in economics come more and more often from novel digital sources (e.g, social media, web applications, or sensors), in diverse formats (e.g., JSON, unstructured text), and in large quantities. In order to effectively and efficiently engage with these developments, economists need a basic understanding of data technologies and practical skills in working with digital data. This course covers basic theoretical concepts and practical skills in (automatically) gathering, preparing, visualizing, and storing digital data for research purposes. It thus covers the crucial first steps underlying empirical research projects. These steps are often rather neglected in traditional social science methodology but are of great relevance in the age of Big Data; this course aims to fill this gap and thereby aims to exploit synergies with other methodology courses such as: Statistics and Empirical Economic Research. Hands-on exercises and case studies from current real-world research projects are meant to deepen the taught concepts and train students in the basics of programming with data. The course covers both theoretical concepts in handling digital data as well as practical hands-on exercises focusing on different data structures and data formats (CSV, HTML, JSON). All exercises are based on freely available open-source-tools (R, RStudio, Atom). Students are expected to install these tools and work with them on their own machines. In the first part of the course, students learn about the relevance and challenges of Big Data for research in economics and related fields, by introducing students to basic data formats and how their use in every-day life has evolved in recent years (with a particular focus on the spread of the Internet and online data). Based on this, the second part of the course introduces concepts and practices to gather and prepare digital data from various sources. In this part, students acquire basic programming skills with R in order to apply these practices with real-world datasets. The last part of the course focuses on analysis and visualization as well as storage and documentation of (relatively) large data sets and discusses the implications of the contents covered in the course for econometric research and applied data science. The structure of the course offers the opportunity to invite guest speakers (in the second and third part of the course) who can give insights into social science research with Big Data and/or applied Data Science in the industry.

1 intro

1 The recent rise of big data and data science Lower computing costs, a stark decrease in storage costs for digital data, as well as the diffusion of the Internet have led to the development of new products (e.g., smartphones) and services (e.g., web search engines, cloud computing) over the last few decades. A side product of these developments is a strong increase in the availability of digital data describing all kind of every-day human activities (Einav and Levin 2014; Matter and Stutzer 2015). As a consequence, new business models and economic structures are emerging with data as their core commodity (i.e., AI-related technological and economic change). For example, the current hype surrounding ‘Artificial Intelligence’ (AI) - largely fueled by the broad application of machine-learning techniques such as ‘deep learning’ (a form of neural networks) - would not be conceivable without the increasing abundance of large amounts of digital data on all kind of socio-economic entities and activities. In short, without understanding and handling the underlying data streams properly, the AI-driven economy cannot function. The same rationale applies, of course, to other ways of making use of digital data, be it traditional big data analytics or scientific research (e.g., applied econometrics). The need for proper handling of large amounts of digital data has given rise to the interdisciplinary field of ‘Data Science’ as well as an increasing demand for ‘Data Scientists’. While nothing within Data Science is particularly new on its own, it is the combination of skills and insights from different fields (particularly Computer Science and Statistics) that has proven to be very productive in meeting new challenges posed by a data-driven economy. In that sense, Data Science is rather a craft than a scientific field. As such, it presupposes a more practical and broader understanding of the data than traditional Computer Science and Statistics from which Data Science borrows its methods. The skill set this course is focusing on can be described as ‘hacking skills’, that is, the skills necessary for acquiring, cleaning, and manipulating massive amounts of electronic data. Moreover, this course revisits and applies/integrates concepts learned in the introductory Statistics course (3,222), and will generally presuppose ‘substantive expertise’ in undergraduate economics. Finally, the aim is to give you first practical insights into each part of the data science pipeline.

2 data processing

1 Data in human history In order to better understand the role of data in today’s economy and society, we study the usage forms and purposes of data records in human history. In a second step, we look at how a computer processes digital data. Throughout human history, the recording and storage of data has primarily been motivated by measuring, quantifying, and keeping record of both our social and natural environments. Early on, the recording of data has been related to economic activity and scientific endeavor. The neolithic transition from hunter-gatherer societies to agriculture and settlements (the economic development sometimes referred to as the ‘first industrial revolution’) came along with a division of labor and more complex organizational structures of society. The change to agricultural food production had the effect that more people could be fed. But it also implied that food production would need to follow a more careful planning (e.g., the right time to seed and harvest) and that the produced food (e.g., grains) would partly be stored and not consumed entirely on the spot. It is believed that partly due to these two practical problems, keeping track of time and keeping record of production quantities, neolithic societies started to use signs (numbers/letters) carved in stone or wood (Hogben 1983). Keeping record of the time and later measuring and keeping record of production quantities in order to store and trade likely led to the first ‘data sets’. At the same time the development of mathematics, particularly geometry, took shape.

 3 data storage and data structures

A core part of the practical skills taught in this course has to do with writing, executing, and storing computer code (i.e., instructions to a computer, in a language that it understands) as well as storing and reading data. The way data is typically stored on a computer follows quite naturally from the outlined principles of how computers process data (both technically speaking and in terms of practical applications). Based on a given standard of how 0s and 1s are translated into the more meaningful symbols we see on our keyboard, we simply write data (or computer code) to a text file and save this file on the hard-disk drive of our computer. Again, what is stored on disk is in the end only consisting of 0s and 1s. But, given the standards outlined in the previous lecture, these 0s and 1s properly map to characters that we can understand when reading the file again from disk and look at its content on our computer screen.

4 big data from the web

1 Flat vs. hierarchical data So far, we have only looked at data structured in a flat/table-like representation (e.g., CSV files). In applied econometrics/statistics it is common to only work with data sets stored in such formats. The main reason is that data manipulation, filtering, aggregation, etc. presuppose/expect data in a table-like format (i.e. matrices). Hence, it makes perfectly sense to already store the data in this format.

5 programming with r

6 data sources gathering and import

 Putting it all together In this lecture we put the key concepts learned so far (text-files for data storage, parsers, encoding, data structures) together and apply them in order to master the first key bottleneck in the data pipeline: how to import raw data from various sources and export/store them for further processing in the pipeline.

In the first two lectures we have learned how data is stored in text-files and how different data structures/formats/syntaxes help to organize the data in these files. Along the way, we have encountered key data formats that are used in various settings to store and transfer data: • CSV (typical for rectangular/table-like data) • Variants of CSV (tab-delimited, fix length, etc.) • XML and JSON (useful for complex/high-dimensional data sets) • HTML (a markup language to define the structure and layout of webpages) • Unstructured text Depending on the data source, data might come in one or the other form. With the increasing importance of the Internet as a data source for economic research, handling XML, JSON, and HTML properly is becoming more important. However, in applied economic research various other, and more specific formats can be encountered: • Excel spreadsheets (.xls) • Formats specific to statistical software packages (SPSS: .sav, STATA: .dat, etc.) • Built-in R datasets • Binary formats While we will cover/revisit how to import all of these formats here, it is important to keep in mind that the learned fundamental concepts are as important (or even more important) than knowing which function to call in R for each of these cases. New formats might evolve and become more relevant in the future for which no R function yet exists. However, the underlying logic of how formats to structure data work will hardly change.

7 data preparation

1 Wrangling with data Importing a dataset properly is just the first of several milestones until an analysis-ready dataset is generated. In some cases, cleaning the raw data is a necessary step to facilitate/enable proper parsing of the data set in order to import it. However, most of the cleaning/preparing (‘wrangling’) with the data follows after the proper parsing of structured data. Many aspects of data wrangling are specific to certain datasets and an entire curriculum could be filled with different approaches and tools to address specific problems. Moreover, proficiency in data wrangling is generally a matter of experience in working with data, gained over many years. Here, we focus on two quite general and broadly applicable techniques that are central to cleaning and preparing a dataset for analysis: Simple string operations (find/replace parts of text strings) and reshaping rectangular data (wide to long/long to wide). The former is focused on individual variables at a time, while the latter typically happens at the level of the entire dataset. 1.1 Cleaning data with basic string operations Recall that most of the data we read into R for analytic purposes is essentially a collection of raw text (structured with special characters). When parsing the data in order to read it into R with high-level functions such as the ones provided in the readr-package, both the structure and the types of the data are considered. The resulting data.frame/tibble might thus contain variables (different columns) of type character, factor, or integer, etc. At this stage it often happens that the raw data is not clean enough for the parser to recognize the data types in each column correctly, and it resorts to just parsing it as character. Indeed, if we have to deal with a very messy dataset it can make a lot of sense to constrain the parser such that it reads each column as character

10 data analysis and basic statistics with r

1 Data analysis with R In the first part of this lecture we take a look at some key functions for applied data analysis in R. At this point, we have already implemented the collecting/importing and cleaning of the raw data. The analysis part can be thought of as a collection of tasks with the aim of making sense of the data. In practice, this can be explorative (discovering interesting patterns in the data) or inductive (testing of a specific hypothesis). Moreover it typically involves both functions for actual statistical analysis as well as various functions to select, combine, filter, and aggregate data. Similar to the topic of data cleaning/preparation, covering all aspects of applied data analysis with R goes well beyond the scope of one lecture. The aim is thus to give a practical overview of some of the basic concepts and their corresponding R functions (here from tidyverse).

11 data display

Data display In the last part of a data pipeline we are typically dealing with the visualisation of data and statistical results for presentation/communication. Typical output formats are reports, a thesis (BA, MA, Dissertation chapter), interactive dashboards, and websites. R (and particularly RStudio) provides a very flexible framework to manage the steps involved in visualisation/presentation for all of these output formats. A first (low-level) step in preparing data/results for publication is the formatting data values for publication. Typically, this involves some string operations to make numbers and text look nicer before we show them in a table or graph.

Comments

Popular posts from this blog

ft

gillian tett 1