Center for Historical Information and Analysis: Big Data in History
July 18, 2013, 10:30 | Panel, CBA 143
Session Abstract
This session address the development of a significant new digital humanities resource in the form of a world-historical dataset. The session provides an overview of the project and details of two key elements in its early stages of development. The commentator will set this project in the context of other complex, multidisciplinary datasets. The Collaborative for Historical Information and Analysis (http://chia.pitt.edu) is a multi-institutional collaborative of scholars in humanities and in social, natural, and information sciences. The Collaborative, structured as a Research Collaborative, Headquarters, and a wider informal network, has recently received major support from the National Science Foundation, which has provided CHIA with an award in its Building Capacity and Community program. On the Research Collaborative side, the award is to strengthen the organizational and technical infrastructure of linking participating institutions that are collecting data on historical population, climate, and other topics with a crowdsourcing tool to demonstrate the feasibility of building a continuously growing collection of diverse historical data and metadata. On the Headquarters side, the award is to assemble and develop knowledge on repository design to develop a repository sufficient to house the incoming data and permit global and interactive analysis. The Collaborative for Historical Information and Analysis’s future plans include expanding its collection and processing of historical data, broadening its community of social and natural science researchers, analyzing historical patterns of global change, and sharing its resources with researchers, policy-makers, teachers and students. CHIA is headquartered at the University of Pittsburgh with formal affiliates at the University of California – Merced and the International Institute of Social History (Amsterdam) plus two affiliates at Harvard University (Institute for Quantitative Social Sciences and Center for Geographic Analysis). Other associated groups are at Boston University and Michigan State University (participating in the NSF-funded project) as well as the University of Portsmouth, the University of California, Irvine, and the Council for the Development of Economic and Social Research in Africa (CODESRIA).
Three papers discuss varying aspects of CHIA and its work. Patrick Manning of the University of Pittsburgh, director of the project, provides an overview of the project objectives, philosophy, structure, and its practical milestones. Ruth Mostern of the University of California - Merced, a member of the governing Executive Committee, presents on the issue of soliciting, linking, and evaluating scholarly datasets. Kai Cao, the technical lead on the project at the University of Pittsburgh, describes the work of creating the prototype archive.. The discussant, Ian Johnson, the developer of a parallel project at the University of Sydney, will comment on both generalities and specifics of the CHIA project as presented.
The Collaborative for Historical Information and Analysis: Framework for Creating a World-Historical Data Repository
Big Data in history will provide a new, comprehensive level of documentation on the past. Currently available historical information, while immense in its overall quantity, is scattered and dispersed. Libraries and archives in great cities hold treasure troves of data on trade, politics, and religion for national and imperial centers, but each archive is separate from the others, and the totality of their records provides scanty information on people of rural areas. The idea of Big Data in history is to digitize a growing portion of existing historical documentation, to link the scattered records to each other (by place, time, and topic), and to create a comprehensive picture of changes in human society over the past four or five centuries.
The Collaborative for Historical Information and Analysis (CHIA), formed in 2011, now has five institutional members based in the U.S. and Europe, four additional, informal associated institutions, and expects to grow further through links with such organizations as the Council for the Development of Economic and Social Research in Africa (CODESRIA, based in Dakar). The Collaborative, directed by an international Executive Committee, is administered at the University of Pittsburgh. Its collaborative projects include the creation of a prototype and a full archive able to contain consistent and documented world historical data, along with systems for gathering and incorporating new data and for analyzing and visualizing the data. Data included are expected to begin with demographic, economic, social, political, health, and environmental variables, and to display their patterns and interactions. This presentation will trace the development of the overall project over the past five years and identify the main problems to be taken on in the next three years.
Tasks addressed from 2007 have included initial articulation of the objective, overall design of the research project and work on the initial steps of implementing several aspects of that design. In particular project members have emphasized recruiting and sustaining collaborating groups at several institutions. In addition, work has included collection of various sorts of historical data, design of the repository for worldwide historical data, development of an ontology describing world-historical data at various levels of detail, and systems of cleaning and documenting data.
At an empirical level, major advances have taken place in collecting, displaying, and linking U.S. data on disease, climate, and population.
The NSF-supported project includes three years of a projected ten-year project to create a global- historical dataset, with the hope that it would be taken over and sustained through the efforts of UNESCO or the World Bank. The CHIA project pledges to maintain open-source, open-access, non-proprietary standards throughout its work in constructing a world-historical archive. We
have allocated the range of our activities among three basic missions.
Mission 1. Gather historical data.
Mission 2. Aggregate data up to the global level.
Mission 3. Visualize, analyze, and mine the data.
Currently funded infrastructure projects include:
- Crowd-sourcing application for collecting and archiving historical and social data will open the bottleneck that has so far prevented systematic study of human society at a large scale (University of Pittsburgh).
- Prototype archive – programming a prototype for global integration and visualization of data, relying on the model of the Dataverse Network and a selection of world-wide data for the twentieth century(University of Pittsburgh, Harvard University).
- “Hoovering” data, the collection of available datasets and a survey of social scientists to determine the availability of historical datasets (UC-Merced)
- Data retrieval for South Asia and Southeast Asia led by the Asian Studies group at Michigan State University
- Colonialism – integration and visualization of data collected in a previously project, CLIO, at Boston University.
- Collaboration as infrastructure among social scientists all over the world, through the creation and maintaining of a global system of historical data, will bring additional sharing of data and analysis.
Additional areas of activity
- Peer-reviewing of datasets through the Journal of World-Historical Information will bring recognition of the scholarly value of creating datasets, and will ensure that high standards for creating historical datasets are created and maintained.
- Archive design at Big-Data scale – design and programming through XSEDE program in association with the Pittsburgh Supercomputer Center.
- Labour Relations – a program of distributed historical research to assess the structure of labor forces in numerous historical situations, supported by the Netherlands National Science Foundation (International Institute of Social History).
- Technical and analytical skills of social scientists will advance through the process of collecting and analyzing data, and demonstrate the parallels and the links of social sciences and natural sciences.
- Theoretical debate. The expanded effort to link and apply social science theories, especially in order to fill in missing historical data, will strengthen theory and analysis in social science.
To understand global social patterns as they exist today, it is increasingly clear that we need to understand how they have evolved over recent centuries. The Collaborative for Historical Information and Analysis responds to this need and takes historical analysis into the realm of Big Data. It is expected that the data resources will grow to several terabytes in size. This project will stimulate development of more efficient research collaborations, enabling systematic large-scale consolidation of diverse historical data sources. Once collected and integrated, the data repository and analytical system will allow scholars to address a wider set of questions testing hypotheses about long-term and short-term social change at the global scale and catalyzing an expansion of the evidence base in humanities and social sciences. For example, our understanding of important societal issues can advance by linking health to demography and by incorporating climate and health factors into economic studies. Disciplinary theory will advance through interaction among the various scientific fields, so that a global network of humanities and social-science researchers will emerge.
The project addresses the global dynamics in humanities issues and social-science variables over the past several centuries. Contemporary globalization and concerns about future global trends naturally raise questions about past patterns of global change. What were the interactions of population, economy, governance, and social inequality with each other and with climate and disease? Historical social science, focused at national and subnational levels, has scarcely addressed global issues. Our group expects to collect, document, and analyze historical data to permit cross-disciplinary analysis of human society over time. The overall topic is immense, but we believe we have found an orderly and productive way to work on it.
How are we to create consistent data at regional and global levels over time? Our group, rather than tunneling within a single discipline, seeks to coordinate data collection and research in multiple disciplines. We advance an explicit focus on the global and historical character of human society. The existing data sources are mostly used for regional comparative efforts; they vary widely in degree of consistency, reliability, completeness, as well as in data representation format. The task of large-scale data utilization can only be resolved via collaborative efforts within a large network of researchers.
What criteria distinguish this global strategy in humanities research from other large-scale projects? Our project is not simply to archive large quantities of data but to define and link them into a single overarching set of interacting, historical data. We require a coherent metadata framework to link data to their sources and each other. Creating this mass of new metadata—as we incorporate, integrate, and aggregate data—requires a strong ontological base and a crowdsourcing procedure to link many contributors. Our collection of base data on population worldwide is to go back four centuries and to include migration and other extra-census data—it is thus complementary to rather than competitive with the Terra Populus (University of Minnesota) collection of census data.
The Collaborative responds to an imperative of the current moment: the need to understand global social patterns not only as they exist today but as they have evolved over some four centuries. The program argues that some national resources should be put into research at this broad level, to clarify and diffuse a global strategy in social-science research. It also focuses on population and climate as key layers of global data. In research agenda, CHIA addresses human interaction with the natural world, global population change, patterns in social inequality, and local and global patterns of governance.
Soliciting, Integrating and Evaluating World-Historical Data
A signal characteristic of world-historical data is that much of it will need to be assembled piecemeal from datasets created by specialists. Some of these datasets, such as those which concern climate information, are quite large. However, at the global historical scale, even climate data needs to be integrated based on local and regional analyses. Ocean sediment samples, ice cores, or dendrochonologies may offer centuries of continuous information, but they concern particular locations. Census and epidemiology datasets often record information about millions of people, but they are episodic in nature and regional in spatial scale. Beyond these types of data, contributions range downward in size and upward in analytical complexity. The history of commodity exchange, for instance, requires meticulous reconstruction from bills of lading and tax documents that may be difficult to locate, trust, and make commensurable. One of the challenges of the CHIA effort concerns this work. The solution involves three tasks.
- “Hoovering” data. CHIA needs to engage in a labor-intensive process to identify the specialist holders of relevant datasets and work with them to solve issues of data structure and intellectual property that may prove to be barriers to contributing them to a CHIA archive. This paper will report about the development of a CHIA survey of historians and historical social scientists regarding their creation, preservation, integration, and use of historical datasets.
- Integrating data. CHIA needs to identify appropriate standards for the formal descrip- tions of historical datasets. Librarians and archivists stress the importance of appropriate metadata for guiding the ingest, discovery, and integration of datasets, and many domain- specific standards exist. We historians have our own challenges, since our disciplinary traditions (rich footnoting, bibliographies, and descriptive text about method) mean that we need particular standards for describing the dataset as a work, the primary and secondary sources (including other datasets) consulted in its production, and the operations conducted upon them to create the final work. Collaboratively developed datasets have primary and secondary authors as well as technical experts and publishers. This paper will discuss existing metadata standards, best practices for less formal data description, and the promise of linked data solutions.
- Evaluating data. Historical datasets lack established conventions of form, content, and genre. Authors do not have clearly recognized models to follow or oppose even as they seek to create effective, excellent, and communicative work; and reviewers, along with readers/users, have to assess the value or character of any given dataset de novo and ad hoc, rather than engaging with a given dataset as an exemplar of a familiar category. CHIA needs to develop standards for reviewing datasets and offering the imprimateur of publication in order to overcome disincentives to pursuing digital scholarship on the part of authors, and trusting it on the part of users. Until these matters are resolved, it will be difficult for historians to contribute data to the CHIA archive, for CHIA editors to evaluate data, and for the CHIA system to handle data content and data types that may be quite diverse. This paper will discuss CHIA efforts to evaluate datasets; in particular by publishing dataset reviews in the Journal of World Historical Information.
Creating a Prototype Archive for a World-Historical Dataset
A key element of the overall project of the Collaborative for Historical Information and Analysis is construction of an appropriate archival system. Three CHIA affiliates collaborate in the initial stage of archive construction by creating a prototype archive and retrieval system based on a “faceted search,” to be created and tested by mid-2013. The collaborating groups are the World-Historical Dataverse at the University of Pittsburgh (Pitt) (with the author as lead developer), the Institute for Quantitative Social Science at Harvard (with Gustavo Durand as principal developer), and the Center for Geographic Analysis at Harvard (with Benjamin Lewis as developer). The archive will be based initially on the Dataverse Network (DVN), to enable linkage of multiple data files (which themselves include explicit spatial and temporal data) so as to develop data that can be searched so as to identify patterns at the global level and interactions among variables at various spatial and temporal levels. Multiple data files will be stored in the DVN system as “studies,” and will be accessed by the system of retrieval.
The “faceted search” is the key element of the archive. It is a search portal, which enables users to define selected data by space, by time, and by topic in a text box. The search is “faceted” in the sense that when the user adjusts the range on one dimension (e.g., space), the range on the other dimensions adjusts appropriately. Once the search criteria are entered, the program identifies the studies that meet the search criteria and displays their geographic distribution within the bounding box on the map. With that, the user can then explore the studies by clicking the dots /links visualized in the mapping area.
Four categories of data have been identified for this task. The four categories of development are:
- i. Global population data for the 20th century at national level or provincial level for countries exceeding 100 million to 2000, at 10-year or 5-year intervals.
- ii. Climate for the 20th century for identified places and times within the same units as above.
- iii. Silver flows of production and trade for the 20th century, by place, trajectory, and time (annual or quinquennial).
- iv. Wars during the 20th century, identified by time, space, national or ethnic combatants, casualties.
A graduate researcher at the University of Pittsburgh has been engaged to collect these data.
In addition to the above outline, the presentation will address as many as possible of the details involved in constructing the archive, and will discuss how it is to be used in later phases of project activity. For each study, metadata are to be defined and implemented to ensure that each individual value is fully defined. As a next step, the localized files that are entered into the archive are to be aggregated in order to yield continental and global summaries of data. Besides, we will define and impose the elements of a mid-level ontology for the data, to define relations among all the topical categories, and allow users to use unstructured tags within it.
The archive and faceted search, once implemented, will enable users to envision the breadth of the world-historical analytical system, which is the ultimate objective of the work. The associated website will appear as a storytelling platform. We also expect to rely on social media to spread the word on the archive to convey the idea that its continued development can build a resource of global and interdisciplinary interest.
This archive and faceted search, after its initial development, is to be articulated with other aspects of the overall CHIA project. To begin with, based on the responses and evaluations of users, we will carry out revisions of the faceted search created in this Phase 1, advancing it to lower the barriers of entry for participants and expand the reach of the project. Further, we will encourage the collection of additional categories of data and develop models to improve the analysis of data. In perhaps the most important next stage, we will link the archive and faceted search to the crowd-sourcing data- input application under separate development at Pitt. With the linkage of these two applications, it will become possible for large quantities of data to be incorporated and integrated into the world- historical archive.