General Guidance on Data Usage and Management
- Data Usage
- Data Management and Documentation:
- Ensure long-term preservation of, and full and open access to, high-quality data sets
- Give proper credit to the researchers providing the data
- Provide thorough, yet simple, documentation: how the data were produced, what they mean
- Generate ASCII data and documentation files; they ensure readibility by virtually all users
- Define variable names and units
- Point to, or provide, important publications that further document the data
CDIAC fully supports the July 1991 Policy Statements on Data Management for Global Change Research of the U.S. Global Change Research Program (the so-called "Bromley Principles"). CDIAC is sponsored by the Office of Biological and Environmental Research of the U.S. Department of Energy, a USGCRP agency. Some of the items that we consider most pertinent to our work are:
- "commitment to the establishment, maintenance, validation, description, accessibility, and distribution of high-quality, long-term data sets"
- "full and open sharing of the full suite of global data sets"
- "preservation of all data needed for long-term global change research"
- "easily accessible information about the data"
- "data should be provided at the lowest possible cost...full and open access to data"
CDIAC encourages the use of the data that you get from us. In most, but not all, cases, the data that we archive were the result of federally-funded research and monitoring programs (we also hold data contributed by state agencies, other countries, and the private sector). We feel strongly that society benefits when these data are used to understand and protect the global environment.
CDIAC requests that, when you use data obtained from us, you properly cite the data product and the data contributors. Not only is this good scholarly practice, but seeing their work properly cited encourages scientists to archive their data at centers such as CDIAC. Our data products contain suggested citations to facilitate this. For example, if you look at Keeling's landmark data on atmospheric carbon dioxide measured at Mauna Loa, Hawaii, you'll find the following suggested citation:
CITE AS: Keeling, R.F., S.C. Piper, A.F. Bollenbacher and J.S. Walker. 2009. Atmospheric CO2 records from sites in the SIO air sampling network. In Trends: A Compendium of Data on Global Change. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, U.S. Department of Energy, Oak Ridge, Tenn., U.S.A. doi: 10.3334/CDIAC/atg.035.
It would also be most helpful if, in the Acknowledgments section of any publication that is based on data you received from CDIAC, you include a statement to the following effect:
The --- database used in this analysis was provided by the U.S. Department of Energy through its Carbon Dioxide Information Analysis Center at Oak Ridge National Laboratory.
or, more fully:
The --- database of --- (investigator) resulted from research sponsored by the --- (agency); it was provided by the U.S. Department of Energy through its Carbon Dioxide Information Analysis Center at Oak Ridge National Laboratory.
We'll be glad to work with you on the proper wording of this credit. The Seeley G. Mudd Library at Lawrence University provides excellent guidance on how to cite electronic documents, plus links to other related online resources.
Please let us know when you cite data from CDIAC (and send us a reprint of your published papers). This is important to us (it guides our activities and helps us focus our efforts on data products that are useful to you), to the U.S. Department of Energy and other sponsoring agencies (evidence that their funding of research and data management results in useful products), and to our data contributors (encouraging them to archive their data, and sometimes leading to new collaborations).
In the fifteen years that CDIAC has been in operation, we have received hundreds of different data bases from contributors in many disciplines. In reviewing them, and in the process of quality-assuring and documenting more than 75 global-change data bases, we have seen a wide range of styles and level of quality. We recognize that virtually every research or data collection project is unique, and no one detailed format would be appropriate across the board. In fact, different sponsors of research may well insist on a specific format, and there are many! CDIAC's policy has always been to accept data in any format that is convenient to the data contributor - relational data base files, SAS files, spreadsheet files, ASCII files, whatever. We thought it would be useful to share with you what we have found to be characteristics of data bases that make them useful and useable.
A user of the data 20 years hence, especially a user who is not an expert in the same field as the researcher producing the data, should be able to understand what measurements were made, what was done to the data before they were archived, what the data mean, what the data can be used for, and what the data should not be used for.
Keep It Simple, Stupid (The famous KISS Principle), or Why CDIAC Uses Flat ASCII Files: The simpler the data structure, the easier it will be to read, and the less likely it will be that unknown errors will sneak in. We invariably aim for a flat ASCII file as the preferred format for archiving and transmitting data files. They can be read now by virtually anyone, using a wide variety of software, and we are confident that this will be true well into the future. For certain, spatial data sets, GIS formats (e.g., ARC/INFO export files) may also be appropriate. But remember that the use of proprietary formats limits the usefulness of data sets for many users! Proprietary spreadsheet formats - while they may suit the needs of the researcher - may not always work for archival and transfer (e.g., users may not all have the most current version, and there may be problems with successfully reading a spreadsheet file created with one software package into another package). Whenever CDIAC receives a spreadsheet file, we immediately convert the file into a flat ASCII file (we may also distribute the spreadsheet files, but that will be done on a "caveat emptor" basis). The simplest way to ensure that a spreadsheet can be ultimately converted into a flat ASCII file is to have a unique variable definition for each column.
For example, use your computer to convert from degrees Fahrenheit to degrees Celsius, rather than performing the transformations offline (e.g., in your head or on the back of an envelope) and entering the transformed values into the data base. Computers can handle routine transformations more reliably than we can in our heads or with a hand-held calculator. If an unusual transformation is used, document it.
For example, for a missing value use -999.99 or some other number that would not reasonably be encountered in the data base. This is especially troublesome when combining individual data sets that handle missing values differently; artificial data discontinuities can appear at the boundaries.
"Metadata" has been defined as "data about data" - that is, documentation. No data file should be considered complete or useful unless it is accompanied by information about the data. Some of this information can be contained in header records (e.g., name/number of data base, name of data contributors, names of variables); more extensive metadata can be included in a separate file, in which case the data file and the metadata file should include pointers to each other.
Fully describe the data base: what was measured; how it was measured; what quality-control and calibration procedures were used (blanks, controls, standard reference materials, etc.); how the data files are formatted; where the data reside and how they may be obtained; what limitations exist on use of the data.
Don not make the user guess what RTBCCSD means. (In case you were wondering, it stands for "standard deviation of the concentration of boron in roots" in a study of the effects of elevated atmospheric CO2 on plants.)
Define any non-standard (i.e., non-SI Metric) units. If in doubt, define the unit. Remember, depending upon the disciplinary background of the researcher, nm can signify either nanometers or nautical miles.
Make the user aware of (or, in the case of unpublished or difficult-to-obtain manuscripts, provide) reports or papers that more fully describe the data, how they were obtained, and how they have been used.
For more information on many of the topics addressed here, see Best Practices for Preparing Ecological and Ground-Based Data Sets to Share and Archive and Data Provider Information , provided by the Oak Ridge National Laboratory Distributed Active Archive Center.