Nov 17, 2010

Can I get assistance from ICPSR in loading the MARC into my card catalog?

Unfortunately, ICPSR does not have a great deal of MARC expertise in house. If there is a flaw in the XML itself, we can correct that quickly, but we cannot provide guidance on loading the MARC into your catalog. Your best source of support would be to discuss the matter with other ORs.

Can I submit ICPSR MARC records to OCLC?

Not without express permission from ICPSR. ICPSR is currently in discussion with interested parties. If you would like to assist in submitting our records to OCLC, please contact web-support@icpsr.umich.edu.

Can I just obtain the MARC from OCLC?

You can, but you'll be charged for the records. In addition, there's no guarantee that the records will be up-to-date.

How do I convert MARC21 XML into MARC?

According to one of our ORs, you can use an application called MARCEdit to transform the files. Unfortunately, ICPSR can provide guidance beyond that, as we have no in-house expertise on true MARC.

How do I match your MARC records to existing records in my catalog?

To update your catalog records, please use the 035 field as the matchpoint. 035 contains the ICPSR study number, which is always unique.

Nov 16, 2010

How can I get around the 500 results maximum on the metadata export/search results?

Your export link will look something like this:

http://www.icpsr.umich.edu/icpsrweb/ICPSR/marc/studies?archive=ICPSR&q=&recency=QUARTER&paging.startRow=1&paging.rows=500

That URL will return ICPSR studies in MARC format that were added/updated in the last quarter. It will return the first 500 results. If you change paging.startRow=1 to paging.startRow=501, it will return results 501 through 1000. By doing that, you can snag more than 500 results, but you have to break them into 500-record increments.

You can change paging.rows to a higher value, but it won't let you get around the 500-record limit.

Why doesn't ICPSR just make links for each 500? Or the whole set?

We want to prevent the export functions from taxing the server too much. If we put those links on the site anywhere, then every search engine crawler would hit those links repeatedly, which could cause our server to grind to a halt.

Can I generate MARC for studies updated in a particular date range?

Unfortunately, no. The Filter by Recency function enables you to select studies added/updated in the last week, month, quarter, or year. We have no plans of implementing a specific date range search for adds/updates. That said, ICPSR will be generating annual update files that feature all adds/updates for the prior year.

Can I obtain MARC from ICPSR? What about Dublin Core?

ICPSR has a number of export functions for study-level metadata. On the search results page, you can now export results as either MARC21 XML or comma-delimited files (with a maximum of 500 results). This export feature appears at the base of the right-hand column on the search results page. From the study home page, you can now export a study-level metadata record in a variety of formats, including DDI, Dublin Core, and MARC21 XML.

These export options replace ICPSR's previous method of disseminating MARC records. The changes are:

  • We now provide MARC21 XML, not true MARC.
  • The new MARC21 XML doesn't include all fields previously indexed; it's a concise metadata record.
  • We no longer submit our MARC to OCLC.
  • We no longer have our MARC reviewed/corrected by library staff at the University of Michigan.
  • Users can easily build custom sets of MARC21 XML, rather than having to re-import all ICPSR metadata.
  • Users can obtain updated XML at will, rather than waiting for an annual batch from ICPSR.

At the beginning of each year, our staff will generate two large XML files:

  • a file containing all ICPSR metadata as MARC21 XML
  • a file containing all studies added or updated in the last year as MARC21 XML

We'll link those files from the OR site, so that the 500 result limit doesn't prevent users from obtaining the full metadata.

If you see any flaws with the XML, please don't hesitate to let us know. The new setup makes it very easy for ICPSR to globally update our XML, and we'd like to ensure that the XML is suitable for our institutions' library catalogs. If you have any questions/concerns, please email web-support@icpsr.umich.edu.

Nov 1, 2010

May I use this question in building my own survey?

Short Answer

Not without further research on your part. The question may be part of a copyrighted instrument. Using it in that case would be copyright infringement and/or plagiarism.

Long Answer

Some instruments utilized as part of the data collection process for a project deposited with ICPSR may contain in whole or in part contents from copyrighted instruments. Reproductions of such instruments are provided as documentation for the analysis of the data of the associated collection. Restrictions on "fair use" apply to all copyrighted content.

Circular 21 from the U.S. Copyright Office provides basic information on fair use and several important legislative provisions and other documents addressing reproduction of copyrighted materials by librarians and educators.

How can I determine if the question is copyrighted?

Read the documentation carefully; contact the investigators directly.

Oct 29, 2010

What kind of data formats does the archive distribute? Do you have SPSS Portable files? SAS transport? Stata?

We primarily distribute data files in eight data formats: three plain text formats (column-delimited ASCII, comma-delimited ASCII, and tab-delimited ASCII), two SAS formats (SAS XPORT and CPORT files), two SPSS formats (SPSS SAV and portable files), and the single Stata data format. Virtually every data file is available in a plain text format. We also supply many data files in one or more of the other formats.

Plain Text

Column-, comma-, and tab-delimited ASCII data files store data, including numeric values, as lines of plain text, with one or more lines per observation (or subject or case). In the plain text format, every character of text--each digit, letter, or other symbol--is encoded in a separate byte in the data file. Thus, the number 133.5 occupies five bytes, the number 8 just one byte, and the string "computer programmer" requires nineteen bytes. Many of ICPSR's plain text data files are encoded with the ASCII character encoding system. However, some use other encodings, such as IBM PC code page 437, which is based on ASCII but supports more characters than ASCII does. Most use the ASCII-based ISO 8859-1 or Windows-1252 encodings.

In all three types of plain text data files, the line(s) allocated to a given observation contains the observation's values for the file's variables. What sets the three types apart is way the values are demarcated on the lines.

In a column-delimited ASCII data file, each variable occupies the same byte(s) on every observation. The bytes are usually called "columns," hence the name of this data format. For example, if a file with one line per observation has just three variables which occupy three bytes each, then the first variable would be located in columns 1-3, the second in columns 4-6, and the third in columns 7-9 on each line in the data file.

To facilitate the use of the column-delimited ASCII data files, which require programming expertise to import them into statistical packages for analysis, ICPSR usually provides programs, called "setups," to read them into SAS, SPSS, or Stata. The setups also assign variable labels and usually assign value labels and define missing values too.

In a comma-delimited ASCII data file, the data values are separated with commas instead of being located in fixed column locations. Thus, in this format, the length of each line varies according to the magnitude of the line's data values. For example, the first two lines of a four-variable data file could look like this:

1,133.5,plumber,250778
2,44,librarian,20000

As with the column-delimited ASCII files, ICPSR usually provides setups to read the comma-delimited ASCII files into SAS, SPSS, or Stata.

Tab-delimited ASCII data files are the same as comma-delimited ASCII files except that values are delimited with a special tab control character instead of a comma. Most of these files were created by ICPSR for use with spreadsheets, such as Excel, into which they can be easily imported. These files can also be read into statistical packages like SAS, SPSS, and Stata. However, ICPSR rarely provides setups for that purpose.

SAS

We distribute two SAS data formats: SAS transport files generated by the SAS CPORT procedure and SAS transport files written by the SAS XPORT engine. Both types of files contain specially formatted SAS data sets, which contain variable labels as well as data. Many of ICPSR's SAS CPORT files also include SAS format catalogs with value labels.

SAS CPORT files should be imported into SAS with the SAS CIMPORT procedure.

Since SAS has an engine that reads SAS XPORT files, they can be read by any SAS command that can read an ordinary SAS data set, such as the SAS set statement or the SAS FREQ procedure. SAS XPORT files can also be converted to standard SAS data sets with the SAS COPY procedure.

SPSS

We distribute two types of SPSS data files: SPSS SAV files written by the SPSS save command and SPSS portable files written by the SPSS export command. Both types of data files include variable labels and usually include value labels and missing value definitions.

To load SPSS SAV files into SPSS use the SPSS get command.

To read SPSS portable files into SPSS use the SPSS import command.

Stata

Like the SAS and SPSS formats, Stata's proprietary data file format, which is written by the Stata save command, is platform independent. Our Stata data files include variable labels and usually include value labels too.

Stata data files should be loaded into Stata with the Stata use command.

Using ASCII data and setup files

Sep 30, 2010

How do I import a study citation into EndNote/RefWorks/etc.?

EndNote X4+ Users

If you click on Export > Citation > RIS, EndNote should automatically import the citation. If the citation comes through with the reference type "Computer Program," you'll need to update your RIS import filter, which can be obtained from EndNote's Web site. Simply download the file and place it in the EndNote X4/Filters folder; it will replace the pre-existing "RefMan RIS.enf" file.

Older Versions of EndNote

If you click on Export > Citation > RIS, EndNote should automatically import the citation. Unfortunately, the citation will come through with the reference type "Computer Program," due to how the EndNote software interprets the "DATA" publication type in RIS. To correct this issue, you can either upgrade your software, or consult EndNote for assistance.

Other Bibliographic Software

Just click on Export > Citation > RIS. Your software will either import it automatically, or it will save a small citation file to your hard drive. Open your bibliographic software and look for an "Import" option in the "File" menu. In the dialogue box, point it at that file. There may be an additional menu that lets you specify the type of import; look for a filter labeled "RIS" or "RefMan RIS" or "Reference Information Systems."

Please note that RIS is a standard, but not all bibliographic software adheres to the standard. I.e., if you import a study citation into Zotero via RIS, it will call the citation a "Web Page," even though the RIS standard labels it a "Data file."

What fields were dropped in the revised MARC?

We removed the following elements when we switched from distributing MARC to MARC21 XML:

  • 035a: OCLC id
  • 490: series name
  • 505a: dataset names
  • 516a: number of datasets
  • 518a: time period
  • 522a: geographic coverage
  • 536a: funding agency
  • 536c: grant number
  • 567a: universe
  • 650/651: LC subject headings added by U-M Libraries

Sep 13, 2010

I need to deposit data in an archive as a requirement for a journal article and obtain a DOI. How do I do that?

ICPSR's Publication-Related Archive (PRA) is a self-archiving mechanism that facilitates the deposit of data supporting publications. Go to the ICPSR Deposit Form and select PRA as the archive for deposit. Then follow the instructions provided for entering descriptive metadata and uploading the data. Upon publication of the data you will receive a notification that includes the Digital Object Identifier to the data, which you can include in your article. If you need a DOI sooner, please contact deposit@icpsr.umich.edu to make a special request.

Note that the PRA requires a connection between the data and the published article, so please be sure to include in the metadata you provide the article’s citation and a DOI if available.

Sep 1, 2010

How do the batch export utilities work?

The developer utilities enable you to export your current search results to a standard format so that you can import the results into various software packages. Currently, we've enabled batch export of citations (as RIS or EndNote XML) so that you can import your search results into bibliographic software (perhaps to include in a report).

What other formats are available?

ICPSR is working on a variety of study-level exports, including DDI2, DDI3, Dublin Core, and RIS/EndNote XML (for the citation of the study itself). We plan to have these done by the end of 2010.

How can I get more than 500 results in the batch export?

We had to put a limit on how much one could export to prevent search engine crawlers from slowing down our site by repeatedly hitting export links that taxed the server a lot. Thus we instituted the 500 cap on batch exports. To get around that, just do the following:

  1. Copy the link of the export you want.
  2. Paste the link into a new browser window.
  3. Add the following text to the end of the link: &paging.startRow=501
  4. Press "return."

This will cause the export to have results 501-1000. Adjust the number to get more. We could add a drop-down menu to enable this, but then the search engine crawlers would ping it repeatedly.

Aug 9, 2010

Is the longitudinal panel data available for Monitoring the Future?

The information below comes directly from Monitoring the Future. Please refer to their Web site for more information.

  1. A subset of high school seniors are selected each year for follow-up, which is conducted in an alternating biennial fashion, with the first half of the subset receiving their first follow-up questionnaire one year after high school, and the second half receiving their follow-up two years after high school. They receive a series of six questionnaires within this arrangement, so the second half of the subset is 12 years past high school when they receive their last young adult "FU-12" questionnaire. Then, the follow-up procedure changes to 5-year intervals to cover middle adulthood.
  2. The questionnaires in the young adult follow-ups are directly comparable to the base year questionnaires, both in content and in numbers of questionnaire forms. The core drug use questions are included along with the same types of related attitude and behavioral items, many of which are unique to each form, so respondents receive the same questionnaire form throughout the base year and young adult follow-up series.
  3. All data for a particular individual are linked (or, in the case of form-specific items, capable of being linked) in the panel dataset. The sheer amount of information greatly increases the risk of breaching confidentiality. Thus, based on policies approved by our funding source and IRB, the panel data set cannot be made available to the public in totality and without modification.
  4. Special data requests can be made through the Web site email address. Once we get a request, information about policies and procedures is sent out. Requests are considered on a case-by-case basis, and may be fulfilled - at requestor's cost - typically by providing data analytic access.

Additional information about the design of the panel component of the design and procedures used in the study are included in our annual NIDA report, Volume II, and in more detail in the MTF "Occasional Papers." See, for example, "The Aims and Objectives of the Monitoring the Future Study and Progress Toward Fulfilling Them as of 2006" (pdf).

To make a request for this data and for further information, please contact MTF staff at: MTFinfo@isr.umich.edu

What are Quick Tables?

Quick Tables are streamlined data analysis tools that allow you to produce analytic tables by choosing from among pre-selected high-interest variables in drop-down menus. Currently, Quick Tables are available for the following series: HBSC, NSDUH, TEDS-A, and TEDS-D.

Aug 5, 2010

What enhancements are available when using SDA?

SAMHDA recently upgraded the Survey Documentation and Analysis (SDA) system to version 3.4. All of the previous statistical procedures are still available. Users who prefer to use the original interface may still use it by selecting the link entitled Use Classic Interface in the upper-left corner of the screen. SDA 3.4 improves the calculation of statistics for complex samples in the TABLES and MEANS programs.

For the TABLES program, enhancements include:

  1. Corrections to the calculation of standard errors and confidence intervals.
  2. Addition of Rao-Scott F-tests.
  3. Ability to display weighted or unweighted N of cases.
  4. Option to set the number of decimals for all statistics.

For the MEANS program, enhancements include:

  1. Corrections to the calculation of the standard errors and confidence intervals.
  2. Option to display the p-value of each difference from the cells in a base row or column.
  3. Default reporting of the weighted N of cases in each cell for weighted analyses.
  4. Option to include charts in output.
  5. Optional diagnostic table for design variables.

Further information is provided in the SDA Manual (SDA 3.4).

SDA 3.3, released in June 2009, contained the following changes to the analysis programs and features:

  1. Disclosure Protection: SAMHDA now has the ability to suppress output that may compromise the confidentiality of survey respondents by applying disclosure protection rules to a data file. Analysis programs, including RECODE and COMPUTE, now check for the presence of disclosure rules and enforce them. Disclosure rules may be specified to: a) prevent an analysis from being run; b) suppress the output after running an analysis; and c) suppress the unweighted number of cases from being reported in the output. The SDA 3.3 Documentation for Disclosure provides greater detail on the disclosure rules that may be specified.

  2. List Created Variables - View Button: The output from the listing of recoded and computed variables now includes a "View" button that provides access to descriptions of the variables. This feature can be accessed under the SDA Create Variables menu.

  3. Title: A title or label can be entered for each analysis request and will appear at the top of the HTML output produced by SDA analysis programs.

  4. Customized Subset: This procedure has also been revised in that recoded and computed variables may now be included in a subset. If pre-set selection filters have been defined by SAMHDA, these filters now apply to the interactive version of the subset procedure as well as to the analysis programs. A Comma Separated Values (CSV) file is available for output.

Content adapted from the SDA Manual (version 3.3).

For further information on SDA, please select the Getting Started button located in the upper-right portion of the screen or visit the SDA Tutorial.

What are the main components of the SDA interface?

In 2008 SAMHDA upgraded the appearance of the Survey Documentation and Analysis (SDA) system. The new interface allows users greater navigational ability within SDA. All of the statistical procedures are still available as they were previously. Users who prefer to use the original interface may still do so by selecting the link entitled Use Classic Interface in the upper-left corner of the screen. For further information about SDA, please select the Getting Started button located in the upper-right portion of the screen.Users no longer need to open the codebook in a separate browser to view a list of variables. Also, users can now switch between the various statistical procedures without having to return to the main analysis page. These improvements are possible because the screen splits into the following four windows:

  1. Program Selection Window. Select from programs to perform analysis, create or recode variables, download the dataset or a customized data subset, view the codebook, or view the help file Getting Started.

  2. Variable Selection Window. The buttons within this window change depending on the type of analysis selected. Variables are selected and placed into the box. The user can then specify which analysis field the variable should go into (i.e., row or column for a crosstabulation, independent or dependent for a regression, or used as a control or filter variable). Users can also obtain a frequency table and accompanying question text for that variable by selecting the View button.

  3. Variable Tree Window. All variables and variable labels are listed and organized as they appear within the codebook, into groups with headings and subheadings. Click on the +/- boxes next to the heading to view all the variables within a selected group. When a variable of interest is located, select it and the program will place it into the variable selection window.

  4. Analysis Window. This screen will display the required and optional fields for the type of analysis you have selected. This screen looks identical to the classic interface for each of the analytic features available.

How do I find a study by Principal Investigator?

Perform a keyword search in the "Study descriptions" tab. Then you can use the "Filter by Author" facet in the right column.

Can I select multiple datasets for a download? What about multiple stat packages?

For every study in the archive you have the option of either downloading all of the files or selecting individual files. Use the option of downloading all files if you want more than one part (dataset) of a multiple part study, or if you want to download the study into multiple statistical packages. There is not a way to "cherry pick" datasets or statistical packages. You cannot select datasets 1 and 3 with a single click. Similarly, you cannot select SAS and Stata, but not SPSS. Once you have downloaded the entire study you can then select which individual files to extract from the zip file provided.

What is SDA and why should I use it?

The Survey Documentation and Analysis (SDA) system allows users to conduct statistical analysis quickly and efficiently on the Internet using their Web browser. It was developed by the Computer-assisted Survey Methods Program (CSM) at the University of California at Berkeley. The SDA system is capable of performing a wide range of statistical analyses from bivariate crosstabulation to multiple regression and analysis of variance. The system allows users to design and implement custom recodes as well as generate subsets of data for download and analysis with traditional statistical applications.

For an overview on how to analyze data online, please consult our SDA Tutorial.

Additional information about SDA and its capabilities can be found in the SDA online documentation from Berkeley.

What is faceted searching? How does it work?

With the new SAMHDA Web site, SAMHDA has enhanced searching our data holdings using SOLR. SOLR offers the following advantages:

  • Faceted searching
  • No more limit of 500 results
  • Date searching of multiple fields
  • Same search rules for data holdings, bibliography, and variables database

Faceted searching alone is a significant improvement.

  • Easy to shift between refining and expanding search results
  • Unlikely to hit "no results found" page as facets provide an indicator of the size of your result set
  • Seamless integration with keyword searching

The SAMHDA site features two types of searches: variables and study descriptions. By default, you search the variables in the studies.

Variables

screenshot

In the screen above, I've done a search on "methamphetamine." The results page lists all the studies that have variables on methamphetamine, sorted by the number of matching variables. I'm not sure I want to limit myself to one study just yet, so I'm going to click on "Find matching variables in all studies" atop the page to return a list of all variables.

screenshot

This returns nearly 2300 variables, which is a bit more than I want to page through. Looking to the right, I can see several facets for narrowing my results. Since I'm interested in looking at relatively current data, I select "2000-2009" under "Filter by Time Period." You'll note that it provides an indication of how many results I'll find. I've also sorted by "Time Period (newest)" to pull the most current variables to the top of the list.

screenshot

Now that I've selected a facet, it no longer appears on the right, and I can see "time period:2000-2009" just above my search results. If I were to click on the "X" next to it, it would re-execute the search, removing that particular term. Thus I can narrow my search quickly by selecting a link to the right, or expand my search by removing a previously selected filter/facet. I can see a really good candidate in the list: "Ever used methamphetamine."

screenshot

Clicking on the variable label takes me to a screen with additional information, including the full question text, responses, and frequencies.

screenshot

Scrolling further down the page, I can see additional options. The option "view the study home page" will take me to the main entry point for the study, where I can read more on the sample, download the data files (in SAS, SPSS, or Stata format), and perform online analysis.

Study Descriptions

You can also search the study descriptions if you're looking for a particular study, investigator, or agency. If you click on the "Go" button without entering a search term, it automatically returns all studies.

screenshot

Thus I can see that SAMHDA has 134 different studies. Looking to the right, I can see more facets, some of which are not available for variables. The subject facet gives a broad description of SAMHDA's holdings in this case. The "more" link expanded the facet from the top 5 terms to the top 15. A "less" link appears if you want to shrink the list again. There's a "view all" link at the bottom of the list, which I'll now follow.

screenshot

The "view all" link returns a comprehensive list of all the subject terms assigned to SAMHDA studies, along with a count of how many studies were thus tagged. The book icon enables you to go to the thesaurus entry for that term if you want to find broader, narrower, or related terms, and clicking on the word itself performs a search for that term.

screenshot

I've selected "cocaine," which returned 29 results. The subject facet no longer appears, and so other facets have risen to the top. I can use the other facets to narrow my results further by geography, time period, investigator, series, or recency (added to the site).

Jul 26, 2010

I attempted to download a study and saw a message that I "had no saved files." Why can't I download the files?

This error can occur in the following situations:

  1. You may have double-clicked the download icon on the previous page. In this case, the first click should have offered you the chance to download your files and, consequently, cleared your saved files. The second click generated this error message. In spite of the error message, the files were probably successfully downloaded. If not, please try again without double-clicking.

  2. You are attempting to download member-only data and are not flagged as being part of a member institution, and this study is atypical in terms of how the documentation files are archived. On the study home page, you should have seen a note under "Access Notes" that stated, "These data are available only to users at ICPSR member institutions; you are not at a member institution. Thus you may only download the documentation files." You will need to use the "Download documentation files" option.

If you believe you received this message in error, please contact web-support@icpsr.umich.edu.

May 26, 2010

My data are collected from very vulnerable populations. How can I prevent these data from being used to portray them in an injurious way?

It is the policy of ICPSR that responsible science, which includes appropriate analytic methods and peer reviewed venues for research results, is adequate to protect vulnerable populations from inappropriate, unfair, and inaccurate portrayals. In order to participate in a valid scientific discussion of the issues that face vulnerable populations, researchers must be willing to share their data and methods in an ethically responsible manner with other researchers who wish to replicate or refute their findings. One must be willing to trust the peer review process to screen out analyses that do not conform to methods appropriate to the question at hand. ICPSR is strongly committed to protecting vulnerable individuals from being identified by data analyses, but the scientific process must be used to protect vulnerable populations from inaccurate representations.

I don't mind depositing the baseline study from my longitudinal data system, but is it possible to delay release of subsequent waves of data?

The utility of longitudinal studies lies primarily in the follow-up embedded in the research design. While the baseline data will be valuable in the short run, NAHDAP will work with depositors on a time frame for acquisition and release of the subsequent waves of data. Without a reasonable time frame, baseline studies will not be acquired for the NAHDAP. Depositors can work with NAHDAP staff to develop a method for acquiring and releasing the additional waves under a delayed-dissemination agreement. These agreements allow the subsequent waves to be acquired and prepared but not released for secondary analysis until the appropriate time.

Can my data be embargoed until I or my research team finish all our planned analyses?

ICPSR has a delayed-dissemination policy that allows researchers to deposit data earlier in the research process so that they may benefit from the data and documentation preparation services offered by staff. Delayed-dissemination contracts require depositors to commit to a timeline, which is usually two years from deposit to data release. Depositors have access to ICPSR files as soon as they are prepared and need not wait for the public release. They must, however, be willing to commit to the timeline for release.

Is it possible for me to read and approve research proposals based on my data? I wish to determine the nature of the research done with my data.

The policy of ICPSR is that responsible use of secondary data should be unfettered by the research agenda of the original data producer. When the data are distributed under restricted-use contracts, a research proposal is required in order to screen users for a credible research agenda and to ascertain whether the data will meet their research needs. The proposal, however, is screened only by the contract administrator at NAHDAP.

If I deposit data with NAHDAP, who owns the data?

ICPSR only asks for the right to redistribute the data, but does not acquire or retain the original copyright or transfer rights. ICPSR users must sign a terms-of-use agreement in order to download data that includes a clause that prevents the redistribution of the data for commercial purposes. The original owner of the data, which is usually the university or not-for-profit that received the grant or contract, retains copyright and other legal rights associated with the data.

In the informed consent documents, I promised the data would only be used by an approved research team. How can I now share my data with others?

Unless the informed consent document names the members of the research team specifically, an amended Institutional Review Board application that includes a plan for data protection and dissemination can be filed with the lead institution to define the research team. Restrictive informed consent documents may prevent the release of data in purely public releases, but do not preclude the possibility of a research team that is defined by a group of restricted- or limited-use contract holders. The research team may be defined as those persons known to the original researchers. In the case of restricted-use or limited-use contracts, the researchers using the data are known to ICPSR and to the original research team.

My data are on very sensitive topics; the risk to participants is very high should they be re-identified. How can I protect the respondents?

ICPSR evaluates all data files for disclosure risk using state-of-the-art techniques developed under a grant from the National Institutes of Health. From this evaluation, staff recommend a method of data release that protects the respondents from re-identification while retaining the analytic utility of the data. Release options include public release; public release with disclosure control practices put in place; restricted release with a user contract; enclave only release; and online analysis only with no micro-data download. A full public release is only warranted when there is little risk of re-identification or the data have been sufficiently transformed to substantially reduce that risk.

My data are very complicated. I am not sure users will be able to use the data. Will NAHDAP staff provide user support?

ICPSR has three levels of user support. Our central email and telephone service uses help desk software to track and prioritize all user support inquires. Technical questions about data downloading and software issues are answered by tier 1 support staff. Questions about specific data files will be sent to NAHDAP staff who prepared the data for release to provide user support on data content and structure. The NAHDAP director and manager will provide more sophisticated, tier 3 support for complex technical questions. Depositors will not be expected to provide ongoing user support, but rather to provide all the documentation necessary for secondary data users to make sense of the original data collection. The ICPSR archival collection includes many very complex data systems that have been successfully analyzed by responsible researchers.

My data/documentation are not in a format that can be released to secondary users. How do I find the resources to prepare it for broader distribution?

The National Institute on Drug Abuse has funded the National Addiction & HIV Data Archive Program (NAHDAP) to assist grant recipients in preparing data for release. NAHDAP staff will help clean and prepare data files, metadata and documentation in consultation with the grant staff. NAHDAP is built on the infrastructure of the Interuniversity Consortium for Political and Social Research (ICPSR), which is designed to easily create standardized, digitally stable data files, and to disseminate SAS, Stata, and SPSS files and searchable PDF codebooks and documentation. The staff of NAHDAP will standardize the data and documentation with input from the original data producers.

Why is sharing data useful to me? Why should I share data that I have worked very hard to collect and analyze?

While data sharing is primarily useful for expanding scientific knowledge, it does provide benefits for individual researchers. Data systems that are in the public domain often generate additional research which is credited to the original source. For instance, the National Longitudinal Study of Adolescent Health, which has been in the public domain since its inception, has generated over 3,000 publications in the last 20 years authored by persons not on the original research team. In addition, data citation practices and the norms of scientific practice have changed substantially in the past 20 years so that the production of data is now considered a scholarly pursuit. A 2009 committee report by the National Academy of Sciences has emphasized the emerging role of data sharing both in science and in the careers of scholars.

Apr 20, 2010

What is the importance of the Integrated Fertility Survey Series?

Though researchers from a broad range of disciplines have produced a large body of research on patterns in families and fertility, the ability to make comparisons over time -- a central task for understanding family change -- has been constrained by difficulties in using multiple datasets to perform time-series analyses. Such difficulties include changes in respondent universe, weighting procedures, imputation protocols, question wording, and variable availability across studies. This is especially true when attempting to include surveys from the earlier years. The IFSS project attempts to address these limitations by developing a data set that allows for comparisons across longer periods than were previously feasible. The primary aim of the IFSS is to establish a harmonized set of data and documentation across ten nationally representative surveys of fertility and family. It is expected that harmonization of common variables across multiple surveys will allow researchers, policymakers, students, and other constituencies to make comparisons across time.

Apr 19, 2010

What is harmonization? Why harmonize variables?

Harmonization is a process by which variables are made comparable across survey years. Harmonization schemes must be developed for each IFSS variable so that comparisons can be made across time. In general, harmonization in the IFSS project involves combining, into a single variable, information covering comparable substantive ground but from different files in the original GAF, NFS, and NSFG data sets.

Differences in question text, sample design, and respondent universes further complicate the harmonization process. Challenges posed by this part of harmonization include varying universes by survey (and survey year) and question text. For example, the IFSS data set will contain a variable for ever having used contraception. In the 1970 National Fertility Survey, the universe of respondents of whom the question is asked contains all respondents in the sample. However, in the 1965 National Fertility Survey, the universe is smaller; it contains only respondents who were not pregnant at the time of the survey interview. For other variables, differences in universe are not observed. For example, respondent's age is asked of all respondents in all ten component surveys; therefore, respondent's age poses no universe problems to those using the IFSS data set. Given the subtle variation in variables across surveys, the IFSS staff must exercise care in evaluating variable comparability.

To assist users of the IFSS data, IFSS staff are developing comprehensive variable documentation that will address important comparability notes across variables and studies. Especially serious, for example, are comparability problems that are not evident from the coding structure, including alterations in the survey question wording and changes in the variable universe. Such documentation will include notes about changes in respondent universe, question design, data collection instrument design, and other information necessary for informed use of IFSS data. The IFSS project staff will seek to make transparent all decisions made in the harmonization process so researchers can choose whether to use the harmonized version or to create one of their own.

Apr 18, 2010

How are variables selected to be harmonized? Are all variables harmonized across all data sets?

With consultation from the IFSS advisory panel -- composed of distinguished experts in fields as diverse as demography, economics, public health, survey methodology, and sociology -- variables selected for inclusion in the IFSS data set will be identified on the basis of their expected interest to social science researchers, graduate students, policymakers, and other constituencies. Not all variables in all surveys will be harmonized. In most cases, variables selected for inclusion in the IFSS data set will have comparable equivalents across three or more surveys.