Feb 22, 2007

How do I deposit a data collection with HMCA?

We recommend that you use ICPSR's Web-based, electronic Data Deposit Form to securely upload data collected under grants from the Robert Wood Johnson Foundation.

The electronic form enables the depositor to describe the data collection being deposited and to designate others to access and add information to the form if desired. It does not need to be completed in one session. If you encounter any technical issues with the form or have questions, please contact deposit@icpsr.umich.edu.

Alternatively, if you cannot or prefer not to use the electronic procedure, please complete the ICPSR Data Deposit Form (PDF 61K - Word 47K) and mail it, together with the data and technical documentation, to one of the addresses shown below. Submit your data files and any electronic documentation on removable media such as CD-ROMs. Do not send your collection in an e-mail attachment or by FTP.

Address for shipment via U.S. Postal Service:

Alon Axelrod
HMCA/ICPSR
University of Michigan
P.O. Box 1248
Ann Arbor, MI 48106-1248

Address for shipment by carriers that require a street address, e.g., UPS:

Alon Axelrod
HMCA/ICPSR
University of Michigan
330 Packard Street
Ann Arbor, MI 48104

For information on preparing your data collection in a manner that is optimally useful for secondary analysis, see the ICPSR Guide to Social Science Data Preparation and Archiving, 4th Edition(PDF 2MB).

How can I obtain a HMCA data collection?

All HMCA data, setup, and documentation files are freely available to the general public. Anyone can download all of the public-use files in any HMCA collection from our Web site.

How can I search the holdings of HMCA?

The main HMCA search engine scans the information in the study descriptions, the abstracts that summarize the main points of each HMCA collection. Searches can be performed on complete study descriptions or can be restricted to study titles, ICPSR study numbers, or names of principal investigators.

A successful search, using one of the methods described below, displays the title, name of the principal investigator(s), ICPSR study number, and the latest publication date of each collection found by the search. Also shown for each match are three links labeled "description," "download," and "related literature." Clicking on "description" displays the information in the study description, while selecting "download" transfers you to a page from which the study's files can be browsed and downloaded. The "related literature" link displays the citations in the collection's bibliography of data-related literature.

One can search the holdings from the HMCA home page or from a separate search page:

Searching from the home page. Below the heading "Quick Search" you will see a text entry box, a pull-down menu, an option to search either by "words" or "phrase," and "search" and "clear" buttons. Use the pull-down menu to select the search mode: by title, by study number, by investigator, or by complete abstract (study description). Next, type the text you are searching for in the text entry box. If the text you entered contains two or more words (e.g., "Community Tracking Study," "mental health," "rural physician"), select either the "word" search option to search for any word in the text, or the "phrase" search option to search for the exact combination of words that you typed. Finally, to execute the search, click the search button. Clicking the clear button will empty the text entry box and reset all the options you selected.

To list all of the collections in HMCA, leave the text entry box empty and click the search button.

Searching from the search page. Click on "Search/Download Holdings" on the home page to display the search page, which is divided into four sections. At the top of the page, the "Search by Keyword" section offers the same search features as "Quick Search" on the home page. Just below in the "Browse by Subject" section, one can list the data collections in each HMCA thematic category by clicking on the name of the category. In the third section, the link "Recent Updates and Additions" displays a list of all studies that were revised or added to HMCA in the last 90 days. Finally, from the "Bibliography of Data-related Literature" section, one can type in keywords to search the citations in the ICPSR Bibliography of Data-Related Literature and list the studies related to each citation.

What are the components of a data collection in HMCA?

A data collection comprises one or more data files, plus technical documentation that describes the data. SAS, SPSS, and/or Stata setups are included with many collections.

Data files are often provided in multiple data formats. Every data file is supplied as an ASCII text file and, for many collections, in at least one other format as well, such as Stata files, SPSS portable files, and SAS transport files generated by the SAS XPORT engine or SAS CPORT procedure. SPSS portable and SAS transport files are the most common data formats besides ASCII.

Technical documentation typically includes the following:

  • study description that summarizes the collection
  • file manifest
  • bibliography of related literature
  • description of the study's methodology
  • data collection instrument(s)
  • data map/record layout of the ASCII data file(s)
  • variable descriptions
  • univariate frequencies (for most collections)

Study descriptions, file manifests, and bibliographies of related literature are presented as separate files. Other components of the documentation may be bundled in a single file or distributed among multiple files. Documentation files are provided in Portable Document Format (PDF) and/or as ASCII text files.

The setups, which usually contain complete variable and value labels and often include missing value declarations or recodes, can be used to create software-specific system files (e.g., SAS datasets) from the ASCII data files.

How are the holdings of HMCA organized?

The data collections in HMCA are organized according to their main subject matter, using a five-category classification scheme:

  1. Health Care Providers
  2. Cost/Access to Health Care
  3. Substance Abuse & Health
  4. Chronic Health Conditions
  5. Other

Feb 19, 2007

Where do I get information about the names and meanings of variables?

All variable information may be found in the study's codebook(s). For survey data, the codebook usually includes the question text used.

Are there advantages to creating a MyData account?

Yes. The features of MyData are answered by the FAQ What is MyData?

How will the research community know that my data are available from ICPSR?

The addition of new and updated data collections is announced on our Web site. There is also an email list that receives notification of new and updated data collections on a regular basis, usually weekly.

Feb 13, 2007

How do I use a Stata setup file to import ASCII data?

Setup files contain syntax or program code to read columnar ASCII data into a statistical package. The instructions below demonstrate how to use Stata setup files. Please note that while the examples and illustrations that follow depict Stata in a Windows environment, the steps and procedures are platform independent. Please see the appropriate Getting Started With Stata manual for operating system details.

Getting Ready to Use Stata

These instructions assume that you have already downloaded and decompressed the ASCII data and the Stata setup files from our Web site.

Be sure to make a note of the exact location of the uncompressed files extracted from the downloaded file you obtained from ICPSR as you will need to input that information into one of the setup files.

The Stata Setup Components

There are three Stata setup components:

  1. A columnar ASCII data file

  2. A dictionary file which defines the elements of the data file.

    For more information about Stata dictionaries please reference Stata's online help or Reference Manual Set:

    • Online Help

      • help infile2

      • help infix

    • Stata Reference Manuals

      • [R] Infile

      • [R] Infile (fixed format)

      • [R] Infix (fixed format)

  3. A Stata do-file, which contains Stata's processing instructions to import and save the data in Stata's system format.

Figure 1 displays these three files for ICPSR 6399, Homicides in Chicago, 1965-1995:

Figure 1 Figure 1: Stata do-file

Note that the files are located on the D drive in a folder titled "homicide". Elsewhere in this document, we refer to this address (D:\homicide) as the path.

Also note that the file extensions (.txt, .dct, and .do) are visible. If you are using Microsoft Windows, certain file extensions may be hidden. To set Windows so that the full filename, including extensions, is visible, select Tools and then Folder Options from the Windows directory menu. At the dialog box, click on the View tab and ensure that the toggle for "Hide extensions for known file types" is not checked, as shown in Figure 2.

Figure 2 figure 2: Windows dialog box

The remainder of these instructions proceed from the assumption that the necessary files were decompressed to D:\homicide.

Using the Stata Setup Files

The Stata setup files prepared by ICPSR are designed to import a columnar ASCII data file into a Stata system file and apply appropriate variable-level metadata, such as labels for variables and variable values. These setups are designed to work across platforms with any recent implementation of Stata.

Of the three files shown in Figure 1, only the do-file (06399-0002-Stata_setup.do) requires editing. To edit, open the file in a text editor that is capable of saving output in plain ASCII text format.

Please take note of the following caveats:

  • Stata is packaged with an editing utility. An example of a do-file header is shown in Figure 3. Most setup files contain a header that describes the contents of the file. Once you have opened the setup file in your editor, read the head, if present, for important information about what is contained in the file.

    While the editor is an adequate tool for small files, many of the Stata Setup files will exceed the utility's size limitations. The do-file editor is limitied to files that are 32k or smaller. It cannot open larger files.

    figure 3: do-file editor header
    Figure 3

  • Any text editor capable of working with and saving plain ASCII text files is sufficient.

    Note that a text editor differs from a word processor. Word processors like Microsoft's Word or Corel's Word Perfect save files in proprietary formats. Stata cannot interpret files saved in those formats; Stata can only interpret ASCII text files.

    For more information about common text editors that work well with Stata, please see the FAQ "Some notes on text editors for Stata users" maintained by Boston College or Wikipedia's Text Editors page.

  • If a setup file is too large for the do-file editor and an alternative text editor is unavailable, word processors can be used for editing. However, please be sure to set the output format to plain text. Figure 4 shows how to select plain text format.

    Figure 4 figure 4: selecting plain text format

Editing the Stata setup file

The Setup file contains 5 distinct sections.

  • Section 1 defines filenames and locations.
  • Section 2 reads the raw data into memory in a Stata system format.
  • Section 3 applies value labels to appropriate variable values.
  • Section 4 recodes numeric missing value codes from numbers to Stata recognized system missing values.
  • Section 5 saves the dataset in a Stata system format.

Figure 5 shows the first two statements of the Stata do-file.

figure 5: first 2 statements of do-file
Figure 5

These statements define internal system settings:

  • set mem 9m: Assigns 9 megabytes of RAM to Stata to receive and store the data. Unlike SPSS or SAS, Stata stores the entire data array in the computer's RAM memory. If the amount of memory allocated to Stata is insufficient to read an entire file, Stata will terminate with an error as shown in Figure 6.

    figure 6: error message for insufficient memory
    Figure 6

    In Figure 6, we tried to run st6399-0002-Stata_setup.do with only 1 megabyte of RAM allocated. Though there are 12,000 observations in the file, Stata was only able to read 6,392 into the memory space. Since it could not read the entire array into memory, the process terminated.

    The default set mem allocation (in this case, 9 megabytes) was specified by ICPSR to be large enough to accommodate the corresponding data file. Therefore, this number need not be adjusted.

  • set more off: Some Stata setup files contain thousands of lines. As the do-file runs, Stata displays or echoes these lines to the screen. If more is not set to off, the system will pause each time the screen buffer fills. Setting more to off allows the do-file to run until completion.

Section 1: File Specifications

This section defines paths and filenames.

The setup files leverage Stata macros which are a programming feature. A local macro acts as a temporary storage container or alias for a string of text characters. Once defined, the contents of this container can be recalled at anytime within the do-file by the reference 'macro' (where macro is a placeholder for the actual macro name).

Please note that the macro reference uses a left quote mark [`], sometimes called a back-tick, on the left and an apostrophe ['] on the right. Therefore, to be clear, `macro' is not the same as 'macro'.

Next, it is necessary to declare the following three macros:

  • local raw_data -- the raw ASCII data file

    If all the files are in the default directory, only the filename need be entered between the double quotes.

    • Example: local raw_data "06399-0002-Data.txt"

    If files are not located in the default directory, then the path must also be specified.

    • Example: local raw_data "D:\homicide\06399-0002-Data.txt"

  • local dict -- the Stata dictionary file

    • Example without a path: local dict "06399-0002-Stata_dictionary.dct"

    • Example with a path: local dict "D:\homicide\06399-0002-Stata_dictionary.dct"

  • local outfile -- the filename you want to associate with the final Stata system file

    • Example without a path: local outfile "homicide.dta"

    • Example with a path: local outfile "D:\data\homicide.dta"

An example of Section 1: File Specifications correctly edited for files in the default directory is shown in Figure 7.

Figure 7 figure 7: section file specifications

Section 2: Infile Command

The infile command (see Figure 8) applies information stored in the dictionary (06399-0002-Stata_dictionary.dct) to the data stored in the data file (06399-0002-Data.txt) and stores the file in system memory in a format optimized for Stata.

Figure 8 figure 8: infile command

The dictionary shown in Figure 9 defines the starting column locations, variable type, name, format, and label. The dictionary document should not need to be edited for any reason.

Figure 9 figure 9: dictionary

Section 3: Value Labels

Section 3 defines value lables (if applicable) for numeric categorical variables (see Figure 10).

Figure 10 figure 10: value labels for numeric categorical variables

The command #delimit ; changes the command delimiter from a carriage return (the default delimiter) to a semi-colon. This allows for multi-line value label definitions. At the end of section 3, the delimiter can be reset to a carriage return with the command #delimit cr.

Section 4: Missing Values

Section 4, shown in Figure 11, recodes values defined to represent missing information from numeric codes to Stata's system missing value (.). ICPSR's processing conventions use numeric values to represent such information in the ASCII data. This ensures that information is not lost across statistical packages. While Stata allows for up to 27 unique system missing values (. .a .b .c .d ...) the do-file programatically recodes all missing values to a single value (.). Accordingly, this section is commented out by default. To apply missing values, remove the comment delimiters (/* */) bracketing this section.

Figure 11 figure 11: recoding values

Section 5: Save Outfile

Section 5, shown in Figure 12, is the final section and saves the data on media in a Stata system format. If the local outfile macro was specified correctly in Section 1, this step will occur automatically.

Figure 12 figure 12: saving data in Stata system format

Once a Stata system file has been saved, it can be used for subsequent analysis sessions. There is no need to run the setup file again.

Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

Feb 7, 2007

How do I use the Subset analysis tool?

To select the subset tool, locate the Analyze and Subset Tool on the Study Description page. Click on the "Subset" option.

Screen Shot

Using the Select Variable(s) for Subsetting dropdown list, pick the variables that will define your subset. The variable name you selected will appear with another dropdown menu. Select a subcategory of the variable from the dropdown menu to define your subset. In this example, we are choosing only Black respondents. You must also choose a statistical software package to which the data definition statements will be written.

Screen Shot

Multiple variables can be used to define the subset, as in the following example where we choose a sample of Black respondents who are under 18 years of age.

Screen Shot

Additional variables can be added to the selection criteria by clicking the "Add Variables" button. Your previous choices will continue to be displayed. Highlight the subgroup of interest and click the "Subset Data" button.

Output from the subsetting procedure will be displayed on a subsequent page. For large files, the process may take a few minutes. The page will display as follows:

Screen Shot

Variable and case counts will be displayed at the top of the page. The data file will be in compressed ASCII format (a file compression feature will be added soon). The codebook and data definition statements will be text format. Please save these files to your own computer as soon as possible. If the download times out or fails to load please contact us at icpsrmdrc@isr.umich.edu.

To analyze the subset, you will need to run the data definition statement in your statistical software of choice. You must change the data file's pathname to a location on your own computer.

How do I use the Sample Characteristics analysis tool?

To access the Sample Characteristics Tool for a given study, click the rightmost tab labeled "Sample Characteristics."

Screen Shot

This will display univariate marginal case counts for 3 or 4 variables in the file. The numbers in the tables represent the unweighted number of cases from whom data are collected. This table is useful when an analyst needs to assess whether a particular study is suitable for comparative analysis.

For more detailed tables of case counts, dynamic tables can be generated by using the drop down menus displayed just above the tables automatically displayed. Row and column variables can be specified from a dropdown list of demographic variables from the survey. Case counts and percentages displayed in the tables are unweighted.

Screen Shot

Both row and column variables can be chosen in the dropdown menus.

The display then provides row percentages and sample sizes for any combinations of the variables chosen in the row and column menus. Below is an example of a crosstabulation between race and father's education.

Screen Shot

Studies with multiple data files require the user to choose the part of the study of interest before sample characteristics are displayed.

Creative Commons License This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

Feb 1, 2007

How can I get the non-restricted version of a restricted dataset?

In many cases there is no public version of the data. If our Web site does not display a public version, then only a restricted version has been authorized for release by the Principal Investigator.

The person who is listed as a contact on my campus is not available to help me. What can I do?

Our Need Help? page answers many general questions that data users may have. There are tutorials on various statistical packages and the most frequently asked questions about data usage. Also, ICPSR User Support can help with issues related to the data.

I'm getting a "file not found" error when trying to use a setup file. What is wrong?

"I am trying to use a setup file, but I keep getting a 'file not found' error. I have entered the drive specification, folder hierarchy, and filename for the data file correctly. What is wrong?"

Sometimes when an application generates a "file not found" error, it is caused by an incomplete filename being used in the setup statements - the specification for the filename may be missing the filename extension. This is often due to Windows not displaying filename extensions in either Windows Explorer or My Computer and subsequent dialog boxes, and users not being aware that they exist or need to be used.

Adding the correct extension to the file specification in the setup statements should correct the problem.

To ensure that you are presented with complete filenames, the default Windows filename display option for folders should be changed.

Go to Start > My Computer > Tools (the menu at the top) > Folder Options > View (the second tab)

Under the "Advanced setting:" look for "Hide extensions for known file types". The selection box for this option should be cleared. Users should then be able to see the filename extension when using either Windows Explorer or My Computer and add the extension to the file specification in the setup statements.

What is the difference between an ASCII and a software-specific file?

An ASCII file is a plain text file consisting of numbers, letters, and symbols with no formatting. An ASCII file can be opened in any word processing program. It can only be analyzed, however, if it is read into a database, statistical, or spreadsheet software package. Data definition statements are necessary to read fixed-format ASCII files. See How to Interpret a Record from an ASCII Data File for more information.

Software-specific files, such as SPSS portable files, SAS transport files, Excel spreadsheets, or Microsoft Word documents, are configured for use with their respective software packages. These files may be used with software other than that in which they were created, if the desired software allows the conversion. Users should keep in mind that some changes, especially regarding formatting, might occur in the translation.

When I open an ASCII codebook, it displays oddly and is difficult to read. What can I do to fix this?

Windows opens .txt files with NotePad by default. NotePad is not a sophisticated text editor, and doesn't understand how to handle UNIX line breaks. If you open the codebook with WordPad, you'll have no display problems.

I'm using SPSS on a Mac. I can't read ASCII data files using the SPSS setup file I downloaded. What do I do?

SPSS for Macintosh, Version 6.1.1 and earlier versions, do not recognize UNIX linefeed characters. Macintosh users must change these characters manually before reading the file into SPSS. We suggest using a text editor, such as BBEdit, to read the data file, and then saving it as a Macintosh file. This will replace the UNIX linefeeds with Macintosh carriage returns that SPSS for the Macintosh can understand. BBEdit is available from the Bare Bones Software Web site. BBEdit and most other text editors are unable to read very large files, depending on the amount of memory available. For larger files, we suggest using text conversion software, such as TextToMac, that may be less memory-intensive. You can download texttomac1.2.hqx from the University of Michigan archive.

What is a data file?

A data file is not the analyzed findings of a study or statistics, but the raw collected data from which these statistics might be extrapolated. It usually consists of rows and columns of alphanumeric characters. The majority of our data files are ASCII fixed-format files. The storage formats of data files may be either logical record length format, card image, or delimited format. The physical structure of data files also varies and may be either rectangular, hierarchical, or relational. Some data collections may also include data available in other formats, such as SPSS portable files or SAS transport files.

How can my organization become a member of ICPSR?

If your institution is not on the list of member institutions and you are interested in ICPSR membership, please see "How to Join ICPSR" or contact the ICPSR Membership Coordinator at netmail@icpsr.umich.edu.