Home / 教学设置

SYLLABUS
QUANTITATIVE DATA ANALYSIS
Prof. Donald J. Treiman
University of California, Los Angeles (UCLA)
(UM-PKU Joint Institute, Beijing University)
(30 June - 23 July, 2008)
Treiman:
Office hours: by appointment; but, for at least the first two weeks, I expect to be available
most afternoons, Monday-Thursday, and often during the weekend.
Email: treiman@ucla.edu
Co-teachers:
Li Jianxin, email: ljx@pku.edu.cn
Ren Qiang, email: renqiang@pku.edu.cn
Zhou Hao, email: Zhouh@pku.edu.cn
Office hours to be announced.

INTRODUCTION
This is a course in how to do theoretically informed quantitative social research. By the end of the course you should have a fair idea of how to make sociological sense out of a body of quantitative data. Toward this end, we will cover a variety of techniques, including tabular analysis, log linear models for tabular data, regression analysis in its various forms, regression diagnostics and robust regression, ways to cope with missing data, logistic regression, fixed- and random-effect models, factor analysis and other techniques of scale construction, measurement error, and related topics. But this is not a statistics course; the emphasis will be on using these procedures to draw substantive conclusions about how the social world works. A prior statistics course is required. School algebra, either remembered or relearned, will also be needed to get through the course. (For any of you whose school algebra is rusty, good reviews can be found in Helen Walker, Mathematics Essential for Elementary Statistics, and W. L. Bashaw, Mathematics for Statistics.) Because time is short (and 15 lectures is a very short time for what we have to cover) the course concentrates on data analysis and the way one links theory and data. We will focus on the quantitative analysis of data from representative samples of well-defined populations. The populations can consist of almost anything—people, formal organizations, societies, occupations, pottery shards, or whatever; the analytic problems are essentially the same. Data collection procedures will be essentially ignored—that is, mentioned only the course of discussions of data- analytic issues. One good way of learning about the practical details of data collection, for those of you who are still students, is to apprentice yourself (unpaid if necessary) to someone who is about to conduct a survey and insist that you get to participate in it step-bystep even when your presence is a nuisance.
The core of the course is a series of exercises. The exercises are difficult and time consuming. Keeping up-to-date is crucial: you must understand the previous material in order to follow what comes next. So it is imperative that you do each set of exercises completely and on time. Exercises are due in class the day after they are assigned; they will be read and returned the following day. Late exercises will not be accepted. After the first few days you will be doing a good deal of analysis using a major U.S. national sample survey (NORC's General Social Survey). However, for many of the assignments you will be able to substitute a data set of your own, focusing on topics that interest you. The exercises have been adapted from a 20-week course that I have been teaching at UCLA for many years, in which I give 16 or 17 lectures, once a week, each with a required exercise, and then ask students to write term papers in the last few weeks of the course. Because we will be meeting four days a week, I have tried to adapt the assignments to make it possible realistically to complete them between noon and 9:00 am the next day. Also, insofar as the rhythm of the course permits doing so, I have tried to set the longer exercises on Thursday, to give you several days to complete them before the following Monday morning. We will find out together how well I have succeeded in adapting the assignments. Also, because I will give only 15 lectures, I have omitted the term paper. There simply isn’t enough time.

PRACTICAL DETAILS

Written exercises. You are expected to prepare your exercises on a word processor.

Calculator. You will find a calculator a necessity for this course, unless you particularly enjoy doing tedious arithmetic computations by hand. Calculators powerful to handle all of your needs are now available for relatively little money. Unfortunately, I am not up-to-date on what is available, but somebody at a local electronics shop probably can help you decide what is optimal. I find it somewhat useful to have a calculator with lots of memory and not at all useful to have built-in statistical functions (e.g., simple correlation and regression). A reasonable alternative is to use the “display” function in Stata, the software we will be using for the course.

Computing. Starting on the 3rd day, all assignments will require doing data analysis of one or more sample surveys, using the statistical package Stata, Version 10.0. (While I once taught this course using SPSS, more than 10 years ago I switched to Stata because it is a fast and efficient package that includes most of the statistical procedures of interest to social scientists. As software, it clearly is superior to SPSS; it is faster, more accurate, and includes a much larger range of applications. My judgement in this matter is widely shared, and Stata has become the statistical package of choice within many economics and sociology departments in U.S. universities. One of the important advantages of Stata is that Stata data sets are fully transportable across platforms. Thus, you may use the same data set to do analysis on a PC or on a UNIX machine. The 3rd lecture will be devoted to a full-fledged introduction to Stata. During that lecture, we will discuss the practical details of using Stata in the Joint Institute.

The course web page. We have established a web page for the course, which contains this syllabus and links to the data sets we will use and the documentation for these data sets. From time to time we will add other materials, which you may then download or print for your own use. Each day we will put up an “Illustrative Answer” to the exercise you turn in that day, which you may also download or print. I urge you to check the web site frequently since it will contain the most up-to-date information regarding the course. The URL (Internet address) is: http://www.umich.cn/teaching/teaching_2008_download.html. The password will be announced in class. You should make a bookmark for this site on your computer.

Text. The text for the course is the typescript of a book I am publishing based on the version of the course I have been teaching at UCLA, Quantitative Data Analysis: Doing Social Research to Test Ideas. Since the book will not be published until December, I have made the typescript available for your use. However, I must ask you not to copy it or distribute it to others. Both my publisher and I will be very unhappy if pirated copies start showing up all around China. Thus, a condition of your use of these materials is that you do not reproduce them or distribute them to others. I thank you for this.

Other reading
I also want to recommend five other books:
• Becker, Howard S. 1986. Writing for Social Scientists: How to Start and Finish Your Thesis, Book, or Article. Chicago: University of Chicago Press.
This is a wonderful book, which you should read through just for the pleasure of it, and reread whenever you discover you are hung up on your writing. As you read through it the first time, make a point of identifying your greatest writing sin—you will probably suffer from more than one, so identify the worst one. Then fix it, of course! Read it as soon as you get it.

Four statistics books are very useful, not so much for this course as for advanced study and future reference. If you want to improve your understanding of OLS regression, work through the Fox book (which also includes some material on logistic regression). There are now two very good books for sociologists and other social scientists on logistic regression and allied topics, by Scott Long and by Daniel Powers and Yu Xie. In addition, there is a book by Long and Jeremy Freese on how to do logistic regression analysis using Stata. All five authors are sociologists; all four books were published fairly recently; and all four are fairly demanding. Of the three logistic regression texts, I have some preference for that by Powers and Xie because Long tends to emphasize standardized coefficients, which I generally find more confusing than helpful; but that may be my own limitation. Here are the bibliographic references, in chronological order. You can buy all but the Fox book through the Stata Bookstore, among other places. For some reason, the Fox book is no longer listed on the Stata Bookstore web page.

• Fox, John. 1997. Applied Regression Analysis, Linear Models, and Related Methods. Thousand Oaks: Sage.
• Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks: Sage.
• Powers, Daniel A., and Yu Xie. 2000. Statistical Methods for Categorical Data Analysis. Orlando: Academic Press.
• Long, J. Scott, and Jeremy Freese. 2006. Regression Models for Categorical Dependent Variables Using Stata. 2nd edition. College Station, TX: Stata Press.

If you are new to Stata, I recommend that you work through the tutorial in the Getting Started with Stata for Windows manual. For any of you who have continue to have difficulty making sense of the material in the manuals, I recommend
• Rabe-Hesketh, Sophia, and Brian Everitt. 2007. A Handbook of Statistical Analyses Using Stata, 4th edition. Boca Raton, FL: Chapman & Hall/CRC.
This is designed as a companion to the Stata manuals. It is fairly elementary, but is a good overview of what can be done with Stata. It covers a lot of ground quickly, so it is best as a review of rather than an introduction to statistics. You can order the book from Stata Corp. There are several other guides to using Stata, which you can check out on the Stata Bookstore web page.
Finally, the classic text on logistic regression, by Hosmer and Lemeshow, first published in 1989, has been updated fairly recently. Hosmer and Lemeshow are biostatisticians and their examples reflect this orientation. In the second edition, many of the examples were generated using Stata. A particular strength of the book is the attention it pays to complex samples that require survey estimation techniques (implemented in the Stata svylogit command).
• Hosmer, David W., and Stanley Lemeshow. 2000. Applied Logistic Regression. 2nd edition. New York: Wiley.

Additional readings. The following will prove helpful to many of you. Note that publications identified as “(Sage No. ##)” are Sage University Papers, in the Quantitative Applications in the Social Sciences Series, published by Sage Publications, Thousand Oaks, CA. These are quite short, less than 100 pages each, and are intended as introductions to various topics. I have found them generally useful, although some are better than others.

On the General Social Survey
Davis, James A., and Tom W. Smith. 1992. The NORC General Social Survey: a User’s Guide. (Guides to Major Social Science Data Bases 1.) Thousand Oaks: Sage. This gives a history and, more important, the logic of the GSS. It is a very useful document for any of you who expect to make professional research use of the GSS. Davis, James Allan, Tom W. Smith, and Peter V. Marsden. 2005. General Social Surveys,
1972-2004: Cumulative Codebook. Chicago: National Opinion Research Center. This is the codebook for the GSS. While you can access the codebook via the Internet, it actually is substantially more convenient to have a local copy. The easiest way to obtain a copy is to ask Libbie Stephenson, the UCLA Data Archivist, to burn a CD for you, which will cost you all of one dollar. You can then put the codebook on your computer, which makes it easy to search or, if you really, really want a paper copy, you can print it. However, since it is about 1,500 pages in length, this will be an expensive proposition.
On cross-tabulations

Davis, James A. 1985. The Logic of Causal Order. (Sage No. 55.)
This publication encompasses more than cross-tabulations but is included here because the logic of causal order is for many students most confusing with respect to crosstabulations.
Zeisel, Hans. 1985. Say it with figures. 6th ed. New York : Harper & Row. This is a classic, first published in 1947 and continuously updated through 1985. You should know about it just for historical reasons. But you also are likely to find it useful as a guide to presenting clear and informative tables.

On regression analysis
Achen, Christopher H. 1982. Interpreting and Using Regression. (Sage No. 29.)
Asher, Herbert B. 1983. Causal Modeling. Second Edition. (Sage No. 3.)
Berry, William D., 1993. Understanding Regression Assumptions. (Sage No. 92.)
Berry, William D., and Stanley Feldman. 1985. Multiple Regression in Practice. (Sage No. 50.)
Cohen, Jacob, and Patricia Cohen. 1975. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
An old, but still useful, text. Lots of practical tricks, all based on school algebra.

Fox, John. 1991. Regression Diagnostics. (Sage No. 79.)
Hamilton, Lawrence C. 1992. Regression with Graphics: a Second Course in Applied Statistics. Belmont: Duxbury Press.
This is a graphically oriented book on regression by the author of Statistics with Stata 7.0. Although it can be used with any statistical package, it is clear that most of the graphics were created using Stata. It thus is a useful advanced applied statistics text for Stata users.
Hardy, Melissa. 1993. Regression with Dummy Variables. (Sage No. 93.)
Hargens, Lowell L. 1976. “A Note on Standardized Coefficients as Structural Parameters.” Sociological Methods and Research 5:247-56.. Makes a case for comparing standardized coefficients across samples. Read in conjunction with Kim and Mueller. Jaccard, James, and Robert Turrisi. 2003. Interaction Effects in Multiple Regression. 2nd ed. (Sage No. 72.)
Kim, Jae-On, and Charles W. Mueller. 1976. “Standardized and Unstandardized Coefficients in Causal Analysis: An Expository Note.” Sociological Methods and Research 4:423-38.
Makes the case against comparing standardized coefficients across samples (the conventional view). Read in conjunction with Hargens.
Marsh, Lawrence C., and David R. Cormier. 2001. Spline Regression Models. (Sage No. 137.)
Stoltzenberg, Ross. 1974. “Estimating an Equation with Multiplicative and Additive Terms.” Sociological Methods and Research 2:313-31.
Wildt, Albert R, and Olli Ahtola. 1978. Analysis of Covariance. (Sage No. 12.) On logistic regression and allied procedures Aldrich, John H., and Forrest D. Nelson. 1984. Linear Probability, Logit, And Probit Models. (Sage No. 45.)
Borooah, Vani Kant. 2001. Logit and Probit: Ordered and Multinomial Models. (Sage No. 138.)

Conroy, Ronán M. 2002. “Choosing an Appropriate Real-life Measure of Effect Size: the Case of a Continuous Predictor and a Binary Outcome.” Stata Journal 2:290-295. DeMaris, Alfred. 1992. Logit Modeling. (Sage No. 86.) Includes a discussion of log linear models. DeMaris, Alfred. 2002. “Explained Variance in Logistic Regression: A Monte Carlo Study of Proposed Measures.” Sociological Methods and Research 31:27-74. Heiss, Florian. 2002. “Structural Choice Analysis with Nested Logit Models.” Stata Journal 2:227-252. Liao, Tim Futing. 1994. Interpreting Probability Models: Logit, Probit, and Other Generalized Linear Models. (Sage No. 101.)

Menard, Scott. 2001. Applied Logistic Regression Analysis. 2nd ed. (Sage No. 106.) A non-technical introduction to logistic regression, treated as an extension of OLS regression.
Pampel, Fred C. 2000. Logistic Regression: A Primer. (Sage No. 132.)
On log-linear and log-multiplicative analysis Goodman, Leo A., and Michael Hout. 1998. “Statistical Methods and Graphical Displays for Analyzing How the Association between Two Qualitative Variables Differs among Countries, among Groups, or over Time: A Modified Regression-Type Approach.” Sociological Methodology 28:175-230.
Goodman, Leo A., and Michael Hout. 2001. “Statistical Methods and Graphical Displays for Analyzing How the Association between Two Qualitative Variables Differs among Countries, among Groups, or over Time - Part II: Some Exploratory Techniques, Simple Models, and Simple Examples.” Sociological Methodology 31:189-221. [These two articles represent the current state of the art. See also the extensive literature cited, particularly the papers by Xie and by Yamaguchi.]
Hagenaars, Jacques A. 1993. Loglinear Models with Latent Variables. (Sage No. 94.) An advanced topic: structural equation models for categorical variables.
Hauser, Robert M. 1978. “A Structural Model of the Mobility Table.” Social Forces 56:919-43.
Hauser, Robert M. 1980. “Some Exploratory Methods for Modeling Mobility Tables and Other Cross-Classified Data.” Sociological Methodology 1980.
Hout, Michael. 1982. Mobility Tables. (Sage No. 31.)
Consideration of an important class of log linear models. Read after Knoke and Burke. Ishii-Kuntz, Masako. 1994. Ordinal Log-linear Models. (Sage No. 97.)
Kaufman, Robert L., and Paul G. Schervish. 1986. “Using Adjusted Crosstabulations to Interpret Log-linear Relationships.” American Sociological Review 51:717-33. Knoke, David, and Peter Burke. 1980. Log-Linear Models. (Sage No. 20.)

An introduction to log linear models.
McCutcheon, Allan L. 1987. Latent Class Analysis. (Sage No. 64.) Extension of log linear analysis to create an analog to factor analysis for categorical variables.
Rudas, Tamás. 1998. Odds Ratios in the Analysis of Contingency Tables. (Sage No. 119).

On factor analysis
Dunteman, George H. 198x. Principal Components Analysis. (Sage No. 69.) Kim, Jae-On, and Charles W. Mueller. 1978. Introduction to Factor Analysis: What It Is and How to Do It. (Sage No. 13.)
Kim, Jae-On, and Charles W. Mueller. 1978. Factor Analysis: Statistical Methods and Practical Issues. (Sage No. 14.)
McCutcheon, Allan L. 1987. Latent Class Analysis. (Sage No. 64.)
Analog to factor analysis for categorical variables.
On sample selection bias, matching, and allied topics
Becker, Sascha O., and Andrea Ichino. 2002. “Estimation of Average Treatment Effects Based on Propensity Scores.” Stata Journal 2:358-377. Breen, Richard. 1996. Regression Models: Censored, Sample Selected, or Truncated Data. (Sage No. 111.)
Berk, Richard A. 1983. “An Introduction to Sample Selection Bias in Sociological Data.” American Sociological Review 48:386-98.
Berk, Richard A., and Subhash C. Ray. 1982. “Selection Biases in Sociological Data.” Social Science Research 11:352-98.
Smith, Herbert L. 1997. “Matching with Multiple Controls to Estimate Treatment Effects in Observational Studies.” Sociological Methodology 27:325-353.
Winship, Christopher, and Robert D. Mare. 1992. “Models for Sample Selection Bias.” Annual Review of Sociology 18:327-50.
On estimation, statistical inference, and related topics
Eliason, Scott R. 1993. Maximum Likelihood Estimation: Logic and Practice. (Sage No. 96). Gould, William, and William Sribney. 1999. Maximum Likelihood Estimation with Stata. College Station: Stata Press.
Henkel, Ramon E. 1976. Tests of Significance. (Sage No. 4.)
Read in conjunction with Sage No. 43, Bayesian Statistical Inference. Iverson, Gudmund R. 1984. Bayesian Statistical Inference. (Sage No. 43.)
Read in conjunction with Sage No. 4, Tests of Significance.


Mohr, Lawrence B. 1993. Understanding Significance Testing. (Sage No. 73.) A coherent review of what you supposedly learned in introductory statistics but perhaps didn’t quite get, with an emphasis on practical applications. Mooney, Christopher Z., and Robert D. Duval. 1993. Boostrapping: A Nonparametric Approach to Statistical Inference. (Sage No. 95.)
Mooney, Christopher Z. 1997. Monte Carlo Simulation. (Sage: No. 116.)
Raftery, Adrian. 1995. “Bayesian Model Selection in Social Research.” Sociological Methodology 1995 25:111-63.
Definitive discussion of BIC in the sociological literature. See also the Special Issue on the Bayesian Information Criterion, Sociological Methods and Research, Vol. 27, No. 3 (February 1999) which includes a critical evaluation of BIC by David Weakliem, a defense by Raftery, and discussions by several others. Smithson, Michael. 2002. Confidence Intervals. (Sage No. 140.)

On coping with missing data
Allison, Paul D. 2001. Missing Data. (Sage No. 136.)
Brick, J. Michael, and Graham Kalton. 1996. “Handling Missing Data in Survey Research.” Statistical Methods in Medical Research 5:215-238.
Landerman, Lawrence R., Kenneth C. Land, and Carl F. Pieper. 1997. “An Empirical Evaluation of the Predictive Mean Matching Method for Imputing Missing Values.” Sociological Methods and Research 26:3-33.
Little, Roderick J. A., and Donald B. Rubin. 2002. Statistical Analysis with Missing Data, 2nd edition. New York: John Wiley & Sons.
The definitive treatment, by the creators of multiple-imputation.
Paul, Christopher, William M. Mason, Daniel McCaffrey, and Sarah A. Fox. 2003. ''What Should We Do About Missing Data? (A Case Study Using Logistic Regression with Missing Data on a Single Covariate)." Los Angeles: California Center for Population Research, On- Line Working Paper Series CCPR-028-03).
A good overview of the strengths and weaknesses of various methods, with some useful skepticism about multiple imputation as the gold standard.
Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. London: Chapman and Hall.
An accessible overview.

Schafer, Joseph L. 1999. “Multiple Imputation: A Primer.” Statistical Methods in Medical Research 8:3-15.
A short version of Schafer 1997.

On other topics
Carmines, Edward G., and Richard A. Zeller. 1979. Reliability and Validity Assessment. (Sage No. 17.)
Firebaugh, Glenn. 1997. Analyzing Repeated Surveys. (Sage: No. 115.)
Fox, James Alan, and Paul E. Tracy. 1986. Randomized Response: A Method for Sensitive Surveys. (Sage No. 58.)
A specialized technique for getting estimates of rates for sensitive topics. Glenn, Norval D. 2004. Cohort Analysis. 2nd ed. (Sage No. 5.)
Jacoby, William G. 1997. Statistical Graphics for Univariate and Bivariate Data. (Sage No. 117.)
Kalton, Graham. 1983. Introduction to Survey Sampling. (Sage No. 35.)
Lee, Eun Sul, Ronald N. Forthofer, and Ronald J. Lorimor. 1989. Analyzing Complex Survey Data. (Sage No. 71.)

The analysis of stratified and clustered samples.
Luke, Douglas A. 2004. Multilevel Modeling. (Sage No. 143.)
Moore, Kristin Anderson, Tamara G. Halle, Sharon Vandivere, and Carrie L. Mariner. 2002. “Scaling Back Survey Scales: How Short is Too Short?” Sociological Methods and Research 30:530-567.
McIver, John P., and Edward G. Carmines. 1984. Unidimensional Scaling. (Sage No. 24.) Namboodiri, Krishnan. 1984. Matrix Algebra. (Sage No. 38.)
Rudas, Tamas. 2004. Probability Theory: A Primer. (Sage No. 142.)
Sullivan, John L., and Stanley Feldman. 1979. Multiple Indicators. (Sage No. 15.)
Use of multiple indicators to assess validity and reliability. Vanleeuwen, Dawn M., and Keith H. Mandabach. 2002. “A Note on the Reliability of Ranked Items.” Sociological Methods and Research 31:87-105.

Course content

Week 1
Day 1 (June 30): Introduction to cross-tabulation
Day 2 (July 1): Manipulating cross-tabulations
Day 3 (July 2): Introduction to computing
Day 4 (July 3): Direct standardization via Stata; simple (2-variable) correlation and regression

Week 2
Day 5 (July 7): Multiple correlation and regression
Day 6 (July 8): Interactions with dummies
Day 7 (July 9): More on multiple regression: various tricks
Day 8 (July 10): Multiple imputation of missing data

Week 3
Day 9 (July 14): More on regression tricks; complex samples and design effects
Day 10 (July 15): Regression diagnostics and robust regression
Day 11 (July 16): Scale construction by factor analysis
Day 12 (July 17): Log linear analysis

Week 4
Day 13 (July 21): Binomial logistic regression
Day 14 (July 22): Multinomial and ordinal logit analysis and tobit analysis
Day 15 (July 23): Additional topics—preview of advanced techniques