Title: | Build Panel Data Sets from PSID Raw Data |
---|---|
Description: | Makes it easy to build panel data in wide format from Panel Survey of Income Dynamics (PSID) delivered raw data. Downloads data directly from the PSID server using the 'SAScii' package. 'psidR' takes care of merging data from each wave onto a cross-period index file, so that individuals can be followed over time. The user must specify which years they are interested in, and the 'PSID' variable names (e.g. ER21003) for each year (they differ in each year). The package offers helper functions to retrieve variable names from different waves. There are different panel data designs and sample subsetting criteria implemented ("SRC", "SEO", "immigrant" and "latino" samples). More information about the PSID can be obtained at <https://simba.isr.umich.edu/data/data.aspx>. |
Authors: | Florian Oswald [aut, cre] |
Maintainer: | Florian Oswald <[email protected]> |
License: | GPL-3 |
Version: | 2.3 |
Built: | 2025-03-07 05:35:06 UTC |
Source: | https://github.com/floswald/psidr |
Builds a panel data set with id variables pid
(unique person identifier) and year
from individual PSID family files and supplemental wealth files.
build.panel( datadir = NULL, fam.vars, ind.vars = NULL, heads.only = FALSE, current.heads.only = FALSE, sample = NULL, design = "balanced", loglevel = INFO )
build.panel( datadir = NULL, fam.vars, ind.vars = NULL, heads.only = FALSE, current.heads.only = FALSE, sample = NULL, design = "balanced", loglevel = INFO )
datadir |
either |
fam.vars |
data.frame of variable to retrieve from family files. Can contain see example for required format. |
ind.vars |
data.frame of variables to get from individual file. In almost all cases this will be the type of survey weights you want to use. don't include id variables ER30001 and ER30002. |
heads.only |
logical TRUE if user wants household heads only. Household heads in sample year. |
current.heads.only |
logical TRUE if user wants current household heads only. Distinguishes mover outs heads. |
sample |
string indicating which sample to select: "SRC" (survey research center), "SEO" (survey for economic opportunity), "immigrant" (immigrant sample), "latino" (Latino family sample). Defaults to NULL, so no subsetting takes place. |
design |
either character balanced or all or integer. balanced means only individuals who appear in each wave are considered. All means all are taken. An integer value stands for minimum consecutive years of participation, i.e. design=3 means present in at least 3 consecutive waves. |
loglevel |
one of INFO, WARN and DEBUG. INFO by default. |
There are several supported approches. Approach one downloads stata data, uses stata to build each wave, then puts it together with 'psidR'. The second (recommended) approach downloads all data directly from the psid servers (no Stata needed). For this approach you need to supply the precise names of psid variables - those variable names vary by year. E.g. total family income will have different names in different waves. The function getNamesPSID
greatly helps collecting names for all waves.
resulting data.table
. the variable pid
is the unique person identifier, constructed from ID1968 and pernum
The variables interview number
in each family file map to
the interview number
variable of a given year in the individual file. Run example(build.panel)
for a demonstration.
Notice that support for wealth supplements is disabled! Recent releases of the main family file have wealth data included. Earlier waves must be merged manually, again by variable interview number
as above.
# ###################################### # reproducible example on artifical data. # run this with example(build.panel). # ###################################### ## make reproducible family data sets for 2 years ## variables are: family income (Money) and age ## Data acquisition step: ## run build.panel with sascii=TRUE # testPSID creates artifical PSID data td <- testPSID(N=12,N.attr=0) fam1985 <- data.table::copy(td$famvars1985) fam1986 <- data.table::copy(td$famvars1986) IND2019ER <- data.table::copy(td$IND2019ER) # create a temporary datadir my.dir <- tempdir() #save those in the datadir # notice different R formats admissible save(fam1985,file=paste0(my.dir,"/FAM1985ER.rda")) save(fam1986,file=paste0(my.dir,"/FAM1986ER.RData")) save(IND2019ER,file=paste0(my.dir,"/IND2019ER.RData")) ## end Data acquisition step. # now define which famvars famvars <- data.frame(year=c(1985,1986), money=c("Money85","Money86"), age=c("age85","age86")) # create ind.vars indvars <- data.frame(year=c(1985,1986),ind.weight=c("ER30497","ER30534")) # call the builder # data will contain column "relation.head" holding the relationship code. d <- build.panel(datadir=my.dir,fam.vars=famvars, ind.vars=indvars, heads.only=FALSE) # see what happens if we drop non-heads # only the ones who are heads in BOTH years # are present (since design='balanced' by default) d <- build.panel(datadir=my.dir,fam.vars=famvars, ind.vars=indvars, heads.only=TRUE) print(d[order(pid)],nrow=Inf) # change sample design to "all": # we'll keep individuals if they are head in one year, # and drop in the other d <- build.panel(datadir=my.dir,fam.vars=famvars, ind.vars=indvars,heads.only=TRUE, design="all") print(d[order(pid)],nrow=Inf) file.remove(paste0(my.dir,"/FAM1985ER.rda"), paste0(my.dir,"/FAM1986ER.RData"), paste0(my.dir,"/IND2019ER.RData")) # END psidR example # ##################################################################### # Please go to https://github.com/floswald/psidR for more example usage # #####################################################################
# ###################################### # reproducible example on artifical data. # run this with example(build.panel). # ###################################### ## make reproducible family data sets for 2 years ## variables are: family income (Money) and age ## Data acquisition step: ## run build.panel with sascii=TRUE # testPSID creates artifical PSID data td <- testPSID(N=12,N.attr=0) fam1985 <- data.table::copy(td$famvars1985) fam1986 <- data.table::copy(td$famvars1986) IND2019ER <- data.table::copy(td$IND2019ER) # create a temporary datadir my.dir <- tempdir() #save those in the datadir # notice different R formats admissible save(fam1985,file=paste0(my.dir,"/FAM1985ER.rda")) save(fam1986,file=paste0(my.dir,"/FAM1986ER.RData")) save(IND2019ER,file=paste0(my.dir,"/IND2019ER.RData")) ## end Data acquisition step. # now define which famvars famvars <- data.frame(year=c(1985,1986), money=c("Money85","Money86"), age=c("age85","age86")) # create ind.vars indvars <- data.frame(year=c(1985,1986),ind.weight=c("ER30497","ER30534")) # call the builder # data will contain column "relation.head" holding the relationship code. d <- build.panel(datadir=my.dir,fam.vars=famvars, ind.vars=indvars, heads.only=FALSE) # see what happens if we drop non-heads # only the ones who are heads in BOTH years # are present (since design='balanced' by default) d <- build.panel(datadir=my.dir,fam.vars=famvars, ind.vars=indvars, heads.only=TRUE) print(d[order(pid)],nrow=Inf) # change sample design to "all": # we'll keep individuals if they are head in one year, # and drop in the other d <- build.panel(datadir=my.dir,fam.vars=famvars, ind.vars=indvars,heads.only=TRUE, design="all") print(d[order(pid)],nrow=Inf) file.remove(paste0(my.dir,"/FAM1985ER.rda"), paste0(my.dir,"/FAM1986ER.RData"), paste0(my.dir,"/IND2019ER.RData")) # END psidR example # ##################################################################### # Please go to https://github.com/floswald/psidR for more example usage # #####################################################################
Builds a panel from the full PSID dataset
build.psid(datadr = "~/datasets/psid/", small = TRUE)
build.psid(datadr = "~/datasets/psid/", small = TRUE)
datadr |
string of the data directory |
small |
logical TRUE if only use years 2013 and 2015. |
a data.table with panel data
see https://asdfree.com/ for other usage and https://stackoverflow.com/questions/15853204/how-to-login-and-then-download-a-file-from-aspx-web-pages-with-r
get.psid(file, name, params, curl)
get.psid(file, name, params, curl)
file |
string psid file number |
name |
string of filename on disc |
params |
'postForm' (RCurl) parameters |
curl |
'postForm' (RCurl) curl handle |
Anthony Damico <[email protected]>
The user can specify one variable name from any year. This function
will find that variable's correct name in any of the years
specified by the user. If user does not specify the years
variable, return will represent all years in which variable was
present.
getNamesPSID(aname, cwf, years = NULL, file = NULL)
getNamesPSID(aname, cwf, years = NULL, file = NULL)
aname |
A variable name in any of the PSID years |
cwf |
A data.frame representation of the cross-walk file, (the psid.xlsx file). |
years |
A vector of years. If NULL, all years in which that variable existed are returned |
file |
optional file name to write csv |
This uses the psid.xlsx crosswalk file from UMich, which is available at http://psidonline.isr.umich.edu/help/xyr/psid.xlsx. In the example, the package openxlsx's read.xlsx is used to import the crosswalk file.
Ask for one variable at a time.
A vector of names, one for each year.
Paul Johnson <[email protected]> and Florian Oswald
# read UMich crosswalk from installed file r = system.file(package="psidR") cwf = openxlsx::read.xlsx(file.path(r,"psid-lists","psid.xlsx")) # or download directly # cwf <- read.xlsx("http://psidonline.isr.umich.edu/help/xyr/psid.xlsx") # then get names with getNamesPSID("ER17013", cwf, years = 2001) getNamesPSID("ER17013", cwf, years = 2003) getNamesPSID("ER17013", cwf, years = NULL) getNamesPSID("ER17013", cwf, years = c(2005, 2007, 2009))
# read UMich crosswalk from installed file r = system.file(package="psidR") cwf = openxlsx::read.xlsx(file.path(r,"psid-lists","psid.xlsx")) # or download directly # cwf <- read.xlsx("http://psidonline.isr.umich.edu/help/xyr/psid.xlsx") # then get names with getNamesPSID("ER17013", cwf, years = 2001) getNamesPSID("ER17013", cwf, years = 2003) getNamesPSID("ER17013", cwf, years = NULL) getNamesPSID("ER17013", cwf, years = c(2005, 2007, 2009))
helper function to convert factor to character in a data.table
make.char(x)
make.char(x)
x |
a |
a character
this list is taken from http://ideas.repec.org/c/boc/bocode/s457040.html
makeids()
makeids()
this function hardcodes the PSID variable names of "interview number" from both family and individual file for each wave, as well as "sequence number", "relation to head" and numeric value x of that variable such that "relation to head" == x means the individual is the head. Varies over time.
three year test, ind file
medium.test.ind(dd = NULL)
medium.test.ind(dd = NULL)
dd |
Data Dictionary location. If NULL, use temp dir and force download |
No return value, called for side effects
three year test, ind file and one NA variable
medium.test.ind.NA(dd = NULL)
medium.test.ind.NA(dd = NULL)
dd |
Data Dictionary location. If NULL, use temp dir and force download |
No return value, called for side effects
three year test, ind file and one NA variable and wealth
medium.test.ind.NA.wealth(dd = NULL)
medium.test.ind.NA.wealth(dd = NULL)
dd |
Data Dictionary location. If NULL, use temp dir and force download |
description No return value, called for side effects
three year test, no ind file
medium.test.noind(dd = NULL)
medium.test.noind(dd = NULL)
dd |
Data Dictionary location |
No return value, called for side effects
psidR is a package that helps the task of building longitudinal datasets from the Panel Study of Income Dynamics (PSID). The user must supply the PSID variable names that correspond to the variables of interest in each desired wave. Data can be supplied via Stata, or directly downloaded from PSID servers without any need for STATA. data.frame.
Maintainer: Florian Oswald [email protected]
Useful links:
one year test, ind file
small.test.ind(dd = NULL)
small.test.ind(dd = NULL)
dd |
Data Dictionary location. If NULL, use temp dir and force download |
No return value, called for side effects
one year test, no ind file
small.test.noind(dd = NULL)
small.test.noind(dd = NULL)
dd |
Data Dictionary location. If NULL, use temp dir and force download |
No return value, called for side effects
makes artifical PSID data with variables age
and income
for two consecutive years 1985 and 1986.
testPSID(N = 100, N.attr = 0)
testPSID(N = 100, N.attr = 0)
N |
number of people in each wave |
N.attr |
number of people lost to attrition |
list with (fake) individual index file IND2009ER and family files for 1985 and 1986