Title: | Doubly Truncated Data Analysis |
---|---|
Description: | Implementation of different algorithms for analyzing randomly truncated data, one-sided and two-sided (i.e. doubly) truncated data. It serves to compute empirical cumulative distributions and also kernel density and hazard functions using different bandwidth selectors. Several real data sets are included. |
Authors: | Carla Moreira, Jacobo de Uña-Álvarez and Rosa Crujeiras |
Maintainer: | Carla Moreira <[email protected]> |
License: | GPL-2 |
Version: | 3.0.1 |
Built: | 2025-02-24 03:00:59 UTC |
Source: | https://github.com/cran/DTDA |
The data include information of 939 patients with confirmed diagnosis of type 1 (primary spontaneous) acute coronary syndrome (ACS). Patientes were consecutively admitted to the Cardiology Department of two tertiary hospitals in Portugal between August 2013 and December 2014. The age at diagnosis is doubly truncated because of the interval sampling.
data(ACS)
data(ACS)
A data frame with 939 observations on the following 5 variables.
X
a numeric vector, age at diagnosis (in years).
U
a numeric vector, the elapsed time (in years) between birth and the beggining of the study (August 2013).
V
a numeric vector, the elapsed time (in years) between birth and end of the study (December 2014).
Sex
a numeric vector, sex of the participants (0 = female, 1 = male).
diagnosis
a numeric vector, type of diagnosis at discharge 1 - STEMI (ST elevation myocardial infarction) and 2 - NSTEACS (all others diagnoses).
The age at diagnosis X
is doubly truncated due to the interval sampling. The length of the sampling interval (V
-U
) is 1.42 years. The NPMLE of the cumulative distribution function of X
does not exist or is not unique for this dataset. The necessary and sufficient graphical condition presented by Xiao and Hudgens (2020) to determine the existence and uniqueness of the NPMLE is not satisfied.
Araújo C, Laszczyska O, Viana M, Melão F, Henriques A, Borges A, Severo M, Maciel MJ, Moreira I, Azevedo A (2018) Sex differences in presenting symptoms of acute coronary syndrome: the EPIHeart cohort study. BMJ Open 8.
Xiao J and Hudgens MG (2020) On nonparametric maximum likelihood estimation with double truncation. Biometrika 106, 989-996.
data(ACS) str(ACS)
data(ACS) str(ACS)
The data include information of 917 patients with confirmed diagnosis of type 1 (primary spontaneous) acute coronary syndrome (ACS). Patients were consecutively admitted to the Cardiology Department of two tertiary hospitals in Portugal between August 2013 and December 2014. The age at diagnosis is doubly truncated because of the interval sampling. This dataset is a reduced sample of the original ACS data, guaranteeing the existence and uniqueness of the NPMLE, according to Xiao and Hudgens (2020)
data(ACSred)
data(ACSred)
A data frame with 917 observations on the following 5 variables.
X
a numeric vector, age at diagnosis (in years).
U
a numeric vector, the elapsed time (in years) between birth and the beggining of the study (August 2013).
V
a numeric vector, the elapsed time (in years) between birth and end of the study (December 2014).
Sex
a numeric vector, sex of the participants (0 = female, 1 = male).
diagnosis
a numeric vector, type of diagnosis at discharge 1 - STEMI (ST elevation myocardial infarction ) and 2 - NSTEACS (all others diagnoses).
The age at diagnosis X
is doubly truncated due to the interval sampling. The length of the sampling interval (V
-U
) is 1.42 years. The NPMLE of the cumulative distribution function of X
for the complete data does not exist or is not unique for this dataset.
Araújo C, Laszczynska O, Viana M, Melão F, Henriques A, Borges A, Severo M, Maciel MJ, Moreira I, Azevedo A (2018) Sex differences in presenting symptoms of acute coronary syndrome: the EPIHeart cohort study. BMJ Open 8.
Moreira C, de Uña-Álvarez J, Santos AC and Barros H (2021) Smoothing Methods to estimate the hazard rate under double truncation. https://arxiv.org/abs/2103.14153.
Xiao J and Hudgens MG (2020) On nonparametric maximum likelihood estimation with double truncation. Biometrika 106, 989-996.
data(ACSred) str(ACSred)
data(ACSred) str(ACSred)
The data include information on the infection and induction times for 258 adults who were infected with HIV virus and developed AIDS by June 30, 1996. The data consist on the time in years, measured from April 1, 1978, when adults were infected by the virus from a contaminated blood transfusion, and the waiting time to development of AIDS, measured from the date of infection. The induction times are right-truncated.
data(AIDS)
data(AIDS)
A data frame with 258 observations on the following 3 variables.
INFTime
a numeric vector, the infection time (years).
INDTime
a numeric vector, the induction time (years).
V
a numeric vector, the time from HIV infection to the end of the study (years).
Klein and Moeschberger (1997) Survival Analysis Techniques for Censored and truncated data. Springer.
Lagakos SW and Barraj LM and de Gruttola V (1988) Nonparametric Analysis of Truncated Survival Data, with Applications to AIDS. Biometrika 75, 515–523.
data(AIDS) str(AIDS)
data(AIDS) str(AIDS)
The data include information of transfusions cases of transfusion-related AIDS, corresponding to individuals diagnosed prior to July 1, 1986. Only 295 patients with consistent data, for which the infection could be attributed to a single transfusion or short series were included. Since HIV was unknown before 1982, this implies that cases developing AIDS prior to this date were not reported, leading to a doubly truncated data. The incubation time is doubly truncated due to the interval sampling.
data(AIDS.DT)
data(AIDS.DT)
A data frame with 295 observations on the following 4 variables.
X
a numeric vector, the induction or incubation time: time elapsed from HIV infection to AIDS (in months).
U
a numeric vector, time from 1982 to HIV infection (in months).
V
a numeric vector,time from HIV infection to July 1, 1986 (in months).
AGE
a numeric vector, age of the individual at infection (in years).
Kalbfleisch JD and Lawless JF (1989) Inference based on retrospective ascertainment: An analysis of the data on transfusion-related AIDS. Journal of the American Statistical Association 84, 360–372.
data(AIDS.DT) str(AIDS.DT)
data(AIDS.DT) str(AIDS.DT)
This dataset corresponds to all children diagnosed from cancer between January 1, 1999 and December 31, 2003 in the region of North Portugal. The database includes information of 406 children with complete records on the age at diagnosis. Because of the interval sampling, the age at diagnosis is doubly truncated by the time from birth to the end of the study, and time from birth to the beggining of the study.
data("ChildCancer")
data("ChildCancer")
A data frame with 406 observations on the following 8 variables.
X
a numeric vector, age at diagnosis (time in days).
U
a numeric vector, time from birth to the beggining of the study (time in days).
V
a numeric vector, time from birth to the end of the study (time in days).
ICCGroup
a numeric vector, cancer types identified according to the International Classification of Childhood Cancer (ICCC). 1=Leukaemias, 2=Limphoma, 3=Nervous System Tumour, 4=Neuroblastoma, 5=Retinoblastoma, 6=Renal, 7=Hepatic, 8=Bone, 9=Soft Tissues, 10=Germ Cell, 11=Melanoma and other epitelial tumours, 12=Other Tumours.
Status
a numeric vector, the status indicator at the end of the study: 0=alive, 1=dead.
SurvTime
a numeric vector, the survival time (time from birth to death or end of the study.
Residence
a numeric vector, districts of residence. 1=Braga, 2=Bragança,3=Porto,4=Viana do Castelo, 5=Vila Real.
Sex
a numeric vector, sex of the participants (1 = female, 2 = male).
The childhood cancer data were gathered from the IPO (Registo Oncológico do Norte) service, kindly provided by Doctor Maria José Bento.
Mandel M, de Uña-Álvarez J, Simon DK and Betensky R (2018). Inverse Probability Weighted Cox Regression for Doubly Truncated Data. Biometrics 74, 481-487.
Moreira C and de Uña-Álvarez J (2010) Bootstrapping the NPMLE for doubly truncated data. Journal of Nonparametric Statistics 22, 567-583.
data(ChildCancer) str(ChildCancer)
data(ChildCancer) str(ChildCancer)
This function provides the nonparametric kernel density estimation of a doubly truncated random variable.
densityDT(X, U, V, bw = "DPI2", from, to, n, wg = NA)
densityDT(X, U, V, bw = "DPI2", from, to, n, wg = NA)
X |
numeric vector with the values of the target variable. |
U |
numeric vector with the values of the left truncation variable. |
V |
numeric vector with the values of the rigth truncation variable. |
bw |
The smoothing bandwidth to be used, but can also be a character string giving a rule to choose the bandwidth. This must be one of |
from |
the left point of the grid at which the density is to be estimated. The default is min(X)+1e-04. |
to |
the rigth point of the grid at which the density is to be estimated. The default is max(X)-1e-04. |
n |
number of evaluation points on a equally spaced grid. |
wg |
Numeric vector of random weights to correct for double truncation. Default weights correspond to the Efron-Petrosian NPMLE. |
The nonparametric kernel density estimation for a variable which is observed under random double truncation is computed as proposed in Moreira and de Uña-Álvarez (2012). As usual in kernel smoothing, the estimator is obtained as a convolution between a kernel function and an appropriate estimator of the cumulative df. Gaussian kernel is used. The automatic bandwidth selection procedures for the kernel density estimator are those proposed in Moreira and Van Keilegom (2013). The automatic bandwidth selection alternatives are appropriate modifications, i.e, taking into account the double truncation issue, of the normal reference rule, two types of plug-in procedures, the least squares cross-validation and a bootstrap based method proposed in Cao et al. (1994) and Sheater and Jones (1991) for the complete data.
A list containing the following values:
x |
the n coordinates of the points where the density is estimated. |
y |
the estimated density values. |
bw |
the bandwidth used. |
Carla Moreira, de Uña-Álvarez and Rosa Crujeiras
Cao R, Cuevas A and González-Manteiga W (1994). A comparative study of several smoothing methods in density estimation. Computational Statistics and Data Analysis 17, 153-176.
Moreira C and de Uña-Álvarez J (2012) Kernel density estimation with doubly truncated data. Electronic Journal of Statistics 6, 501-521.
Moreira C and Van Keilegom I (2013) Bandwidth selection for kernel density estimation with doubly truncated data. Computational Statistics and Data Analysis 61, 107-123.
Sheather S and Jones M (1991) A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B 53, 683-690.
Silverman BW (1986) Density Estimation. London: Chapman and Hall.
set.seed(4321) n<-50 X <- runif(n, 0, 1) U <- runif(n,-1/3, 1) V <- U + 1/3 for (i in 1:n){ while (U[i] > X[i] | V[i] < X[i]){ X[i] <- runif(1, 0, 1) U[i] <- runif(1, -1/3, 1) V[i] <- U[i] + 1/3 } } vxDens1<-densityDT(X,U,V,bw="DPI1",0,1,500) plot(vxDens1, type = "l") vxDens2<-densityDT(X,U,V,bw="DPI2",0,1,500) vxDens3<-densityDT(X,U,V, bw=0.5,0,1,500) vxDens4<-densityDT(X,U,V,bw="LSCV",0,1,500) data(Quasars) densityDT(Quasars[,1],Quasars[,2],Quasars[,3],bw="DPI1",-2.5,2.2,500) densityDT(Quasars[,1],Quasars[,2],Quasars[,3], bw=0.5,500)
set.seed(4321) n<-50 X <- runif(n, 0, 1) U <- runif(n,-1/3, 1) V <- U + 1/3 for (i in 1:n){ while (U[i] > X[i] | V[i] < X[i]){ X[i] <- runif(1, 0, 1) U[i] <- runif(1, -1/3, 1) V[i] <- U[i] + 1/3 } } vxDens1<-densityDT(X,U,V,bw="DPI1",0,1,500) plot(vxDens1, type = "l") vxDens2<-densityDT(X,U,V,bw="DPI2",0,1,500) vxDens3<-densityDT(X,U,V, bw=0.5,0,1,500) vxDens4<-densityDT(X,U,V,bw="LSCV",0,1,500) data(Quasars) densityDT(Quasars[,1],Quasars[,2],Quasars[,3],bw="DPI1",-2.5,2.2,500) densityDT(Quasars[,1],Quasars[,2],Quasars[,3], bw=0.5,500)
This function computes the NPMLE for the cumulative distribution function of X
observed under one-sided (right or left) and two-sided (double) truncation.
It provides simple bootstrap pointwise confidence limits too.
efron.petrosian(X, U = NA, V = NA, wt = NA, error = NA, nmaxit = NA, boot = TRUE, B = NA, alpha = NA, display.F = FALSE, display.S = FALSE)
efron.petrosian(X, U = NA, V = NA, wt = NA, error = NA, nmaxit = NA, boot = TRUE, B = NA, alpha = NA, display.F = FALSE, display.S = FALSE)
X |
Numeric vector with the values of the target variable. |
U |
Numeric vector with the values of the left truncation variable. If there are no truncation values from the left, put |
V |
Numeric vector with the values of the right truncation variable. If there are no truncation values from the right, put |
wt |
Numeric vector of non-negative initial solution, with the same length as |
error |
Numeric value. Maximum pointwise error when estimating the density associated to X (f) in two consecutive steps. If this is missing, it is $1e-06$. |
nmaxit |
Numeric value. Maximum number of iterations. If this is missing, it is set to |
boot |
Logical. If TRUE (default), the simple bootstrap method is applied to lifetime distribution estimation. Pointwise confidence bands are provided. |
B |
Numeric value. Number of bootstrap resamples . The default |
alpha |
Numeric value. (1- |
display.F |
Logical. Default is FALSE. If TRUE, the estimated cumulative distribution function associated to |
display.S |
Logical. Default is FALSE. If TRUE, the estimated survival function associated to |
The NPMLE for the cumulative distribution function is computed by the first algorithm proposed in Efron and Petrosian (1999). This is an iterative algorithm which converges to the NMPLE after a number of iterations. If the second (respectively third) argument is missing, computation of the Lynden-Bell estimator for right-truncated (respectively left-truncated) data is obtained. Note that individuals with NAs in the three first arguments will be automatically excluded.
A list containing the following values:
time |
The timepoint on the curve. |
n.event |
The number of events that ocurred at time |
events |
The total number of events. |
density |
The estimated density values. |
cumulative.df |
The estimated cumulative distribution values. |
truncation.probs |
The probability of |
S0 |
|
Survival |
The estimated survival values. |
n.iterations |
The number of iterations used by this algorithm. |
B |
Number of bootstrap resamples computed. |
alpha |
The nominal level used to construct the confidence intervals. |
upper.df |
The estimated upper limits of the confidence intervals for F. |
lower.df |
The estimated lower limits of the confidence intervals for F. |
upper.Sob |
The estimated upper limits of the confidence intervals for S. |
lower.Sob |
The estimated lower limits of the confidence intervals for S. |
sd.boot |
The bootstrap standard deviation of F estimator. |
boot.repeat |
The number of resamples done in each bootstrap call to ensure the existence and uniqueness of the bootstrap NPMLE. |
Carla Moreira, Jacobo de Uña-Álvarez and Rosa Crujeiras
Efron B and Petrosian V (1999) Nonparametric methods for doubly truncated data. Journal of the American Statistical Association 94, 824-834.
Lynden-Bell D (1971) A method of allowing for known observational selection in small samples applied to 3CR quasars. Monograph National Royal Astronomical Society 155, 95-118.
Xiao J and Hudgens MG (2020) On nonparametric maximum likelihood estimation with double truncation. Biometrika 106, 989-996.
## Generating data which are doubly truncated set.seed(4321) n<-25 X<-runif(n,0,1) U<-runif(n,0,0.5) V<-runif(n,0.5,1) for (i in 1:n){ while (X[i]<U[i]|X[i]>V[i]){ U[i]<-runif(1,0,0.5) X[i]<-runif(1,0,1) V[i]<-runif(1,0.5,1) } } efron.petrosian(X=X,U=U,V=V,boot=FALSE,display.F=TRUE,display.S=TRUE)
## Generating data which are doubly truncated set.seed(4321) n<-25 X<-runif(n,0,1) U<-runif(n,0,0.5) V<-runif(n,0.5,1) for (i in 1:n){ while (X[i]<U[i]|X[i]>V[i]){ U[i]<-runif(1,0,0.5) X[i]<-runif(1,0,1) V[i]<-runif(1,0.5,1) } } efron.petrosian(X=X,U=U,V=V,boot=FALSE,display.F=TRUE,display.S=TRUE)
Digitized data from Figure 2 in Ye and Tang 2016. The dataset contains (rounded) observations of 174 failure times of certain devices, observed under interval sampling. Right-runcation is years between installation and 2011 and left truncation corresponds to right-truncation time minus 15 years. The failure time is doubly truncated due to the interval sampling.
data("EqSRounded")
data("EqSRounded")
A data frame with 174 observations on the following 3 variables.
X
a numeric vector, time to failure in years.
U
a numeric vector, the number of years between installation and 2011 minus 15 years.
V
a numeric vector, the number of years between installation and 2011.
Digitalization of the data plot in the original paper of Ye and Tang 2016.
Ye ZS and Tang LC (2016) Augmenting the unreturned for field data with information on returned failures only. Technometrics 58, 513-523.
data(EqSRounded) str(EqSRounded)
data(EqSRounded) str(EqSRounded)
This function provides the nonparametric kernel hazard estimation for a variable which is observed under random double truncation, which is defined as a convolution of a kernel function with the NPMLE of the cumulative hazard. Least square cross validation bandwidth selection procedure is implemented too.
hazardDT(X, U, V, bw = "LSCV", from, to, n, wg = NA)
hazardDT(X, U, V, bw = "LSCV", from, to, n, wg = NA)
X |
Numeric vector with the values of the target variable. |
U |
Numeric vector with the values of the left truncation variable. |
V |
Numeric vector with the values of the rigth truncation variable. |
bw |
The smoothing bandwidth to be used, but can also be a character string giving a rule to choose the bandwidth. This must be |
from |
the left point of the grid at which the density is to be estimated. The default is min(X)+1e-04. |
to |
the rigth point of the grid at which the density is to be estimated. The default is max(X)-1e-04. |
n |
number of evaluation points on a equally spaced grid. |
wg |
Numeric vector of random weights to correct for double truncation. Default weights correspond to the Efron-Petrosian NPMLE. |
The nonparametric kernel density estimation for a variable which is observed under random double truncation is computed as proposed in Moreira et al.(2021). As usual in kernel smoothing, the estimator is obtained as a convolution between a kernel function and an appropriate estimator of the cumulative hazard. Gaussian kernel is used. The automatic bandwidth selection procedures for the kernel hazard estimator is the least square cross validation, presented in Moreira et al. (2021).
A list containing the following values:
x |
the n coordinates of the points where the hazard is estimated. |
y |
the estimated hazard values. |
bw |
the bandwidth used. |
Carla Moreira, Jacobo de Uña-Álvarez and Rosa Crujeiras
Moreira C, de Uña-Álvarez J, Santos AC and Barros H (2021) Smoothing Methods to estimate the hazard rate under double truncation. https://arxiv.org/abs/2103.14153.
set.seed(4321) n<-100 X <- runif(n, 0, 1) U <- runif(n,-1/3, 1) V <- U + 1/3 for (i in 1:n){ while (U[i] > X[i] | V[i] < X[i]){ X[i] <- runif(1, 0, 1) U[i] <- runif(1, -1/3, 1) V[i] <- U[i] + 1/3 } } vxhazard1<-hazardDT(X,U,V,bw=0.3,0,1,500) vxhazard2<-hazardDT(X,U,V,bw="LSCV",0,1,500)
set.seed(4321) n<-100 X <- runif(n, 0, 1) U <- runif(n,-1/3, 1) V <- U + 1/3 for (i in 1:n){ while (U[i] > X[i] | V[i] < X[i]){ X[i] <- runif(1, 0, 1) U[i] <- runif(1, -1/3, 1) V[i] <- U[i] + 1/3 } } vxhazard1<-hazardDT(X,U,V,bw=0.3,0,1,500) vxhazard2<-hazardDT(X,U,V,bw="LSCV",0,1,500)
This function computes the NPMLE for the cumulative distribution function of X
observed under one-sided (right or left) and two-sided (double) truncation.
It provides simple bootstrap pointwise confidence limits too. This function allows for ties in the samples of X
, U
and V
.
lynden(X, U = NA, V = NA, error = NA, nmaxit = NA, boot = TRUE, B = NA, alpha = NA, display.F = FALSE, display.S = FALSE)
lynden(X, U = NA, V = NA, error = NA, nmaxit = NA, boot = TRUE, B = NA, alpha = NA, display.F = FALSE, display.S = FALSE)
X |
Numeric vector with the values of the target variable. |
U |
Numeric vector with the values of the left truncation variable. If there are no truncation values from the left, put |
V |
Numeric vector with the values of the right truncation variable. If there are no truncation values from the right, put |
error |
Numeric value. Maximum pointwise error when estimating the density associated to X (f) in two consecutive steps. If this is missing, it is $1e-06$. |
nmaxit |
Numeric value. Maximum number of iterations. If this is missing, it is set to |
boot |
Logical. If TRUE (default), the simple bootstrap method is applied to lifetime distribution estimation. Pointwise confidence bands are provided. |
B |
Numeric value. Number of bootstrap resamples . The default |
alpha |
Numeric value. (1- |
display.F |
Logical. Default is FALSE. If TRUE, the estimated cumulative distribution function associated to |
display.S |
Logical. Default is FALSE. If TRUE, the estimated survival function associated to |
The NPMLE for the cumulative distribution function is computed by the second algorithm proposed in Efron and Petrosian (1999). This is an iterative algorithm which converges to the NMPLE after a number of iterations. If the second (respectively third) argument is missing, the Lynden-Bell estimator for right-truncated (respectively left-truncated) data is obtained. Note that individuals with NAs in the three first arguments will be automatically excluded.
A list containing the following values:
time |
The timepoint on the curve. |
n.event |
The number of events that ocurred at time |
events |
The total number of events. |
NJ |
The number of individuals in risk considering the left truncation times. |
density |
The estimated density values. |
cumulative.df |
The estimated cumulative distribution values. |
truncation.probs |
The probability of |
hazard |
The estimated hazard values. |
S0 |
|
Survival |
The estimated survival values. |
n.iterations |
The number of iterations used by this algorithm. |
B |
Number of bootstrap resamples computed. |
alpha |
The nominal level used to construct the confidence intervals. |
upper.df |
The estimated upper limits of the confidence intervals for F. |
lower.df |
The estimated lower limits of the confidence intervals for F. |
upper.Sob |
The estimated upper limits of the confidence intervals for S. |
lower.Sob |
The estimated lower limits of the confidence intervals for S. |
sd.boot |
The bootstrap standard deviation of F estimator. |
boot.repeat |
The number of resamples done in each bootstrap call to ensure the existence and uniqueness of the bootstrap NPMLE. |
Carla Moreira, Jacobo de Uña-Álvarez and Rosa Crujeiras
Efron B and Petrosian V (1999) Nonparametric methods for doubly truncated data. Journal of the American Statistical Association 94, 824-834.
Lynden-Bell D (1971) A method of allowing for known observational selection in small samples applied to 3CR quasars. Monograph National Royal Astronomical Society 155, 95-118.
# Generating data which are doubly truncated set.seed(4321) n<-25 X<-runif(n,0,1) U<-runif(n,0,0.25) V<-runif(n,0.75,1) for (i in 1:n){ while (X[i]<U[i]|X[i]>V[i]){ U[i]<-runif(1,0,0.25) X[i]<-runif(1,0,1) V[i]<-runif(1,0.75,1) } } res<-lynden(X=X, U=U, V=V, boot=FALSE, display.F=TRUE, display.S=TRUE) # Generating data which are right truncated set.seed(4321) n<-25 X<-runif(n,0,1) V<-runif(n,0.75,1) for (i in 1:n){ while (X[i]>V[i]){ X[i]<-runif(1,0,1) V[i]<-runif(1,0.75,1) } } res<-lynden(X=X,U=NA, V=V, boot=FALSE)
# Generating data which are doubly truncated set.seed(4321) n<-25 X<-runif(n,0,1) U<-runif(n,0,0.25) V<-runif(n,0.75,1) for (i in 1:n){ while (X[i]<U[i]|X[i]>V[i]){ U[i]<-runif(1,0,0.25) X[i]<-runif(1,0,1) V[i]<-runif(1,0.75,1) } } res<-lynden(X=X, U=U, V=V, boot=FALSE, display.F=TRUE, display.S=TRUE) # Generating data which are right truncated set.seed(4321) n<-25 X<-runif(n,0,1) V<-runif(n,0.75,1) for (i in 1:n){ while (X[i]>V[i]){ X[i]<-runif(1,0,1) V[i]<-runif(1,0.75,1) } } res<-lynden(X=X,U=NA, V=V, boot=FALSE)
The sample consists of DNA from 99 Caucasian Parkinson's Disease (PD) patients with earlier onset PD (age 35-55 years). To remove the selection bias related to survival, the study was limited to patients diagnosed from PD who had their DNA sample taken within eight years after onset. Consequently, the age of onset is doubly truncated by the age at blood sampling and this time minus 8 years. This is a situation of interval sampling, the sampling interval being subject-specific.
data("PDearly")
data("PDearly")
A data frame with 99 observations on the following 5 variables.
X
a numeric vector, age at onset of PD (in years).
U
a numeric vector, age at blood sampling minus 8 years.
V
a numeric vector, age at blood sampling.
SNP_A10398G
a factor with allels levels A
and G
.
SNP_PGC1a
a factor with allels levels A
, AG
and G
.
Clark et al., 2011 hypothesized that the rs8192678 PGC-1a single nucleotide polymorphism (SNP) and the A10398G mitochondrial SNP may influence risk or age of onset of PD. To test these hypotheses, genomic DNA samples from human blood samples were obtained from the National Institute of Neurological Disorders and Stroke(NINDS) Human Genetics DNA and Cell Line Repositoryat the Coriell Institute for Medical Research (Camden, NewJersey).
Mandel M, de Uña-Álvarez J, Simon DK and Betensky R (2018). Inverse Probability Weighted Cox Regression for Doubly Truncated Data. Biometrics 74, 481-487.
Clark J, Reddy S, Zheng K, Betensky RA and Simon DK (2011) Association of PGC-1alphapolymorphisms with age of onset and risk of Parkinson's disease. BMC Medical Genetics 12, 69.
data(PDearly) str(PDearly)
data(PDearly) str(PDearly)
The sample consists of DNA from 100 Caucasian Parkinson's Disease (PD) patients with late onset PD (age 63-87 years). To remove the selection bias related to survival, the study was limited to patients diagnosed from PD who had their DNA sample taken within eight years after onset. Consequently, the age of onset is doubly truncated by the age at blood sampling and this time minus 8 years. This is a situation of interval sampling, the sampling interval being subject-specific.
data("PDlate")
data("PDlate")
A data frame with 100 observations on the following 5 variables.
X
a numeric vector, age at onset of PD (in years).
U
a numeric vector, age at blood sampling minus 8 years.
V
a numeric vector, age at blood sampling.
SNP_A10398G
a factor with allels levels A
and G
.
SNP_PGC1a
a factor with allels levels A
, AG
and G
.
Clark et al., 2011 hypothesized that the rs8192678 PGC-1a single nucleotide polymorphism (SNP) and the A10398G mitochondrial SNP may influence risk or age of onset of PD. To test these hypotheses, genomic DNA samples from human blood samples were obtained from the National Institute of Neurological Disorders and Stroke(NINDS) Human Genetics DNA and Cell Line Repositoryat the Coriell Institute for Medical Research (Camden, NewJersey).
Mandel M, de Uña-Álvarez J, Simon DK and Betensky R (2018). Inverse Probability Weighted Cox Regression for Doubly Truncated Data. Biometrics 74, 481-487.
Clark J, Reddy S, Zheng K, Betensky RA and Simon DK (2011) Association of PGC-1alphapolymorphisms with age of onset and risk of Parkinson's disease. BMC Medical Genetics 12, 69.
data(PDlate) str(PDlate)
data(PDlate) str(PDlate)
The original dataset studied by Efron and Petrosian (1999) comprised independlently collected quadruplets of the redshift and the apparent magnitude of a quasar object. Due to experiemtnal constraints, the distribution of each luminosity in a log-scale is truncated to a known interval.
data(Quasars)
data(Quasars)
A data frame with 210 observations on the following 3 variables.
y (adj lum)
a numeric vector, the log lominosity values.
u (lower)
a numeric vector, lower truncation limits.
v (upper)
a numeric vector, upper truncation limits.
Quadruplets in the original data set studied by Efron and Petrosian (1999) are of the form , where
is the redshift of the ith quasar and
is the apparent magnitude. Due to experimental constraints, the distribution of each luminosity in the log-scale
is truncated to a known interval
, where
represents a transformation which depends on the cosmological model assumed (see Efron
and Petrosian (1999) for details). Quasars with apparent magnitude above
were too dim to yield dependent redshifts, and hence they were excluded from the study. The lower limit
was used to avoid confusion with non quasar stellar objects.
Vahé Petrosian and Bradley Efron.
Boyle BJ, Fong R, Shanks, T and Peterson, BA (1990) A catalogue of faint, UV-excess objects. Monograph National Royal Astronomical Society 243, 1-56.
Efron B and Petrosian V (1999) Nonparametric methods for doubly truncated data. Journal of the American Statistical Association 94, 824-834.
data(Quasars) str(Quasars)
data(Quasars) str(Quasars)
Random generation functions of doubly truncated data with two different patterns of observational bias.
rsim.DT(n,tau, model=NULL)
rsim.DT(n,tau, model=NULL)
n |
number of observations to generate. |
tau |
length of the observational window. |
model |
model to be simulated. Number 1 or 2 corresponding to different patterns of observacional bias. |
If model
=1, and V= U+
tau
. If model
=2, and V= U+
tau
. In model
=1 there is no observational bias due double truncation while in model
=2 double truncation induces observational bias.
A matrix with n
unit length rows representing the generated values from a doubly truncated data with triplets , in which
.
Carla Moreira, Jacobo de Uña-Álvarez and Rosa Crujeiras
set.seed(4321) rsim.DT(500,1/2, model=2)
set.seed(4321) rsim.DT(500,1/2, model=2)
This function computes the NPMLE for the cumulative distribution function of X
observed under one-sided (right or left) and two-sided (double) truncation.
The NPMLE of the joint distribution of the truncation times along with its marginal distributions are also computed.
It provides bootstrap pointwise confidence limits too.
shen(X, U = NA, V = NA, wt = NA, error = NA, nmaxit = NA, boot = TRUE, boot.type = "simple", B = NA, alpha = NA, display.FS = FALSE, display.UV = FALSE, plot.joint = FALSE, plot.type = NULL)
shen(X, U = NA, V = NA, wt = NA, error = NA, nmaxit = NA, boot = TRUE, boot.type = "simple", B = NA, alpha = NA, display.FS = FALSE, display.UV = FALSE, plot.joint = FALSE, plot.type = NULL)
X |
Numeric vector with the values of the target variable. |
U |
Numeric vector with the values of the left truncation variable. If there are no truncation values from the left, put |
V |
Numeric vector with the values of the right truncation variable. If there are no truncation values from the right, put |
wt |
Numeric vector of non-negative initial solution, with the same length as |
error |
Numeric value. Maximum pointwise error when estimating the density associated to X (f) in two consecutive steps. If this is missing, it is $1e-06$. |
nmaxit |
Numeric value. Maximum number of iterations. If this is missing, it is set to |
boot |
Logical. If TRUE (default), the simple bootstrap method is applied to lifetime and truncation times distributions estimation. Pointwise confidence bands are provided. |
boot.type |
A character string giving the bootstrap type to be used. This must be one of |
B |
Numeric value. Number of bootstrap resamples . The default |
alpha |
Numeric value. (1- |
display.FS |
Logical. Default is FALSE. If TRUE, the estimated cumulative distribution function and the estimated survival function associated to |
display.UV |
Logical. Default is FALSE. If TRUE, the marginal distributions of |
plot.joint |
Logical. Default is FALSE. If TRUE, the joint distribution of the truncation times is plotted. |
plot.type |
A character string giving the plot type to be used to represent the joint distribution of the truncation times.
This must be one of "image" or "persp", with default |
The NPMLE for the cumulative distribution function is computed by a single algorithm proposed in Shen (2010). This is an iterative algorithm which converges to the NMPLE after a number of iterations. Initial solutions are given by the ordinary empirical distribution functions. If the second (respectively third) argument is missing, computation of the Lynden-Bell estimator for right-truncated (respectively left-truncated) data is obtained. Note that individuals with NAs in the three first arguments will be automatically excluded.
A list containing the following values:
time |
The timepoint on the curve. |
n.event |
The number of events that ocurred at time |
events |
The total number of events. |
density |
The estimated density values associated to |
cumulative.df |
The estimated cumulative distribution values of |
truncation.probs |
The probability of |
S0 |
|
Survival |
The estimated survival values. |
density.joint |
The estimated joint densities values associated to |
marginal.U |
The estimated cumulative univariate marginal values of the |
marginal.V |
The estimated cumulative univariate marginal values of the |
cumulative.joint |
The estimated joint cumulative distribution values. |
n.iterations |
The number of iterations used by this algorithm. |
biasf |
The estimated probabilities of observing the lifetimes. |
Boot |
The type of bootstrap method applied. |
B |
Number of bootstrap resamples computed. |
alpha |
The nominal level used to construct the confidence intervals. |
upper.df |
The estimated upper limits of the confidence intervals for F. |
lower.df |
The estimated lower limits of the confidence intervals for F. |
upper.Sob |
The estimated upper limits of the confidence intervals for S. |
lower.Sob |
The estimated lower limits of the confidence intervals for S. |
upper.fU |
The estimated upper limits of the confidence intervals for |
lower.fU |
The estimated lower limits of the confidence intervals for |
upper.fV |
The estimated upper limits of the confidence intervals for |
lower.fV |
The estimated lower limits of the confidence intervals for |
sd.boot |
The bootstrap standard deviation of F estimator. |
boot.repeat |
The number of resamples done in each bootstrap call to ensure the existence and uniqueness of the bootstrap NPMLE. |
Carla Moreira, Jacobo de Uña-Álvarez and Rosa Crujeiras
Lynden-Bell D (1971) A method of allowing for known observational selection in small samples applied to 3CR quasars. Monograph National Royal Astronomical Society 155, 95-118.
Shen P-S (2010) Nonparametric analysis of doubly truncated data. Annals of the Institute of Statistical Mathematics 62, 835-853.
Xiao J, Hudgens MG (2020) On nonparametric maximum likelihood estimation with double truncation. Biometrika 106, 989-996.
## Generating data which are doubly truncated set.seed(4321) n<-100 X<-runif(n,0,1) U<-runif(n,0,0.67) V<-runif(n,0.33,1) for (i in 1:n){ while (X[i]<U[i]|X[i]>V[i]){ U[i]<-runif(1,0,0.67) X[i]<-runif(1,0,1) V[i]<-runif(1,0.33,1) } } res<-shen(X,U,V,boot=FALSE, plot.joint=TRUE, plot.type="persp")
## Generating data which are doubly truncated set.seed(4321) n<-100 X<-runif(n,0,1) U<-runif(n,0,0.67) V<-runif(n,0.33,1) for (i in 1:n){ while (X[i]<U[i]|X[i]>V[i]){ U[i]<-runif(1,0,0.67) X[i]<-runif(1,0,1) V[i]<-runif(1,0.33,1) } } res<-shen(X,U,V,boot=FALSE, plot.joint=TRUE, plot.type="persp")