Date : 27.02.03 (jj.mm.aa)
Doctorat en Statistique, 09/09/1994
TURLACH Berwin, Inde
Promoteurs:
Abstract
The thesis addresses several interrelated topics, which are at the present time of great interest both in statistical theory and in computationnal aspects. The well-known generalized linear model is reconsidered under semi-parametric or non-parametric assumptions leading to single index models, and generalized additive models. Berwin Turlach proposes the algorithms needed for estimating functionnals in the generalized additive models or parameters in the single index model. He studies this subject thoroughly, taking into account all the problems as for example the problem of correlated explanatory variables. He suggests to use techniques based on binning to optimize the computations in the case of average derivative estimation. The implementation of the algorithms is done in the Xplore environment.
Several statistical results in the thesis are new as for example the procedure allowing to identifiy the variables which should be included into a generalized additive model, or the misspecification test for the single index model. The sults of Härdle and Tsybakov are generalized to the case of the generalized model, and to the choice of interaction terms. He gives a detailed discussion about the assumption of independance between covariates and suggests new ideas.
Doctorat en Statistique, 13/03/1995
NAMORO Soiliou Daw, Togo
Promoteur: Jean-Marie ROLIN, UCL/STAT
Abstract
L'objectif visé dans le présent ouvrage est l'analyse statistique du modèle dynamique des automaticiens (appelé encore " système dynamique» ) . La version linéaire de ce modèle constitue le point de départ vers une analyse plus générale. Un résumé de la théorie linéaire standard fait l'objet du premier chapitre. L'accent est mis sur la structure markovienne du modèle et le caractère récursif de l'inférence dans son cadre (Filtre de Kalman-Bucy). Au deuxième chapitre, la question de la représentation markovienne des processus est abordée. Les résultats dans ce domaine suggèrent en effet des modèles du type considéré; ils apparaissent par conséquent comme une justification sur le plan théorique de la considération de tels modèles. L'optimalité des représentations markoviennes est étudiée d'abord dans un cadre linéaire (espace de Hilbert), puis dans un cadre non linéaire (analyse s-algébrique).
Dans le dernier chapitre, le modèle est reconsidéré sous une forme générale abstraite en ne retenant que sa structure d'indépendance entre les variables impliquées Cette forme générale appelée "système dynamique bayesien" met en oeuvre un paramètre incident (le paramètre d'état) et un paramètre structurel. L'estimabilité exacte du paramètre structurel est étudiée. Le modèle linéaire sert alors d'exemple d'illustration des résultats obtenues dans ce cadre général. L'identification du paramètre structurel est aussi examinée dans le cas particulier du modèle linéaire et cette analyse établit le lien entre les concepts d'observabilité et d'identifiabilité. Toujours dans le dernier chapitre est établie, dans le cas général, la récursivité de la statistique suffisante minimale (à chaque instant) du couple "paramètre d'état- paramètre structurel". Le filtre de Kalman-Bucy apparaît comme un exemple particulier de cette récursion. Enfin, les conséquences de l'existence d'une telle récursion sont examinées en rapport avec le problème de la construction de filtres exact fini- dimensionnels.
Doctorat en Statistique, 23/06/1995
SCHEIHING Eliana, Chili
Promoteur: MOUCHART Michel, UCL/STAT
Abstract
In this work we investigate two problems involving
a Bayesian analysis of discrete data. The first problem concerns
the analysis of Bayesian admissibility of the reductions by conditioning.
We study the conditions under which the admissibility by conditioning
holds, and next we consider the evaluation of the loss of information
when a non-admissible conditioning is used for an approximation
of the exact posterior distribution. We use the Fisher test, i.e.
a test conditional on the two margins in a 2x2 contingency table
as an example where the admissibility by conditioning is not generally
satisfied and then we quantify the corresponding loss of information
by means of a simulation study. The numerical results indicate
that for a specific range of parameters the loss of information
increases with the sample size and decreases with the precision
of the prior distribution. Hence this is a small sample size approximation
.
The second problem is situated in the context of
discrete choice models. Bayesian inference for a semi-parametric
binary choice model is developed. We propose a semi-parametric
binary choice model stated in terms of a latent random variable
l such that
U, the binary
choice random variable verify:
P(Y = 1 | Z, b,
u) =
P(l <
Z'b
| Z,
b, u)
= u (Z'b)
where Z OE
¬p
is the vector of explanatory variables, u
and are b
the model parameters. The link function, u,
is a priori distributed according to a Dirichlet process with
parameters (no, Po) and b
is a priori distributed according to an arbitrary distribution
Qo on ¬p.
We consider two methods to estimate the posterior
distribution of
l
.
Both methods perform a Gibbs sampling for the
l
simulation. The first method uses a simulation procedure of the
Dirichlet process (Rolin (1992)) which allows to work with the
distribution of l
conditionally to u.
The second one considers the Polya urn representation of the Dirichlet
process to compute the distribution of l
marginally to u.
A numerical evaluation of both methods is presented as well as
an application to real data.
Doctorat en Statistique, 31/07/1995
KLINKE Sigbert, Allemagne
Promoteurs:
Abstract
"Data structures" describe the way how a statistical program needs to handle his data. Data are not only data in the statistical sense, but also includes graphical data objects (e.g. Boxplots, Histograms, etc.) and the consequences the use will have on the program.
A statistical program is composed of 3 elements: the user-interface, the statistical methods and the statistical graphics. The user-interface is today a graphical user interface (GUI) which is provided by the operating system. In the first chapter I give reason why statisticians need interactive programming environments and describe the tools we have from the GUIs. The second chapter describes some statistical graphics like Boxplots, Histograms, Scatterplots etc and show how linked plots can be used to help in the analysis e.g. for subgroups, especially in the multivariate data analysis.
The third chapters describes Exploratory Projection Pursuit in detail. It is a technique to analyze multivariate data which is composed by statistical methods and statistical graphics. Such combinations of mathematical and graphical methods will become more important in statistics in future. I examine the speed of kernel-based indices and and improve them by the use of binning techniques for the underlying kernel density estimation. The behaviour of the bandwidth selection by the rule-of-thumb is examined and a new method based on the minimization of the Mean-Squared-Error is suggested. The possibility of multivariate projections and the treatment of discrete variables are examined.
Other, more complicated methods, of data analysis are described in the fourth chapter: Cluster analysis, Teachware and (non-parametric) regression. Whereas Cluster analysis is again an exploratory tool, teachware will be used to teach statistics to the students. Nevertheless teachware often fails to fulfill his aims. But it needs a lot of interactivity. The third part of the chapter describes (non-parametric) regression methods and I discuss how these methods should be implemented, as black-box-commands or as procedure, which everyone can edit and change.
The fifth chapter describes the data structures I propose in statistical software for the graphical objects, for the data objects and for the linking. I have a short look what other statistical programs have and see that they partially realize these structures. The next chapter shows the implementation of these data structures in XploRe 3.2 and XploRe 4.0. Not all aims could be fulfilled in XploRe 3.2 since the data structures are the basics of a statistical program. Decisions about it have to be made very early. Thus some mistakes are done in the development of XploRe 3.2, e.g. the use of matrices as basic data elements, the construction of different window types for some statistical graphics etc. XploRe 4.0 has as a basis data type a multi-dimensional array and procedures which are able to handle these arrays.
Doctorat en Sciences économiques, 27/11/1995
DIAS PROENÇA Isabel Maria, Portugal
Promoteur: Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany
Abstract
L'usage des modèles "single index" se généralise en économétrie et en biométrie. Les modèles logit et probit en sont des cas particuliers, correspondant à des fonctions de liens déterminées. Cette thèse considère un test de spécification détectant les déviations non-paramétriques de la fonction de lien, qui correspond au test d'une hypothèse paramétrique contre une hypothèse semi-paramétrique.
La thèse est organisée en sept chapitres :
Doctorat en Sciences économiques, 16/04/1996
BERTSCHEK-ENTORF Irene, Allemagne
Promoteur: Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany
Abstract
Cette thèse propose un modèle théorique de l'activité innovatrice des entreprises dans lequel des importations et des investissements directs étrangers (FDI) ont des effets positifs sur l'innovation de produit et de processus des entreprises du pays. Ces hypothèses sont analysées empiriquement utilisant des méthodes semi paramétriques et nonparamétriques se concentrant surtout sur des modèles du choix binaire. De plus, la performance pratique des estimateurs semiparamétriques et non paramétriques est étudiée par des simulations.
La thèse est organisée comme une collection de cinq articles:
Doctorat en Sciences (orientation : Statistique), 26/09/1997
PATILEA Valentin, Roumanie
Promoteur: Jean-Marie ROLIN, UCL/STAT
Abstract
In the first part we consider dominated, generally infinite-dimensional, models which can be written as a convex set of densities. We analyze the asymptotic behavior of nonparametric maximum likelihood estimators (NPMLE) when the true probability governing the independent data does not necessarily belong to the model (not even asymptotically). Using recent results from empirical process theory, which we recall at the beginning of this first part, we (re)obtain convergence of NPMLE towards a (pseudo-) true density as well as their rates of convergence. Extensions to non convex models, dependent data, M-estimators,... are also discussed. Afterwards the general results are applied to several examples: decreasing densities, mixture models, increasing and decreasing failure rate distributions,... The first part ends with a chapter on the asymptotic normality of linear functionals in mixture models. Based on the rates of convergence of the NPMLE we extend existing results on the asymptotic normality to the case of misspecified models.
In the second part we develop likelihood-based estimation methods for structural econometric models (nonlinear rational expectations, option pricing, auction models, ...). Many of such econometric models characterize observable variables as highly nonlinear functions of some latent variables. These functions are one-to-one, but they depend on the unknown distribution of the latent variables through the equilibrium of the game and/or the learning process. Therefore numerical complexity of the equilibrium definition generates substantial obstacles for the direct implementation of maximum likelihood inference. Motivated by the fact that the law of motion of the latent variables is often defined in a fairly simpler way, simulation-based strategies have been developed recently. Herein we propose alternative estimation strategies based on learning on the latent variables in order to perform approximated MLE directly inside the more tractable latent model. This leads us to build various indirect and recursive estimators which appear to be well-suited for both empirical implementation of nonlinear rational expectation models and bounded rationality modeling.
Doctorat en Sciences (orientation : Statistique), 5/06/1998
WEINER Christian, Allemagne
Promoteur: Aloïs KNEIP, UCL/STAT
Abstract
The thesis investigates the stochastic properties of the FDH estimator for Farrell efficiency scores. The problem is of econometric interest: Farrell efficiency scores measure the degree of productive efficiency for firms and administrations, and they provide a tool to describe structural properties of a production possibilities set. Yet, these scores depend on the unknown production possibilities set, so they must be estimated. The Free Disposal Hull (FDH) is a nonparametric estimator of the production possibilities set, and the FDH estimator for Farrell efficiency scores is calculated relative to the FDH.
More precisely, the FDH is the smallest free disposal set that covers all observations. It is a very flexible estimator, because free disposability is the only supposed condition. The FDH is always a subset in the true production possibilities set, thus FDH estimates are always conservative.
If the observations are randomly drawn from a population distribution for which the support is the production possibilities set, then FDH estimates provide a consistent approximation. The thesis treats the stochastic properties of the FDH estimator in order to quantify this approximation.
Production of goods and services is usually represented by multivariate vectors, which consist of all used inputs and all produced outputs. As a consequence productivity and efficiency analysis is a multivariate problem. However, one can represent the Farrell efficiency scores as maximum or minimum of the support of a univariate random variable, and it turns out that the FDH estimator for Farrell efficiency scores is the sample maximum or minimum of iid observations. Therefore the error term of this estimator converges to a univariate extreme value distribution.
The sparseness of high-dimensional data is a general problem in nonparametric statistics, often referred as "curse of dimensionality", which also appears in boundary estimation.
Therefore the estimator has a poor rate of convergence for high-dimensional data. This is the prize to pay for the nonparametric flexibility, and this limits the practical use of FDH estimators.
Simulations and an application to scale efficiencies for the US banking industry illustrate this effect.
Cette thèse examine les propriétés stochastiques de l'estimateur FDH des scores d'efficacité de Farrell. Le problème est intéressant en économétrie. En effet, les scores d'efficacité de Farrell taxent le degré d'efficacité productive des entreprises et des administrations. De plus, on peut les utiliser pour décrire certaines propriétés structurelles d'un ensemble de production - il s'agit de l'ensemble de tous les processus de production possibles. Cependant, les scores d'efficacité de Farrell dépendant d'un ensemble de production inconnu, il faut alors les estimer. L'enveloppe "Free Disposal Hull" (FDH) est un estimateur nonparamétrique de l'ensemble de production et l'estimateur FDH des scores d'efficacité de Farrell est calculé sur base de l'enveloppe FDH.
Plus précisément, l'enveloppe FDH est l'ensemble le plus petit, qui est "free disposal" et qui enveloppe toutes les observations. C'est un estimateur très flexible, car la "free disposability" est la seule condition imposée. L'ensemble FDH est toujours un sous-ensemble du véritable ensemble de production, ainsi les estimateurs FDH sous-estiment les vraies valeurs. De plus, l'ensemble FDH est une approximation consistante, lorsque les observations sont tirées au hasard d'une distribution pour laquelle le support coïncide avec l'ensemble de production.
La production des biens et des services est habituellement représentée par des vecteurs multivariés, qui se composent de tous les inputs utilisés et de tous les outputs produits. Par conséquent, l'analyse de productivité et d'efficacité est un problème multivarié. Néanmoins, on peut représenter les scores d'efficacité de Farrell comme le maximum ou le minimum du support d'une variable aléatoire univariée; de plus, on montre que l'estimateur FDH des scores d'efficacité est le maximum ou le minimum des observations d'un échantillon. En conséquence, l'erreur de cet estimateur converge vers une distribution des valeurs extrêmes.
Un problème général se pose en statistique nonparamétrique, puisque, en pratique, les données en grande dimension sont clairsemées; dans la littérature, ce phénomène est usuellement appelé le "curse of dimensionality". Ce problème apparaît aussi pour l'estimation des valeurs extrêmes.
C'est pour cette raison que le taux de convergence est faible si les observations sont multidimensionnelles. C'est le prix à payer de la flexibilité nonparamétrique. Des simulations et une
application aux rendements d'échelle des banques américaines illustrent cet effet.
Doctorat en Sciences (orientation : Statistique), 23/10/1998
ARS Pierre, Belgique
Promoteur: José PARIS, UCL/STAT
Abstract
Ce travail trouve sa place dans la nécessité pour les actuaires de
maîtriser le risque encouru par leur entreprise. Nous définissons le
surplus d’une compagnie d’assurance comme la somme des fonds propres et
des provisions techniques.
Nous supposons que le processus du surplus
est une semimartingale générale et notre objectif principal est alors de
montrer l’intérêt en théorie du risque de la théorie des semimartingales
et, dans une moindre mesure, du calcul de Malliavin.
Les principaux
résultats dégagés concernent la théorie de la ruine mais également la
gestion actif-passif (ALM) qui est en interaction avec la théorie du
risque. Nous définissons un processus d’ajustement qui constitue une
extension de la notion de la fonction d’ajustement local introduite par
Asmussen et Nielsen (1995) et obtenons des généralisations de la borne
de Lundberg.
Une extension des principes de dualité (développés par
Asmussen et Petersen (1988), Dufresne et Gerber (1989), …) nous permet
de mettre en évidence des méthodes approchées pour la détermination de
la probabilité de ruine pour le modèle avec investissement du surplus
dans un actif risqué introduit par Paulsen (1993).
Le calcul de
Malliavin conduit à des résultats nouveaux pour le problème de la
gestion du risque de taux lorsque celui-ci ne suit pas un processus de
diffusion. Des extensions naturelles de la duration stochastique sont
alors développées.
Doctorat en Sciences (orientation : Statistique), 02/04/1999
COCCHI Daniela, Italie
Promoteur: Michel MOUCHART, UCL/STAT
Abstract
This thesis aims to propose a modern framework for modelling survey data. It deals with situations where individual data, i.e. observed on distinct individuals, must be analysed. The set of individuals is considered as finite, as it actually happens in real world situations.
In the work we outline suitable approximations to a carefully designed hierarchical model for finite populations. The argument is divided into two parts: the development of a complete framework for inference and of the approximations for such a framework.
A particular issue in this context is the emphasis of the role of the probabilistic models involved: the structural model and the sampling model.
We propose a hierarchical linear model, which has, as a special case, a type II Anova model. Modeling considers the joint distribution of all the variables involved and introduces, in a stepwise manner, a sequence of hypotheses, which aim at obtaining admissible reductions of the underlying model and at looking for operational simplifications. Within this context, special emphasis is given to the role of conditional modelling. The sampling structure is represented by a selection matrix which enter in all phases of the solution. Since, in type II Anova models, the exogenous variable is categorical, some important simplifications in the first two moments of the joint distribution which is the basis of all computations are possible. Predictive inference for the proposed model is developed by means of Bayesian least squares approximations. When looking for approximations, two sources of arbitrariness appear: the choice of the coordinates, which depends on the object of inference, and the choice of the statistic to be conditioned on. For what concerns the first issue, a careful choice of a set of approximating functions is first discussed. The solution is found conditionally on a statistic which permits to keep conjectures on non-normality into account. As a special case, the solution obtained under the normal hypothesis is found when computations are performed conditionally on a simpler statistic.
The solutions based on least squares approximations rely on a number of linear algebra manipulations, some of which are collected in a series of appendices. Such developments of linear algebra are not a specific aim of our work, but rather a consequence of the assumed framework. The work contains a discussion of the comparison with other results in the literature performed by means of analytical comparisons and with the help of a simulation study.
Doctorat en Sciences (orientation : Statistique), 25/02/2000
SAN MARTÍN Ernesto, Chili
Promoteur: Michel MOUCHART, UCL/STAT
Abstract
This Ph.D. thesis is concerned with modelling problems in social and
behavioural sciences. The main issue of this thesis is to
propose a modelling strategy paying a particular attention to both the
contextual as well as the statistical meaning of the
hypotheses introduced in a statistical model.
The motivation is the
following: reading some literature on Structural Equations with
Latent Variables, it may be found that in many instances models are
presented at once with all hypotheses regrouped; often hypotheses are
redundant or assumed implicitly, and therefore it may be concluded that
the contextual meaning of each hypothesis has not been carefully thought
and that hypotheses are motivated more by justifying a numerical or an
inferential procedure than by trying to model a real context. The thesis
makes a contribution at the level of model building and the underlying
message is to provoke a cross-fertilization between a statistician and,
for instance, a sociologist who can evaluate whether an hypothesis is
relevant or not in contextual terms. The modelling strategy
proposed in this thesis may be qualified as structural in the
sense of being monitored by a contextual theory .
The modelling strategy developed in this thesis essentially
consists in introducing the hypotheses progressively. The
motivation is twofold: firstly decomposing an hypothesis into simple
elementary pieces makes the contextual interpretation of each piece
easier and, secondly, the meaning of a given hypothesis is conditional
on the hypotheses previously introduced. Consequently, the interpretation
of a statistical model critically depends on the order in which the
hypotheses have been introduced and, therefore, on the "logic" of
its construction.
Identification problems are typically involved in most structural
models and, in general, the interpretation of the parameters is crucially
conditioned by the identifying restrictions. For this class of problems,
the contribution of this thesis is twofold. Firstly, statistical models
are decomposed into a marginal and a conditional submodel and the
relationships between the identification of the submodels and the
identification of the complete model are analysed. Next the
statistical model relative to a sample a size, say
The thesis is divided into two parts:
the first one deals with general formulations concerning problems of
specification and of identification, whereas the second one applies the
general results to two particular cases: the class of Item Response
Models and the class of LISREL type models.
Doctorat en Sciences (orientation : Statistique), 29/09/2000
WALHIN Jean-François
Promoteur: José PARIS, UCL/STAT
Abstract
Since the 80's recursive formulae are used in actuarial sciences essentially in order to give the probability function of aggregate claims distributions easily, i.e. without using the brute force convolution formula.
Doctorat en Sciences (orientation : Statistique), 15/12/2000
BECK Benoît
Promoteur: Jean-Marie ROLIN, UCL/STAT
Abstract
Problems involving special patterns of incompleteness are thus presented. These problems have been studied with the help of two well-known Bayesian non parametric techniques based on Pòlya trees and Lévy processes. This manuscript is mainly composed of two parts-respectively linked to these two different techniques. The first part deals with the set-censoring problem in a general measurable state space. This problem encompasses the interval censoring problem, the doubly censored data problem, the current status data problem, etc. Bayesian estimators are given as well as an exact method of simulation for the posterior distribution. These estimators are explicit and not relying on estimating equations, as in the classical solution of Turnbull (JRSSB 1976), nor on MCMC methods as in the solution of Doss (An.Stat. 1994). As a particular case, the solution for the left censored survival problem is deduced and a non parametric method to estimate high quantiles based on empirical Bayesian considerations is proposed. The second part is devoted to event history (survival) data as it only handles the real line. A multiple risk semi-parametric model with time-dependent covariates, for which both truncation and censoring mechanisms are allowed, is investigated.
This model combines the Aalen's (additive hazards) model (An.Stat. 1978) and the Cox's (proportional) model (JRSSB 1972). In spite of the fact that the posterior distributions asociated to the last model is given under general neutral to the right process priors, the result is then particularized for Dirichlet priors in order to exhibit an efficient method of simulation.
Doctorat en Sciences (orientation : Statistique), 07/12/2001
GODERNIAUX Anne-Cécile
Promoteur: Irène GIJBELS, UCL/STAT
Abstract
We consider the problem of change-point detection in nonparametric regression, i.e. nonparametric estimation of a regression curve
with jumps or change points.
The issue is that any nonparametric estimation method involves
the choice of parameters, call them smoothing parameters, and that
the performance of the estimation procedures heavily depends
on the choice of these parameters. Hence it is very important to
address the issue of how to choose these parameters in practice.
The main objective of this thesis is to propose nonparametric methods
that are automatic in
the sense that all smoothing parameters are chosen from the data.
We will first focus on a data-driven estimation the locations of the jump discontinuities.
More precisely, the objective
is to come up with an estimation procedure with data-driven choice of the bandwidth parameters and
with a built-in estimation of the number of discontinuity points, which performs well in practice.
As a basis, we use the two-steps estimation method proposed by Gijbels, Hall
and Kneip (1999) for which it has been shown that the
estimator for the location of a jump discontinuity
achieves the optimal rate n-1, where n is the sample size.
This two-steps estimation method involves the choice of two smoothing
parameters: the first step uses
the first derivative of a Nadaraya-Watson estimator as a diagnostic function, and the second
(least-squares) step requires the determination of a small interval around the preliminary estimator of the
jump resulting from the first step.
We propose a bootstrap algorithm to select these parameters in practice.
With this additional bootstrap procedure implemented we
obtain a two-steps fully data-driven procedure for estimating a jump discontinuity
in an unknown regression function.
We also propose to generalize the fully data-driven procedure
for estimating jump discontinuities in a derivative curve. The method includes a data-driven way
of determining the number of discontinuities in a derivative curve.
Further, we deal with the problem of testing whether or not there is an abrupt
change in the regression function itself or in
its first derivative at certain (prespecified or not) locations.
We discuss a bootstrap procedure for this testing
problem, which does not rely on asymptotic laws. This is in contrast
with testing procedures available in the literature who rely on
asymptotic distributions of the estimators involved.
The bootstrap testing procedures presented here use the
data-driven two-steps estimation methods developed to locate jump discontinuities in
a regression function or in its derivatives.
As a consequence, the bootstrap testing procedures are also
fully data-driven.
Finally, in the bivariate setup, we consider the problem of jump detection in a regression surface.
We develop a fully data-driven two-steps estimation procedure to locate the jump curve
based on similar idea as in the univariate setup.
We evaluate the performance of all proposed procedures via
an extensive simulation study, showing a good performance. The methods have also been illustrated
on some real data examples.
Doctorat en Sciences (orientation : Statistique), 14/12/2001
NICOL Florence
Promoteur: Alois KNEIP, Universität Mainz
Abstract
This thesis investigates the problem of registration in Functional Data Analysis (FDA).
FDA differs from standard statistical approaches in the nature
of the observations. Rather than individual points or vectors, the data xi(t) are functions observed for
each individual over some argument continuum often called time arguments.
Examples are known in many fields of applied research, among others in biology and biomedicine
with longitudinal growth studies, in medicine with
psychophysiological studies of EEG curves and, for higher dimensional data, in medical imaging with
analysis of brain images.
Doctorat en Sciences (orientation : Statistique), 28/05/02
CLIMOV Daniela
Promoteur: Léopold SIMAR, UCL/STAT
Abstract
The ultimate aim of the research work presented in this thesis is to show that various semiparametric M-estimation methods provide a very useful means of estimating a regression model for count data, namely Poisson regression. It is also intended to show that semiparametric methods are also valuable in testing problems.
The main contribution of this thesis to this respect is threefold.
First, we propose a robust procedure with respect to the numerical instability inherent to the application of M-estimation methods to real data. In our Poisson regression setting, the objective function to be optimized is the pseudo likelihood function and the resulting estimator of the direction vector is called Pseudo Maximum Likelihood (PML) estimator. We investigate, by simulation arguments, the practical validity of the PML estimator asymptotic behavior and of the associated regression function estimator. In particular, it appears that the asymptotic results should not be used unless a huge number of observations is available. We propose a bootstrap procedure for approximating the variance of the direction estimator and a variant of bagging method introduced by Breiman (1996), in order to numerically stabilize the PML estimation procedure. Our method gives reasonable results even for moderate sized samples and therefore it can be used for doing statistical inference in practical situations.
Second, we derive two alternative M-estimation methods, based on risk estimation. For estimating the risk associated with the weighted
average squared error, we propose two data-driven selectors: weighted least-squares (WLS) and double smoothing (DS).
The first criterion is a weighted least-squares criterion plus a term which prevents undersmoothing in small samples, whereas the second method makes use of a double smoothing idea, as in Wand and Gutierrez (1997). Simulations are used to investigate the behavior of the above data-driven estimation methods in the single index Poisson model.
In small samples, our weighted least-squares and double smoothing methods out-perform
both the pseudo maximum likelihood method and the weighted least-squares cross-validation method of Härdle, Hall and Ichimura (1993).
Finally, we provide a procedure for testing the validity of the Poisson assumption.
We propose a test statistics for overdispersion and derive its asymptotic distribution under the null hypothesis of Poisson model. The distributional approximation is assessed by simulations in several regression scenarios. The results indicate that the asymptotical normal approximation is not satisfactory unless the sample size is very large. Therefore, we propose a bootstrap approach for conducting the overdispersion test in small samples.
The bootstrap procedures for the estimation of the direction vector variance and for the overdispersion test are illustrated on a real data sample.
Doctorat en Sciences (orientation : Statistique), 12/07/02
BEGUIN Claire
Promoteur: Léopold SIMAR, UCL/STAT
Abstract
In order to control the health care costs, the Belgian government introduced in 1995 a regulation of the hospital payment system taking into account the pathologies measured by the Diagnosis Related Groups (DRG). In this context, the mean is usually estimated by a trimmed mean, i.e. a mean computed and the interquartile range like Q1(3)-(+)k*(Q3-Q1). This thesis proposes to measure the variability of these bounds, in particular the higher bound and to take into account the characteristics of the patients.
First, the distribution of the total expenses to a pathology is estimated by a survival curve and the confidence interval of this curve is defined using bootstrap technique. We look for the corresponding value of the bounds described above on the survival curve. When analysing myocardial infarction, the variability of the bounds measured by the length of the confidence interval may result in 3.8% of difference in the estimation of the mean.
We also propose the use of frontier models in order to rank hospital stays in function of their expenses taking into account the severity level of the patient. In this case, the hospital stays with the lowest total expenses for a given level of severity characterise the frontier of possibilities and are thus considered as being efficient. We work with deterministic models, parametric and nonparametric. In an univariate approach, we try to find the minimal achievable value of LOS or the minimal value of the total expenses and in a multivariate approach, we try to find simultaneously the minimal value of both LOS and medical fees, for a given level of the severity of the patients. The efficiencies obtained by the parametric estimator are lower the nonparametric model, more flexible than the parametric model with his restrictive hypothesis, envelops the data more closely and should be preferred for this type of analysis.
But, deterministic models are very sensitive to the extreme stays so that some efficient stays could be in fact "too" efficient and considered as outliers. We try to highlight these stays using a simplified version of a method proposed by Wilson (1993) (B-W method). The too efficient stays are characterised by lower resource variables and higher severity level. As expected, the exclusion of the too efficient stays, the inefficient stays can be hightlighted. The stays to be excluded from the hospital financing could be searched among these inefficient stays. The inefficient stays present higher value of the resources variables for a lower severity level.
An alternative method allowing to detect the too efficient stays has been proposed by Simar (2001). It is based on a robust estimator of the frontier. The method is tested on the nonparametric model in the univariate approach minimizing the total expenses and the results are compared to the previous method. The too efficient stays hightlighted by the order-m forntier have an higher compact on the efficiencies of the other stays. Moreover the order-m frontier proposes an estimation of a plausible frontier for a given level of severity. In some cases, when there are very few stays for a given level of severity, the order-m frontier fails to hightlight the "too efficient" stays. For these cases, the B-W method offers an interesting alternative. So, a good way of work would be using first the order-m frontier and second the B-W method when there is no stay with at least the same characteristics.
Doctorat en Sciences (orientation : Statistique), 20/12/02
DELOUILLE Véronique
Promoteur: Rainer von SACHS, UCL/STAT
Abstract
The recent development of wavelet transforms based on multiresolution analysis suggests new techniques
for nonparametric function estimation.
Indeed, wavelet procedures achieve denoising of a curve and adaptivity through thresholding of the empirical wavelet coefficients.
The localization both in space and frequency allows for the wavelet estimators to perform well in regions of high regularity, without deteriorating in the neighborhood of discontinuities.
However, standard wavelet estimators are optimal only in presence of equispaced samples.
This thesis proposes some ways to remove this restriction by constructing wavelets which automatically adapt to the design at hand.
Within the framework of second-generation wavelets, we use some instances of the lifting scheme to achieve this objective.
This automatic adaptation avoids the use of some pre-processing steps, whose aim is to come back to the evenly spaced design case, and which may deteriorate the quality of the final estimator.
In the first part of this thesis, we treat the stochastic and the autoregressive univariate models by using a particular weighted wavelet basis, that is, a wavelet basis of the space weighted by the
empirical measure of the regressors.
Wavelet bases of this space are called "design-adapted".
The approximation properties of an estimator based on a simple, Haar-like, design-adapted basis are investigated, and thereafter this initial basis is improved with the lifting scheme. This leads to smooth design-adapted wavelets. The resulting estimator is tested on simulated and real data sets. Under some conditions on the distribution of the regressors, the rate of decay of the smooth design-adapted wavelet coefficients is the same as in the classical case, and
the risk, in the wavelet domain, of a linear wavelet estimator is also the same as in the classical setting.
We show which thresholding scheme to use in case of an autoregressive model in order to remove all the noise, and we establish an upper bound for the corresponding l2-risk in the wavelet domain.
The use of design-adapted wavelets can easily be extended to half-regular grids.
By "half-regular", we mean a grid which is the Cartesian product of two irregular one-dimensional grids. In this construction, tensor products of spaces are used. A denoising scheme is proposed, and the resulting estimator is compared with a locally weighted regression procedure.
In the last part, several wavelet transforms adapted to an irregular bivariate design are proposed. There, another instance of the lifting scheme has to be utilized. Coupled with a Bayesian denoising scheme, these wavelet transforms provide different wavelet estimators. A simulation study shows their good behaviour even in presence of discontinuities in the regression function.
Doctorat en Sciences (orientation : Statistique), 29/01/03
DELAIGLE Aurore
Promoteur: Irène GIJBELS, UCL/STAT
Abstract
Our interest is to estimate a density from an i.i.d. sample that has been contaminated by a measurement error. This problem, usually referred to as a deconvolution problem, has applications in many different fields such as astronomy, public health, chemistry, microfluorimetry or electrophoresis. The contaminating density, or error density, is often assumed to be known. In this context, a so-called deconvolution kernel density estimator has been proposed in the literature. This estimator requires the choice of a smoothing parameter called the bandwidth, which plays a crucial role in the estimation process. Despite this fact, very few papers in the literature deal with the problem of how to choose the bandwidth of the deconvolution kernel density estimator in practice. Part one of this thesis aims to provide practical methods of bandwidth selection in this deconvolution problem.
After having presented the cross-validation method of Stefanski and Carroll (1990), we propose several other methods of bandwidth selection that can be used in practice: a method based on exact calculations for the sinus cardinal kernel, a normal reference method, a plug-in procedure, a solve-the-equation bandwidth selector and a bootstrap method. We illustrate each method on some simulated examples, and prove the consistency of the plug-in and bootstrap procedures. Finally, we compare the performances of the various methods via simulated examples and apply the methods on some real data. From our simulation results it appears that the plug-in and bootstrap procedures compete and both outperform the other data-driven bandwidth selection procedures. As a by-product, we study the estimation of integrated squared density derivatives.
Another situation which arises quite often in real life examples is that the domain of variation of the uncontaminated data is not the whole real line: the possible values are between a finite minimum and/or a finite maximum. In the error-free case, we know that the kernel density estimator is not consistent in a finite endpoint of its support. In order to have a consistent estimator of the density of interest in an endpoint, one has to make some modifications on the kernel estimator, that take the support into account. This property extends to the error case, i.e. the deconvolution kernel estimator is not a good estimator at a (finite) boundary point. Hence it is important to be able to estimate the endpoints of the support if these are unknown. This boundary estimation problem also has applications in stochastic frontier estimation.
In the error-free case, various methods have been proposed to estimate the endpoints of a density, but this problem has been studied little in the error case. Part two of this thesis studies the boundary estimation problem for contaminated data. We propose a deconvolving kernel estimator of an endpoint, inspired by methods that exist in the error-free case to estimate a discontinuity point. The idea of the method is to estimate the location of a boundary point by the maximiser of a certain diagnostic function. The methodology can also be used to estimate a discontinuity point. We prove the consistency of the method and give some ideas of how to apply it in practice. We illustrate the performance of the method on some real and simulated examples.
Doctorat en Sciences (orientation : Statistique), 20/02/03
TAJAR Abdelouahid
Promoteur: Michel DENUIT, UCL/STAT and Jean-Marie ROLIN, UCL/STAT
Abstract This PhD thesis is concerned with measuring and modelling stochastic dependence between random variables, and the main focus is on discrete random variables. In our study on dependence, concepts such as concordance and copula are widely used.
We first propose a copulatype representation for random couples with Bernoulli margins. Some association measures for binary data are reexamined. It is stressed that satisfactory dependence measure should only depend on the discrete copula, and not on the margins.
We propose a systematic study of the monotonicity property of Kendall's t and Spearman's r
with respect to the concordance ordering of pairs of discrete as well as continuous random variables. We also prove that various relationships between Kendall's t and Spearman's r mentioned in Nelsen (1999) remain valid for discrete variables. In particular, results of Capéraá and Genest (1993) are extended to the case of discrete random pairs. We also establish that some useful stochastic dependence properties used in the actuarial literature are conserved for discrete random variables. In particular some results in Dhaene and Goovaerts (1996) continue to hold for discrete random variables. Furthermore, we propose a copula for discrete random variables using a principle given in Denuit and Lambert (2002) which transforms an integervalued random variable to a continuous one via convolution with
a uniform random variable on [0,1].
Finally, we consider a loglinear model to construct a joint distribution with uniform discrete margins. This new distribution will preserve a local association of the joint distribution with arbitrary
margins. The construction is based on a copulatype representation of an n × m table of joint probabilities with fixed margins. The idea arises from the decomposition of the joint probability in the original
table into margins and association effects leading to the uniform representation. The technique uses an iterative procedure which allows to preserve local association by means of a set of local oddsratio.
In the continuous case any joint distribution function H, with margins F and G, admits a unique copula representation C, which is the part representing the dependence structure of H. While Sklar's continuous copula C preserves any dependence structure in H, the discrete uniform representation is
shown to conserve only local dependence structures and properties.
Doctorat en Sciences (orientation : Statistique), 07/05/04
DE MACQ Isabelle
Promoteur: Léopold SIMAR, UCL/STAT
Abstract Decision trees are one of the most widely used tools for "supervised classification" problems: provided a sample of correctly classified data is available, they allow to establish rules for explaining and/or predicting a categorical response (membership to classes) on the basis of observed values of predictive variables. Besides efficiency, this hierarchical method is endowed with many additional advantages. In particular decision trees are very flexible and can handle heterogeneous data types, they are invariant under monotone transformations of the data, they perform internal feature selection as an integral part of the procedure, and they generate solutions allowing qualitative
understanding of the generated prediction rules. All these properties, often essential in data mining applications, explain the major place occupied by tree methods in this field.
A plethora of tree algorithms have been developed, most of them being but variants of the original CART and C4.5 algorithms. A large part of academic research has focused on incremental modifications to current machine learning methods, and on the speed up of existing algorithms. The main contribution of this work has been to consider a theoretically sound decision tree classifier, initiated by Devroye, Györfi and Lugosi, and to evaluate its
expected performance and efficacy as compared to traditional tree methods by evolving it into a practical tool. It is presented in a larger framework including an overview of related approaches issued from different fields, and potential developments.
Enhanced problems related to the splitting method of ordinary trees include the instability of the solutions, the unnecessarily complex expression of the identified structure or the failure of the classifier when complex interrelations between predictors and class cannot be detected. Hyperrectangular Space Partitioning trees on the other hand are based on a different cutting method, and despite their greedy construction, they have been proven to be consistent without any additional tricks. However their practical implementation is a challenge due the computational burden involved. We explore its specific features, strong and weak points, by means of a massive search working tool, while a two-stage approach has been developed for approximating HSP tree solutions, allowing to tackle moderate dimensional problems from commonly used data bases.
Doctorat en Sciences (orientation : Statistique), 19/11/04
BOUEZMARNI Taoufik
Promoteur: Jean-Marie ROLIN, UCL/STAT
Abstract This thesis is devoted to the nonparametric estimation of the class of density functions with a known bounded support. We distinguish two cases of bounded support.
For the first case, the support is [0,1]. For such a class of density functions we consider two estimators: the beta histogram and the beta kernel estimator.
Different types of convergence are proved for these estimators: uniform weak consistency and convergence to infinity at the endpoints if the density is unbounded at these points. The rate of convergence of the mean integrated absolute error is also established. The quality of the estimation of the beta histogram is very sensitive to the choice of the bandwidth parameter. In this work we adapt two methods already used in the selection of the optimal choice for the standard kernel, the likelhood cross validation and the least square cross validation.
The two estimators are already used in regression estimation. It will be interesting to use them in other nonparametric estimations such as the quantile and the hazard function.
For the second case, the support is [0,¥). For such a class, we have studied the consistency of the gamma histogram, the gamma kernel, the inverse gaussian and the reciprocal inverse gaussian estimators. We give the rate of convergence of the mean integrated absolute error of gamma histogram and gamma kernel estimators. The uniform weak and the uniform strong convergence is also proved. When the density is unbounded at the origin we proved the weak convergence to infinite under some conditions on the density function f of gamma histogram at this points. All the results proved for these estimators are generalised to the generalised kernel estimators. Simulations showed that the gamma histogram is better than the first gamma kernel estimator. We applied the gamma histogram and gamma kernel estimators on income data. The estimation requires a choice of an optimal bandwidth. Here the bandwidth selection method used leads to an asymptotically optimal window in the sense of minimising L1 distance.
The practical choice of the bandwidth parameter for the gamma histogram and gamma kernel estimators is an important problem.
Dernière mise à jour : 21 janvier 2005
- Contact : Sophie
Malali <www@stat.ucl.ac.be>
Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany
Léopold SIMAR, UCL/STAT
Berwin Turlach generalizes the theory of U-statistics to the case where the function defining the U-statistic is multi-dimensional and depends on the number of observations. The thesis is well structured and very pleasant to read. The statistical results are always followed by their implementation, and illustrated by examples. The thesis is self-contained ; all the models or the theoretical tools used are introduced.
Conclusion : I am very enthusiast about this thesis whose subject is very interesting. I am impressed both by the statistical content and by the implementation of the proposed methods. It constitutes an excellent contribution and an excellent thesis, and I am sure it will have an important impact in its field.
Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany
Léopold SIMAR, UCL/STAT
, is decomposed
into
individual (marginal) submodels and the
relationships between the identification of the submodels and the
identification of the complete model are also analysed.
Modern developments extend the classical univariate recursions in a multivariate setting.
The first part of the PhD thesis gives results in this sense and applies the results to the calculation of the ruin probability of Insurance Companies buying excess of loss covers with reinstatements.
Some results about the multivariate stochastic order are also derived and used in the same context.
The second part of the PhD thesis introduces the Hofmann Distribution which seen to be a good candidate for the fitting of count data sets with a low frequency. Theoretical and practical properties are reviewed. A comparison is made with recent models developed in the literature.
The Hofmann Distribution is then extended in a bivariate setting by using the Mixed Bivariate Poisson Distribution or the Trivariate Reduction Method. When they are available, recursions for the bivariate aggregate claims distribution are derived. Finally applications are given in the field of bonus-malus systems which is an important subject in Belgium nowadays because Insurance Companies will be obliged to use new and different bonus-malus systems in a near future. It is chown how to construct a bonus-malus system with the implications of the hunger for bonus developed by the drivers.
In FDA, a serious drawback must be considered when the observations are shifted, owing to
time lags or general differences in dynamics. The problem due to variations can hinder
even the simplest analysis of a sample of curves. Two or more functions may differ because of two
types of variations: phase variations (horizontally) due to time lags and amplitude
variations (vertically) due to intensity differences. Often both the types of variations are mixed
and it may be hard to distinguish between phase variations and amplitude variations.
A preliminary step often consists in the registration, or alignment, of the curves or images by
suitable transformations often called "warping functions". Thus, the main point of the registration
problem is to remove phase variations so that we could eventually improve the analysis of individual
differences in order to better
compare the dynamics of the functions.
A way of treating individual differences is then to use a new scale, adjusting the scale
and shift distortions in each case.
Parametric and semi-parametric techniques have already been explored. Yet here, we focus on
non-parametric approaches in order to estimate more complex, possible non-linear, warping transformations.
Among non-parametric methods already proposed, the generalization to higher dimensions may be difficult to
perform. As an alternative, we present a non-parametric method based on a local linear regression
technique which could be easily generalized to high dimensional data.
We will particularly tackle the hard problem of unidentifiability in the model combining
amplitude and phase variations. Moreover, in order to register functions having discontinuities,
we will lay out a modified smoothed local approach. The method and the problem of unidentifiability
will be illustrated by using simulated two-dimensional data. An important application in the field of
medical imaging will be studied to register multiple medical images of the brain acquired
from the same patient.
This extends and completes results of Yanagimoto and Okamoto (1969) and Tchen (1980). Analytic expressions are given for the most extreme values of Kendall's t and Spearman's r associated with discrete uniform variables. Some other measures of concordance are also studied. A certain number
of examples are used to highlight the drawbacks of some concordance measures noticed for Bernoulli random variables and still present for more general discrete random variables. Corrections for some measures of monotone dependence for discrete random variables are proposed to obtain a margins free range.