UCL - Institut de statistique (STAT)

UCL/STAT - Doctorats / PhD degrees

Date :  27.02.03 (jj.mm.aa)


Computer-aided additive modeling

Doctorat en Statistique, 09/09/1994

TURLACH Berwin, Inde

Promoteurs:
Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany
Léopold SIMAR, UCL/STAT


Abstract

The thesis addresses several interrelated topics, which are at the present time of great interest both in statistical theory and in computationnal aspects. The well-known generalized linear model is reconsidered under semi-parametric or non-parametric assumptions leading to single index models, and generalized additive models. Berwin Turlach proposes the algorithms needed for estimating functionnals in the generalized additive models or parameters in the single index model. He studies this subject thoroughly, taking into account all the problems as for example the problem of correlated explanatory variables. He suggests to use techniques based on binning to optimize the computations in the case of average derivative estimation. The implementation of the algorithms is done in the Xplore environment.

Several statistical results in the thesis are new as for example the procedure allowing to identifiy the variables which should be included into a generalized additive model, or the misspecification test for the single index model. The sults of Härdle and Tsybakov are generalized to the case of the generalized model, and to the choice of interaction terms. He gives a detailed discussion about the assumption of independance between covariates and suggests new ideas.
Berwin Turlach generalizes the theory of U-statistics to the case where the function defining the U-statistic is multi-dimensional and depends on the number of observations. The thesis is well structured and very pleasant to read. The statistical results are always followed by their implementation, and illustrated by examples. The thesis is self-contained ; all the models or the theoretical tools used are introduced.
Conclusion : I am very enthusiast about this thesis whose subject is very interesting. I am impressed both by the statistical content and by the implementation of the proposed methods. It constitutes an excellent contribution and an excellent thesis, and I am sure it will have an important impact in its field.


Théorie du filtrage.

Doctorat en Statistique, 13/03/1995

NAMORO Soiliou Daw, Togo

Promoteur: Jean-Marie ROLIN, UCL/STAT


Abstract

L'objectif visé dans le présent ouvrage est l'analyse statistique du modèle dynamique des automaticiens (appelé encore " système dynamique» ) . La version linéaire de ce modèle constitue le point de départ vers une analyse plus générale. Un résumé de la théorie linéaire standard fait l'objet du premier chapitre. L'accent est mis sur la structure markovienne du modèle et le caractère récursif de l'inférence dans son cadre (Filtre de Kalman-Bucy). Au deuxième chapitre, la question de la représentation markovienne des processus est abordée. Les résultats dans ce domaine suggèrent en effet des modèles du type considéré; ils apparaissent par conséquent comme une justification sur le plan théorique de la considération de tels modèles. L'optimalité des représentations markoviennes est étudiée d'abord dans un cadre linéaire (espace de Hilbert), puis dans un cadre non linéaire (analyse s-algébrique).

Dans le dernier chapitre, le modèle est reconsidéré sous une forme générale abstraite en ne retenant que sa structure d'indépendance entre les variables impliquées Cette forme générale appelée "système dynamique bayesien" met en oeuvre un paramètre incident (le paramètre d'état) et un paramètre structurel. L'estimabilité exacte du paramètre structurel est étudiée. Le modèle linéaire sert alors d'exemple d'illustration des résultats obtenues dans ce cadre général. L'identification du paramètre structurel est aussi examinée dans le cas particulier du modèle linéaire et cette analyse établit le lien entre les concepts d'observabilité et d'identifiabilité. Toujours dans le dernier chapitre est établie, dans le cas général, la récursivité de la statistique suffisante minimale (à chaque instant) du couple "paramètre d'état- paramètre structurel". Le filtre de Kalman-Bucy apparaît comme un exemple particulier de cette récursion. Enfin, les conséquences de l'existence d'une telle récursion sont examinées en rapport avec le problème de la construction de filtres exact fini- dimensionnels.


Some problems on the bayesian analysis of discrete data.

Doctorat en Statistique, 23/06/1995

SCHEIHING Eliana, Chili

Promoteur: MOUCHART Michel, UCL/STAT


Abstract


In this work we investigate two problems involving a Bayesian analysis of discrete data. The first problem concerns the analysis of Bayesian admissibility of the reductions by conditioning. We study the conditions under which the admissibility by conditioning holds, and next we consider the evaluation of the loss of information when a non-admissible conditioning is used for an approximation of the exact posterior distribution. We use the Fisher test, i.e. a test conditional on the two margins in a 2x2 contingency table as an example where the admissibility by conditioning is not generally satisfied and then we quantify the corresponding loss of information by means of a simulation study. The numerical results indicate that for a specific range of parameters the loss of information increases with the sample size and decreases with the precision of the prior distribution. Hence this is a small sample size approximation .

The second problem is situated in the context of discrete choice models. Bayesian inference for a semi-parametric binary choice model is developed. We propose a semi-parametric binary choice model stated in terms of a latent random variable l such that U, the binary choice random variable verify:

P(Y = 1 | Z, b, u) = P(l < Z'b | Z, b, u) = u (Z'b)

where Z OE ¬p is the vector of explanatory variables, u and are b the model parameters. The link function, u, is a priori distributed according to a Dirichlet process with parameters (no, Po) and b is a priori distributed according to an arbitrary distribution Qo on ¬p.

We consider two methods to estimate the posterior distribution of l . Both methods perform a Gibbs sampling for the l simulation. The first method uses a simulation procedure of the Dirichlet process (Rolin (1992)) which allows to work with the distribution of l conditionally to u. The second one considers the Polya urn representation of the Dirichlet process to compute the distribution of l marginally to u. A numerical evaluation of both methods is presented as well as an application to real data.


Data structures in computational statistics.

Doctorat en Statistique, 31/07/1995

KLINKE Sigbert, Allemagne

Promoteurs:
Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany
Léopold SIMAR, UCL/STAT


Abstract

"Data structures" describe the way how a statistical program needs to handle his data. Data are not only data in the statistical sense, but also includes graphical data objects (e.g. Boxplots, Histograms, etc.) and the consequences the use will have on the program.

A statistical program is composed of 3 elements: the user-interface, the statistical methods and the statistical graphics. The user-interface is today a graphical user interface (GUI) which is provided by the operating system. In the first chapter I give reason why statisticians need interactive programming environments and describe the tools we have from the GUIs. The second chapter describes some statistical graphics like Boxplots, Histograms, Scatterplots etc and show how linked plots can be used to help in the analysis e.g. for subgroups, especially in the multivariate data analysis.

The third chapters describes Exploratory Projection Pursuit in detail. It is a technique to analyze multivariate data which is composed by statistical methods and statistical graphics. Such combinations of mathematical and graphical methods will become more important in statistics in future. I examine the speed of kernel-based indices and and improve them by the use of binning techniques for the underlying kernel density estimation. The behaviour of the bandwidth selection by the rule-of-thumb is examined and a new method based on the minimization of the Mean-Squared-Error is suggested. The possibility of multivariate projections and the treatment of discrete variables are examined.

Other, more complicated methods, of data analysis are described in the fourth chapter: Cluster analysis, Teachware and (non-parametric) regression. Whereas Cluster analysis is again an exploratory tool, teachware will be used to teach statistics to the students. Nevertheless teachware often fails to fulfill his aims. But it needs a lot of interactivity. The third part of the chapter describes (non-parametric) regression methods and I discuss how these methods should be implemented, as black-box-commands or as procedure, which everyone can edit and change.

The fifth chapter describes the data structures I propose in statistical software for the graphical objects, for the data objects and for the linking. I have a short look what other statistical programs have and see that they partially realize these structures. The next chapter shows the implementation of these data structures in XploRe 3.2 and XploRe 4.0. Not all aims could be fulfilled in XploRe 3.2 since the data structures are the basics of a statistical program. Decisions about it have to be made very early. Thus some mistakes are done in the development of XploRe 3.2, e.g. the use of matrices as basic data elements, the construction of different window types for some statistical graphics etc. XploRe 4.0 has as a basis data type a multi-dimensional array and procedures which are able to handle these arrays.


Testing the link specification in binary choice models. A semiparametric approach.

Doctorat en Sciences économiques, 27/11/1995

DIAS PROENÇA Isabel Maria, Portugal

Promoteur: Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany


Abstract

L'usage des modèles "single index" se généralise en économétrie et en biométrie. Les modèles logit et probit en sont des cas particuliers, correspondant à des fonctions de liens déterminées. Cette thèse considère un test de spécification détectant les déviations non-paramétriques de la fonction de lien, qui correspond au test d'une hypothèse paramétrique contre une hypothèse semi-paramétrique.

La thèse est organisée en sept chapitres :

  1. Modèle paramétrique de choix binaire. Ce chapitre discute différentes spécifications de la variable latente et fait une analyse des hypothèses déterminantes. Différents modèles paramétriques sont présentés.
  2. Modèle semi-paramétrique de choix binaire. Ce chapitre présente le modèle semi-paramétrique de choix binaire comme un modèle "single index". Le chapitre analyse les hypothèses et montre qu'elles sont plus générales et flexibles que les hypothèses des modèles paramétriques.
  3. Le test HH. Ce chapitre est consacré au test de spécification de Horowits et Härdle appliqué aux modèles de choix binaires pour détecter toutes les déviations d'une fonction de lien paramétrique.
  4. Propriétés en échantillons finis du test HH. Ici, les propriétés du test HH en échantillons finis sont analysés à travers une étude de simulation très détaillée. L'étude permet de conclure que le test présente un biais négatif et dépend clairement de la fenêtre utilisée pour l'estimateur non-paramétrique en noyaux qui affectent la puissance à distance finie de la statistique en provoquant une mauvaise performance du test dans les modèles de choix binaire. Cette situation traduit une nécessité absolue de trouver des améliorations au test HH. Ce but est poursuivi dans les chapitres suivants.
  5. "Bootstrapping" le test HH. Ce chapitre propose une modification du test HH basée sur une technique de "bootstrap". Le "bootstrap" sert à obtenir pour la statistique de test une distribution asymptotique plus précise (sous l'hypothèse nulle), que celle qui résulte du théorème limite classique. Une étude de simulation montre que les valeurs critiques obtenues par "bootstrap" sont meilleures que les valeurs standards. Le "bootstrap" permet d'éliminer le biais négatif du test et, dès lors, améliore sa puissance pour des échantillons de petite taille.
  6. Le test HH modifié. Ce chapitre est tiré d'un papier élaboré conjointement avec Christian Ritter. On propose une modification du test HH, le MHH test, qui corrige le biais négatif et la dépendance de la fenêtre. Les simulations effectuées montrent en effet que le MHH test présente une performance indubitablement meilleure en échantillons de petite taille.
  7. Applications empiriques. Ce chapitre présente deux applications empiriques qui utilisent les techniques présentées dans cette thèse.


Semiparametric analysis of innovation behavior.

Doctorat en Sciences économiques, 16/04/1996

BERTSCHEK-ENTORF Irene, Allemagne

Promoteur: Wolfgang HÄRDLE, Humboldt-Universität, Berlin, Germany


Abstract

Cette thèse propose un modèle théorique de l'activité innovatrice des entreprises dans lequel des importations et des investissements directs étrangers (FDI) ont des effets positifs sur l'innovation de produit et de processus des entreprises du pays. Ces hypothèses sont analysées empiriquement utilisant des méthodes semi paramétriques et nonparamétriques se concentrant surtout sur des modèles du choix binaire. De plus, la performance pratique des estimateurs semiparamétriques et non paramétriques est étudiée par des simulations.

La thèse est organisée comme une collection de cinq articles:

  1. On nonparametric estimation of the Schumpeterian link between innovation and firm size: Ce papier écrit en collaboration avec H. Entorf présente quelques instruments nonparamétriques et une méthode semiparamétrique qui conviennent à l'analyse de la relation entre l'innovation et la taille de l'entreprise Des données des industries transformatrices belge, française et allemande sont utilisées.
  2. A theoretical framework of innovative behavior: Le modèle économique suppose que le marché du pays est caractérisé par une concurrence monopolistique. Les innovations de produit et de processus des entreprises du pays sont positivement influencées par des importations et des FDI via des prix décroissants.
  3. Estimating binary choice models with free link function: Un estimateur semiparamétrique de la moyenne dérivée est appliqué à l'analyse des innovations de produit. Des simulations étudient la performance de l'estimateur dans le cas ou des variables individuelles sont combinées avec des variables agrégées.
  4. Estimating binary choice models with partially linear index: Ce papier discute le modèle semiparamétrique avec un index partiellement linéaire et une fonction de liage du type probit. Le modèle est estimé par une approche de "quasi maximum de vraisemblance" permettant un effet complètement flexible de la variable de la taille de l'entreprise.
  5. GMM estimation of the panel probit model: Nonparametric estimation of the optimal instruments: Cet article écrit en collaboration avec M. Lechner propose un estimateur pour le modèle du choix binaire dans le cas des données de panel. La méthode généralisée des moments (GMM) est combinée avec l'estimation nonparamétrique des instruments optimales. Une étude Monte Carlo montre que cet estimateur a des bonnes propriétés dans des petits échantillons. L'innovation de produit est analysée par plusieurs estimateurs.


On the maximum likelihood.

Doctorat en Sciences (orientation : Statistique), 26/09/1997

PATILEA Valentin, Roumanie

Promoteur: Jean-Marie ROLIN, UCL/STAT


Abstract

In the first part we consider dominated, generally infinite-dimensional, models which can be written as a convex set of densities. We analyze the asymptotic behavior of nonparametric maximum likelihood estimators (NPMLE) when the true probability governing the independent data does not necessarily belong to the model (not even asymptotically). Using recent results from empirical process theory, which we recall at the beginning of this first part, we (re)obtain convergence of NPMLE towards a (pseudo-) true density as well as their rates of convergence. Extensions to non convex models, dependent data, M-estimators,... are also discussed. Afterwards the general results are applied to several examples: decreasing densities, mixture models, increasing and decreasing failure rate distributions,... The first part ends with a chapter on the asymptotic normality of linear functionals in mixture models. Based on the rates of convergence of the NPMLE we extend existing results on the asymptotic normality to the case of misspecified models.

In the second part we develop likelihood-based estimation methods for structural econometric models (nonlinear rational expectations, option pricing, auction models, ...). Many of such econometric models characterize observable variables as highly nonlinear functions of some latent variables. These functions are one-to-one, but they depend on the unknown distribution of the latent variables through the equilibrium of the game and/or the learning process. Therefore numerical complexity of the equilibrium definition generates substantial obstacles for the direct implementation of maximum likelihood inference. Motivated by the fact that the law of motion of the latent variables is often defined in a fairly simpler way, simulation-based strategies have been developed recently. Herein we propose alternative estimation strategies based on learning on the latent variables in order to perform approximated MLE directly inside the more tractable latent model. This leads us to build various indirect and recursive estimators which appear to be well-suited for both empirical implementation of nonlinear rational expectation models and bounded rationality modeling.


Nonparametric Statistical Analysis of Productivity and Efficiency with the Free Disposal Hull

Doctorat en Sciences (orientation : Statistique), 5/06/1998

WEINER Christian, Allemagne

Promoteur: Aloïs KNEIP, UCL/STAT


Abstract

The thesis investigates the stochastic properties of the FDH estimator for Farrell efficiency scores. The problem is of econometric interest: Farrell efficiency scores measure the degree of productive efficiency for firms and administrations, and they provide a tool to describe structural properties of a production possibilities set. Yet, these scores depend on the unknown production possibilities set, so they must be estimated. The Free Disposal Hull (FDH) is a nonparametric estimator of the production possibilities set, and the FDH estimator for Farrell efficiency scores is calculated relative to the FDH.

More precisely, the FDH is the smallest free disposal set that covers all observations. It is a very flexible estimator, because free disposability is the only supposed condition. The FDH is always a subset in the true production possibilities set, thus FDH estimates are always conservative. If the observations are randomly drawn from a population distribution for which the support is the production possibilities set, then FDH estimates provide a consistent approximation. The thesis treats the stochastic properties of the FDH estimator in order to quantify this approximation.

Production of goods and services is usually represented by multivariate vectors, which consist of all used inputs and all produced outputs. As a consequence productivity and efficiency analysis is a multivariate problem. However, one can represent the Farrell efficiency scores as maximum or minimum of the support of a univariate random variable, and it turns out that the FDH estimator for Farrell efficiency scores is the sample maximum or minimum of iid observations. Therefore the error term of this estimator converges to a univariate extreme value distribution.

The sparseness of high-dimensional data is a general problem in nonparametric statistics, often referred as "curse of dimensionality", which also appears in boundary estimation. Therefore the estimator has a poor rate of convergence for high-dimensional data. This is the prize to pay for the nonparametric flexibility, and this limits the practical use of FDH estimators. Simulations and an application to scale efficiencies for the US banking industry illustrate this effect.

Cette thèse examine les propriétés stochastiques de l'estimateur FDH des scores d'efficacité de Farrell. Le problème est intéressant en économétrie. En effet, les scores d'efficacité de Farrell taxent le degré d'efficacité productive des entreprises et des administrations. De plus, on peut les utiliser pour décrire certaines propriétés structurelles d'un ensemble de production - il s'agit de l'ensemble de tous les processus de production possibles. Cependant, les scores d'efficacité de Farrell dépendant d'un ensemble de production inconnu, il faut alors les estimer. L'enveloppe "Free Disposal Hull" (FDH) est un estimateur nonparamétrique de l'ensemble de production et l'estimateur FDH des scores d'efficacité de Farrell est calculé sur base de l'enveloppe FDH.

Plus précisément, l'enveloppe FDH est l'ensemble le plus petit, qui est "free disposal" et qui enveloppe toutes les observations. C'est un estimateur très flexible, car la "free disposability" est la seule condition imposée. L'ensemble FDH est toujours un sous-ensemble du véritable ensemble de production, ainsi les estimateurs FDH sous-estiment les vraies valeurs. De plus, l'ensemble FDH est une approximation consistante, lorsque les observations sont tirées au hasard d'une distribution pour laquelle le support coïncide avec l'ensemble de production.

La production des biens et des services est habituellement représentée par des vecteurs multivariés, qui se composent de tous les inputs utilisés et de tous les outputs produits. Par conséquent, l'analyse de productivité et d'efficacité est un problème multivarié. Néanmoins, on peut représenter les scores d'efficacité de Farrell comme le maximum ou le minimum du support d'une variable aléatoire univariée; de plus, on montre que l'estimateur FDH des scores d'efficacité est le maximum ou le minimum des observations d'un échantillon. En conséquence, l'erreur de cet estimateur converge vers une distribution des valeurs extrêmes.

Un problème général se pose en statistique nonparamétrique, puisque, en pratique, les données en grande dimension sont clairsemées; dans la littérature, ce phénomène est usuellement appelé le "curse of dimensionality". Ce problème apparaît aussi pour l'estimation des valeurs extrêmes. C'est pour cette raison que le taux de convergence est faible si les observations sont multidimensionnelles. C'est le prix à payer de la flexibilité nonparamétrique. Des simulations et une application aux rendements d'échelle des banques américaines illustrent cet effet.


Une nouvelle approche semimartingale en théorie du risque

Doctorat en Sciences (orientation : Statistique), 23/10/1998

ARS Pierre, Belgique

Promoteur: José PARIS, UCL/STAT


Abstract

Ce travail trouve sa place dans la nécessité pour les actuaires de maîtriser le risque encouru par leur entreprise. Nous définissons le surplus d’une compagnie d’assurance comme la somme des fonds propres et des provisions techniques.

Nous supposons que le processus du surplus est une semimartingale générale et notre objectif principal est alors de montrer l’intérêt en théorie du risque de la théorie des semimartingales et, dans une moindre mesure, du calcul de Malliavin.

Les principaux résultats dégagés concernent la théorie de la ruine mais également la gestion actif-passif (ALM) qui est en interaction avec la théorie du risque. Nous définissons un processus d’ajustement qui constitue une extension de la notion de la fonction d’ajustement local introduite par Asmussen et Nielsen (1995) et obtenons des généralisations de la borne de Lundberg.

Une extension des principes de dualité (développés par Asmussen et Petersen (1988), Dufresne et Gerber (1989), …) nous permet de mettre en évidence des méthodes approchées pour la détermination de la probabilité de ruine pour le modèle avec investissement du surplus dans un actif risqué introduit par Paulsen (1993).

Le calcul de Malliavin conduit à des résultats nouveaux pour le problème de la gestion du risque de taux lorsque celui-ci ne suit pas un processus de diffusion. Des extensions naturelles de la duration stochastique sont alors développées.


Bayesian least squares approximations in finite populations

Doctorat en Sciences (orientation : Statistique), 02/04/1999

COCCHI Daniela, Italie

Promoteur: Michel MOUCHART, UCL/STAT


Abstract

This thesis aims to propose a modern framework for modelling survey data. It deals with situations where individual data, i.e. observed on distinct individuals, must be analysed. The set of individuals is considered as finite, as it actually happens in real world situations.

In the work we outline suitable approximations to a carefully designed hierarchical model for finite populations. The argument is divided into two parts: the development of a complete framework for inference and of the approximations for such a framework.

A particular issue in this context is the emphasis of the role of the probabilistic models involved: the structural model and the sampling model.

We propose a hierarchical linear model, which has, as a special case, a type II Anova model. Modeling considers the joint distribution of all the variables involved and introduces, in a stepwise manner, a sequence of hypotheses, which aim at obtaining admissible reductions of the underlying model and at looking for operational simplifications. Within this context, special emphasis is given to the role of conditional modelling. The sampling structure is represented by a selection matrix which enter in all phases of the solution. Since, in type II Anova models, the exogenous variable is categorical, some important simplifications in the first two moments of the joint distribution which is the basis of all computations are possible. Predictive inference for the proposed model is developed by means of Bayesian least squares approximations. When looking for approximations, two sources of arbitrariness appear: the choice of the coordinates, which depends on the object of inference, and the choice of the statistic to be conditioned on. For what concerns the first issue, a careful choice of a set of approximating functions is first discussed. The solution is found conditionally on a statistic which permits to keep conjectures on non-normality into account. As a special case, the solution obtained under the normal hypothesis is found when computations are performed conditionally on a simpler statistic.

The solutions based on least squares approximations rely on a number of linear algebra manipulations, some of which are collected in a series of appendices. Such developments of linear algebra are not a specific aim of our work, but rather a consequence of the assumed framework. The work contains a discussion of the comparison with other results in the literature performed by means of analytical comparisons and with the help of a simulation study.


Latent Structural Models: Specification and Identification Problems

Doctorat en Sciences (orientation : Statistique), 25/02/2000

SAN MARTÍN Ernesto, Chili

Promoteur: Michel MOUCHART, UCL/STAT


Abstract

This Ph.D. thesis is concerned with modelling problems in social and behavioural sciences. The main issue of this thesis is to propose a modelling strategy paying a particular attention to both the contextual as well as the statistical meaning of the hypotheses introduced in a statistical model.

The motivation is the following: reading some literature on Structural Equations with Latent Variables, it may be found that in many instances models are presented at once with all hypotheses regrouped; often hypotheses are redundant or assumed implicitly, and therefore it may be concluded that the contextual meaning of each hypothesis has not been carefully thought and that hypotheses are motivated more by justifying a numerical or an inferential procedure than by trying to model a real context. The thesis makes a contribution at the level of model building and the underlying message is to provoke a cross-fertilization between a statistician and, for instance, a sociologist who can evaluate whether an hypothesis is relevant or not in contextual terms. The modelling strategy proposed in this thesis may be qualified as structural in the sense of being monitored by a contextual theory .

The modelling strategy developed in this thesis essentially consists in introducing the hypotheses progressively. The motivation is twofold: firstly decomposing an hypothesis into simple elementary pieces makes the contextual interpretation of each piece easier and, secondly, the meaning of a given hypothesis is conditional on the hypotheses previously introduced. Consequently, the interpretation of a statistical model critically depends on the order in which the hypotheses have been introduced and, therefore, on the "logic" of its construction.

Identification problems are typically involved in most structural models and, in general, the interpretation of the parameters is crucially conditioned by the identifying restrictions. For this class of problems, the contribution of this thesis is twofold. Firstly, statistical models are decomposed into a marginal and a conditional submodel and the relationships between the identification of the submodels and the identification of the complete model are analysed. Next the statistical model relative to a sample a size, say $n$, is decomposed into $n$ individual (marginal) submodels and the relationships between the identification of the submodels and the identification of the complete model are also analysed.

The thesis is divided into two parts: the first one deals with general formulations concerning problems of specification and of identification, whereas the second one applies the general results to two particular cases: the class of Item Response Models and the class of LISREL type models.


Recursions for actuaries and applications in the field of reinsurance and bonus-malus systems

Doctorat en Sciences (orientation : Statistique), 29/09/2000

WALHIN Jean-François

Promoteur: José PARIS, UCL/STAT


Abstract

Since the 80's recursive formulae are used in actuarial sciences essentially in order to give the probability function of aggregate claims distributions easily, i.e. without using the brute force convolution formula.
Modern developments extend the classical univariate recursions in a multivariate setting.
The first part of the PhD thesis gives results in this sense and applies the results to the calculation of the ruin probability of Insurance Companies buying excess of loss covers with reinstatements.
Some results about the multivariate stochastic order are also derived and used in the same context.
The second part of the PhD thesis introduces the Hofmann Distribution which seen to be a good candidate for the fitting of count data sets with a low frequency. Theoretical and practical properties are reviewed. A comparison is made with recent models developed in the literature.
The Hofmann Distribution is then extended in a bivariate setting by using the Mixed Bivariate Poisson Distribution or the Trivariate Reduction Method. When they are available, recursions for the bivariate aggregate claims distribution are derived. Finally applications are given in the field of bonus-malus systems which is an important subject in Belgium nowadays because Insurance Companies will be obliged to use new and different bonus-malus systems in a near future. It is chown how to construct a bonus-malus system with the implications of the hunger for bonus developed by the drivers.


Nonparametric bayesian analysis for special patterns of incompleteness.

Doctorat en Sciences (orientation : Statistique), 15/12/2000

BECK Benoît

Promoteur: Jean-Marie ROLIN, UCL/STAT


Abstract

Problems involving special patterns of incompleteness are thus presented. These problems have been studied with the help of two well-known Bayesian non parametric techniques based on Pòlya trees and Lévy processes. This manuscript is mainly composed of two parts-respectively linked to these two different techniques. The first part deals with the set-censoring problem in a general measurable state space. This problem encompasses the interval censoring problem, the doubly censored data problem, the current status data problem, etc. Bayesian estimators are given as well as an exact method of simulation for the posterior distribution. These estimators are explicit and not relying on estimating equations, as in the classical solution of Turnbull (JRSSB 1976), nor on MCMC methods as in the solution of Doss (An.Stat. 1994). As a particular case, the solution for the left censored survival problem is deduced and a non parametric method to estimate high quantiles based on empirical Bayesian considerations is proposed. The second part is devoted to event history (survival) data as it only handles the real line. A multiple risk semi-parametric model with time-dependent covariates, for which both truncation and censoring mechanisms are allowed, is investigated. This model combines the Aalen's (additive hazards) model (An.Stat. 1978) and the Cox's (proportional) model (JRSSB 1972). In spite of the fact that the posterior distributions asociated to the last model is given under general neutral to the right process priors, the result is then particularized for Dirichlet priors in order to exhibit an efficient method of simulation.


Automatic detection of change points in nonparametric regression.

Doctorat en Sciences (orientation : Statistique), 07/12/2001

GODERNIAUX Anne-Cécile

Promoteur: Irène GIJBELS, UCL/STAT


Abstract

We consider the problem of change-point detection in nonparametric regression, i.e. nonparametric estimation of a regression curve with jumps or change points. The issue is that any nonparametric estimation method involves the choice of parameters, call them smoothing parameters, and that the performance of the estimation procedures heavily depends on the choice of these parameters. Hence it is very important to address the issue of how to choose these parameters in practice. The main objective of this thesis is to propose nonparametric methods that are automatic in the sense that all smoothing parameters are chosen from the data.

We will first focus on a data-driven estimation the locations of the jump discontinuities. More precisely, the objective is to come up with an estimation procedure with data-driven choice of the bandwidth parameters and with a built-in estimation of the number of discontinuity points, which performs well in practice. As a basis, we use the two-steps estimation method proposed by Gijbels, Hall and Kneip (1999) for which it has been shown that the estimator for the location of a jump discontinuity achieves the optimal rate n-1, where n is the sample size. This two-steps estimation method involves the choice of two smoothing parameters: the first step uses the first derivative of a Nadaraya-Watson estimator as a diagnostic function, and the second (least-squares) step requires the determination of a small interval around the preliminary estimator of the jump resulting from the first step. We propose a bootstrap algorithm to select these parameters in practice. With this additional bootstrap procedure implemented we obtain a two-steps fully data-driven procedure for estimating a jump discontinuity in an unknown regression function.

We also propose to generalize the fully data-driven procedure for estimating jump discontinuities in a derivative curve. The method includes a data-driven way of determining the number of discontinuities in a derivative curve.

Further, we deal with the problem of testing whether or not there is an abrupt change in the regression function itself or in its first derivative at certain (prespecified or not) locations. We discuss a bootstrap procedure for this testing problem, which does not rely on asymptotic laws. This is in contrast with testing procedures available in the literature who rely on asymptotic distributions of the estimators involved. The bootstrap testing procedures presented here use the data-driven two-steps estimation methods developed to locate jump discontinuities in a regression function or in its derivatives. As a consequence, the bootstrap testing procedures are also fully data-driven.

Finally, in the bivariate setup, we consider the problem of jump detection in a regression surface. We develop a fully data-driven two-steps estimation procedure to locate the jump curve based on similar idea as in the univariate setup.

We evaluate the performance of all proposed procedures via an extensive simulation study, showing a good performance. The methods have also been illustrated on some real data examples.


The problem of registration in functional data analysis: a local regression approach.

Doctorat en Sciences (orientation : Statistique), 14/12/2001

NICOL Florence

Promoteur: Alois KNEIP, Universität Mainz


Abstract

This thesis investigates the problem of registration in Functional Data Analysis (FDA). FDA differs from standard statistical approaches in the nature of the observations. Rather than individual points or vectors, the data xi(t) are functions observed for each individual over some argument continuum often called time arguments. Examples are known in many fields of applied research, among others in biology and biomedicine with longitudinal growth studies, in medicine with psychophysiological studies of EEG curves and, for higher dimensional data, in medical imaging with analysis of brain images.
In FDA, a serious drawback must be considered when the observations are shifted, owing to time lags or general differences in dynamics. The problem due to variations can hinder even the simplest analysis of a sample of curves. Two or more functions may differ because of two types of variations: phase variations (horizontally) due to time lags and amplitude variations (vertically) due to intensity differences. Often both the types of variations are mixed and it may be hard to distinguish between phase variations and amplitude variations.
A preliminary step often consists in the registration, or alignment, of the curves or images by suitable transformations often called "warping functions". Thus, the main point of the registration problem is to remove phase variations so that we could eventually improve the analysis of individual differences in order to better compare the dynamics of the functions. A way of treating individual differences is then to use a new scale, adjusting the scale and shift distortions in each case.
Parametric and semi-parametric techniques have already been explored. Yet here, we focus on non-parametric approaches in order to estimate more complex, possible non-linear, warping transformations. Among non-parametric methods already proposed, the generalization to higher dimensions may be difficult to perform. As an alternative, we present a non-parametric method based on a local linear regression technique which could be easily generalized to high dimensional data.
We will particularly tackle the hard problem of unidentifiability in the model combining amplitude and phase variations. Moreover, in order to register functions having discontinuities, we will lay out a modified smoothed local approach. The method and the problem of unidentifiability will be illustrated by using simulated two-dimensional data. An important application in the field of medical imaging will be studied to register multiple medical images of the brain acquired from the same patient.


Semiparametric Analysis of Single Index Poisson Regression Models.

Doctorat en Sciences (orientation : Statistique), 28/05/02

CLIMOV Daniela

Promoteur: Léopold SIMAR, UCL/STAT


Abstract

The ultimate aim of the research work presented in this thesis is to show that various semiparametric M-estimation methods provide a very useful means of estimating a regression model for count data, namely Poisson regression. It is also intended to show that semiparametric methods are also valuable in testing problems. The main contribution of this thesis to this respect is threefold.

First, we propose a robust procedure with respect to the numerical instability inherent to the application of M-estimation methods to real data. In our Poisson regression setting, the objective function to be optimized is the pseudo likelihood function and the resulting estimator of the direction vector is called Pseudo Maximum Likelihood (PML) estimator. We investigate, by simulation arguments, the practical validity of the PML estimator asymptotic behavior and of the associated regression function estimator. In particular, it appears that the asymptotic results should not be used unless a huge number of observations is available. We propose a bootstrap procedure for approximating the variance of the direction estimator and a variant of bagging method introduced by Breiman (1996), in order to numerically stabilize the PML estimation procedure. Our method gives reasonable results even for moderate sized samples and therefore it can be used for doing statistical inference in practical situations.

Second, we derive two alternative M-estimation methods, based on risk estimation. For estimating the risk associated with the weighted average squared error, we propose two data-driven selectors: weighted least-squares (WLS) and double smoothing (DS). The first criterion is a weighted least-squares criterion plus a term which prevents undersmoothing in small samples, whereas the second method makes use of a double smoothing idea, as in Wand and Gutierrez (1997). Simulations are used to investigate the behavior of the above data-driven estimation methods in the single index Poisson model. In small samples, our weighted least-squares and double smoothing methods out-perform both the pseudo maximum likelihood method and the weighted least-squares cross-validation method of Härdle, Hall and Ichimura (1993).

Finally, we provide a procedure for testing the validity of the Poisson assumption. We propose a test statistics for overdispersion and derive its asymptotic distribution under the null hypothesis of Poisson model. The distributional approximation is assessed by simulations in several regression scenarios. The results indicate that the asymptotical normal approximation is not satisfactory unless the sample size is very large. Therefore, we propose a bootstrap approach for conducting the overdispersion test in small samples.

The bootstrap procedures for the estimation of the direction vector variance and for the overdispersion test are illustrated on a real data sample.


Analysing expenses linked to hospital stays: a frontier approach.

Doctorat en Sciences (orientation : Statistique), 12/07/02

BEGUIN Claire

Promoteur: Léopold SIMAR, UCL/STAT


Abstract

In order to control the health care costs, the Belgian government introduced in 1995 a regulation of the hospital payment system taking into account the pathologies measured by the Diagnosis Related Groups (DRG). In this context, the mean is usually estimated by a trimmed mean, i.e. a mean computed and the interquartile range like Q1(3)-(+)k*(Q3-Q1). This thesis proposes to measure the variability of these bounds, in particular the higher bound and to take into account the characteristics of the patients.

First, the distribution of the total expenses to a pathology is estimated by a survival curve and the confidence interval of this curve is defined using bootstrap technique. We look for the corresponding value of the bounds described above on the survival curve. When analysing myocardial infarction, the variability of the bounds measured by the length of the confidence interval may result in 3.8% of difference in the estimation of the mean.

We also propose the use of frontier models in order to rank hospital stays in function of their expenses taking into account the severity level of the patient. In this case, the hospital stays with the lowest total expenses for a given level of severity characterise the frontier of possibilities and are thus considered as being efficient. We work with deterministic models, parametric and nonparametric. In an univariate approach, we try to find the minimal achievable value of LOS or the minimal value of the total expenses and in a multivariate approach, we try to find simultaneously the minimal value of both LOS and medical fees, for a given level of the severity of the patients. The efficiencies obtained by the parametric estimator are lower the nonparametric model, more flexible than the parametric model with his restrictive hypothesis, envelops the data more closely and should be preferred for this type of analysis.

But, deterministic models are very sensitive to the extreme stays so that some efficient stays could be in fact "too" efficient and considered as outliers. We try to highlight these stays using a simplified version of a method proposed by Wilson (1993) (B-W method). The too efficient stays are characterised by lower resource variables and higher severity level. As expected, the exclusion of the too efficient stays, the inefficient stays can be hightlighted. The stays to be excluded from the hospital financing could be searched among these inefficient stays. The inefficient stays present higher value of the resources variables for a lower severity level.

An alternative method allowing to detect the too efficient stays has been proposed by Simar (2001). It is based on a robust estimator of the frontier. The method is tested on the nonparametric model in the univariate approach minimizing the total expenses and the results are compared to the previous method. The too efficient stays hightlighted by the order-m forntier have an higher compact on the efficiencies of the other stays. Moreover the order-m frontier proposes an estimation of a plausible frontier for a given level of severity. In some cases, when there are very few stays for a given level of severity, the order-m frontier fails to hightlight the "too efficient" stays. For these cases, the B-W method offers an interesting alternative. So, a good way of work would be using first the order-m frontier and second the B-W method when there is no stay with at least the same characteristics.


Nonparametric stochastic regression using design-adapted wavelets.

Doctorat en Sciences (orientation : Statistique), 20/12/02

DELOUILLE Véronique

Promoteur: Rainer von SACHS, UCL/STAT


Abstract

The recent development of wavelet transforms based on multiresolution analysis suggests new techniques for nonparametric function estimation. Indeed, wavelet procedures achieve denoising of a curve and adaptivity through thresholding of the empirical wavelet coefficients. The localization both in space and frequency allows for the wavelet estimators to perform well in regions of high regularity, without deteriorating in the neighborhood of discontinuities. However, standard wavelet estimators are optimal only in presence of equispaced samples.

This thesis proposes some ways to remove this restriction by constructing wavelets which automatically adapt to the design at hand. Within the framework of second-generation wavelets, we use some instances of the lifting scheme to achieve this objective. This automatic adaptation avoids the use of some pre-processing steps, whose aim is to come back to the evenly spaced design case, and which may deteriorate the quality of the final estimator.

In the first part of this thesis, we treat the stochastic and the autoregressive univariate models by using a particular weighted wavelet basis, that is, a wavelet basis of the space weighted by the empirical measure of the regressors. Wavelet bases of this space are called "design-adapted". The approximation properties of an estimator based on a simple, Haar-like, design-adapted basis are investigated, and thereafter this initial basis is improved with the lifting scheme. This leads to smooth design-adapted wavelets. The resulting estimator is tested on simulated and real data sets. Under some conditions on the distribution of the regressors, the rate of decay of the smooth design-adapted wavelet coefficients is the same as in the classical case, and the risk, in the wavelet domain, of a linear wavelet estimator is also the same as in the classical setting. We show which thresholding scheme to use in case of an autoregressive model in order to remove all the noise, and we establish an upper bound for the corresponding l2-risk in the wavelet domain.

The use of design-adapted wavelets can easily be extended to half-regular grids. By "half-regular", we mean a grid which is the Cartesian product of two irregular one-dimensional grids. In this construction, tensor products of spaces are used. A denoising scheme is proposed, and the resulting estimator is compared with a locally weighted regression procedure. In the last part, several wavelet transforms adapted to an irregular bivariate design are proposed. There, another instance of the lifting scheme has to be utilized. Coupled with a Bayesian denoising scheme, these wavelet transforms provide different wavelet estimators. A simulation study shows their good behaviour even in presence of discontinuities in the regression function.


Kernel estimation in deconvolution problems.

Doctorat en Sciences (orientation : Statistique), 29/01/03

DELAIGLE Aurore

Promoteur: Irène GIJBELS, UCL/STAT


Abstract

Our interest is to estimate a density from an i.i.d. sample that has been contaminated by a measurement error. This problem, usually referred to as a deconvolution problem, has applications in many different fields such as astronomy, public health, chemistry, microfluorimetry or electrophoresis. The contaminating density, or error density, is often assumed to be known. In this context, a so-called deconvolution kernel density estimator has been proposed in the literature. This estimator requires the choice of a smoothing parameter called the bandwidth, which plays a crucial role in the estimation process. Despite this fact, very few papers in the literature deal with the problem of how to choose the bandwidth of the deconvolution kernel density estimator in practice. Part one of this thesis aims to provide practical methods of bandwidth selection in this deconvolution problem.

After having presented the cross-validation method of Stefanski and Carroll (1990), we propose several other methods of bandwidth selection that can be used in practice: a method based on exact calculations for the sinus cardinal kernel, a normal reference method, a plug-in procedure, a solve-the-equation bandwidth selector and a bootstrap method. We illustrate each method on some simulated examples, and prove the consistency of the plug-in and bootstrap procedures. Finally, we compare the performances of the various methods via simulated examples and apply the methods on some real data. From our simulation results it appears that the plug-in and bootstrap procedures compete and both outperform the other data-driven bandwidth selection procedures. As a by-product, we study the estimation of integrated squared density derivatives.

Another situation which arises quite often in real life examples is that the domain of variation of the uncontaminated data is not the whole real line: the possible values are between a finite minimum and/or a finite maximum. In the error-free case, we know that the kernel density estimator is not consistent in a finite endpoint of its support. In order to have a consistent estimator of the density of interest in an endpoint, one has to make some modifications on the kernel estimator, that take the support into account. This property extends to the error case, i.e. the deconvolution kernel estimator is not a good estimator at a (finite) boundary point. Hence it is important to be able to estimate the endpoints of the support if these are unknown. This boundary estimation problem also has applications in stochastic frontier estimation.

In the error-free case, various methods have been proposed to estimate the endpoints of a density, but this problem has been studied little in the error case. Part two of this thesis studies the boundary estimation problem for contaminated data. We propose a deconvolving kernel estimator of an endpoint, inspired by methods that exist in the error-free case to estimate a discontinuity point. The idea of the method is to estimate the location of a boundary point by the maximiser of a certain diagnostic function. The methodology can also be used to estimate a discontinuity point. We prove the consistency of the method and give some ideas of how to apply it in practice. We illustrate the performance of the method on some real and simulated examples.


Measuring and modelling dependence.

Doctorat en Sciences (orientation : Statistique), 20/02/03

TAJAR Abdelouahid

Promoteur: Michel DENUIT, UCL/STAT and Jean-Marie ROLIN, UCL/STAT


Abstract

This PhD thesis is concerned with measuring and modelling stochastic dependence between random variables, and the main focus is on discrete random variables. In our study on dependence, concepts such as concordance and copula are widely used.

We first propose a copula­type representation for random couples with Bernoulli margins. Some association measures for binary data are re­examined. It is stressed that satisfactory dependence measure should only depend on the discrete copula, and not on the margins.

We propose a systematic study of the monotonicity property of Kendall's t and Spearman's r with respect to the concordance ordering of pairs of discrete as well as continuous random variables.
This extends and completes results of Yanagimoto and Okamoto (1969) and Tchen (1980). Analytic expressions are given for the most extreme values of Kendall's t and Spearman's r associated with discrete uniform variables. Some other measures of concordance are also studied. A certain number of examples are used to highlight the drawbacks of some concordance measures noticed for Bernoulli random variables and still present for more general discrete random variables. Corrections for some measures of monotone dependence for discrete random variables are proposed to obtain a margins free range.

We also prove that various relationships between Kendall's t and Spearman's r mentioned in Nelsen (1999) remain valid for discrete variables. In particular, results of Capéraá and Genest (1993) are extended to the case of discrete random pairs. We also establish that some useful stochastic dependence properties used in the actuarial literature are conserved for discrete random variables. In particular some results in Dhaene and Goovaerts (1996) continue to hold for discrete random variables. Furthermore, we propose a copula for discrete random variables using a principle given in Denuit and Lambert (2002) which transforms an integer­valued random variable to a continuous one via convolution with a uniform random variable on [0,1].

Finally, we consider a loglinear model to construct a joint distribution with uniform discrete margins. This new distribution will preserve a local association of the joint distribution with arbitrary margins. The construction is based on a copula­type representation of an n × m table of joint probabilities with fixed margins. The idea arises from the decomposition of the joint probability in the original table into margins and association effects leading to the uniform representation. The technique uses an iterative procedure which allows to preserve local association by means of a set of local odds­ratio. In the continuous case any joint distribution function H, with margins F and G, admits a unique copula representation C, which is the part representing the dependence structure of H. While Sklar's continuous copula C preserves any dependence structure in H, the discrete uniform representation is shown to conserve only local dependence structures and properties.


Hyperrectangular space partitioning trees

Doctorat en Sciences (orientation : Statistique), 07/05/04

DE MACQ Isabelle

Promoteur: Léopold SIMAR, UCL/STAT


Abstract

Decision trees are one of the most widely used tools for "supervised classification" problems: provided a sample of correctly classified data is available, they allow to establish rules for explaining and/or predicting a categorical response (membership to classes) on the basis of observed values of predictive variables. Besides efficiency, this hierarchical method is endowed with many additional advantages. In particular decision trees are very flexible and can handle heterogeneous data types, they are invariant under monotone transformations of the data, they perform internal feature selection as an integral part of the procedure, and they generate solutions allowing qualitative understanding of the generated prediction rules. All these properties, often essential in data mining applications, explain the major place occupied by tree methods in this field.

A plethora of tree algorithms have been developed, most of them being but variants of the original CART and C4.5 algorithms. A large part of academic research has focused on incremental modifications to current machine learning methods, and on the speed up of existing algorithms. The main contribution of this work has been to consider a theoretically sound decision tree classifier, initiated by Devroye, Györfi and Lugosi, and to evaluate its expected performance and efficacy as compared to traditional tree methods by evolving it into a practical tool. It is presented in a larger framework including an overview of related approaches issued from different fields, and potential developments.

Enhanced problems related to the splitting method of ordinary trees include the instability of the solutions, the unnecessarily complex expression of the identified structure or the failure of the classifier when complex interrelations between predictors and class cannot be detected. Hyperrectangular Space Partitioning trees on the other hand are based on a different cutting method, and despite their greedy construction, they have been proven to be consistent without any additional tricks. However their practical implementation is a challenge due the computational burden involved. We explore its specific features, strong and weak points, by means of a massive search working tool, while a two-stage approach has been developed for approximating HSP tree solutions, allowing to tackle moderate dimensional problems from commonly used data bases.


Smoothed Histograms and Asymmetric kernels Estimation For Density Function

Doctorat en Sciences (orientation : Statistique), 19/11/04

BOUEZMARNI Taoufik

Promoteur: Jean-Marie ROLIN, UCL/STAT


Abstract

This thesis is devoted to the nonparametric estimation of the class of density functions with a known bounded support. We distinguish two cases of bounded support.

For the first case, the support is [0,1]. For such a class of density functions we consider two estimators: the beta histogram and the beta kernel estimator. Different types of convergence are proved for these estimators: uniform weak consistency and convergence to infinity at the endpoints if the density is unbounded at these points. The rate of convergence of the mean integrated absolute error is also established. The quality of the estimation of the beta histogram is very sensitive to the choice of the bandwidth parameter. In this work we adapt two methods already used in the selection of the optimal choice for the standard kernel, the likelhood cross validation and the least square cross validation. The two estimators are already used in regression estimation. It will be interesting to use them in other nonparametric estimations such as the quantile and the hazard function.

For the second case, the support is [0,¥). For such a class, we have studied the consistency of the gamma histogram, the gamma kernel, the inverse gaussian and the reciprocal inverse gaussian estimators. We give the rate of convergence of the mean integrated absolute error of gamma histogram and gamma kernel estimators. The uniform weak and the uniform strong convergence is also proved. When the density is unbounded at the origin we proved the weak convergence to infinite under some conditions on the density function f of gamma histogram at this points. All the results proved for these estimators are generalised to the generalised kernel estimators. Simulations showed that the gamma histogram is better than the first gamma kernel estimator. We applied the gamma histogram and gamma kernel estimators on income data. The estimation requires a choice of an optimal bandwidth. Here the bandwidth selection method used leads to an asymptotically optimal window in the sense of minimising L1 distance. The practical choice of the bandwidth parameter for the gamma histogram and gamma kernel estimators is an important problem.


Dernière mise à jour : 21 janvier 2005  - Contact : Sophie Malali <www@stat.ucl.ac.be>