contributions à la statistique computationnelle et à la

Transcription

contributions à la statistique computationnelle et à la
C ONTRIBUTIONS À LA STATISTIQUE
COMPUTATIONNELLE
ET À LA CLASSIFICATION NON SUPERVISÉE
THIS IS A TEMPORARY TITLE PAGE
It will be replaced for the final print by a version
provided by the service academique.
soutenue le 12 Décembre 2014
à la Faculté des Sciences
Institut de Mathématiques et Modélisation de Montpellier (I3M)
École Doctorale Information, Structures et Systèmes (I2S)
Université Montpellier 2 (UM2)
pour l’obtention d’une Habilitation à Diriger des Recherches
par
Pierre Pudlo
devant le jury composé de:
Prof Mark Beaumont, University of Bristol, rapporteur
Prof Gérard Biau, Université Pierre et Marie Curie, rapporteur
Dr Gilles Celeux, INRIA, président du jury
Dr Arnaud Estoup, INRA, examinateur
Prof Jean-Michel Marin, Université Montpellier 2, coordinateur
Prof Didier Piau, Université Joseph Fourier, examinateur
Montpellier, UM2, 2014
“Et tout d’un coup le souvenir m’est apparu. Ce goût celui du petit morceau de madeleine que
le dimanche matin à Combray (parce que ce jour-là je ne sortais pas avant l’heure de la
messe), quand j’allais lui dire bonjour dans sa chambre, ma tante Léonie m’offrait après l’avoir
trempé dans son infusion de thé ou de tilleul. La vue de la petite madeleine ne m’avait rien
rappelé avant que je n’y eusse goûté; peut-être parce que, en ayant souvent aperçu depuis,
sans en manger, sur les tablettes des pâtissiers, leur image avait quitté ces jours de Combray
pour se lier à d’autres plus récents; peut-être parce que de ces souvenirs abandonnés si
longtemps hors de la mémoire, rien ne survivait, tout s’était désagrégé; les formes — et celle
aussi du petit coquillage de pâtisserie, si grassement sensuel, sous son plissage sévère et dévot
— s’étaient abolies, ou, ensommeillées, avaient perdu la force d’expansion qui leur eût permis
de rejoindre la conscience. Mais, quand d’un passé ancien rien ne subsiste, après la mort des
êtres, après la destruction des choses, seules, plus frêles mais plus vivaces, plus immatérielles,
plus persistantes, plus fidèles, l’odeur et la saveur restent encore longtemps, comme des âmes,
à se rappeler, à attendre, à espérer, sur la ruine de tout le reste, à porter sans fléchir, sur leur
gouttelette presque impalpable, l’édifice immense du souvenir.”
— À la recherche du temps perdu, Marcel Proust
Remerciements
L’écriture de ce mémoire d’habilitation à diriger des recherches était l’occasion de faire le
bilan de mes travaux de recherche depuis ma thèse. Il doit beaucoup à l’ensemble de mes
co-auteurs, mes nombreuses rencontres dans la communauté statistique, et au delà ; cette
page est l’occasion de les remercier tous, ainsi que ma famille et mes amis.
Je voudrais d’abord exprimer ma profonde gratitude à Mark Beaumont, Gérard Biau, Chris
Holmes et Adrian Raftery qui ont accepté de rapporter cette habilitation et se sont acquittés
de cette lourde tâche dans les délais impartis par les nombreuses contraintes administratives
françaises. Je remercie vivement mes deux premiers rapporteurs, Mark Beaumont et Gérard
Biau, ainsi que Gilles Celeux, Arnaud Estoup et Didier Piau d’avoir bien voulu faire partie de
ce jury, malgré leurs emplois du temps chargés.
Ces travaux de recherche doivent beaucoup à Jean-Michel Marin, qui coordonne cette habilitation. Sa curiosité scientifique, sa disponibilité et son enthousiasme constants m’ont porté
depuis son arrivée à Montpellier. J’ai découvert avec lui les méthodes de Monte-Carlo et la
statistique bayésienne. Ses compétences et son rayonnement scientifique m’ont guidé jusqu’à
ce mémoire. Naturellement, je dois aussi un grand merci à Bruno Pelletier avec qui tout a
commencé.
J’ai bénéficié d’excellentes conditions de travail à l’Institut de Mathématiques et Modélisation
de Montpellier, au sein de son équipe de probabilités et statistique. Ils ont su me faire confiance en me recrutant après ma thèse en probabilités appliquées. C’est aussi l’occasion de
ré-itérer mes remerciements à Didier Piau qui a encadré ma thèse, et m’a beaucoup appris, y
compris dans ses cours de licence et maitrise. Je remercie tous les membres passés et présents
de mon équipe. Je n’ose pas me lancer ici dans une liste exhaustive au risque d’en oublier un ;
ils ont tous contribué.
Je remercie chaleureusement l’INRA, et le Centre de Biologie pour la Gestion des Populations
qui m’a accueilli pendant deux ans. J’y ai trouvé des conditions idéales pour mener des travaux
i
Remerciements
de recherche en statistique appliquée à la génétique des populations et d’excellents co-auteurs,
Jean-Marie Cornuet, Alex Dehne-Garcia, Arnaud Estoup, Mathieu Gautier, Raphael Leblois et
Renaud Vitalis auxquels s’ajoute François Rousset. C’est Jean-Michel qui m’a présenté à cette
équipe, ainsi qu’à Christian Robert, avec qui j’ai eu la chance de travailler. Merci Christian
d’avoir accepté de faire partie de ce jury, mais tes nombreux talents n’incluent pas encore le
don d’ubiquité.
Enfin, de nombreuses structures ont financé ces travaux de recherches. Je pense en particulier
à l’ANR, au LabEx NUMEV et à l’Institut de Biologie Computationnelle que je remercie. Et
un grand merci aux étudiants qui se sont lancés dans une thèse sous ma co-direction, sans
peut-être savoir ce qui les attendaient : Mohammed Sedki, Julien Stoehr, Coralie Merle et
Paul-Marie Grollemund. Je leur souhaite un brillant avenir.
Montpellier, le 4 Décembre 2014
Pierre Pudlo
ii
Contents
Remerciements
i
List of figures
v
1
Résumé en français
1
1
Classification non supervisée . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
Statistique computationnelle et applications à la génétique des populations .
3
2
Échantillonnage préférentiel et algorithme SIS . . . . . . . . . . . . . .
4
2.2
Algorithmes ABC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.3
Premiers pas vers les jeux de données moléculaires issus des technologies NGS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
Graph-based clustering
1
3
2.1
9
Spectral clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.1
Graph-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
1.2
Consistency results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
2
Consistency of the graph cut problem . . . . . . . . . . . . . . . . . . . . . . . .
17
3
Selecting the number of groups . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
4
Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Computational statistics and intractable likelihoods
23
1
Approximate Bayesian computation . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.1
Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
1.2
Auto-calibrated SMC sampler . . . . . . . . . . . . . . . . . . . . . . . .
26
1.3
ABC model choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
Bayesian inference via empirical likelihood . . . . . . . . . . . . . . . . . . . . .
31
2
3
4
Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.1
Computing the likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
3.2
Sample from the posterior with AMIS . . . . . . . . . . . . . . . . . . . .
36
Inference in neutral population genetics . . . . . . . . . . . . . . . . . . . . . . .
39
4.1
A single, isolated population at equilibrium . . . . . . . . . . . . . . . .
40
4.2
Complex models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.3
Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
iii
Contents
Bibliography
56
A Published papers
57
B Preprints
59
C Curriculum vitæ
Thèmes de recherche . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Liste de publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
62
64
iv
List of Figures
2.1
2.2
2.3
2.4
3.1
3.2
3.3
Hartigan (1975)’s definition of clusters in terms of connected components of
the t -level set of the density: at the chosen threshold, there are two clusters,
C1 and C2 . If t is much larger, only the right cluster remains, and if t is smaller
than the value on the plot, both clusters merge into a single group. . . . . . . .
Results of our algorithm on a toy example. (left) the partition given by the
spectral clustering procedure (points in black remains unclassified); (right)
the spectral representation of the dataset, i.e., the ρ(X i )’s: the relatively large
spread of the points on the top of the plot is due to poor mixing properties of
the random walk on the red points. . . . . . . . . . . . . . . . . . . . . . . . . . .
Graph-cut on a toy example. The red line represents the bottleneck, the hneighborhood graph is in gray. The partition returned by the spectral clustering
method with the h-neighborhood graph corresponds to the color of the crosses
(blue or orange) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The graph-based selection of k: (left) the datapoints and the spectral clustering
output; (right)the eigenvalues of the matrix Q: the eigengap is clearly between
λ3 and λ4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gene genealogy of a sample of five genes numbered from 1 to 5. The intercoalescence times T5 , . . . , T2 are represented on the vertical time axis. . . . . .
Simulation of the genotypes of a sample of eight genes. As for microsatellite
loci with the stepwise mutation model, the set of alleles is a interval of integer
numbers A ⊂ N. The mutation process Q mut adds +1 or −1 to the genotype
with equal probability. Once the genealogy has been drawn, the MRCA is
genotyped at random, here 100 and we run the mutation Markov process along
the vertical lines of the dendrogram. For instance, the red and green lines are
the lineages from MRCA to gene number 2 and 4 respectively. . . . . . . . . . .
Example of an evolutionary scenario: four populations Pop1, . . . , Pop4 have
been sampled at time t = 0. Branches of the history can be considered as tubes
in which the gene genealogy should be drawn. The historical model includes
two unobserved populations (Pop5 and Pop6) and fifteen parameters: six dates
t 1 , . . . , t 6 , seven populations sizes Ne 1 , . . . , Ne 6 and Ne 40 and two admixture rates
r, s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
16
19
21
41
42
45
v
1 Résumé en français
Mes travaux de recherche ont touché à des thématiques diversifiées, allant de résultats
théoriques en probabilités au développement d’algorithmes d’inférence et à leurs mises
en œuvre. Deux caractéristiques dominantes se dégagent : (1) l’omniprésence d’algorithmes
(depuis l’algorithme glouton de ma thèse aux algorithmes de Monte-Carlo dans mes derniers
travaux), et (2) leur adaptation en biologie, principalement en génétique des populations. Du
fait de la taille croissante des données produites notamment en génomique, les méthodes
statistiques doivent gagner en efficacité sans perdre le détail de l’information incluse dans ses
grandes bases de données. Mes travaux détaillés ci-dessous ont fourni aussi bien des contributions importantes à l’analyse et la compréhension des performances de ces algorithmes,
qu’à la conception de nouveaux algorithmes gagnant en précision ou efficacité d’estimation.
1 Classification non supervisée
Publications. (A2), (A3) et (A4), voir page 64. Un brevet international.
Mots clés. Machine learning, classification spectrale, théorèmes asymptotiques, constante de
Cheeger, graph-cut, graphes de voisinages, théorie spectrale d’opérateurs.
Après ma thèse en probabilités appliquées, je me suis tourné à Montpellier vers la statistique.
J’ai ainsi obtenu plusieurs résultats en Machine Learning, sur des problèmes de classification
non supervisée (voir, par exemple, Hastie et al., 2001). Cette méthode d’analyse de données
consiste à partitionner les individus (taxons) d’un échantillon en groupes relativement similaires et homogènes, sans aucune information sur l’appartance des individus aux groupes,
ni même sur le nombre de groupes. De plus, les informations organisées selon un réseau,
ou graphe, comme les réseaux d’interaction, sont de plus en plus fréquentes et les méthodes
permettant d’analyser la structure de tels réseaux de plus en plus nécessaires. Ces méthodes
se doivent d’être algorithmiquement efficaces du fait de la taille croissante des données.
Les techniques de classifications spectrales (von Luxburg, 2007) auxquelles je me suis intéressé
1
Chapter 1. Résumé en français
occupent une place importante dans ce champ de recherche. Comme dans les méthodes à
noyau (Shawe-Taylor and Cristianini, 2004), nous utilisons une comparaison par paire des
individus, mais au travers d’une fonction de similarité qui associe un nombre positif à chaque
paire d’observations, reflètant leur proximité. L’un des avantages de cette méthode récente est
sa grande maniabilité et sa faculté d’adaptation à de nombreux types de données. Elle permet
en effet de détecter des groupes d’observations de forme quelconque, contrairement aux
k-means ou méthodes de mélange, qui ne détectent que des groupes convexes. L’algorithme
étudié (Ng et al., 2002) s’appuie sur une marche aléatoire qui se déplace sur les individus à
classer proportionnellement à leur similarité. On reconstruit alors les groupes homogènes en
cherchant des ensembles d’états dont la marche aléatoire sort avec faible probabilité.
Récemment, von Luxburg et al. (2008) ont montré que de tels algorithmes convergent. Mais
l’obtention d’une caractérisation géométrique simple du partionnement limite restait une
question largement ouverte. Pourtant, dans le contexte de la classification non supervisée,
Hartigan (1975) a proposé une définition précise et intuitive d’un groupe en terme d’ensembe
de niveaux de la densité sous-jacente. En modifiant l’algorithme de classification spectrale,
nous avons montré, avec Bruno P ELLETIER (PR, Rennes 2), que le partitionnement limite
coïncide avec cette définition ((A02) Pelletier and Pudlo, 2011). Cette démonstration repose sur
un mode de convergence fort d’opérateurs associés aux matrices de similarité de l’échantillon.
La méthode de graphe de voisinage (Biau et al., 2007) s’interprète comme une classification
spectrale dont la fonction de similarité est binaire. Avec Benoît C ADRE (PR, ENS Cachan,
Rennes) et Bruno P ELLETIER (PR, Rennes 2), nous avons largement amélioré les résultats sur
l’estimation du nombre de groupes k dans ce cadre particulier ((A04) Cadre, Pelletier, and
Pudlo, 2013). L’estimateur de k est la multiplicité de la valeur propre nulle du laplacien de
graphe. Ce résultat est une première justification formelle de l’heuristique de trou spectral
(von Luxburg, 2007) dans le cadre général.
Enfin, l’algorithme de classification spectrale peut se voir comme une approximation du problème NP-difficile de graph-cut ou de détection de goulets d’étranglement dans ces graphes
ou réseaux d’interaction (von Luxburg, 2007). La constante de Cheeger et le problème de
minimisation associé détectent ces goulets d’étranglement dans des graphes valués, comme
par exemple les graphes de voisinages. Avec Ery A RIAS -C ASTRO (A-PR, University of Californy,
San Diego) et Bruno P ELLETIER (PR, Rennes 2), nous avons étudié la constante de Cheeger sur
de tels graphes aléatoires construits par échantillonnage ((A03) Arias-Castro, Pelletier, and
Pudlo, 2012) . Nous avons obtenu des résultats asymptotiques donnant la limite continue
lorsque la taille de l’échantillon grandit en adaptant la fonction de similarité binaire à cette
taille.
Récemment, nous avons été contacté par le CHU avec André Mas (PR UM2, I3M) pour des
traiter des jeux de données d’analyse sanguine par cytomètre en flux. La question initiale était
de détecter un petit cluster de cellules rares (des cellules circulantes à cause d’un cancer par
exemple) parmi un très grand nombre d’observation. L’algorithme que nous avons développé
2
2. Statistique computationnelle et applications à la génétique des populations
avec l’équipe Tatoo du LIRMM (UMR d’informatique de l’UM2) a été partiellement breveté
(brevet international, voir page 64). Avec la société d’accélération du transfert de technologies
de Montpellier (SATT AxLR), nous cherchons un partenaire industriel pour valoriser ce brevet.
Cette innovation fait l’objet d’une demande de financement de maturation auprès de la SATT
AxLR, ainsi que des questions de transferts de méthodologies statistiques bien connues, mais
peu utilisées par l’industrie fournissant des logiciels d’analyse de données issues de cytomètres
en flux.
2 Statistique computationnelle et applications à la génétique
des populations
Publications. (A5), (A6), (A7), (A8), (A9), (A12), (A13), (A14) et (A15), ainsi que les prépublications (A10), (A11), (A16) et (A17) voir page 64.
Mots clés. Méthodes de Monte Carlo, statistique computationnelle, méthodes ABC, échantillonnage préférentiel, vraisemblance empirique, génétique des populations.
Depuis 2010, je me suis intéressé à la statistique computationnelle pour la génétique des
populations. Cette thématique de recherche m’a permis de mettre en valeur mes compétences
en statistique, en probabilités, ainsi que sur les questions d’implémentation informatique. J’ai
souhaité profiter de l’environnement scientifique exceptionnel de Montpellier en génétique
des populations pour en faire un axe majeur de mes travaux de recherche.
Sous neutralité, l’évolution génétique est modélisée par des processus stochastiques complexes (notamment diffusion de Kimura (1968) et coalescent de Kingman (1982)) prenant
en compte simultanément les mutations et la dérive génétique. Répondre à des questions
d’intérêt biologique (quantifier un taux de migration, une réduction de taille de population,
dater des fondations de populations, ou retracer les voies d’invasion d’espèces étudiées au
CBGP) est un problème méthodologique délicat. Une modélisation fine permet de distinguer
des effets confondants comme, par exemple, sélection vs. variations démographiques. La
généalogie (i.e., les liens de parenté de l’échantillon de copies de gènes étudié), les dates des
mutations et les génotypes ancestraux, décrits par le modèle stochastique, ne sont pas observées directement. On parle de processus latents ou cachés. La vraisemblance des données
s’obtient alors en sommant sur toutes les possibilités, ce qui n’est pas faisable en temps fini.
On peut citer deux classes d’algorithmes de Monte-Carlo qui permettent d’inférer les paramètres
ou le modèle évolutif sous-jacent (on parle plus couramment de scénario d’évolution) malgré
la présence de ce processus stochastique latent. La première classe de méthodes repose sur
un échantillonnage préférentiel séquentiel (Sequential Importance Sampling ou SIS, voir
Stephens and Donnelly, 2000; De Iorio and Griffiths, 2004a,b; De Iorio et al., 2005). Elles
attaquent directement le calcul de la somme en tirant aléatoirement (échantillonnage) le
processus latent. Ce tirage séquentiel (qui remonte le temps progressivement) est dirigé par
3
Chapter 1. Résumé en français
une loi (dite loi d’importance) qui charge les généalogies supposées contribuer le plus à cette
somme (d’où l’adjectif préférentiel). Cette première famille de méthode est la plus précise,
dans le sens où elle minimise l’erreur d’estimation dans un modèle donné. Mais elle est
aussi la plus gourmande en temps de calcul et la plus restreinte dans le champ des scénarii
d’évolution couverts. En effet, elle nécessite d’adapter la loi d’importance qui échantillonne
les arbres les plus importants à chaque situation démographique et historique considérée
(voir par exemple (A14) Leblois, Pudlo, Néron, Bertaux, Beeravolu, Vitalis, and Rousset (2014)).
La seconde classe de méthodes comprend les méthodes bayésiennes approchées (ABC ou
approximate Bayesian computation, voir par exemple Beaumont, 2010; (A05) Marin, Pudlo,
Robert, and Ryder, 2012; (A13) Baragatti and Pudlo, 2014) qui contournent le calcul de cette
somme en comparant des jeux de données simulées aux données observées au travers de quantités numériques (statistiques résumées) supposées informatives. Les estimations obtenues
par ABC sont moins précises que celles obtenues par SIS. Mais elles sont beaucoup plus
souples car ne reposent que sur notre capacité à (1) simuler suivant le modèle stochastique,
et (2) capter l’information importante au travers de statistiques résumées. Pour cette raison,
elles sont considérées comme les plus prometteuses pour répondre aux questions complexes
de génétique des populations. Nous avons passé en revue les principes statistiques ainsi
que les principaux résultats de cette dernière méthode dans deux articles qui se complètent
(A05) Marin, Pudlo, Robert, and Ryder (2012), (A13) Baragatti and Pudlo (2014).
2.1 Échantillonnage préférentiel et algorithme SIS
Avec Jean-Michel M ARIN (Montpellier 2), j’ai co-encadré la thèse en biostatistique de Mohammed S EDKI. Nous nous sommes principalement intéressés à l’algorithme d’échantillonnage
préférentiel adaptatif et multiple (AMIS pour Adaptive Multiple Importance Sampling), voir
Cornuet et al. (2012a). Lorsque la vraisemblance est calculée avec SIS, le coût de calcul en
chaque point de l’espace des paramètres est coûteux. Dans une perspective bayésienne, le
schéma AMIS permet d’échantillonner cet espace des paramètres suivant la loi a posteriori, en
ne nécessitant que peu d’appels au calcul de la vraisemblance en un point. D’où son efficacité
computationnelle. AMIS approche sa cible par un système de particules pondérées, mis à jour
séquentiellement et recycle l’ensemble des calculs obtenus. Des tests numériques effectués
sur des modèles de génétique des populations ont montré les performances numériques de
l’algorithme AMIS (efficacité et stabilité). Toutefois, la question de la convergence des estimateurs obtenus par cette technique restait largement ouverte. Nous avons montré ((A16) Marin,
Pudlo, and Sedki, 2014) des résultats de convergence d’une version légèrement modifiée de
cet algorithme, mais conservant les qualités numériques du schéma original.
Dans le schéma SIS, ce sont les lois d’importance qui sont chargées de proposer des généalogies supposées contribuer le plus à la vraisemblance. Malheureusement, ces lois d’importance
sont conçues pour des situations en équilibre démographique. Il est possible de les utiliser
dans des situations où la taille d’une population varie au cours du temps, au prix d’un coût
4
2. Statistique computationnelle et applications à la génétique des populations
de calcul beaucoup plus important pour conserver la même qualité d’approximation. Avec
Raphaël L EBLOIS (CR INRA, CBGP) et Renaud V ITALIS (DR INRA, CBGP), nous nous sommes
attaqués dans (A14) Leblois et al. (2014) au cas d’une unique population panmictique dont la
taille varie au cours du temps, et compare les résultats obtenus avec d’autres algorithmes de
la littérature, moins efficaces ou moins bien justifiés.
J’encadre avec Raphaël L EBLOIS la thèse de Coralie M ERLE (UM2 I3M & INRA CBGP), dont
l’un des objectifs est de proposer des pistes pour palier ce coût de calcul, en prolongement
direct de son stage de M2. Nous avons cherché à comprendre comment appliquer un rééchantillonnage (Liu et al., 2001) dont le but est d’apprendre automatiquement quels sont
les arbres proposés par la loi d’importance qui contribueront le plus à la vraisemblance. Des
premiers résultats à publier montrent que ce ré-échantillonnage permet de diviser le coût de
calcul par un facteur 10 dans des modèles de dynamique comparables à ceux étudiés dans
(A14) Leblois et al. (2014).
2.2 Algorithmes ABC
Avec mon premier étudiant en thèse, Mohammed S EDKI, et Jean-Marie C ORNUET (DR INRA,
CBGP), nous avons également développé un nouvel algorithme d’inférence ABC sur des scénarii évolutifs dans le paradigme bayésien ((A11) Sedki, Pudlo, Marin, Robert, and Cornuet,
2013). Comparé à l’état de l’art, cet algorithme est auto-calibré (auto-tuning) et plus efficace, d’où un gain de temps pour obtenir une réponse de même qualité que l’algorithme
standard. Nous avons illustré cette méthode sur un jeu de données portant sur quatre populations d’abeilles domestiques européennes (Apis mellifera) et un scénario démographique
précédemment validé sur une étude de l’ADN mitochondrial.
Au Centre de Biologie pour la Gestion des Populations (UMR INRA SupAgro Cirad IRD, Montpellier), je me suis fortement impliqué dans le codage de la seconde version de DIYABC
(Cornuet et al., 2008, 2010), qui vient de sortir ((A12) Cornuet et al., 2014). Ce logiciel condense
toute l’expérience acquise au sein de l’ANR EMILE sur les méthodes ABC pour la génétique des
populations. En particulier, pour gérer les situations où le nombre de statistiques résumées
est important, nous avons proposé avec Arnaud E STOUP (DR INRA, CBGP) et Jean-Marie
C ORNUET (DR INRA, CBGP) d’estimer la probabilité a posteriori d’un modèle via une analyse
discriminante linéaire ((A06) Estoup et al., 2012).
Je co-encadre actuellement les travaux de thèse de Julien S TOEHR, qui porte sur la sélection
de modèles pour des champs de Markov latents, question de difficulté comparable à celle de
choix de scénarii d’évolution en génétique des populations. Notons ici que de modèles de
champs markoviens ou markoviens cachés intéressent également le département MIA pour
l’analyse de réseaux d’interaction. Je pense en particulier à Nathalie P EYRARD (MIA Toulouse),
qui fait partie du comité de suivi de thèse de Julien, et au champ thématique qu’elle anime
sur l’analyse des réseaux. Dans (A15) Stoehr, Pudlo, and Cucala (2014), nous avons mis en
place une procédure ABC de choix de modèle, qui renonce à l’approximation de la probabilité
5
Chapter 1. Résumé en français
a posteriori de chacun des modèles (qui représentent différentes structures de dépendance)
pour améliorer le taux de mauvaise classification (c’est-à-dire de mauvais choix de modèle),
via une procédure des k plus proches voisins parmis les simulations ABC. Nous inférons
ensuite localement autour du jeu de données observé un taux de mauvaise classification qui
fournit un succédané à l’estimation de la difficulté locale du choix fourni par la probabilité
a posteriori. Cette approche nous a permis de diminuer le nombre de simulations ABC
nécessaires à prendre une décision correcte, et permet donc de diminuer de façon importante
le temps de calcul, tout en gagnant en qualité de décision. Ce nouvel indicateur nous permet
finalement de construire une procédure de sélection de statistiques résumées qui s’adapte au
jeu de données observé, et de réduire ainsi la dimension du problème.
Avec Arnaud E STOUP (DR INRA, CBGP), Jean-Michel M ARIN (PR I3M, UM2) et Christian P.
R OBERT (PR CEREMADE, Dauphine), nous prolongeons ces travaux en remplaçant la méthode des k plus proches voisins par des forêts aléatoires (Breiman, 2001), entrainées sur les
simulations ABC. Ce type de classifieur, qui prédit un modèle en fonction des statistiques
résumées, est bien moins sensible à la dimension que la méthode des k plus proches voisins et
donne de bien meilleurs résultats en génétique des populations où le nombre de statistiques
résumées est de l’ordre de la centaine (à comparer avec la dizaine de statistiques résumées
dans les questions de champs markoviens cachés des travaux de Julien S TOEHR). En outre,
nous fournissons un autre succédané à la probabilité a posteriori qui est un taux de mauvaise
classification intégré contre la loi a posteriori de prédiction, intégration que l’on peut facilement réaliser avec une méthode ABC, voir (A17) Pudlo, Marin, Estoup, Gauthier, Cornuet, and
Robert (2014).
2.3 Premiers pas vers les jeux de données moléculaires issus des technologies NGS
Les données de polymorphisme collectées par séquençage ultra-haut débit (Next Generation Sequencing data ou NGS) fournissent des marqueurs de type SNP (Single Nucleotide
Polymorphism) bi-allélique. L’intérêt de ces données de grandes dimensions est la quantité
d’information qu’elles portent. Pour réduire les coûts financiers, il est possible d’utiliser un
génotypage par lots (ou pool d’individus), chacun d’entre eux représentant une population
d’intérêt pour l’espèce étudiée. Avec Mathieu G AUTIER (CR INRA, CBGP) et Arnaud E STOUP
(DR INRA, CBGP), j’ai participé à des premiers travaux ((A08) Gautier et al., 2013) qui nous
permettent de mieux comprendre l’information perdue par ces schémas de génotypage.
Le séquençage RAD (Restriction site–associated DNA) est une technique récente basée sur la
caractérisation de fragments du génome adjacents à des sites de restriction. Lorsque le site
de restriction associé au marqueur d’intérêt a muté et perdu sa fonction, il est impossible de
séquencer le fragment d’ADN correspondant pour ce site. Ceci introduit possiblement un
biais, connu sous le nom d’Allele Drop Out. J’ai participé, avec les mêmes collaborateurs, à
l’étude détaillée ((A09) Gautier et al., 2012) de ce biais : il s’avère relativement faible dans la
6
2. Statistique computationnelle et applications à la génétique des populations
plupart des cas. Nous proposons en outre une méthode pour filtrer les sites où ce biais est
important.
Il est essentiel de souligner ici que les données NGS nécessitent un lourd travail pour mettre les
algorithmes d’inférence à l’échelle de la dimension des données produites et tirer le meilleur
parti de cette information. Une première piste finalisée depuis peu correspond aux travaux
que j’ai développés avec Kerrie M ENGERSEN (Queensland University of Technology, Brisbane,
Australie) et Christian P. R OBERT (Paris-Dauphine & IUF). Il s’agit dans (A07) Mengersen, Pudlo,
and Robert (2013) d’une utilisation originale de la vraisemblance empirique (Owen, 1988,
2010) pour construire un algorithme de calcul bayésien (BCel pour Bayesian Computation
via empirical likelihood). Cette méthode BCel utilise des équations d’estimation intra-locus
dérivées du modèle via une vraisemblance composite (Lindsay, 1988) par paires de gènes. Au
lieu de résoudre ces équations pour trouver une estimation des paramètres d’intérêt par locus
et de réconcilier ces différents estimateurs en prenant leur moyenne ou leur médiane, nous
utilisons ces équations d’estimation comme entrée dans l’algorithme de vraisemblance empirique. Cette dernière reconstruit alors une fonction de vraisemblance à partir des données
et de ces équations. Nous avons testé cette méthodologie dans différentes situations. En particulier, en génétique des populations, sur de gros jeux de données composés de marqueurs
microsatellites, nous avons montré que notre méthode fournit des résultats plus précis que les
méthodes ABC, pourtant considérées comme la référence étalon dans ce domaine. En outre,
BCel permet de réduire grandement les temps de calcul (plusieurs heures en ABC deviennent
ici autant de minutes), donc de traiter des jeux de données de dimension plus grande. Cette
piste est donc prometteuse pour l’avenir.
7
2 Graph-based clustering
Keywords. Machine learning, spectral clustering, asymptotic theorem, Cheeger constant, graphcut, neighborhood graphs, spectral theory of operators.
Papers. See page 64.
• (A02) Pelletier and Pudlo (2011)
• (A03) Arias-Castro, Pelletier, and Pudlo (2012)
• (A04) Cadre, Pelletier, and Pudlo (2013)
Clustering or cluster analysis Under this generic term we gather data analysis methods
that aim at grouping observations or items in such a way that objects in the same group (also
called a cluster) are similar to each other. The precise sense of similar will be defined below,
though we should stress here that it is subjective. The grouping that ensued depends on the
algorithm as well as its tuning. Most famous methods to achieve this goal are hierarchical
clustering, k-means or other centroid-based methods and model-based methods such as
mixture of multivariate Gaussian distribution, see e.g. Chapter 14 in Hastie et al. (2009),
Chapter 10 in Duda et al. (2012) and McLachlan and Peel (2000). Hierarchical clustering
algorithms (at least the bottom up approach) gather progressively data points from the pair
of points that are the nearest, to build a whole hierarchy of groups always designed with a
dendrogram. Groups of interests are then produced by cutting the dendrogram at a certain
level. k-mean clustering recovers groups by minimizing the within-cluster sum of square.
Various methods have been derived from this idea, changing the criterion we minimize. And,
finally clustering with mixture distribution recover the groups by setting each data point to the
component of the mixture with the highest probability of density, after seeking the maximum
likelihood estimator with an EM algorithm. The latter class of method permits to select the
number of groups resorting to penalized likelihood criterion such as BIC or ICL. Though this
popular methods have proved useful in numerous applications, they suffer from some internal
limitations, can be unstable (hierarchical methods) or fail to uncover clusters of complex (e.g.,
9
Chapter 2. Graph-based clustering
non-convex) shapes.
As in the case of kernel based learning (see, e.g., Shawe-Taylor and Cristianini, 2004) the
methods I have studied with Bruno P ELLETIER and other colleagues are based on pairwise
comparison of individuals via a similarity function returning a non negative real number
which reflects their proximity. Often the input of these methods is a pairwise similarity or
distance matrix, and they ignore the other details of the dataset. Hence we can also resort to
these methods for analyzing networks.
Graph based methods I have studied two procedures, namely spectral clustering (von
Luxburg, 2007) and a neighborhood graph scheme (Biau et al., 2007), that we can both see as
graph based algorithms. The vertices of the undirected graph symbolize the observations, i.e.
the items we are trying to gather into groups, and the edges link comparable items. Or weights
on the edges of the graph indicate the strength of the link, see Subsections 1.1 below. The ideal
clusters form a partition of the vertices which tends to minimize the number of edges that have
their endpoints in different subsets of the partition. When the edges are weighted, the number
of edges is replaced by the sum of the weights. Generally, the above graph cut problem is NP
hard. But the spectral clustering algorithm provides the solution of a relaxed optimization
problem in polynomial time (von Luxburg, 2007). The noticeable exception used in Section 3
is when the graph has multiple connected components, in which case the partition can be
obtained in linear time with the help of famous algorithms based on a breadth-first search or
depth-first search over the graph.
Preprocessing The success of clustering methods to reveal an interesting structure of the
dataset depends often on preprocessing steps such as normalization of covariates, deletion
of outliers, etc. as well as a fine tuning of the algorithm. The results we have obtained in
(A02) Pelletier and Pudlo (2011) and in (A04) Cadre, Pelletier, and Pudlo (2013) show that we
can learn which part of the data space has low density during a first stage of the algorithms
and learn how to clusterize the data during a second stage of the algorithms on the same
dataset without any over-fitting or bias. In both procedures, the observations falling into areas
of low density are set apart and we do not attempt to assign these items to any revealed group:
they are considered as background noise or outliers. Besides the preprocessing step stabilizes
the spectral clustering algorithm, as discussed in Subsection 1.2. The last advantage of the
above preprocessing is to fit the intuitive, geometric definition of clusters given by (Hartigan,
1975) as connected components of (upper) level set of the density, see below.
The literature on cluster analysis is wide, and many clustering algorithms have been developed
so that a comprehensive review would be too prominent for this summary of my own research.
The references of this chapter reflects only the way I entered into this subject. Cluster analysis
always implies a subjective dimension mainly depending on the context of its end-use; evaluating objectively the result of an algorithm is almost impossible. As defended by von Luxburg
10
et al. (2012) when facing a concrete dataset, the sole judge of the accuracy of the revealed
partition is the fact that the partition is or is not useful practically. Whence the importance
of studying theoretically some algorithms and proving that the partition obtained on data
converges to a solution that depends only on the underlying distribution. The first result in
this direction is the consistency of k-means (Pollard, 1981) whose limit is well characterized.
Moreover, the limit of the cluster centers depends on the Euclidean distance used to assess
proximity of data points, thus on a subjective choice of the user. Gaussian mixture clustering
(see, e.g., McLachlan and Peel, 2000) is also a consistent method whose limit is independent
of a distance choice. The partitions provided straightaway by the mixture components are
always convex, though we can merge them (Baudry et al., 2010).
Figure 2.1 – Hartigan (1975)’s definition of clusters in terms of connected components of the
t -level set of the density: at the chosen threshold, there are two clusters, C1 and C2 . If t is
much larger, only the right cluster remains, and if t is smaller than the value on the plot, both
clusters merge into a single group.
There is no clear, mathematical definition of a cluster anyone agrees with, but Hartigan (1975)
outlined the following. Fix a real, non negative number t . The clusters are the connected
components of the upper level set of the density f , namely
L (t ) := {x : f (x) ≥ t }.
See Figure 2.1. This definition depends of course on the value of t , but also on reference
measure used to set the density, another way of hiding normalization of the covariates, distance
issues, etc. Other loose definitions can be found in the literature; an example is given by
Friedman and Meulman (2004) whose clusters are defined in terms of proximity on different
sets of variables (a set which depends on the group the item belongs to).
11
Chapter 2. Graph-based clustering
1 Spectral clustering
The class of spectral clustering algorithms ((A02) Pelletier and Pudlo, 2011) is a recent alternative that has become one of the most popular modern clustering method, outperforming
classical clustering algorithms, see von Luxburg (2007). As with kernel methods (Shawe-Taylor
and Cristianini, 2004), the input of the procedure is a pairwise comparison of the items in the
dataset via a similarity function s.
1.1 Graph-based clustering
Consider a dataset X 1 , . . . , X n of size n in Rd . The similarity matrix
S(i , j ) := s(h −1 (X j − X i ))
where s is a given similarity function whose support is the unit ball (or any convex, open,
bounded set including the origin), taking values in [0; +∞), and where h is a scale parameter
we have to tune. If s(u) = s(−u) for all u, the above matrix is symmetric and its coefficients are
non-negative. Examples are given in von Luxburg (2007). Thus it can be seen as the weighted
adjacency matrix of a similarity graph (that is allowed to have self-loops): each node i of the
graph represents one data point X i , if S(i , j ) = 0, there is no edge between i and j , and if
S(i , j ) > 0, there is an edge between i and j with weight equals to S(i , j ).
Note that, if s is the indicator function of the unit ball, namely s(u) = 1{kuk ≤ 1}, the graph
is actually unweighted (entries of the adjacency matrix are either 0 or 1) and is called the
h-neighborhood graph. It connects all points of the dataset whose pairwise distances are
smaller than h. More generally, a benefit of the spectral clustering is its flexibility and its ability
to adapt to different kinds of covariates with the help of a symmetric similarity s, though we
have assumed here that all covariates are continuous.
The normalized algorithm (Ng et al., 2002) is based on the properties of a random walk on the
similarity graph: when the walk is at a vertex i , it will jump to another vertex j with probability
proportional to S(i , j ). Its transition matrix is then
Q := D −1 S,
(2.1)
where D is the diagonal matrix defined by
D(i , i ) =
X
S(i , j ).
j
Note that D(i , i ) can be interpreted as the weighted degree of node i in the similarity graph.
Then the algorithm tries to recover clusters as sets of nodes in which the random will be
trapped for a long time. Other variants of the spectral clustering algorithm rely on other ways
of normalizing the similarity matrix, see von Luxburg (2007) or Maier et al. (2013).
12
1. Spectral clustering
A simplified case To give intuition on the spectral clustering algorithm, we begin with a
simplified case where the similarity graph has more than one connected component, say
C1 , . . . , Ck . Then, the set of nodes (i.e., the dataset) can be partitioned into recurrent classes.
And, if the random walk starts from one node of the graph, it is trapped forever into the
recurrent class where this first state lies. The connected components, thus the clusters, can
be recovered with the help of clever graph algorithms (breadth-first search or depth-first
search). But this method cannot be applied in the general case. The number of recurrent
classes, k, is actually the multiplicity of the eigenvalue 1 of the matrix Q. And the linear
subset of (right) eigenvectors corresponding to this largest eigenvalue of Q represents the set
of harmonic functions on the graph. Note that, in this setting, a vector V = (V (1), . . . ,V (n))
indexed by {1, . . . , n} can be considered as a function v whose parameter varies in {X 1 , . . . , X n }
with v(X i ) = V (i ). The k vectors V1 = (V1 (1), . . . ,V1 (n)),. . . , Vk = (Vk (1), . . . ,Vk (n)) defined by

1 if i ∈ C ,
`
V` (i ) =
0 otherwise,
forms a basis of the eigenspace Ker(Q − I ). In other words, the eigenvectors are piecewise
constant on the connected components C` , ` = 1, . . . , k. Consider now any basis V1 , . . . ,Vk of
this eigenspace of dimension k. We have that, for any pair i , j ,
• if i and j belong to the same component, then the i -th and j -th coordinates of any
vector of the basis are equal: V1 (i ) = V1 ( j ), . . . , Vk (i ) = Vk ( j )
(because the vectors of the basis are harmonic functions);
• if i and j do not belong to the same component, then V` (i ) and V` ( j ) differ at least on
one vector of the basis, i.e., for at least one value of `
(because these vectors form a basis of the eigenspace).
With the help of the basis we can send the original dataset into Rk (where k is number of
clusters) as follow: the i -th item, X i , is replaced with the vector ρ(X i ) = (V1 (i ), . . . ,VK (i )) whose
coordinates are the i -th coordinates of each vector in the basis. The two above properties of
the basis implies that the representation of the dataset in RK is composed of exactly k different
points in a one-to-one correspondance with the clusters.
The general algorithm Recall that we consider here only the normalized algorithm of Ng
et al. (2002). When the similarity graph cannot be decomposed into connected components
as above, the random walk is irreducible and the order of the multiplicity of the (largest)
eigenvalue 1 of Q is then one. As considered in von Luxburg et al. (2008) and in (A02) Pelletier
and Pudlo (2011), the general situation can be a seen as a perturbation of the simplified case,
where some null coefficients of the similarity matrix have been replaced by small ones. Then
the k largest eigenvalues of Q (counted with their multiplicity) replace the eigenvalue 1 of
multiplicity k and the corresponding eigenvectors V1 , . . . , Vk are substitutes of the general
13
Chapter 2. Graph-based clustering
basis described above. Likewise, we can send the dataset into Rk via this basis:
ρ(X i ) = (V1 (i ), . . . ,Vk (i )).
(2.2)
Because of the perturbation, the eigenvector are now only almost constant on the clusters.
Hence the representation is composed of n different points, but they concentrate around k well
separated centers. Finally the algorithm recover the clusters applying a k-means procedure
on this new representation. This leads to Algorithm 1.
Algorithm 1: Normalized spectral clustering
Input: the similarity matrix S; the number of groups k
compute the transition matrix Q as in (2.1);
compute the k largest eigenvalues of Q;
set V1 , . . .Vk as the corresponding eigenvectors;
for i ← 1 to n do
set ρ i = (V1 (i ), . . . ,Vk (i ));
end
apply a k-means algorithm on the n new data points ρ 1 , . . . , ρ n with k groups;
Output: the partition of the dataset into k groups
It is worth noting that the matrix Q is similar (or conjugate) to a symmetric matrix:
³
´
Q = D −1/2 D −1/2 SD −1/2 D 1/2 .
¡
¢
Hence, first Q is diagonalizable, second both spectrums of Q and D −1/2 SD −1/2 are equal. In
particular, all eigenvalues of Q are real, and a vector V is an eigenvector of Q if and only if
¡
¢
D 1/2V is an eigenvector of the symmetric matrix D −1/2 SD −1/2 . And finally, all eigenvalues of
Q lie in [−1; 1] because, for each i ,
X
j
Q(i , j ) =
X¯
¯
¯Q(i , j )¯ = 1.
j
So that the computation of the largest eigenvalues and the eigenvectors of Q can be performed
with an efficient algorithm on symmetric matrix.
Graph cut, conductance,. . . There is another interpretation of Algorithm 1 in terms of conductance, graph cut and mixing time of the Markov chain defined by the random walk on
the graph. To simplify the discussion, let us assume that the number of groups k equals 2.
Since the weights of the edges represent the strength of the similarity between data points, it
makes sense to partition the dataset by minimizing the weights of the edges with endpoint in
different groups. So, if A is a subset of nodes and Ā its complement, we seek a bottleneck or a
subset A of nodes not well connected with the rest of the graph, namely Ā. Thus one might be
14
1. Spectral clustering
tempted to minimize
X
S(A, Ā) :=
s(i , j ).
i ∈A, j ∈ Ā
Very often, the minimum of the optimization problem is realized when A (or equivalently Ā) is
a singleton composed of one observation rather far from the rest of the dataset. To avoid such
undesirable solution, we normalize S(A, Ā) by a statistic that take into account the size of A
and Ā. Various normalizations are possible, see e.g. von Luxburg (2007), but we can consider
the Cheeger cut, also called the normalized cut, defined as
Cut(A, Ā) :=
S(A, Ā)
min(S(A), S( Ā))
,
where S(A) :=
X
s(i , j ).
(2.3)
i ∈A, j ∈A
The minimization of the normalized cut can be transformed into the optimization of a
quadratic form on Rn with very complex constrains and if f = ( f (1), . . . , f (n)) is a point
where the quadratic form is minimal, the partition can be recovered as A = {i : f (i ) > 0}
and Ā = {i : f (i ) < 0}, see Shi and Malik (2000) and von Luxburg (2007). Due to the constrains,
the optimization problem is NP hard in general and Algorithm 1 gives an approximate solution
by removing some of the constrains (Optimization problems are related to eigendecomposition of symmetric matrices via Rayleigh-Ritz theorem). Moreover, the minimum value of
Cut(A, Ā) is called the Cheeger constant which controls how fast the random walk converges
to its stationary distribution. Indeed, the presence of a narrow bottleneck between A and Ā
will prevent the chain to move easily from A to its complement. The precise result is a control
on the second largest eigenvalue of Q, see e.g. Diaconis and Stroock (1991).
1.2 Consistency results
Previous results from the literature Assume that the dataset is an iid sample X 1 , . . . , X n
from an unknown distribution P on Rd . If the number of groups k is known, von Luxburg et al.
(2008) have proven that, when the data size n tends to infinity, and when the scale parameter h
is fixed, the Markovian transition matrix Q tends (in some weak sense) to a Markovian operator
whose state space is the support of the underlying distribution P . Their proof follows the
footsteps of Koltchinskii (1998); it has been simplified by Rosasco et al. (2010) a few years
later. As a consequence, with the help of functional analysis results, they can prove that the
eigendecomposition of Q converges to the eigendecomposition of the limit operator which is
an integral operator. Yet the limit partition is difficult to interpret and its geometrical meaning
in relation to the unknown distribution P is neither explicit, nor discussed in both paper. The
geometry of spectral clustering is still a hot topic, see e.g. Schiebinger et al. (2014).
We believed that this is maybe not the correct limit to consider. Indeed the scale parameter h
15
Chapter 2. Graph-based clustering
can be compared to a bandwidth and the weighted degree of node i ,
D(i , i ) =
X
j
S(i , j ) =
´
X ³ −1
s h (X j − X i )
j
is a kernel estimate of the density f of P at point X i (up to a constant). It is well known that
kernel density estimators converge to some function f h when h is fixed and the size of the
dataset n → ∞. But this is a biased estimator and f h 6= f . In the similarity graph context,
when h remains constant while the number of data points increases, the narrowness of the
bottleneck decreases. In other words, there is more points in between clusters and the random
walk defined by Q will jump from one group to another one more easily. Our intuition with
Bruno Pelletier was that the bottleneck would not wipe out if h → 0 while n → ∞. But the limit
problem becomes harder, see Section 2.
Figure 2.2 – Results of our algorithm on a toy example. (left) the partition given by the spectral
clustering procedure (points in black remains unclassified); (right) the spectral representation
of the dataset, i.e., the ρ(X i )’s: the relatively large spread of the points on the top of the plot is
due to poor mixing properties of the random walk on the red points.
Consistency proven in (A02) Pelletier and Pudlo (2011) The cluster analysis becomes much
more easier if we remove the data points in between groups. This is what we have proposed
in (A02) Pelletier and Pudlo (2011). We start with a preprocessing step that set apart points
in areas of low density. This preprocessing is performed with the help of a consistent density
estimator fˆn (note that we have at our disposal a kernel density estimator if we multiply the
weighted degrees of the similarity graph by the correct constant). We fix a threshold t and
keep the points whose indices lie in
J (t ) := { j : fˆn (X j ) ≥ t }.
The density estimator fˆn is (of course) computed on the dataset X 1 , . . . , X n we want to study;
we do not need to divide the dataset into different folds (one for density estimation, one for
16
2. Consistency of the graph cut problem
the cluster analysis). We then apply Algorithm 1 on the subset of kept observations. Because
of the preprocessing, data points we use as input of the spectral clustering algorithm easily
group into well separated area of Rd whatever the size n. Hence we do not need an intricate
calibration of h. An example is given in Figure 2.2. Under mild assumptions not recalled here,
we have proven in (A02) Pelletier and Pudlo (2011) that the whole procedure is consistent and
that the groups converge to the connected components of the upper level set of the density f
defined as
L (t ) = {x : f (x) ≥ t }
as long as the scale parameter h is below a threshold that does not depend on the size of the
dataset! Additionally the limit corresponds to the definition of a cluster given by Hartigan
(1975) and can be easily interpreted. Yet the data points that fall outside J (t ) remain unclassified. Besides we have proven that the Markovian transition matrices Q converge strongly
(in a certain operator norm sense) to a compact integral operator, which is a stronger result
than the one obtained by von Luxburg et al. (2008). The trick was to consider the appropriate
Banach space:
n
o
W (L (t )) := v of class C 1 on L (t )
equipped with the norm
¯
¯
¯
¯ X
¯ ∂v
¯
kvkW := sup ¯v(x)¯ + sup ¯¯
(x)¯¯ .
∂x i
x
x
i
2 Consistency of the graph cut problem
Difficulty of studying the spectral clustering algorithm As explained above, we believed
that a correct calibration of h in the spectral clustering algorithm would lead to a scale h that
goes to 0 when n → ∞. In this regime, the matrices Q converge to the identity operator (which
is neither an integral operator nor a compact operator anymore) and we have to study the so
called normalized graph Laplacian
(nh d +2 )−1 (I −Q)
(2.4)
(correctly scaled) in order to obtain a non trivial limit.
Indeed, as shown in Belkin and Niyogi (2005) or Giné and Koltchinskii (2006) for the case where
the distribution in uniform over an unknown set of the Euclidean space, the limit operator
is a Laplace-Beltrami operator (i.e., a second order derivative operator). At least intuitively,
this implies that the random walk on the similarity graph converges to a diffusive continuous
Markov process on the support of the underlying distribution when time is correctly scaled. A
formal proof has not been written, but this could be done resorting to theorems in Ethier and
Kurtz (2009). But the spectral decomposition of such unbounded operators are much more
17
Chapter 2. Graph-based clustering
difficult to handle, and thus the asymptotic study of the spectral algorithm requires much
deeper theorems of functional analysis than the (not trivial!) one manipulated in von Luxburg
et al. (2008), Rosasco et al. (2010) or our own (A02) Pelletier and Pudlo (2011).
Laplace-Beltrami operators The most simple Laplace operator is the one dimensional
∆ = −∂2 /(∂x)2 , which is the infinitesimal generator of the Brownian motion. On the real line R,
any real number λ is an eigenvalue of ∆, and eigenfunctions are either
p
p
p
p
x 7→ C sin( |λ|x) +C 0 cos( |λ|x) or x 7→ C sinh( |λ|x) +C 0 cosh( |λ|x)
depending on the sign of λ. Hence the spectrum of ∆ is far from being countable. While
on the circle S1 , i.e. the segment [0; 2π] whose endpoints have been glued, the spectrum of
∆ is countable, limited to non positive integer numbers λ and, up to an additive constant,
eigenfunctions are now
x 7→ C sin
³p
´
³p
´
|λ|x +C 0 cos
|λ|x .
We face here the very difference between the Fourier expansion of a (2π)-periodic function
and the Fourier transform of a real function on the real line.
To circumvent the functional analysis difficulties, we studied the graph cut problem in
(A03) Arias-Castro, Pelletier, and Pudlo (2012). Recall from Section 1.1 that the spectral
algorithm can be seen as an approximation of the graph cut optimization. Thus we discard
the evaluation of the accuracy of the latter approximation. What remains to study is the
convergence of a sequence of optimization problems when n → ∞ and h → 0 although they
are practically NP-hard to solve.
Results The results we have obtained in (A03) Arias-Castro, Pelletier, and Pudlo (2012) are
rather limited in their statistical consequences despite the difficulty of the proofs. Yet we have
shown that our intuition on a relevant calibration of the scale h with respect to the data size
was correct : in this setting, the correct scaling implies than h → 0 when n → ∞. The main
assumptions are as follows:
• the distribution of the iid dataset X 1 , . . . , X n is the uniform distribution on a compact set
M of Rd ;
• the similarity function is the indicator function of the unit ball of Rd , so that the similarity
graph is the h-neighborhood graph (see above in Section 1.1);
• and n → ∞ and h → 0 such that nh 2d +1 → ∞.
The toy example of Figure 2.3 is a uniform sample of size 1,500, where M is the union of two
overlapping circles. Since the graph-cut problem is NP-hard, the approximation solution
18
−1.0
−0.5
0.0
0.5
1.0
2. Consistency of the graph cut problem
−2
−1
0
1
2
Figure 2.3 – Graph-cut on a toy example. The red line represents the bottleneck, the hneighborhood graph is in gray. The partition returned by the spectral clustering method with
the h-neighborhood graph corresponds to the color of the crosses (blue or orange)
drawn in Figure 2.3 was computed with the spectral clustering procedure (here without any
preprocessing).
To avoid peculiarities of the discrete optimization problem on a finite graph and non regular
solutions, we optimize (2.3) under the constraint that A has a smooth boundary in Rd and
that the curvature of the boundary of A is not too large; measured in term of reach (a mild way
to define radius of curvature), we constrained the reach of ∂A to be above a threshold ρ and
that ρ → 0 such that, for some α > 0, when n → ∞,
• hρ −α → 0 and
• nh 2d +1 ρ α → ∞.
In a first theorem, we have proven that, when n → ∞, if well normalized, the function defined
in (2.3) converges almost surely to
c(A; M \ A) :=
perimeter(M ∩ ∂A)
,
min(volume(A ∩ M ), volume(M \ A))
(2.5)
where ∂A denotes the boundary of A.
Minimizing such a function of the set A detects the narrowest bottleneck of the compact
set M : if h(A; M \ A) is small for some A, then we can divide M in two sets, A and M \ A of
rather large contents (because of the volumes in the denominator of (2.5)) but separated by
an hyper-surface of small perimeter. This is known in geometry as the Cheeger problem: the
19
Chapter 2. Graph-based clustering
smallest value of c(A; M \ A) is the Cheeger (isoperimetric) constant of the compact set M . The
two other theorems we have proven in (A03) Arias-Castro, Pelletier, and Pudlo (2012) show
that, with the above constraints, the smallest value of (2.3) converges to almost surely the
smallest value of (2.5) up to a normalizing constant, and that the discrete sets we obtain on
data can serve as skeleton to recover the optimal set A at the limit.
3 Selecting the number of groups
Calibration of the threshold t We mentioned earlier that Hartigan (1975)’s cluster depends
on the value of the threshold t we have to set. A clear procedure to fix the value of t is to
assume that we want to discard a small part of the dataset, say of probability p. In (A04) Cadre,
Pelletier, and Pudlo (2013), we studied consistency of the algorithm estimating t as the pquantile of fˆn (X 1 ),. . . , fˆn (X n ) when fˆn is a density estimator trained on X 1 , . . . , X n , a quantile
that can be easily computed using an order statistic. Under mild assumptions, we established
concentration inequalities for t when n → ∞, depending on the supremum norm supx | fˆn (x)−
f (x)|. Moreover, when fˆn is a kernel density estimator, we controlled the Lebesgue measure of
the symmetric difference between the true level set of f of probability (1 − p) and the level set
of fˆn using the calibrated value of t . The exact convergence rate we obtained is proportional
to (nεd )−1/2 , where ε is the bandwidth of the kernel density estimator (see also Rigollet and
Vert, 2009). Note that ε tends to 0 when n → ∞ so that the rate of convergence is (as common
in unparametric settings) much smaller than n −1/2 .
Estimating the number of components in the level set A few years before, Biau et al. (2007)
provided a graph-based method to select the number of groups in such nonparametric settings.
Their estimator, inspired from Cuevas et al. (2001), can be described as follows: this is the
multiplicity of the eigenvalue 1 of the matrix Q described in Section 1.1 when the similarity
function s is the indicator function of the unit ball and when the scale factor h is well calibrated.
We recall here that the similarity graph is then the unweighted h-neighborhood graph. The
results of (A04) Cadre, Pelletier, and Pudlo (2013) were originally followed by concentration
results on this estimator of k, but were not published in the journal. These results can be
found in the french preprint server HAL, see
http://hal.archives-ouvertes.fr/hal-00397437.
In comparison with the original results in Biau et al. (2007), the proven concentration inequalities do not require any assumption on the gradient of fˆn and take into account the estimation
of t .
The eigengap heuristic The idea to understand how to generalize the above reasoning to the
spectral clustering algorithm is always based on perturbation theory. Indeed, if the number
of groups is k, we expect to observe k eigenvalues around 1, and a gap between these largest
20
4. Perspectives
Figure 2.4 – The graph-based selection of k: (left) the datapoints and the spectral clustering
output; (right)the eigenvalues of the matrix Q: the eigengap is clearly between λ3 and λ4 .
eigenvalues and the (k + 1)-th eigenvalue. Thus, if λ1 > · · · > λn are the eigenvalues of Q we
seek the smallest value of k for which λk − λk+1 is relatively large compared to λ` − λ`+1 ,
` = 1, . . . , (k − 1). An example in given in Figure 2.4. The eigengap heuristic usually works well
when
• the data contains well pronounced clusters,
• the scale parameter is well calibrated and
• the clusters are not far from being convex.
For instance, the toy example of Figure 2.3 does not present any eigengap. We have tried
several toy examples with Mohammed Sedki during a graduate intership, but we were unable
to get a general algorithm that produces a reasonable results whatever the example, see my
paper in the proceedings of conference (T10), 41ièmes Journées de Statistique, Bordeaux, 2009.
4 Perspectives
Calibration The major problem with unsupervised learning is calibration since we cannot
resort to cross-validation. As we have seen above, many properties we are able to prove on the
spectral clustering algorithm depends on the fact that the scale parameter h is well chosen. A
possible road to solve this problem is based on the fact the weighted degrees of each node (i.e.,
of each observation) form a density estimate. Indeed, we have
(nh d )−1 D(i , i ) = (nh d )−1
X
j
S(i , j ) = n −1
X
³
´
h −d s h −1 (X j − X i )
j
21
Chapter 2. Graph-based clustering
and can calibrate h using cross-validation. We have no theoretical guarantee that this way
of proceeding is the best, but at least, the similarity graph captures some important features
of the data. Another more standard procedure to set h is to rely on stability of the partition
returned by the algorithm when we subsample the dataset or when we add small noise, see
e.g. Ben-David et al. (2006) or von Luxburg et al. (2012), but such procedures need careful
handling.
Selecting the number of groups The eigengap heuristic is not optimal to select the number
of groups. Another possible road is to focus on concentration into clusters of the spectral
representation of the data points , i.e., the ρ(X i )’s defined in (2.2). In this perspective, we are
looking for a largest value of k such that the within sum of squares of the ρ(X i )’s remains small.
The difficulty is that the dimension of the spectral representation is k and thus varies with the
number of groups.
22
3 Computational statistics and intractable likelihoods
Keywords. Mote Carlo methods, computational statistics, approximate Bayesian computation,
importance sampling, empirical likelihood, population genetics.
Papers. See page 64.
•
•
•
•
•
•
•
•
•
•
•
•
•
(A05) Marin, Pudlo, Robert, and Ryder (2012)
(A06) Estoup, Lombaert, Marin, Robert, Guillemaud, Pudlo, and Cornuet (2012)
(A07) Mengersen, Pudlo, and Robert (2013)
(A08) Gautier, Foucaud, Gharbi, Cezard, Galan, Loiseau, Thomson, Pudlo, Kerdelhué,
and Estoup (2013)
(A09) Gautier, Gharbi, Cezard, Foucaud, Kerdelhué, Pudlo, Cornuet, and Estoup (2012)
(A10) Ratmann, Pudlo, Richardson, and Robert (2011)
(A11) Sedki, Pudlo, Marin, Robert, and Cornuet (2013)
(A12) Cornuet, Pudlo, Veyssier, Dehne-Garcia, Gautier, Leblois, Marin, and Estoup (2014)
(A13) Baragatti and Pudlo (2014)
(A14) Leblois, Pudlo, Néron, Bertaux, Beeravolu, Vitalis, and Rousset (2014)
(A15) Stoehr, Pudlo, and Cucala (2014)
(A16) Marin, Pudlo, and Sedki (2014)
(A17) Pudlo, Marin, Estoup, Gauthier, Cornuet, and Robert (2014)
Intractable likelihoods Since 2010 and the arrival of Jean-Michel M ARIN at Montpellier,
I have been interested in computational statistics for Bayesian inference with intractable
likelihoods. When conducting a parametric Bayesian analysis, Monte Carlo methods aim
at approximating the posterior via the empirical distribution of a simulated sample on the
parameter space, Φ say. More precisely, in the regular case, the posterior has density
π(φ | x obs ) ∝ p prior (φ) f (x obs | φ),
(3.1)
23
Chapter 3. Computational statistics and intractable likelihoods
where p prior (φ) is the prior density on Φ and f (x | φ) the density of the stochastic model. But
the likelihood may be unavailable for mathematical reason (no closed form as a function of φ)
or for computational reasons (too expensive to calculate). Difficulties are obvious when the
stochastic model is based on a latent process u ∈ U , i.e.,
f (x | φ) =
Z
U
f (x, u | φ)du,
and the above integral cannot be computed explicitly or when the likelihood is known up to a
normalising constant depending on φ, i.e.,
f (x | φ) =
g (x, φ)
,
Zφ
Z
where Zφ =
g (x, φ)dx.
And some models such as hidden Markov random fields suffer from both difficulties, see, e.g.,
Everitt (2012). Monte Carlo algorithms, for instance Metropolis-Hastings algorithms (see e.g.
Robert and Casella, 2004), require numerical evaluations of the likelihood f (x obs | φ) at many
values of φ. Or, using a Gibbs sampler, one needs to be able to simulate from both conditionals
f (φ | u, x obs ) and
f (u | x obs , φ).
Moreover, if the above is often possible, the increase in dimension induced by the data augmentation from φ to (φ, u) may be such that the properties of the MCMC algorithms are too
poor for the algorithm to be considered.
The limit of MCMC (poor mixing properties in complex models and multi-modal likelihoods)
have been reached for instance in population genetics. Authors in this field of evolutionary sciences, which has been my favorite application domain in the past years, proposed a
new methodology named approximate Bayesian computation (ABC) to deal with intractable
likelihoods. A major benefit (which contributes to its success) of this class of algorithms is
that computations of the likelihoods are replaced by simulations from the stochastic models.
Since its introduction by Tavaré et al. (1997), it has been widely used and provoked a flurry
of researches. My contributions to ABC methods is described in Section 1, including fast
ABC samplers, high performance computing strategies and ABC model choice issues. When
simulating datasets is slow (for instance with high dimensional data), ABC is no longer an
option. Thus we have proposed to replace the intractable likelihood with an approximation
given by the empirical likelihood of Owen (1988, 2010), see Section 2. On the contrary, when
the dimension of the latent process u and of the parameter φ is moderate, we can rely on
importance sampling to approximate the relevant integrals and my work in this last class of
methods is described in Section 3. We end up the Chapter with a short presentation of the
stochastic models in population genetics and a few perspectives to face the drastic increase of
the size of genetic data due to next generation sequencing (NGS) techniques. These models
are almost all available in the DIYABC software to which I contributed, see (A12) Cornuet et al.
(2014).
24
1. Approximate Bayesian computation
1 Approximate Bayesian computation
We have written two reviews on ABC, namely (A05) Marin, Pudlo, Robert, and Ryder (2012)
completed by (A13) Baragatti and Pudlo (2014). Before exposing our original developments,
we begin with a short presentation of ABC methods.
1.1 Recap
Basic samplers and their targets The basic idea underlying ABC is as follow. Using simulations from the stochastic model, we can produce a simulated sample from the joint distribution
π(φ, x) := p prior (φ) f (x | φ).
(3.2)
The posterior distribution (3.1) is then the conditional distribution of (3.2) knowing that
the data x is equal to the observation x obs . From the joint, simulated sample, ABC derives
approximations of the conditional density π(φ | x obs ) and of other functionals of the posterior
such as moments or quantiles.
The previous elegant idea suffers from two shortcomings. First, the algorithm might be
time consuming if simulating from the stochastic model is not straightforward. But the
most profound problem is that estimating the conditional distribution of φ knowing x = x obs
requires that some simulations fall into the neighbourhood of x obs . If the data space X is
not of very small dimension, we face the curse of dimensionality, namely that it is almost
impossible to get a simulated dataset near the observed one. To solve the problem, ABC
schemes perform a (non linear) projection of the (observed and simulated) datasets on a space
of reasonable dimension d via some summary statistics η : X → Rd and set a metric δ[s, s 0 ]
on Rd . Hence Algorithm 2.
Algorithm 2: ABC acceptation-rejection algorithm with a given threshold ε
for i = 1 → Nprior do
draw φi from the prior p prior (φ);
draw x from the likelihood f (x | φi );
compute s i = η(x);
store the particle (φi , s i );
end
for i = 1 → Nprior do
compute the distance δ[s i , s obs ];
reject the particle if δ(s i , s obs ) > ε;
end
Note that the first loop does not depend on the data set x obs , hence the simulated particles
might be reused to analyse other datasets. In the second loop, the particles can also be
25
Chapter 3. Computational statistics and intractable likelihoods
weighted by w i = K ε (δ[s i , s obs ]), where K ε is some smoothing kernel with bandwidth ε. Then,
the acceptation-rejection algorithm is just a particular case of the weighted algorithm, with
K ε (δ) = 1{δ ≤ ε}.
Target of the algorithm and approximations The holy grail of the ABC scheme is the posterior distribution (3.1). To bypass the curse of dimensionality, ABC introduces the non linear
projection η : X → Rd . Hence, we cannot recover anything better than the conditional distribution of φ knowing that the simulated summary statistics η(x) is equal to s obs = η(x obs ), i.e.,
¡ ¯
¢
π φ ¯ η(x) = s obs .
(3.3)
Moreover, the event η(x) = s obs might have very low probability, if not null. Hence, the
output of the ABC algorithm is a sample from the distribution conditioned by the larger event
©
ª
A ε = δ[η(x), s obs ] ≤ ε , namely
¡ ¯
¢
π φ ¯ δ[η(x), s obs ] ≤ ε .
(3.4)
The distribution (3.4) tends to (3.3) when ε → 0. But, if the user want to control the size of
the output, decreasing the threshold ε might be problematic in terms of processing time:
indeed, if we want a sample of size N from (3.4), the algorithm requires an average number of
¡ ¢
¡ ¢
simulations Nprior = N /π A ε and the probability of event A ε , π A ε , can decrease very fast
toward 0 when ε → 0. Actually, the error we commit be estimating the density of the output of
the algorithm rather than computing explicitly (3.3) has been widely studied in nonparametric
statistics since the seminal paper of Rosenblatt (1969). But the discrepancy between (3.3) and
the genuine posterior (3.1) remains insufficiently explored.
1.2 Auto-calibrated SMC sampler
Rejection sampling (Algorithm 2) and ABC-MCMC methods (Marjoram et al., 2003) can
perform poorly if the tolerance level ε is small. Various sequential Monte Carlo algorithms (see
Doucet et al., 2001; Del Moral, 2004; Liu, 2008a, for general references) have been constructed
as an alternative to these two methods: Sisson et al. (2007, 2009), Beaumont et al. (2009),
Drovandi and Pettitt (2011) and Del Moral et al. (2012). These algorithms start from a large
tolerance level ε0 , and at each iteration the tolerance level decreases, εt < εt −1 . The simulation
problem becomes therefore more and more difficult, whereas the proposal distribution for
the parameters φ becomes more and more close to the posterior. In practice, the tolerance
level ε used in the rejection sampling algorithm is not fixed in advance, but corresponds to a
quantile of the distances between the observed dataset and some simulated ones — see the
interpretation of this calibration in term of nearest neighbors in Biau et al. (2012), as well as in
Section 1.3.
26
1. Approximate Bayesian computation
The algorithm of Beaumont et al. (2009) corrects the bias introduced by Sisson et al. (2007)
and is a particular population Monte Carlo scheme (Cappé et al., 2004). It requires fixing
a sequence of decreasing tolerance levels ε0 > ε1 > . . . > εT which is not very realistic for
practical problems. In contrast, the proposals of Del Moral et al. (2012) and Drovandi and
Pettitt (2011) are adapted likelihood-free versions of the Sequential Monte Carlo sampler (Del
Moral et al., 2006) and include a self-calibration mechanism for the sequence of decreasing
tolerance levels. Sadly, in some situations, these last auto-calibrated algorithms do not permit
any gain in computation time, see our test presented in Table 1 of (A11) Sedki, Pudlo, Marin,
Robert, and Cornuet (2013) when all calculations are negligible in front of the simulation
of one dataset from the model. This is typically true for complex models, e.g. for complex
scenarios in population genetics.
With my former PhD student Mohammed S EDKI, we have transformed the likelihood-free
SMC sampler of Del Moral et al. (2012) in order to keep the number of simulated datasets from
the stochastic model as low as possible. Indeed, the rejection sampler of Algorithm 2 proposes
values of φ according to the prior distribution, hence probably in areas of the parameter
space with low posterior density. Almost any value of φ in such areas will produce simulated
datasets far from the observed one. The idea of sequential ABC algorithms is to learn gradually
the posterior distribution, i.e. in which area of the parameter space we should draw the
proposed parameter φ. It is worth noting also that even if we draw parameters from the
(unknown) posterior distribution, the simulated datasets are distributed according to the
posterior predictive law, and such simulated datasets do not fall automatically in a small
neighborhood of the observed data (measured in term of summary statistics). So that, even if
we learn perfectly how to sample the parameter space, reducing the value of the threshold ε
induces always an irreducible computational cost.
The algorithm we proposed in (A11) Sedki, Pudlo, Marin, Robert, and Cornuet (2013) is based
on the above assessments and can be described as follow:
(a) begin with a first run of the rejection sampler proposing Nprior = N0 particles and
accepting N1 particles (N1 < N0 , usually N1 is one tenth of N0 );
(b) use a few iterations of ABC-SMC which update the sample of size N1 and reduce the
threshold ε;
(c) end with a last run of the rejection sampler on the N1 particles produced by ABC-SMC
and return to the user a sample of particles of size N2 < N1 .
Step (a) produces a crude estimate of the posterior distribution, and has proved crucial (in
numerical example) to reduce the computational cost of the whole procedure. Step (b) includes a self-calibration of the sequence of thresholds, resulting from a trade off between
small thresholds and quality of the sample of size N1 . This step includes also a stopping
criterion to detect when ABC-SMC looses efficiency and faces the irreducible computational
27
Chapter 3. Computational statistics and intractable likelihoods
cost mentioned above. The ABC approximation of the posterior is often far from admissible
when the stopping criterion is met: it is designed to detect when it becomes useless to learn
where to draw parameters φ to reduce easily the threshold ε, but not to serve as a guarantee
that the final sample is a good approximation of the posterior. Hence, we end up the whole
algorithm with a rejection step in (c), which decreases the threshold by keeping only a small
proportion of the particles returned by ABC-SMC. We have illustrated the numerical behavior
of the proposed scheme on simulated data and a challenging real-data example from population genetics, studying the European honeybee Apis melifera. In the latter example, the
computational cost was about twice smaller than Algorithm 2.
Finally, in a conference paper (Marin, Pudlo, and Sedki, 2012), we compared different parallelization strategies for our own algorithm and for Algorithm 2. Algorithm 2 is embarrassingly
parellel : each drawing from the joint distribution can be done independently by some core of
the CPU, or by some computer of the cluster. On the other hand, between each iteration of
the ABC-SMC, there is a step (sorting the particles with respect to their distances to s obs and
calibrating ε) that cannot be done with parallel computation. We show indeed that the parallel
overhead of sequential ABC samplers prohibits their use on large clusters of computers while
the simplest algorithm can take advantage of the many computers in the cluster without
loosing time when we have to distribute some relevant informations to the computers.
1.3 ABC model choice
There has been several attempts to improve models by interpreting lack of fits between simulated and observed summary statistics that are interpretable, see our own work (A10) Ratmann,
Pudlo, Richardson, and Robert (2011). But ABC model choice is generally conducted with
Algorithm 3, which rearranges Algorithm 2 for Bayesian model choice. Posterior probabilities
of each model are then estimated with the frequency of each model among the kept particles.
If the goal of the Bayesian analysis is the selection of the model that best fits the observed data
x obs (to decide between various possible histories in population genetics for instance), it is
performed through the maximum a posterior (MAP) model number, replacing the unknown
probabilities with their ABC approximations. Sadly, Robert et al. (2011) and Didelot et al. (2011)
have raised important warnings regarding model choice with ABC since there is a fundamental
discrepancy between the genuine posterior probabilities and the ones based on summary
statistics.
Moving away from the rejection algorithms From the standpoint of machine learning, the
reference table simulated during the first loop of Algorithm 3 serves as a training database
composed of iid replicates drawn from the joint Bayesian distribution (model × parameter
× summaries of dataset), which can be seen as a hierarchical model. To select the best
model, we have drifted gradually to more sophisticated machine learning procedures. With
my PhD student Julien S TOEHR, we have considered the ABC model choice algorithm as a
k-nearest neighbor (knn) method: the calibration of ε in Algorithm 3 is thus transformed
28
1. Approximate Bayesian computation
Algorithm 3: ABC acceptation-rejection algorithm with a given threshold ε
for i = 1 → Nprior do
choose a model m i at random from the prior on model number;
draw φi from the prior p mi −prior (φ);
draw x from the likelihood f (x | φi , m i ) of model m i ;
compute s i = η(x);
store the particle (m i , φi , s i );
end
for i = 1 → Nprior do
compute the distance δ[s i , s obs ];
reject the particle if δ(s i , s obs ) > ε;
end
into the calibration of k. This knn interpretation was also introduced by Biau et al. (2012) for
parameter inference. Indeed, the theoretical results of Marin et al. (2013) requires a shift from
the approximation of the posterior probabilities: the only issue that ABC can address with
reliability is classification (i.e., decision in favor of the model that best fits the data). In this
first paper, we proposed to calibrate k by minimizing the misclassification error rate of knn on
a test reference table, drawn independently from the simulations that have been used to train
b is a classifier, the misclassification error rate is
the knn classifier. Recall that, if m
Ï
X
b ) 6= M ) = p(m)
b
τ := P(m(X
1{m(x)
6= m}p m−prior (φ) f (x | φ, m)dxdφ.
(3.5)
m
With the help of a Nadaraya-Watson estimators, we also proposed to disintegrate the misclassification rate to get the conditional expected value of the misclassification loss knowing
the observed data (or, more precisely, knowing some summaries of the observed data). The
conditional error is used to assess the difficulty of the model choice at a given value of x and
b
the confidence we can put on the decision m(x).
This is crucial since at the end the classifier
is built to take a decision on a single value of x which is the observed dataset. More precisely,
since ABC relies on summary statistics, we introduced
¯
¡
¢
b 1 (X )) 6= M ¯ η 2 (X ) = s
τ(s) := P m(η
(3.6)
where η 1 (·) and η 2 (·) are two (different or equal) functions we can use to summarize a dataset:
the first one is used by the knn classifier, the second one to disintegrate the error rate. And
finally, we have proposed an adaptive ABC scheme based on this local assessment of the error
to adjust the projection, i.e. the selection of summary statistics, to the data point within knn.
This attempt to fight against the curse of dimensionality locally around the observed data
x obs contrasts with most projection methods which are often performed on the whole set of
particles (i.e. not adapted to x obs ) and most often limited to parameter inference (Blum et al.,
2012). Yet the methodology (Nadaraya-Watson estimators) is limited to a modest number
of summary statistics, as when selecting between dependency structures of discrete hidden
29
Chapter 3. Computational statistics and intractable likelihoods
Markov random fields.
ABC model choice via random forest Real population genetics examples are often based
on many summary statistics (for example 86 numerical summaries in Lombaert et al. (2011))
and require machine learning techniques that are less sensible to the dimension than knn. In
large dimensional spaces, knn classifiers are often strongly outperformed by other classifiers.
Various methods have been proposed to reduce the dimension, see the review of Blum et al.
(2013), and our own proposal based on LDA axes ((A06) Estoup et al., 2012). But, when the
decision can be taken with reliability on a small (but unknown) subset of the summaries
(possibly depending the point of the data space), random forest is a good choice, since its
performances depend mainly on the intrinsic dimension of the classification problem (Biau,
2012; Scornet et al., 2014). However machine learning solutions such as random forests miss
the distinct advantage of posterior probabilities, namely that they evaluate the confidence
degree in the selected model. Indeed, in a well trained forest, the proportion of trees in favor of
a given model is an estimate of the posterior probability largely biased toward 0 or 1. We thus
developed a new posterior indicator to assess the confidence we can put in the model choice
for the observed dataset. This index is the integral of the misclassification loss with respect to
the posterior predictive and can be evaluated with an ABC algorithm. It can be written as
´Ï n
o ³ ¯
´ ³ ¯
´
X ³ ¯¯
¯
¯
b
error(s) := p m ¯ η(x) = s
1 m(η(y))
6= m p φ ¯ η(x) = s, m f y ¯ φ, m dφdy (3.7)
m
and should be compared with τ defined in (3.5). The difference is that the prior probabilities,
densities are replaced by the posterior probabilities and densities knowing that η(x) = s.
Unlike tests based on posterior predictive p-values (Meng, 1994), our indicator do not commit
the sin of “using the data twice” and is simply an exploratory statistic. Additionally it does
not suffer from the drastic limitation in dimension as the conditional error rate (3.6) we
developed with Julien Stoehr due to the resort to Nadaraya-Watson estimator (see above). Last
but not least, the training of random forests requires a much lower simulation effort when
compared with the standard ABC model choice algorithm. The gain in computation time is
large: our population genetic examples show that we can divide by ten or twenty the number
of simulations.
Conclusion and perspectives
My favorite way to present ABC is now the following two stage algorithm.
The general ABC algorithm
(A) Generate a large set of particles (m, φ, η(x)) from the joint p(m)p(φ | m) f (x|m, φ)
(B) Use learning methods to infer about m or φ at s obs = η(x obs ).
If we are only interested in parameter inference, we can drop m in the above algorithm (and
30
2. Bayesian inference via empirical likelihood
the trivial prior which set probability 1 on a single model number). We have conducted
researches to speed up the computation either by improving the standard ABC algorithms
in stage (A) of the algorithm, see Section 1.2, or in stage (B), see Section 1.3. The gain in
computation time we have obtain in the latter case is much larger than the gain we obtained
with sequential ABC algorithms (although the setting is not exactly the same). Due to obvious
reasons regarding computer memory, ABC algorithms will never be able to keep track of the
whole details of simulated datasets: they commonly saves vectors of summary statistics. Thus
the error between (3.3) and (3.1) will remain. The only work that study this discrepancy is
Marin et al. (2013), and is limited to model choice issues. For parameter inference, the problem
remains open.
Additionally, machine learning methods will be of considerable interest for the statistical
processing of massive SNP datasets whose production is on the increase with the field of
population genetics, even if such sequencing methods can introduce some bias (see, e.g.,
our work in (A09) Gautier et al., 2012; (A08) Gautier et al., 2013). Those powerful methods
have not been really tested for parameter inference which means that there is still room for
improvements in stage (B) of the algorithm.
2 Bayesian inference via empirical likelihood
In (A07) Mengersen, Pudlo, and Robert (2013), we have also explored another likelihood-free
approximation for parameter inference in the Bayesian paradigm. Instead of bypassing the
computation of intractable likelihoods with numerous simulations, the proposed methodology
relied on the empirical likelihood of Owen (1988, 2010). The latter defines a pseudo-likelihood
on a vector of parameters φ as follows
½
L el (φ | x obs ) = max
n
Y
p i : 0 ≤ p i ≤ 1,
X
p i = 1,
X
p i h(x i , φ) = 0
¾
(3.8)
i =1
when the data x obs = (x 1 , . . . , x n ) are composed of iid replicates from a (unknown) distribution
P and, when, for some known function h, the parameter of interest φ satisfy
Z
h(x i , φ)P (dx i ) = 0.
(3.9)
Note that the above equation might be interpreted as a non-parametric definition of the
parameter of interest φ since there is no assumption on the distribution P . The original
framework introduced by Owen (1988) dealt with the mean φ of an unknown distribution
P in which case h(x i , φ) = x i − φ. And, with empirical likelihood ratio test, we can produce
confidence intervals on φ in a non-parametric setting, i.e. more robust than confidence
intervals from a parametric model. Other important examples are moments of various order
or quantiles. In all these cases, the function h in (3.9) is well known.
31
Chapter 3. Computational statistics and intractable likelihoods
Bayesian computation via empirical likelihood, BCel Once the function h in (3.9) is known,
we can compute the value of L el (φ | x obs ) with (3.8) at any value of φ. Solving the optimization
problem can be done with an efficient Newton-Lagrange algorithm (the constraint is linear
in p). The method we have proposed in (A07) Mengersen, Pudlo, and Robert (2013) replaces
the true likelihood with the empirical likelihood in samplers from the posterior distribution.
It is well known, see Owen (2010), that empirical likelihood are not normalized, hence we
renounce using it to compute the evidence or the posterior probability of each competing
model in the model choice framework. Moreover, the paper exposes two samplers from the
approximation of the posterior. The first one draws the parameters from the prior and weights
them with the empirical likelihood; the second one relies on AMIS, see below in Section 3.2, to
learn gradually how proposed values of φ should be drawn.
Empirical likelihood in population genetics It was a real challenge to use the empirical
likelihood in population genetics, even if we assumed that the data are composed of iid blocks,
each one corresponding to the genetic data at some locus. Actually empirical likelihoods have
already been used to produce robust confidence intervals of parameters from a likelihood
L(φ, x i ). In this case, the function h in the constraint (3.9) is the score function
h(x i , φ) = ∇φ log L(φ, x i ),
whose zero is the maximum likelihood estimator. But, in population genetics, interesting
parameters such as dates of divergence between populations, effective population sizes,
mutation rates etc. cannot be expressed as moments of the sampling distribution at a given
locus; and additionally they are generally parameters of an intractable likelihood. Hence
we rely on another ersatz of the likelihood which is also used in the context of intractable
likelihoods: composite likelihoods (Lindsay, 1988), and composite score functions. At a the
i -th locus, since the information is mainly in the dependency between individuals of the
sample x i , we rely on the pairwise composite likelihood we can compute explicitly. Indeed,
the pairwise composite likelihood of the data x i at a the i -th locus is a product of likelihoods
when the dataset is restricted to some pair of genes from the sample:
L comp (x i |φ) =
Y
L(x i (a), x i (b)|φ),
a,b
where {a, b} ranges the set of pairs of genes, and x i (a) (resp. x i (b)) is the allele carried by gene
a (resp. b) at the i -th locus. When there is only two genes in a sample, the combinatorial
complexity due to the gene genealogy disappears and each L(x i (a), x i (b)|φ) is known explicitly.
Thus, we can compute explicitly the composite score function at each locus, then set
h(x i , φ) =
X
∇φ log L comp (x i (a), x i (b)|φ)
a,b
and rely on empirical likelihood on the whole dataset composed of n loci.
32
3. Importance sampling
The paper ((A07) Mengersen, Pudlo, and Robert, 2013) includes also synthetic examples with
samples from two or three populations genotyped at one hundred independent microsatellite
loci. Replacing the intractable likelihood with (3.8) in Monte Carlo algorithms computing the
posterior distribution of φ, we obtained an inference algorithm much faster than ABC (no
need to simulate from the model). The accuracy of Bayesian estimators such as the posterior
expectation and the coverage of credible intervals were at least as accurate as the one we can
obtain with ABC. In particular, the coverage of credible intervals was equal or larger than their
nominal probability. Hence, in presence of iid replicates in the data, the empirical likelihood
can be used as a black box to adjust the composite likelihood and obtain appropriate inference;
in other words, the empirical likelihood methodology appears as a major competitor to the
calibration of composite likelihoods proposed by Ribatet et al. (2012).
In the theoretical works of Owen (2010), there is no assumptions on the function h of (3.9).
When φ is a vector of parameters, h takes values in a vector space of the same dimension
than φ. Quite surprisingly, our simulation studies performed badly when the coordinates of
h were not scaled properly with respect to each others. Gradients of criterions such as the
log-likelihood or the composite log-likelihood provide a natural way to scale each coordinate
of h; and this scaling performs quite well. But we have no theoretical justification of this.
3 Importance sampling
Importance sampling is an old trick to approximate or compute numerically an expected value
or an integral in small or moderate dimensional spaces. Suppose that we aim at computing
´ Z
E ψ(X ) = ψ(x)Π(dx)
³
(3.10)
where Π(·) is the distribution of X under the probability measure P. Let Q(·) denote another
distribution on the same space, such that |ψ(x)|Π(dx) is absolutely continuous with respect to
Q(·), i.e. |ψ|Π ¿ Q. Then the target integral can be written as
Z
ψ(x)Π(dx) =
Z
e
ψ(x)Q(dx),
e
where ψ(x)
=
d(ψΠ)
(x)
dQ
is the Radon-Nikodym derivative of the signed measure ψΠ with respect to Q. Note that, if Π
and Q have both densities with respect to a reference measure µ, i.e. Π(dx) = π(x)µ(dx) and
e
Q(dx) = q(x)µ(dx), then we can compute the derivative practically with ψ(x)
= ψ(x)π(x)/q(x).
If additionally the X i ’s are sampled independently from Q(·), then the average
n
1X
e i)
ψ(X
n i =1
(3.11)
is the importance sampling estimator of (3.10) with importance (or instrumental or proposal
or sampling) distribution Q. This approximation is useful in moderate dimensional spaces
to reduce the variance of a crude Monte Carlo estimate based on a sample from Π, or when
33
Chapter 3. Computational statistics and intractable likelihoods
it is difficult to sample from Π. From another viewpoint, maybe the most significant when
Π is the posterior distribution of a Bayesian model, we can also use importance sampling to
approximate the distribution Π. Indeed, if Π ¿ Q, and if ω(x) is the Radon-Nikodym derivative
of Π with respect to Q, then the empirical distribution
n
1X
ω(X i )δ X i
n i =1
provides a Monte Carlo approximation of the distribution Π(·) and integrals as in (3.10) can be
estimated replacing Π with the empirical distribution.
The calibration of the proposal distribution Q is paramount in most implementation of the
importance sampling methodology if one wants to avoid estimates with large (if not infinite)
variance. As explained in many textbooks, when ψ(x) is non-negative for all x, the best
sampling distribution is
Q(dx) = R
ψ(x)
Π(dx)
ψ(z)Π(dz)
(3.12)
which leads to a zero variance estimator. Yet the best distribution depends on the unknown
integral, and simulating from this distribution is generally more demanding than the original
problem of computing (3.10). There exists various classes of adaptive algorithms to learn in
line how to adapt the proposal distribution; the adaptive multiple importance sampling (AMIS)
of Cornuet et al. (2012a) is one of the most efficient when the likelihood is complex to compute,
or when it should be approximated via another Monte Carlo algorithm. Section 3.1 is devoted
to importance sampling methods to integrate over the latent processes and approximate the
likelihood. Section 3.2 presents theoretical results on AMIS which can be used with the latter
approximation of the likelihood to compute the posterior distribution.
3.1 Computing the likelihood
Framework Since the seminal paper of Stephens and Donnelly (2000), importance sampling
has being used to integrate over all possible realizations of the latent process and compute
the likelihood in population genetics. Forward in time, the coalescent based model can be
described as a pure jump, measure-valued Markov process where Z (t ) describes the genetic
ancestors of the sample at time t , see Section 4. The observed sample (composed of n
individuals, or genes) is modeled by the distribution of Z (σ − 0) where σ is the optional time
defined as the first time at which Z (t ) is composed of (n + 1) genes. Thus, if {X (k)} is the
embedded Markov chain of the pure jump process, and τ the stopping time defined as the
first time at which X (k) is composed of (n + 1) genes, the likelihood of the parameter φ is
³
´
f (x obs |φ) = Pφ X (τ − 1) = x obs
34
(3.13)
3. Importance sampling
that can be written as an integral of some indicator function with respect to the distribution of
the embedded Markov chain:
Ã
!
o
∞ X
k−1
X
X n
Y
f (x obs |φ) =
· · · 1 x k−1 = x obs , card(x k ) = n + 1 p 0 (x 0 )
Π(x i , x i +1 ) , (3.14)
k=1 x 0
xk
i =0
where x i ranges the set of counting measures on the set of possible alleles, p 0 is the initial
distribution of the embedded Markov chain and Π its transition matrix.
The best importance distribution (3.12) is the distribution of the embedded chain conditioned
by the event {X (τ − 1) = x obs }. Reversing time, this is the distribution of a process starting
from x obs . But due to the combinatorial complexity of the coalescent, we are generally unable
to simulated from the distribution of the reversed Markov chain. Stephens and Donnelly
(2000) proposed an approximation of the conditioned, reversed chain which serves as importance distribution. The latter corresponds exactly to the reversed chain which leads to a zero
variance approximation in the specific case of “parent independent” mutants. The resulting
algorithm falls into the class of sequential importance sampling (SIS), see Algorithm 4. Nevertheless, the biological framework of the seminal paper is limited as the authors consider
only a single population at equilibrium, i.e., with a constant population size. De Iorio and
Griffiths (2004a,b) have extended the biological scope to samples from different populations
linked by migrations between them. Among different terms we do not described precisely
here, the importance distribution they proposed depends on some matrix inversion. De Iorio
et al. (2005) have replaced this matrix inversion by an explicit formula when the mutation
process of microsatellite loci is the stepwise mutation model (SMM).
Algorithm 4: Sequential importance sampling
Input: The transition matrix Q of the importance distribution (backward in time);
The transition matrix of the model Π (forward in time);
The equilibrium distribution p 0 of the mutation model;
The observed data x obs
set L ← 1;
set x ← x obs ;
while sample size of x ≥ 2 do
draw y according to Q(x, ·);
Π(y, x)
set L ← L ×
;
Q(x, y)
set x ← y;
end
set L ← L × p 0 (x)
Output: the estimate L
Our work on population in disequilibrium When considering population sizes that vary
over time, the above Markov process {Z (t )} becomes inhomogeneous: transition rates de35
Chapter 3. Computational statistics and intractable likelihoods
pends on the population size Ne (t ) at time t . In (A14) Leblois et al. (2014) we face the case
of a single population whose size grows or shrinks exponentially on some interval of time.
The importance distribution is now an inhomogeneous Markov process, backward in time,
starting from the observed data. Moreover the explicit formula of De Iorio et al. (2005) replacing numerical matrix inversion has been extended to a more complex mutation model on
microsatellite loci called the generalized stepwise model (GSM). The motivation of the switch
to a GSM model is that misspecification of the mutation model can lead to false bottleneck
signals as is also shown in the paper. The efficiency of the algorithm (in term of variance of the
likelihood estimates) decreases when moving away from homogeneity of the process. One
sometimes hears the claim that importance sampling is inefficient in high dimensional spaces
because the variance of the likelihood blows up. There is certainly some truth to this, and it is
a well known problem of sequential importance sampling for long realizations of a Markov
chain. See e.g. Proposition 5.5 in Chapter XIV of Asmussen and Glynn (2007), proving that
variance of the importance sampling estimator grows exponentially with the length of the
paths. Hence, to obtain a reasonable accuracy, the computational cost increases largely in
time.
Perspectives With my student Coralie M ERLE (during a graduate internship and part of
the first year of her PhD scholarship), we are trying to increase the accuracy with the same
computational budget resorting to sequential importance sampling with resampling (SIR Liu
et al., 2001). The core ideas is to resample N instances of Algorithm 4 at various iterations
according to a function of the current weights L and thus get rid of simulations that will not
contribute to the average (3.11). Adaptives methods to tune the importance distribution
can also help increase the sharpness of the likelihood estimate. Nevertheless many adaptive
algorithms are always designed to approximate (3.10) for a large class of function ψ while here
the integrand ψ is fixed to a given indicator function.
3.2 Sample from the posterior with AMIS
Being able to compute the likelihood is not the final goal of a statistical analysis; in a Bayesian
framework, MCMC algorithms provides a Monte Carlo approximation of the posterior distribution, various approximations of punctual estimators such as the posterior expected value, the
posterior median and approximate credible intervals. The pseudo-marginal MCMC algorithm
proposed by Beaumont (2003) and studied by Andrieu and Roberts (2009) and Andrieu and
Vihola (2012) replaces evaluations of the likelihood with importance sampling estimates as
presented in Section 3.1. In particular, the authors proved that, when the estimates of the
likelihood are unbiased, this scheme is exact since it provides samples from the true posterior.
But such Metropolis-Hastings algorithms can suffer from poor mixing properties: if the chain
is at some value φ of the parameter space, and if the likelihood has been largely over-estimated
(which can happen because of the large, if not infinite, variance), then the chain is stucked
at this value for a very long time. Of course, one can always improve the accuracy of the
36
3. Importance sampling
likelihood estimate by increasing the number of runs of the SIS algorithm, which increases as
well the time complexity. But we can also replace MCMC algorithms with other Monte Carlo
methods to solve this problem.
AMIS In general, particle algorithms are less sensible to such poor mixing properties and can
better explore different modes of posterior distributions. The adaptive multiple importance
sampling (AMIS Cornuet et al., 2012b) is a good example, which combines multiple importance
sampling and adaptive techniques to draw parameters from the posterior. The sequential
particle scheme is in the same vein as Cappé et al. (2008). But it introduces a recycling of
the successive samples generated during the learning process of the importance distribution
on the parameter space. In particular, AMIS do not throw approximations (via importance
sampling in the gene genealogy space) of the likelihood at given value φ of the parameter and
can be less time consuming than other methods. On various numerical experiments where the
target is the posterior distribution of some population genetics data sets, Cornuet et al. (2012b)
and Sirén et al. (2010) show considerable improvements of the AMIS in Effective Sampling
Size (see, e.g., Liu, 2008b, chapter 2). In such settings where calculating the posterior density
(or an estimate of it) is drastically time consuming, a recycling process makes sense. If the rest
of the Section, π(φ) denotes the posterior density, and is the target of the AMIS.
During the learning process, the AMIS tries successive proposal distributions from a paramet¡ ¢
¡ ¢
ric family of distributions , say Q θb1 , . . . ,Q θbT . Each stage of the iterative process estimates a
¡
¢
better proposal Q θbt +1 , by minimising a criterion such as, for instance, the Kullback-Leibler
divergence between Q(θ) and the target Π, which in our context is the posterior distribution.
The novelty of the AMIS is the following recycling procedure of all past simulations. At iteration
t , the AMIS has already produced t samples:
¡ ¢
φ11 , . . . , φ1N1 ∼ Q θb1 ,
¡ ¢
φ21 , . . . , φ2N2 ∼ Q θb2 ,
..
.
¡ ¢
φ1t , . . . , φtNt ∼ Q θbt
with respective sizes N1 , N2 , . . . , N t . Then the scheme derives a new parameter θbt +1 from all
those past simulations. To that purpose, the weight of φki (k ≤ t , i ≤ Nk ) is updated with
"
#
t N
¡ k ¢. X
` ¡ k b ¢
q φi , θ` ,
π φi
`=1 Ωt
(3.15)
where π(φ) is the density of the posterior distribution, or an IS estimate of the density, θb1 , . . . , θbt
are the parameters generated throughout the t past iterations, x 7→ q(x, θ) is the density of
Q(θ) with respect to the reference measure dx and Ωt = N1 + N2 + · · · + N t is the total number
37
Chapter 3. Computational statistics and intractable likelihoods
of past particles. The importance weight (3.15) is inspired from ideas of Veach and Guibas
(1995), which had be popularized by Owen and Zhou (2000) to merge several independent
importance samples.
Our results Before our work ((A16) Marin, Pudlo, and Sedki, 2014), no proof of convergence
had been provided, neither in Cornuet et al. (2012b) nor elsewhere. It is worth noting that
the weight (3.15) introduces long memory dependence between the samples, and even a bias
which was not controlled by theoretical results. The main purpose of (A16) Marin, Pudlo, and
Sedki (2014) was to fill in this gap, and to prove the consistency of the algorithm at the cost of
a slight modification in the adaptive process. We suggested learning
the new parameter θbt +1
¡ t ¢. ³ t ´
t
t
on the last sample φ1 , . . . , φNt weighted with the classical π φi q φi , θbt for all i = 1, . . . , N t .
The only recycling procedure was in the final output that merges all the previously generated
samples using (3.15). Contrary to what has been done in the literature dealing with other
adaptive scheme, we decided to adopt a more realistic asymptotic setting in our paper. In
Douc et al. (2007) for instance, the consistency of the adaptive population Monte Carlo (PMC)
schemes is proven assuming that the number of iterations, say T , is fixed and that the number
of simulations within each iteration, N = N1 = N2 = · · · = NT , goes to infinity. The convergence
we have proven in (A16) Marin, Pudlo, and Sedki (2014) holds when N1 , . . . , NT , . . . is a growing,
but fixed sequence and T goes to infinity. Hence the proofs of ours theorems provide new
insights on adaptive PMC in that last asymptotic regime. The convergence of θbt to the target
θ ∗ relies on limit theorems on triangular arrays (see Chopin, 2004; Douc and Moulines, 2008;
Cappé et al., 2005, Chapter 9). The consistency of the final merging with weights given by
(3.15) is not a straightforward consequence of asymptotic theorems. Its proof requires the
introduction of a new weighting
π(φki )
Á
q(φki , θ ∗ )
(3.16)
that is more simple to study, although biased and non explicitly computable (because θ ∗ is
unknown). Under the set of assumptions given below, this last weighting scheme is consistent
(see the proposition in the last part of the paper) and is comparable to the actual weighting
given by (3.15), which yields the consistency of the modified AMIS algorithms.
Perspectives One of the restrictive assumptions that have appeared during the proof of
consistency is that the sample sizes satisfy
∞ 1
X
< ∞,
t =1 N t
(3.17)
which means that the sample size goes to infinity much faster than linearly. Following the
proofs, one can see that we used a uniform Chebyshev inequality and uniform square integrability of some family of random variables to control the asymptotic behavior of θbt . These
38
4. Inference in neutral population genetics
results have practical consequences in the design of N t when running AMIS, though one do
not say this assumption is necessary to have consistency. Actually, we believe that we can
replace Chebyshev inequality with Chernoff or large deviation bounds to obtain consistency of
θbt , and then replace (3.17) with a less restrictive assumption. The price to pay then is uniform
exponential integrability of the some family of random variables.
Besides, the technics of proof we have developed can lead to new results on adaptive PMC.
Nevertheless the setting of such adaptive algorithms is quite different as they do not target an
ideal proposal distribution Q(θ ∗ ) but they claim to approximate a sequence of ideal proposal
distributions Q(θt∗ ) at each stage of the iterative algorithm. The sequence θt∗ is defined as a
recurring sequence, namely
θt∗+1 =
Z
h(x, θt∗ )Π(dx),
and the above integral is approximated with an importance sampling estimate at each iteration
of the sequential algorithm. Such settings are obviously more complex than the one we
considered in (A16) Marin, Pudlo, and Sedki (2014).
4 Inference in neutral population genetics
Under neutrality, genetic evolution is modeled by complex stochastic models (in particular
Kimura (1968)’s diffusion and Kingman (1982)’s coalescent) that take into account mutations
and genetical drift simultaneously. Answering to important biological questions (such as
assessing a migration rate, a shrink in population size, dating foundation of a population or
other important events) is often a delicate methodological issue. Such subtle models allow us
to discriminate between confusing effects like, for instance, misspecification of the mutation
model vs a shrink in population size (a bottleneck). The statistical issues fall into the class
of inference on hidden or latent processes, because the genealogy (which is a graph that
represents the genetical kinship of the sample), mutations dates and ancestral genotypes are
not observed directly. To shed some light on the difficulties that occur in my favorite fields of
application of the methods I have developed, we start with a short description of the stochastic
models in Sections 4.1 and 4.2 and we end the section with a few perspectives.
Most of the models exposed in this last part are implemented in the DIYABC software ((A12) Cornuet et al., 2014). This user friendly program permits a comprehensive analysis of population
history using ABC on genetic data, proposes various panels to set the models, the prior distribution and includes several ABC analyses. But, in order to stay as efficient as possible,
the Monte Carlo simulations and numerical computations have been implemented in C++.
We have made an effort to take advantage of multi-cores or multi-CPU modern computer
architectures and have parallelized the computations.
39
Chapter 3. Computational statistics and intractable likelihoods
4.1 A single, isolated population at equilibrium
Many statistical models (linear regression, mixed models, linear discriminant analysis, time
series, brownian motion, stochastic differential equations...) are based on the Gaussian
distribution, the latter being the limit distribution of empirical averages in many settings.
When it comes to model kinship of a sample of individuals, Kingman (1982)’s coalescent
plays this role of standard distribution. In both cases the Gaussian distribution or Kingman’s
coalescent are far from universal distributions. We can mention the Λ-coalescent which allows
multiple collisions and refer the reader to the review of Berestycki (2009) though it has rarely
been used to analyse genetic datasets.
Algorithm 5: Simulation of the coalescent
Input: Number n of genes in the sample;
Effective size Ne of the population
set time t ← 0;
set the ancestral sample size k ← n;
while k ≥ 2 do
draw Tk from the exponential distribution with rate k(k − 1)/(2Ne );
set time t ← t + Tk ;
choose the pair of genes that coalesce at random amid the k(k − 1)/2 pairs;
set k ← k − 1;
end
The coalescent For the sake of clarity, let us begin with a single close population of constant
size Ne (i.e., at equilibrium) and a sample of n haploid individuals from this population (i.e. a
sample of n/2 diploid individuals). As often in population genetics we will call these haploid
individuals “genes” (in particular here, the word “gene” does not mean a coding sequence of
DNA). In the following paragraphs, we do not intend to give a comprehensive description of
genetic materials: we neglect sexual chromosomes, mitochondrial DNA... and will focus on
the autosomal genome. Backward in time, pairs of genes find common ancestors until they
reach the most recent common ancestor (MRCA) of the whole sample. Kingman’s coalescent
stays as simple as possible: it is a memoryless process and genes are exchangeable. Hence
times between coalescences are exponentially distributed and the pair that coalesces is chosen
at random amid the k(k − 1)/2 pairs of genes in the ancestral sample of size k ≤ n. Moreover
the population size Ne influences the rate at which pairs coalesce: it is more complex to find a
common ancestor in large populations than in small ones and the average time for a given
pair of genes before coalescence is Ne . Whence the simulation of Kingman’s coalescent in
Algorithm 5. A realization of the whole process is often exposed as a dendrogram where tips
represent the observed sample and where the root is the most recent common ancestor of the
sample, see Figure 3.1.
40
4. Inference in neutral population genetics
Past
MRCA
Lineage
T2
T3
T4
T5
Now
•
•
• •
•
4
2
5 3
1
Figure 3.1 – Gene genealogy of a sample of five genes numbered from 1 to 5. The intercoalescence times T5 , . . . , T2 are represented on the vertical time axis.
Mutations Over time, the genes vary in a set A of possible alleles. The mutation process is
commonly described with a Markov transition matrix Q mut and a mutation rate µ per gene
per time unit. If we follow a lineage forward in time from the MRCA to a given tip of the
dendrogram, we faces a pure jump Markov process with intensity matrix Λ = µ(Q mut − I ).
When possible, these Markov processes along the lineages of the genealogy are supposed to
be at equilibrium so that marginal distributions of the allele of the MRCA, as well as any gene
from the observed sample are all equal to the stationary distribution of Q mut . Additionally the
stationary distribution do not carry any information (regarding parameters µ or Ne ), and the
relevant information of the dataset lies in the dependency between genes.
It is worth noting that the above description of the stochastic model provides straight a
simulation algorithm, with a first stage backward in time to build the gene genealogy, followed
by a second stage forward in time which simulates the Markov processes along the lineages.
Yet the whole model can also be described as a measure-value process {Z (t )} forward in time.
In this setting, Z (t ) is a counting measure on A which describes the ancestral sample at
time t . In particular, Z (0) is a Dirac mass that puts mass one on the allele type of the MRCA.
The process {Z (t )} is still a pure jump Markov process, which is explosive because Kingman’s
coalescent is coming down from infinity, see e.g. Berestycki (2009). Yet, if σ is the optional
time defined as the first time the total mass of Z (t ) hits (n + 1), then Z (σ − 0), the last measure
visited before σ, models the genetic sample of size n. The intensity matrix of the forward
in time process can be written explicitly. Reversing time, the measure-valued process is still
a pure jump Markov process, but its intensity matrix cannot be computed explicitly due to
dependency between branches of the genealogy (only pair of genes carrying the same allele
41
Chapter 3. Computational statistics and intractable likelihoods
MRCA
100
?−1
?+1
?+1
?+1
?−1
?+1
−1?
?−1
?−1
?−1
•
Gene number: 2
Genotype: 99
?−1
•
4
•
3
•
1
•
7
•
8
•
5
96
96
99
98
98
103
−1?
•
6
101
Figure 3.2 – Simulation of the genotypes of a sample of eight genes. As for microsatellite loci
with the stepwise mutation model, the set of alleles is a interval of integer numbers A ⊂ N.
The mutation process Q mut adds +1 or −1 to the genotype with equal probability. Once the
genealogy has been drawn, the MRCA is genotyped at random, here 100 and we run the
mutation Markov process along the vertical lines of the dendrogram. For instance, the red and
green lines are the lineages from MRCA to gene number 2 and 4 respectively.
can coalesce).
Finally the sample is often composed of multiple loci data, which means that the individuals
have been genotyped at different positions of their genome. If these loci are on different
chromosomes or if they are distant enough on the same chromosomes, taking recombination
into account, we can safely assume independency between the loci. Note that this implies
that gene genealogies of different loci are independent.
4.2 Complex models
Answering to important biological questions regarding the evolution of a given species requires
much more complex models than the one presented in Section 4.1, though the simplest model
serves as a foundation stone of others.
Varying population size To study the population size, we have to set Ne (t ) as a function
depending on time. Because the time of the most recent common ancestor is unknown, while
42
4. Inference in neutral population genetics
the date at which we have sampled the individuals from the population is known, the function
Ne (t ) describes the population size backward in time (t = 0 is the sampling date). Markov
properties of the gene genealogy remain, but the process becomes inhomogeneous as the
jump rates depend on time. Algorithm 5 can be adapted with a Gillespie algorithm as long as
we can explicitly bound from below the jump rate k(k − 1)/(2Ne (s)) after current time t , i.e. for
any s ≥ t , see Algorithm 6. The function t 7→ Ne (t ) can be, for example, a piecewise constant
function, or as in (A14) Leblois et al. (2014), a continuous function which remains constant
except on a given interval of time where the size increases or decreases exponentially.
Algorithm 6: Simulation of the gene genealogy with varying population size
Input: Number n of genes in the sample;
A procedure that compute the effective size Ne (t ) of the population;
A procedure that compute an upper bound B (t ) of Ne (s) for all s ≥ t
set time t ← 0;
set the ancestral sample size k ← n;
while k ≥ 2 do
compute the bound B ← B (t );
draw T from the exponential distribution with rate k(k − 1)/(2 B );
set time t ← t + T ;
draw U from a uniform distribution on [0; 1);
if U ≤ Ne (t )/B then
choose the pair of genes that coalesce at random amid the k(k − 1)/2 pairs;
set k ← k − 1;
end
end
More than one population We are considering here the class of models implemented in
the software DIYABC ((A12) Cornuet et al., 2014), see also Donnelly and Tavare (1995). An
example is given in Figure 3.3. In this setting, the history of the sampled populations is
described (backward in time) in terms of punctual events, divergence and admixture, until
reaching a single ancestral population. Locally in each population of the history, in between
those punctual events, the gene genealogy is built up following the Markov process given
in Section 4.1 if the population size is constant, and following the above algorithm when
population sizes vary. Once we reach the date of a punctual event, we have to move ancestral
genes from one population to the others according to this event.
• If the event says that, forward in time, a new population named B has diverged from
a population named A at this date, then, backward in time, the ancestral sample of
population B is moved into population A.
• If the event says that, forward in time, a new population named C is an admixture
between populations A and B , then, reversing time, the ancestral genes that are in
43
Chapter 3. Computational statistics and intractable likelihoods
population C are sent in population A with probability r and in population B with
probability 1 − r , where r is a parameter of the model.
4.3 Perspectives
The challenge we face in population genetics is a drastic increase of the size of the data, due to
next generation sequencing (NGS). This microbiological technique produces datasets including up to millions of SNP loci. The interest of such high dimensional data is the amount of
information they carry. But developing methods to reveal this information requires important
works on inference algorithms. Gold standard methods such as sequential importance sampling (when the evolutionary scenario is relative simple) or ABC are limited in the number
of loci they can analyse. In the paragraphs below, we sketch a few research points to scale
inference algorithms to NGS data.
Understanding and model SNP loci
SNP mutations Two models coexist in the literature to explain SNP data. The first and
simplest model, that can be simulated with Hudson’s algorithm, considers that the gene
genealogy of a SNP locus is given by the Kingman’s coalescent, and that one and only one
mutation event occurs during the past history of the gene sample at a given locus. The second
model assumes that, at each base pair of the DNA strand, a gene genealogy is drawn following
the Kingman’s coalescent independently of the gene genealogies of other base pairs of the
DNA strand. Then, mutations are put at random on the branches of the genealogies at some
very low rate. Most of the gene genealogies will not carry any mutation event and the other
gene genealogies will carry only one mutation event (hence the presence of bi-allelic SNP loci).
The gene genealogies with mutation event(s) are characterized by a total branch length which
is larger than those without mutation event. The probabilistic distribution of gene genealogies
with mutation event(s) is often referred in the literature as Unique Event Polymorphism (UEP)
genealogies (see for instance Markovtsova et al., 2000). The Hudson and UEP models are
clearly different. Therefore, simulating data with one model and estimating parameters with
the other one leads to a bias which is due to misspecification of the model (independent
of the effect of other bias). Markovtsova et al. (2001) discussed the consequences of this
misspecification in the particular context of an evolutionary neutrality test assuming a simple
demographic scenario with a single population and an infinite site mutation model. The
Markovtsova et al. (2001)’s paper provoked a flurry of responses and comments, which globally
suggests that the Hudson approximation is correct at least for the tests that were carried out for
infinite sites models, provided that some conditions on parameters of the mutation model are
satisfied. Additional work is certainly needed to investigate the effect of this misspecification
bias on SNP data when more complex demographic histories involving several populations
are considered.
44
4. Inference in neutral population genetics
t6
Ne 40
t5
Ne 4
Divergence
t4
Ne 6
Pop6
t3
s
1−s
Admixture
t2
Ne 5
t1
Pop5
r
Ne 1
1−r
Ne 2
Ne 3
Ne 4
t =0
Pop1
Pop2
Pop3
Pop4
Figure 3.3 – Example of an evolutionary scenario: four populations Pop1, . . . , Pop4 have been
sampled at time t = 0. Branches of the history can be considered as tubes in which the gene
genealogy should be drawn. The historical model includes two unobserved populations (Pop5
and Pop6) and fifteen parameters: six dates t 1 , . . . , t 6 , seven populations sizes Ne 1 , . . . , Ne 6 and
Ne 40 and two admixture rates r, s.
45
Chapter 3. Computational statistics and intractable likelihoods
Linkage disequilibrium and time scale Autosomal loci are commonly considered as independent if they are not numerous. Thus each locus has its own gene genealogy independently
of the others and the likelihood writes as a product over loci. When the number of loci increases, this assumption becomes debatable: genetic markers are dense along the genome
and recombination occurs more rarely between them. The stochastic model that explains
the hidden genealogies along the genome is the ancestral recombination graph (Griffiths and
Marjoram, 1997), which has been approximated by Markov processes along the genome (Wiuf
and Hein, 1999; McVean and Cardin, 2005; Marjoram and Wall, 2005). But very few inference
algorithms are based on these models; first attempts to design such methods are limited in
the number of individual they can handle (Hobolth et al., 2007; Li and Durbin, 2011). Yet
modeling recombination would permit to retrieve a natural time scale on the parameter of
interests. Indeed, models are often over-parametrized: they is no way to set a time scale of a
latent genealogy based on current genetic data but a Bayesian model can provide information
on this time scale in the prior distribution of the recombination rate.
Infer and predict
Sequential importance sampling (SIS) algorithms They are currently the most accurate
method for parameter inference, but also the most intensive in term of computation time,
and their implementation is limited to realistic, but still simple, evolutionary scenarios. We
face major obstacles to scale these algorithms to NGS data size: the lack of efficient proposal
distributions in demographical model which are not at equilibrium, and the slow calculation
speed. Additionally, a major drawback of their accuracy is that they are very sensible to
misspecification of the model. The future of such methods on high dimensional data is
compromised, expect if we make a huge leap forward in computation time.
ABC algorithms This class of algorithms is currently one of the gold standard method to
conduct a statistical analysis in complex evolutionary setting including numerous populations.
They drew strength and flexibility from the fact that they require only (1) an efficient simulation
algorithm from the stochastic model and (2) our ability to summarize the relevant information
in numerical statistics of smaller dimension than the original data. Yet, major challenges
remain to adapt this class of inference procedure to NGS data: the design and the number of
summary statistics, though we have reduced the issue with the resort to random forests for
model choice. In particular, machine learning based algorithm such as ABC-EP (Barthelmé and
Chopin, 2011) can take advantage of the specific structure of genetic data into independent
or Markov-dependent loci. ABC algorithms can also approximate a posterior predictive
distribution of the genetic polymorphism under neutrality with complex demographic models.
This can help detect loci under selection as outliers of the posterior predictive.
Algorithme BCel Finally, this class of inference methods is one of the most promising to
handle efficiently large NGS dataset, as we have shown in (A07) Mengersen, Pudlo, and Robert
46
4. Inference in neutral population genetics
(2013). To transform it into a routine procedure to analyze data under complex evolutionary
scenario, we need to (1) understand better SNP mutation models, and recombination as
explained in Section 4.3 and (2) extend to this latter models the explicit formulas for the
pairwise composite likelihood.
47
Bibliography
(A01) Pudlo, P. (2009). Large deviations and full Edgeworth expansions for finite Markov chains
with applications to the analysis of genomic sequences. ESAIM: Probab. and Statis. 14,
435–455.
(A02) Pelletier, B. and P. Pudlo (2011). Operator norm convergence of spectral clustering on
level sets. The Journal of Machine Learning Research 12, 385–416.
(A03) Arias-Castro, E., B. Pelletier, and P. Pudlo (2012). The normalized graph cut and Cheeger
constant: from discrete to continuous. Advances in Applied Probability 44(4), 907–937.
(A04) Cadre, B., B. Pelletier, and P. Pudlo (2013). Estimation of density level sets with a given
probability content. Journal of Nonparametric Statistics 25(1), 261–272.
(A05) Marin, J.-M., P. Pudlo, C. P. Robert, and R. Ryder (2012). Approximate bayesian computational methods. Statistics and Computing 22(6), 1167–1180.
(A06) Estoup, A., E. Lombaert, J.-M. Marin, C. Robert, T. Guillemaud, P. Pudlo, and J.-M.
Cornuet (2012). Estimation of demo-genetic model probabilities with Approximate Bayesian
Computation using linear discriminant analysis on summary statistics. Molecular Ecology
Ressources 12(5), 846–855.
(A07) Mengersen, K. L., P. Pudlo, and C. P. Robert (2013). Bayesian computation via empirical
likelihood. Proc. Natl. Acad. Sci. USA 110(4), 1321–1326.
(A08) Gautier, M., J. Foucaud, K. Gharbi, T. Cezard, M. Galan, A. Loiseau, M. Thomson, P. Pudlo,
C. Kerdelhué, and A. Estoup (2013). Estimation of population allele frequencies from nextgeneration sequencing data: pooled versus individual genotyping. Molecular Ecology 22(4),
3766–3779.
(A09) Gautier, M., K. Gharbi, T. Cezard, J. Foucaud, C. Kerdelhué, P. Pudlo, J.-M. Cornuet, and
A. Estoup (2012). The effect of RAD allele dropout on the estimation of genetic variation
within and between populations. Molecular Ecology 22(11), 3165–3178.
(A10) Ratmann, O., P. Pudlo, S. Richardson, and C. P. Robert (2011). Monte Carlo algorithms for
model assessment via conflicting summaries. Technical report, arXiv preprint 1106.5919.
49
Bibliography
(A11) Sedki, M., P. Pudlo, J.-M. Marin, C. P. Robert, and J.-M. Cornuet (2013). Efficient learning
in abc algorithms. Technical report, arXiv preprint 1210.1388.
(A12) Cornuet, J.-M., P. Pudlo, J. Veyssier, A. Dehne-Garcia, M. Gautier, R. Leblois, J.-M. Marin,
and A. Estoup (2014). DIYABC v2.0: a software to make Approximate Bayesian Computation
inferences about population history using Single Nucleotide Polymorphism, DNA sequence
and microsatellite data. Bioinformatics 38(8), 1187–1189.
(A13) Baragatti, M. and P. Pudlo (2014). An overview on Approximate Bayesian Computation.
ESAIM: Proc. 44, 291–299.
(A14) Leblois, R., P. Pudlo, J. Néron, F. Bertaux, C. R. Beeravolu, R. Vitalis, and F. Rousset (2014).
Maximum likelihood inference of population size contractions from microsatellite data.
Molecular biology and evolution (to appear), 19 pages.
(A15) Stoehr, J., P. Pudlo, and L. Cucala (2014). Adaptive ABC model choice and geometric
summary statistics for hidden Gibbs random fields. Statistics and Computing in press, 15
pages.
(A16) Marin, J.-M., P. Pudlo, and M. Sedki (2012, 2014). Consistency of the Adaptive Multiple
Importance Sampling. Technical report, arXiv preprint arXiv:1211.2548.
(A17) Pudlo, P., J.-M. Marin, A. Estoup, M. Gauthier, J.-M. Cornuet, and C. P. Robert (2014).
ABC model choice via Random Forests. Technical report, arXiv preprit 1406.6288.
Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient monte carlo
computations. The Annals of Statistics 37(2), 697–725.
Andrieu, C. and M. Vihola (2012). Convergence properties of pseudo-marginal markov chain
monte carlo algorithms. Technical report, arXiv preprint arXiv:1210.1484.
Asmussen, S. and P. W. Glynn (2007). Stochastic Simulation, Algorithms and Analysis, Volume 57
of Stochastic modelling and applied probability. New York: Springer.
Barthelmé, S. and N. Chopin (2011). ABC-EP: Expectation Propagation for Likelihood-free
Bayesian Computation. In L. Getoor and T. Scheffer (Eds.), ICML 2011 (Proceedings of the
28th International Conference on Machine Learning), pp. 289–296.
Baudry, J.-P., A. E. Raftery, G. Celeux, K. Lo, and R. Gottardo (2010). Combining mixture
components for clustering. Journal of Computational and Graphical Statistics 19(2), 332–
353.
Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored
populations. Genetics 164(3), 1139–1160.
Beaumont, M. A. (2010). Approximate Bayesian Computation in Evolution and Ecology. Annu.
Rev. Ecol. Evol. Syst. 41, 379–406.
50
Bibliography
Beaumont, M. A., J.-M. Cornuet, J.-M. Marin, and C. P. Robert (2009). Adaptive approximate
Bayesian computation. Biometrika 96(4), 983–990.
Belkin, M. and P. Niyogi (2005). Towards a theoretical foundation for laplacian-based manifold
methods. In Learning theory, Volume 3559 of Lecture Notes in Comput. Sci., pp. 486–500.
Berlin: Springer.
Ben-David, S., U. von Luxburg, and D. Pal (2006). A sober look on clustering stability. In
G. Lugosi and H. Simon (Eds.), Proceedings of the 19th Annual Conference on Learning
Theory (COLT), pp. 5–19. Springer, Berlin.
Berestycki, N. (2009). Recent progress in coalescent theory. Ensaios Matematicos 16, 1–193.
Biau, G. (2012). Analysis of a random forests model. The Journal of Machine Learning Research 13(1), 1063–1095.
Biau, G., B. Cadre, and B. Pelletier (2007). A graph-based estimator of the number of clusters.
ESAIM Probab. Stat. 11, 272–280.
Biau, G., F. Cérou, and A. Guyader (2012). New insights into approximate bayesian computation. Technical report, arXiv preprint arXiv:1207.6461.
Blum, M., M. Nunes, D. Prangle, and S. Sisson (2012). A comparative review of dimension
reduction methods in approximate Bayesian computation. Technical report, arXiv preprint
arXiv:1202.3819.
Blum, M. G. B., M. A. Nunes, D. Prangle, and S. A. Sisson (2013). A comparative review of dimension reduction methods in Approximate Bayesian computation. Statistical Science 28(2),
189–208.
Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
Cappé, O., A. Guillin, J. M. Marin, and C. P. Robert (2004). Population Monte Carlo. J. Comput.
Graph. Statist. 13(4), 907–929.
Cappé, O., A. Guillin, J.-M. Marin, and C. P. Robert (2008). Adaptive importance sampling in
general mixture classes. Statistics and Computing 18, 587–600.
Cappé, O., E. Moulines, and T. Rydén (2005). Inference in hidden Markov models. Springer,
New York.
Chopin, N. (2004). Central Limit Theorem for Sequential Monte Carlo Methods and Its
Application to Bayesian Inference. The Annals of Statistics 32(6), 2385–2411.
Cornuet, J.-M., J.-M. Marin, A. Mira, and C. Robert (2012a). Adaptive multiple importance
sampling. Scandinavian Journal of Statistics 39(4), 798–812.
Cornuet, J.-M., J.-M. Marin, A. Mira, and C. P. Robert (2012b). Adaptive Multiple Importance
Sampling. Scandinavian Journal of Statistics 39(4), 798–812.
51
Bibliography
Cornuet, J.-M., V. Ravigné, and A. Estoup (2010). Inference on population history and model
checking using DNA sequence and microsatellite data with the software DIYABC (v1.0).
Bioinformatics 11(1), 401.
Cornuet, J.-M., F. Santos, M. A. Beaumont, C. P. Robert, J.-M. Marin, D. J. Balding, T. Guillemaud,
and A. Estoup (2008). Inferring population history with DIYABC: a user-friendly approach
to Approximate Bayesian Computation. Bioinformatics 24(23), 2713–2719.
Cuevas, A., M. Febrero, and R. Fraiman (2001). Cluster analysis: a further approach based on
density estimation. Comput. Statist. Data Anal. 36, 441–459.
De Iorio, M. and R. C. Griffiths (2004a). Importance sampling on coalescent histories, I.
Advances in Applied Probability 36, 417–433.
De Iorio, M. and R. C. Griffiths (2004b). Importance sampling on coalescent histories, II.
Advances in Applied Probability 36, 434–454.
De Iorio, M., R. C. Griffiths, R. Leblois, and F. Rousset (2005). Stepwise matuation likelihood computation by sequential importance sampling in subdivided population models.
Theoretical Population Biology 68, 41–53.
Del Moral, P. (2004). Feynman-Kac formulae. Probability and its Applications (New York). New
York: Springer-Verlag. Genealogical and interacting particle systems with applications.
Del Moral, P., A. Doucet, and A. Jasra (2006). Sequential Monte Carlo samplers. J. Royal Statist.
Society Series B 68(3), 411–436.
Del Moral, P., A. Doucet, and A. Jasra (2012). An adaptive sequential Monte Carlo method for
approximate Bayesian computation. Statistics and Computing 22(5), 1009–1020.
Diaconis, P. and D. Stroock (1991). Geometric bounds for eigenvalues of markov chains. The
Annals of Applied Probability 1(1), 36–61.
Didelot, X., R. G. Everitt, A. M. Johansen, D. J. Lawson, et al. (2011). Likelihood-free estimation
of model evidence. Bayesian analysis 6(1), 49–76.
Donnelly, P. and S. Tavare (1995). Coalescents and genealogical structure under neutrality.
Annual review of genetics 29(1), 401–421.
Douc, R., A. Guillin, J. M. Marin, and C. P. Robert (2007). Convergences of adaptive mixtures of
importance sampling schemes. The Annals of Statistics 35(1), 420–448.
Douc, R. and E. Moulines (2008). Limit theorems for weighted samples with applications to
Sequential Monte Carlo Methods. The Annals of Statistics 36(5), 2344–2376.
Doucet, A., N. de Freitas, and N. Gordon (2001). Sequential Monte Carlo Methods in Practice.
Springer-Verlag, New York.
52
Bibliography
Drovandi, C. C. and A. N. Pettitt (2011). Estimation of parameters for macroparasite population
evolution using approximate bayesian computation. Biometrics 67(1), 225–233.
Duda, R. O., P. E. Hart, and D. G. Stork (2012). Pattern classification (2nd ed.). John Wiley &
Sons.
Ethier, S. N. and T. G. Kurtz (2009). Markov processes: characterization and convergence,
Volume 282 of Wiley series in Probability and Statistics. John Wiley & Sons.
Everitt, R. G. (2012). Bayesian parameter estimation for latent Markov random fields and social
networks. Journal of Computational and Graphical Statistics 21(4), 940–960.
Friedman, J. H. and J. J. Meulman (2004). Clustering objects on subsets of attributes (with
discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(4),
815–849.
Giné, E. and V. Koltchinskii (2006). Empirical graph laplacian approximation of laplace–
beltrami operators: Large sample results. In High dimensional probability, pp. 238–259.
Institute of Mathematical Statistics.
Griffiths, R. C. and P. Marjoram (1997). An ancestral recombination graph. Institute for
Mathematics and its Applications 87, 257.
Hartigan, J. (1975). Clustering Algorithms. New-York: Wiley.
Hastie, T., R. Tibshirani, and J. Friedman (2001). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction. New-York: Springer.
Hastie, T., R. Tibshirani, and J. Friedman (2009). The Elements of Statistical Learning: Data
Mining, Inference, and Prediction (2nd ed.). Springer.
Hobolth, A., O. F. Christensen, T. Mailund, and M. H. Schierup (2007). Genomic relationships
and speciation times of human, chimpanzee, and gorilla inferred from a coalescent hidden
Markov model. PLoS genetics 3(2), e7.
Kimura, M. (1968). Evolutionary rate at the molecular level. Nature 217, 624–626.
Kingman, J. (1982). The coalescent. Stoch. Proc. and Their Applications 13, 235–248.
Koltchinskii, V. I. (1998). Asymptotics of spectral projections of some random matrices approximating integral operators. In High dimensional probability (Oberwolfach, 1996), Volume 43
of Progr. Probab., pp. 191–227. Basel: Birhäuser.
Li, H. and R. Durbin (2011). Inference of human population history from individual wholegenome sequences. Nature 475, 493–496.
Lindsay, B. G. (1988). Composite likelihood methods. In Statistical inference from stochastic
processes (Ithaca, NY, 1987), Volume 80 of Contemp. Math., pp. 221–239. Providence, RI:
Amer. Math. Soc.
53
Bibliography
Liu, J. S. (2008a). Monte Carlo strategies in scientific computing. Springer Series in Statistics.
New York: Springer.
Liu, J. S. (2008b). Monte Carlo Strategies in Scientific Computing. Series in Statistics. Springer.
Liu, J. S., R. Chen, and T. Logvinenko (2001). A theoretical framework for sequential importance
sampling with resampling. In Sequential Monte Carlo methods in practice, pp. 225–246.
Springer.
Lombaert, E., T. Guillemaud, C. Thomas, et al. (2011). Inferring the origin of populations introduced from a genetically structured native range by Approximate Bayesian Computation:
case study of the invasive ladybird Harmonia axyridis. Molecular Ecology 20, 4654–4670.
Maier, M., U. von Luxburg, and M. Hein (2013). How the result of graph clustering methods
depends on the construction of the graph. ESAIM: Probability and Statistics 17, 370–418.
Marin, J.-M., N. Pillai, C. P. Robert, and J. Rousseau (2013). Relevant statistics for
Bayesian model choice.
Journal of the Royal Statistical Society: Series B Early
View(doi:10.1111/rssb.12056), 21 pages.
Marin, J.-M., P. Pudlo, and M. Sedki (2012). Optimal parallelization of a sequential approximate Bayesian computation algorithm. In IEEE Proceedings of the 2012 Winter Simulation
Conference, pp. Article number 29, 7 pages.
Marjoram, P., J. Molitor, V. Plagnol, and S. Tavaré (2003, December). Markov chain Monte
Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100(26), 15324–15328.
Marjoram, P. and J. Wall (2005). Fast "coalescent" simulation. BMC genetics 7, 16–16.
Markovtsova, L., P. Marjoram, and S. Tavaré (2000). The age of a unique event polymorphism.
Genetics 156(1), 401–409.
Markovtsova, L., P. Marjoram, and S. Tavare (2001). On a test of depaulis and veuille. Molecular
biology and evolution 18(6), 1132–1133.
McLachlan, G. and D. Peel (2000). Finite mixture models. John Wiley & Sons.
McVean, G. A. and N. J. Cardin (2005). Approximating the coalescent with recombination.
Philosophical Transactions of the Royal Society B: Biological Sciences 360(1459), 1387–1393.
Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics 22(3), 1142–1160.
Ng, A., M. Jordan, and Y. Weiss (2002). On spectral clustering: Analysis and an algorithm. In
T. Dietterich, S. Becker, and Ghahramani (Eds.), Advances in Neural Information Processing
Systems, Volume 14, pp. 849–856. MIT Press.
Owen, A. and Y. Zhou (2000). Safe and Effective Importance Sampling. Journal of the American
Statistical Association 95(449), 135–143.
54
Bibliography
Owen, A. B. (1988). Empirical likelihood ratio confidence intervals for a single functional.
Biometrika 75, 237–249.
Owen, A. B. (2010). Empirical likelihood. CRC press.
Pollard, D. (1981). Strong consistency of k-means clustering. The Annals of Statistics 9(1),
135–140.
Ribatet, M., D. Cooley, and A. C. Davison (2012). Bayesian inference from composite likelihoods, with an application to spatial extremes. Statistica Sinica 22, 813–845.
Rigollet, P. and R. Vert (2009). Optimal rates for plug-in estimators of density level sets.
Bernoulli 15(4), 1154–1178.
Robert, C. and G. Casella (2004). Monte Carlo Statistical Methods (second ed.).
Robert, C. P., J.-M. Cornuet, J.-M. Marin, and N. Pillai (2011). Lack of confidence in approximate
bayesian computation model choice. Proc. Natn. Acad. Sci. USA 108(37), 15112–15117.
Rosasco, L., M. Belkin, and E. De Vito (2010). On learning with integral operators. Journal of
Machine Learning Research 11, 905–934.
Rosenblatt, M. (1969). Conditional probability density and regression estimators. Multivariate
analysis II 25, 31.
Schiebinger, G., M. Wainwright, and B. Yu (2014). The Geometry of Kernelized Spectral Clustering. arXiv preprint arXiv:1404.7552.
Scornet, E., G. Biau, and J.-P. Vert (2014). Consistency of Random Forests. Technical report,
(arXiv) Technical Report 1405.2881.
Shawe-Taylor, J. and N. Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge
University Press.
Shi, J. and J. Malik (2000). Normalized cuts and image segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence 22(8), 888–905.
Sirén, J., P. Marttinen, and J. Corander (2010). Reconstructing population histories from
single-nucleotide polymorphism data. Molecular Biology and Evolution 28(1), 673–683.
Sisson, S. A., Y. Fan, and M. Tanaka (2007). Sequential Monte Carlo without likelihoods. Proc.
Natl. Acad. Sci. USA 104, 1760–1765.
Sisson, S. A., Y. Fan, and M. Tanaka (2009). Sequential Monte Carlo without likelihoods: Errata.
Proc. Natl. Acad. Sci. USA 106, 16889.
Stephens, M. and P. Donnelly (2000). Inference in molecular population genetics. J. R. Statist.
Soc. B 62, 605–655.
55
Bibliography
Tavaré, S., D. Balding, R. Griffith, and P. Donnelly (1997). Inferring coalescence times from
DNA sequence data. Genetics 145, 505–518.
Veach, E. and L. Guibas (1995, August). Optimally Comabining Sampling Techniques For
Monte Carlo Rendering. In SIGGRAPH’95 Proceeding, pp. 419–428. Addison-Wesley.
von Luxburg, U. (2007). A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416.
von Luxburg, U., M. Belkin, and O. Bousquet (2008). Consistency of spectral clustering. Ann.
Statis. 36(2), 555–586.
von Luxburg, U., B. Williamson, and I. Guyon (2012). Clustering: Science or art? In ICML Unsupervised and Transfer Learning, JMLR Workshop and Conference Proceedings, Volume 27,
pp. 65–80.
Wiuf, C. and J. Hein (1999). Recombination as a point process along sequences. Theoretical
population biology 55(3), 248–259.
56
A Published papers
See web page http://www.math.univ-montp2.fr/~pudlo/HDR
• (A2) B. Pelletier and P. Pudlo (2011) Operator norm convergence of spectral clustering
on level sets. Journal of Machine Learning Research, 12, pp. 349–380
• (A3) E. Arias-Castro, B. Pelletier and P. Pudlo (2012) The Normalized Graph Cut and
Cheeger Constant: from Discrete to Continuous. Advances in Applied Probability, 44(4),
dec 2012
• (A4) B. Cadre, B. Pelletier and P. Pudlo (2013) Estimation of density level sets with a given
probability content. Journal of Nonparametric Statistics 25(1), pp. 261–272.
• (A5) J.–M. Marin, P. Pudlo, C. P. Robert and R. Ryder (2012) Approximate Bayesian
Computational methods. Statistics and Computing 22(6), pp. 1167–1180.
• (A06) Estoup, A., E. Lombaert, J.-M. Marin, C. Robert, T. Guillemaud, P. Pudlo, and
J.-M. Cornuet (2012). Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics.
Molecular Ecology Ressources 12(5), 846–855.
• (A7) Mengerson, K.L., Pudlo, P. and Robert, C. P. (2013) Bayesian computation via empirical likelihood. Proc. Natl. Acad. Sci. USA 110(4), pp. 1321–1326.
• (A8) Gautier, M., J. Foucaud, K. Gharbi, T. Cezard, M. Galan, A. Loiseau, M. Thomson,
P. Pudlo, C. Kerdelhué, and A. Estoup (2013). Estimation of population allele frequencies
from next-generation sequencing data: pooled versus individual genotyping. Molecular
Ecology 22(4), 3766–3779.
• (A9) Gautier, M., K. Gharbi, T. Cezard, J. Foucaud, C. Kerdelhué, P. Pudlo, J.-M. Cornuet,
and A. Estoup (2012). The effect of RAD allele dropout on the estimation of genetic
variation within and between populations Molecular Ecology 22(11), 3165–3178.
57
Appendix A. Published papers
• (A12) Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin
J.-M., Estoup A. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA
sequence and microsatellite data. Bioinformatics 30(8), pp. 1187–1189.
• (A13) Baragatti, M. and P. Pudlo (2014). An overview on Approximate Bayesian Computation. ESAIM: Proc. 44, 291–299.
• (A14) Leblois, R., Pudlo, P., Néron, J., Bertaux, F., Beeravolu, C. R., Vitalis, R. and Rousset,
F. Maximum likelihood inference of population size contractions from microsatellite
data. Molecular Biology and Evolution, in press.
• (A15) Stoehr, J., Pudlo, P. and Cucala, L. (2014) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random fields. Accepté dans Statistics and
Computing.
58
B Preprints
See web page http://www.math.univ-montp2.fr/~pudlo/HDR
• (A10) Ratmann, O., P. Pudlo, S. Richardson, and C. P. Robert (2011). Monte Carlo algorithms for model assessment via conflicting summaries. Technical report, arXiv preprint
1106.5919.
• (A11) Sedki, M., P. Pudlo, J.-M. Marin, C. P. Robert, and J.-M. Cornuet (2013). Efficient
learning in abc algorithms. Technical report, arXiv preprint 1210.1388.
• (A16) Marin, J.-M., P. Pudlo, M. Sedki (2013). Consistency of the Adaptive Multiple
Importance Sampling. Technical report, arXiv preprint 1211.2548.
• (A17) Pudlo, P., Marin, J.-M., Estoup, A., Gautier, M., Cornuet, J.-M. and Robert, C. P. ABC
model choice via random forests. Technical report, arXiv preprint 1406.6288.
59
C Curriculum vitæ
M AÎTRE DE CONFÉRENCES à l’Université Montpellier 2, Faculté des Sciences
I3M - Institut de Mathématiques et Modélisation de Montpellier UMR CNRS 5149
Place Eugène Bataillon ; 34095 Montpellier CEDEX, France
Tél. : +33 4 67 14 42 11 / +33 6 85 17 78 46
Email : pierre.pudlo@univ-montp2.fr
URL : http://www.math.univ-montp2.fr/˜pudlo
Né le 20 septembre 1977 à Villers-Semeuse (08 – Ardennes)
Nationalité française.
Études
2001–2004 T HÈSE au laboratoire de Probabilités, Combinatoire et Statistique, université
Claude Bernard Lyon 1 sous la direction de Didier P IAU Estimations précises de grandes
déviations et applications à la statistique des séquences biologiques
1998–2002 Étude à l’École Normale Supérieure de Lyon : M AGISTÈRE mathématiques et
applications (MMA)
L ICENCE , MAÎTRISE ET DEA à l’université Lyon 1
A GRÉGATION de mathématiques (reçu 52ème)
Postes occupés
septembre 2011–août 2013 Délégation INRA au Centre de Biologie pour la Gestion des Populations
septembre 2006 Recrutement MCF à l’université Montpellier 2
2005–2006 ATER, université de Franche-Comté, Laboratoire de Mathématiques de Besançon
(UMR 6623)
2002–2005 Allocataire-moniteur, université Lyon 1
61
Appendix C. Curriculum vitæ
1998–2002 Fonctionnaire stagiaire, École Normale Supérieure de Lyon
Thèmes de recherche
Classification : Clustering spectral • Clustering basé sur la densité • Machine learning •
Théorèmes asymptotiques • constante de Cheeger • Graphes aléatoires.
Financé par le projet ANR CLARA (2009–2013): Clustering in High Dimension: Algorithms and
Applications, porté par B. P ELLETIER (je suis responsable du pôle montpelliérain)
Probabilités numériques et statistique bayésienne : méthodes de Monte Carlo • Génétique
des populations • ABC (Approximate Bayesian Computation) • échantillonnage préférentiel •
Vraisemblance empirique.
Financé par
• une délégation INRA (département SPE) de deux ans au Centre de Biologie pour la
Gestion des Populations (CBGP, UMR INRA SupAgro Cirad IRD, Montpellier)
• le projet ANR EMILE (2009–2013): Études de Méthodes Inférentielles et Logiciels pour
l’Évolution, porté par J.M. C ORNUET, puis R. V ITALIS ;
• le Labex NUMEV, Montpellier ;
• l’Institut de Biologie Computationnelle (projet Investissement d’Avenir, Montpellier) ;
• le projet PEPS (CNRS) « Comprendre les maladies émergentes et les épidémies : modélisation, évolution, histoire et société », que je porte.
Principaux collaborateurs
Communauté mathématique : Benoît C ADRE (Pr, École Normale Supérieure de Cachan,
antenne de Rennes), Jean-Michel M ARIN (Pr, université Montpellier 2), Kerrie M ENGERSEN (Pr,
Queensland University of Technology, Brisbane, Australie), Bruno P ELLETIER (Pr, université
Rennes 2), Didier P IAU (Pr, université Joseph Fourier Grenoble 1) Christian P. R OBERT (Pr,
université Paris-Dauphine & IUF).
Communauté biologique : Jean-Marie C ORNUET (DR, INRA CBGP), Arnaud E STOUP (DR,
INRA CBGP), Mathieu G AUTIER (CR, INRA CBGP), Raphaël L EBLOIS (CR, INRA CBGP) et
François R OUSSET (DR, CNRS ISE-M).
Responsabilités administratives et scientifiques
2013–aujourd’hui Co-responsable de l’axe Algorithmes & Calculs du Labex NUMEV
62
2008–aujourd’hui Responsable du séminaire de probabilités et statistique de Montpellier
2010–aujourd’hui Membre élu du Conseil de l’UMR I3M
2014 Membre du comité scientifique des Troisième Rencontres R à Montpellier (25–27 juin
2014)
2012, -13, -14 Membre des comités d’organisation des Écoles-Ateliers “Mathematical and
Computational evolutionary biology”, juin 2012, 2013 et 2014.
2012 Membre du comité de sélection à l’université Lyon 1 pour recrutement d’un maître de
conférences en statistique.
2010 Membre du comité d’organisation des Journées de Statistiques du Sud à Mèze (juin
2010): “Modelling and Statistics in System Biology”.
2009–2010 Membre de comités de sélection à Montpellier 2.
2008 Membre élu de la commission de spécialistes CNU 26 à Montpellier 2.
2004–2005 Administrateur du serveur du LaPCS (Laboratoire de Probabilités, Combinatoire
et Statistique, Lyon 1)
Encadrement
Outre une dizaine d’étudiants en Master 1, j’ai encadré les travaux ci-dessous.
février - juin 2009 Stage de M2 de Mohammed S EDKI.
2009–2012 Thèse de Mohammed S EDKI avec J.-M. M ARIN (PR, Montpellier 2) : Échantillonnage préférentiel adaptatif et méthodes bayésiennes approchées appliquées à la génétique
des populations. (soutenue le 31 octobre 2012). M. S EDKI est M AÎTRE DE CONFÉRENCES
à l’université d’Orsay (faculté de médecine – Le Kremlin-Bicetre) depuis septembre
2013.
mars – juin 2012 Stage de M2 de Julien S TOEHR, élève de l’École Normale Supérieure de
Cachan
2012 – aujourd’hui Thèse de Julien S TOEHR co-encadrée avec Jean-Michel M ARIN (PR, Montpellier 2) et Lionel C UCALA (MCF, Montpellier 2) : Choix de modèles pour les champs de
Gibbs (en particulier via ABC)
avril - juillet 2013 Stage de M2 de Coralie M ERLE (Master MathSV, Université Paris Sud –
École Polytechnique) avec Raphaël L EBLOIS (CR - CBGP).
2013 – aujourd’hui Thèse de Coralie M ERLE en co-direction effective avec Raphaël L EBLOIS
(CR - CBGP) : Nouvelles méthodes d’inférence de l’histoire démographique à partir de
données génétiques.
63
Appendix C. Curriculum vitæ
Enseignements
Depuis mon recrutement en 2006, j’ai eu l’occasion d’effectuer de nombreuses heures d’enseignement,
parmis lesquelles on peut citer :
PhD Responsable du module doctoral “Programmation orientée objet : modélisation probabiliste & calcul numérique en statistique pour la biologie (30h)
M2R Responsable du module de Classification Supervisée et Non-Supervisée (20h)
M1 Responsable du module Processus stochastique / Réseaux et files d’attente (50h)
L3 Responsable du module Traitement de données (50h) pour les licences de Biologie et de
Géologie-Biologie-Environnement
Liste de publications
Articles
Pour la notoriété des revues, voir tableau en page 67.
(A1) P. Pudlo@ (2009) Large deviations and full Edgeworth expansions for finite Markov chains
with applications to the analysis of genomic sequences. ESAIM: Probab. and Statis. 14,
pp. 435–455.
(A2) B. Pelletier and P. Pudlo@ (2011) Operator norm convergence of spectral clustering on
level sets. Journal of Machine Learning Research, 12, pp. 349–380.
(A3) E. Arias-Castro, B. Pelletier and P. Pudlo@ (2012) The Normalized Graph Cut and Cheeger
Constant: from Discrete to Continuous. Advances in Applied Probability, 44(4), dec
2012.
(A4) B. Cadre, B. Pelletier and P. Pudlo@ (2013) Estimation of density level sets with a given
probability content. Journal of Nonparametric Statistics 25(1), pp. 261–272.
(A5) J.–M. Marin, P. Pudlo@ , C. P. Robert and R. Ryder (2012) Approximate Bayesian Computational methods. Statistics and Computing 22(6), pp. 1167–1180.
(A6) A. Estoup, E. Lombaert, J.–M. Marin, T. Guillemaud, P. Pudlo, C. P. Robert and J.–M. Cornuet (2012) Estimation of demo-genetic model probabilities with Approximate Bayesian
Computation using linear discriminant analysis on summary statistics. Molecular Ecology Resources 12(5), pp. 846–855.
(A7) Mengerson, K.L., Pudlo, P.@ and Robert, C. P. (2013) Bayesian computation via empirical
likelihood. Proc. Natl. Acad. Sci. USA 110(4), pp. 1321–1326.
(A8) Gautier, M., Foucaud, J., Gharbi, K., Cezard, T., Galan, M., Loiseau, A., Thomson, M.,
Pudlo, P., Kerdelhué, C., Estoup, A. (2013) Estimation of population allele frequencies
64
from next-generation sequencing data: pooled versus individual genotyping. Molecular
Ecology 22(14), pp. 3766–3779.
(A9) Gautier, M., Gharbi, K., Cezard, T., Foucaud, J., Kerdelhué, C., Pudlo, P., Cornuet, J.-M.,
Estoup, A. (2013) The effect of RAD allele dropout on the estimation of genetic variation
within and between populations. Molecular Ecology 22(11), pp. 3165–3178.
(A12) Cornuet J.-M., Pudlo P., Veyssier J., Dehne-Garcia A., Gautier M., Leblois R., Marin J.-M.,
Estoup A. (2014) DIYABC v2.0: a software to make Approximate Bayesian Computation inferences about population history using Single Nucleotide Polymorphism, DNA
sequence and microsatellite data. Bioinformatics, btt763.
(A13) Baragatti, M. and Pudlo, P.@ (2014) An overview on Approximate Bayesian computation.
ESAIM: Proc. 44, pp. 291–299.
(A14) Leblois, R., Pudlo, P., Néron, J., Bertaux, F., Beeravolu, C. R., Vitalis, R. and Rousset, F.
Maximum likelihood inference of population size contractions from microsatellite data.
Molecular Biology and Evolution, in press.
(A15) Stoehr, J., Pudlo, P. and Cucala, L. (2014) Adaptive ABC model choice and geometric summary statistics for hidden Gibbs random fields. Accepté dans Statistics and Computing.
Voir Arxiv:1402.1380
Articles soumis
(A10) Ratmann, O., Pudlo, P., Richardson, S. and Robert, C. P. (2011) Monte Carlo algorithms
for model assessment via conflicting summaries. Voir Arxiv:1106.5919
(A11) Sedki, M., Pudlo, P., J.–M. Marin, C. P. Robert and J.–M. Cornuet (2013) Efficient learning
in ABC algorithms. Soumis. Voir Arxiv:1210.1388
(A16) Marin, J.-M., Pudlo, P.@ and Sedki, M. (2012 ; 2014) Consistency of the Adaptive Multiple
Importance Sampling. Soumis. Voir Arxiv:1211.2548.
(A17) Pudlo, P., Marin, J.-M., Estoup, A., Gautier, M., Cornuet, J.-M. and Robert, C. P. ABC
model choice via random forests. Soumis. Voir Arxiv:1406.6288
Brevet international
(B1) Demande PCT n°EP13153512.2 : Process for identifying rare events (2014).
@
Attention, il est d’usage en mathématiques de classer les auteurs par ordre alphabétique.
Cela concerne les articles (A1), (A2), (A3), (A4), (A5), (A7), (A10) et (A13) marqués d’un @
Documents et programmes informatiques à vocation de transfert
(D1) Cornuet J-M, Pudlo P, Veyssier J, Dehne-Garcia A, Estoup A (2013) DIYABC V2.0. a userfriendly package for inferring population history through Approximate Bayesian Com65
Appendix C. Curriculum vitæ
putation using microsatellites, DNA sequence and SNP data. Programme disponible sur
le site http://www1.montpellier.inra.fr/CBGP/diyabc/
(D2) Cornuet J-M, Pudlo P, Veyssier J, Dehne-Garcia A, Estoup A (2013) DIYABC V2.0. a
user-friendly package for inferring population history through Approximate Bayesian
Computation using microsatellites, DNA sequence and SNP data. Notice détaillée
d’utilisation de 91 pages disponible sur le site
http://www1.montpellier.inra.fr/CBGP/diyabc/
Communications orales
(T1) Séminaire de génétique des populations, Vienne, Autriche, Avril 2013.
http://www.popgen-vienna.at/news/seminars.html
(T2) Journées du groupe Modélisation Aléatoire et Statistique de la SMAI, Clermont-Ferrand,
Août 2012.
(T3) Mathematical and Computational Evolutionary Biology, Montpellier, Juin 2012.
(T4) International Workshop on Applied Probability, Jérusalem, Juin 2012.
(T5) Séminaires Statistique Mathématique et Applications, Fréjus, Août-Septembre 2011.
(T6) 5èmes Journées Statistiques du Sud, Nice, Juin 2011.
(T7) Approximate Bayesian Computation in London, Mai 2011.
(T8) 3rd conference of the International Biometric Society Channel Network, avril 2011.
(T9) 42èmes Journées de Statistique, Marseille 2010.
(T10) 41èmes Journées de Statistique, Bordeaux 2009.
(T11) XXXIVème École d’Été de Probabilités de Saint-Flour, 2004.
(T12) Journées de probabilités, La Rochelle, 2002.
D IVERS SÉMINAIRES EN F RANCE : entre autre, Toulouse (2012), AgroParisTech (2012), Grenoble
au LECA (2011), Avignon (2011), Besançon (2010, 2013), Rennes (2010), Grenoble (2008). . .
Notoriété des revues
66
Table C.1 – Référentiel de notoriété
Revue
Advances in Applied Probability
ESAIM: Probab. and Statis.
Journal of Machine Learning Research
Journal of Nonparametric Statistics
Molecular Biology and Evolution
Molecular Ecology
Molecular Ecology Resources
Proc. Natl. Acad. Sci. USA
Statistics and Computing
∗∗∗∗
Facteur d’impact (FI à 5 ans)
Notoriété
0.900 (0.841)
0.408 (–)
3.420 (4.284)
0.533 (0.652)
10.353(11.221)
6.275 (6.792)
7.432 (4.150)
9.737 (10.583)
1.977 (2.663)
∗∗
∗
∗∗∗
∗
∗∗∗∗
∗∗∗
∗∗∗
∗∗∗∗
∗∗∗
= Exceptionnelle, ∗∗∗ = Excellente, ∗∗ = Correcte, ∗ = Médiocre
La notoriété des revues est tirée du Référentiel de notoriété 2012, Erist de Jouy-en-Josas – Crebi,
M. Désiré, M.-H. Magri et A. Solari. Elle provient d’une étude de la distribution des facteurs
d’impact par discipline.
67