Groupe de travail Statistique et Sciences des Données

GT Stat & Sciences des Données @ LMNO

  • Usual days : Friday at 16:00
  • Responsible : Faïcel Chamroukhi et José G. Gómez García
  • Contact : chamroukhi at unicaen dot fr et jose-gregorio.gomez-garcia at unicaen dot fr
  • How to find us : Address: Université de Caen, Campus 2 Côte de Nacre, Boulevard du Maréchal Juin, 14000 Caen-Cedex, Bâtiment Sciences 3.
    • By train: If you come by train from Paris or other areas, get off at Caen station. Then, take the tramway "A" direction Caen - Campus 2 and get off at the Last stop (Campus 2). The tramway station is just at the front of the Train station and the Building Sciences 3 in which the seminar takes place is just at few meters from the tram stop.
    • By car: If you come by car, follow "périphérique nord" (North caen ringroad) : exit n°5, direction Douvres-la-Délivrande
    • By bus: From Caen city center, you can also take the bus lines 10, 13 et 14 and get off at the station : Maréchal Juin or take the bus line 7 and get off at the station : centre commercial campus 2
    • Our location on Google Maps: 

 

Programme 2018/2019 :

 

Bao Tuyen Huynh

  • Date : May 10, 2019 at 16:00
  • Room : S3 247
  • Affiliation : Laboratoire de Mathématiques Nicolas Oresme (LMNO), Université de Caen
  • Title : "Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures of Experts Models"
  • Abstract: Mixture of Experts (MoE) are successful models for modeling heterogeneous data in many statistical learning problems such as regression, clustering and classification. Generally fitted by maximum likelihood estimation via the well-known EM algorithm, their application to high-dimensional problems is still challenging. We consider the problem of fitting and feature selection in MoE models for generalized linear models, and propose a regularized maximum likelihood estimation approach that encourages sparse solutions for heterogeneous regression data models with potentially high-dimensional predictors.
    First, we develop hybrid EM algorithms based on proximal Newton-type method to Gaussian regression model. The proposed algorithms allow to automatically obtaining sparse solutions without thresholding, and avoid matrix inversion by allowing univariate parameter updates.
    Finally, we extent the method for Poisson regression model and classification model.
    An experimental study shows the good performance of the algorithms in terms of recovering the actual sparse solutions, parameter estimation, and clustering of heterogeneous regression data.

 

José G. Gómez García

  • Date : March 29, 2019 at 16:00
  • Room : S3 247
  • Affiliation : Laboratoire de Mathématiques Nicolas Oresme (LMNO), Université de Caen
  • Webpage : https://gomezgarcia.users.lmno.cnrs.fr/
  • Title : "Modèles CHARME de mémoire infinie constitués par des réseaux de neurones profonds"
  • Abstract: On considère un modèle appelé CHARME (Conditional Heteroscedastic Autoregressive Mixture of Experts), lequel peut être vu comme une sorte de modèle de mélange de séries temporelles AR-ARCH non-paramétriques et non-linéaires. Sous certaines conditions de type Lipschitz sur les fonctions de auto-régression et de volatilité, on démontre que ce modèle est tau-faiblement dépendant au sens de Dedecker & Prieur (2004), et donc ergodique et stationnaire. Ce résultat pose les bases théoriques pour élaborer une théorie asymptotique de l'estimation non-paramétrique sous-jacente. En guise d'application, à partir de la propriété d'approximation universelle des réseaux de neurones, on peut alors estimer les fonctions d'auto-régression et de volatilité par des réseaux de neurones profonds, où la consistance de l'estimateur (ou entraîneur) des neurones et des biais sont garanties.

     
     

Van-Ha Hoang (Univ. Rouen)

  • Date : March 1st, 2019 at 16:00
  • Room : S3 247
  • Affiliation : Laboratoire de Mathématiques Raphaël Salem (LMRS), Université de Rouen.
  • Webpage : http://vanhahoang.perso.math.cnrs.fr/
  • Title : "Régression multivariée adaptative par ondelettes avec erreurs sur les variables"
  • Abstract: Dans le contexte multidimensionnel, nous considérons le modèle avec erreurs sur les variables. Notre objectif est d'estimer la fonction de régression non-paramétrique multivariée inconnue avec erreurs sur les covariables. Nous concevons un estimateur adaptif basé sur des noyaux de projection sur une base d'ondelettes et un opérateur de déconvolution. Nous proposons une procédure à sélectionner le niveau de résolution d'ondelettes inspiré par la méthode de Goldenshluger et Lepski. Nous obtenons une inégalité oracle et des vitesses de convergence optimales dans des espaces de Hölder anisotropes. Les résultats théoriques sont illustrés par des simulation numériques. 

     
    Il s’agit d’un travail en collaboration avec M. Chichignoud, T. M. Pham Ngoc et V. Rivorard.
 
 

Marius Bartcus

  • Date : February 15, 2019 at 16:00
  • Room : S3 247
  • Affiliation : Laboratoire de Mathématiques Nicolas Oresme (LMNO), Université de Caen
  • Title : "Latent data models for clustering large data sets"
  • Abstract: In this talk, I’ll focus on the paradigm in which we analyse a large data set in an unsupervised way where the observed data are incomplete or require recovering some hidden structure. I’ll show how large data sets can be processed in such a context, both from a statistical and a computational of view, and discuss some perspectives to scale up some of our existing unsupervised algorithms.

 

Faïcel Chamroukhi

  • Date : February 1st, 2019 at 16:00
  • Room : S3 247
  • Affiliation : Laboratoire de Mathématiques Nicolas Oresme (LMNO), Université de Caen
  • Webpage : https://chamroukhi.com/
  • Title : "Statistical data science: on some unsupervised learning problems"
  • Abstract: Statistics and machine learning are foundational to data science, and the unsupervised learning of useful information, from heterogeneous and unlabeled raw data, is one of the most popular and challenging problems in statistical data science. This is becoming of broader interest in the today's statistics & data science community, in particular with the increasing prevalence of large-scale (big) data. In this talk, I will introduce some statistical latent data models, and unsupervised algorithms, that are able to discern knowledge from raw data, while addressing problems regarding the data complexity, including heterogeneity, high dimensionality, dynamical behaviour, and missing information. Some of the discussed questions are connected to the ongoing research projects AStERiCs and ANR SMILES, which aim at introducing an unsupervised learning framework and scaled inference algorithms for transforming large-scale data, into knowledge.

 

 

Programme 2017/2018 :

José G. Gómez García (Cergy)

  • Date : 17 may 2018 at 14:00
  • Room : S3 124
  • Affiliation : Cergy
  • Title : Théorèmes limites pour des fonctionnelles de clusters d'extrêmes de processus faiblement dépendants.
  • Abstract: Des théorèmes limites pour les processus empiriques de fonctionnelles de clusters d’extrêmes de séries temporelles stationnaires sont fournis par Drees & Rootzén (2010) sous des conditions de régularité absolue (ou "ß−mélange"). Cependant, ces conditions de dépendance de type mélange sont très restrictives : elles sont particulièrement adaptées aux modèles dans la finance et dans l’histoire, et elles sont de plus compliquées à vérifier. Généralement, pour d’autres modèles fréquemment rencontrés dans les domaines applicatifs, les conditions de mélange ne sont pas satisfaites. En revanche, les conditions de dépendance faible de Doukhan & Louhichi (1999) et Dedecker & Prieur (2004a) sont plus générales et comprennent une grande liste de modèles. À partir de ces conditions favorables, nous étendons certains des théorèmes limites de Drees & Rootzén (2010) aux processus faiblement dépendants. En outre, comme application des théorèmes précédents, nous montrons la convergence en loi de l’estimateur de l’extremogramme de Davis & Mikosch (2009) et l’estimateur fonctionnel de l’indice extrémal de Drees (2011) sous dépendance faible. L'exposé conclura avec une discussion des possibles extensions de ces résultats aux processus stochastiques fonctionnels et leurs applications.
 
Vincent Roger (Toulon)
  • Date : 22 november 2017 at 14:00
  • Room : S3 124
  • Affiliation : Univ Toulon/LIS
  • Title : Learning repreetations with generative deep networks. Application to biacoustics
  • Slides : TBA

Programme 2016/2017 :

Wajdi Farhani

  • Date : 23 february 2017 at 14:00
  • Room : S3 124
  • Affiliation : Artfact http://www.artfact-online.fr/about.html
  • Title :  Online clustering with MCMC
  • Abstract : Clustering analysis is increasingly used in modern industry and classical algorithms aren't able to fit with all use-cases. In this talk, I will present an online clustering algorithm based on MCMC methods developed within Artfact. With each incoming data-point, the algorithm aims to detect the optimal number of clusters and their geometrical position within the n-dimensional space (n: number of variables/columns).
    I will begin with a presentation of the algorithm before talking about the challenges of its implementation.Then, I will give some concrete example of industrial use-cases where such algorithm had been or can be used.
  • Slides : TBA

Tuyen B. Huynh

  • Date : 09 february 2017 at 14:00
  • Room : S3 124
  • Affiliation : Laboratoire de Mathématiques Nicolas Oresme (LMNO), Université de Caen
  • Webpage : 
  • Title :  Statistical learning in large-scale scenarios: State of the art.
  • Abstract : TBA.
  • Slides : TBA

Faïcel Chamroukhi

  • Date : 02 february 2017 at 14:00
  • Room : S3 124
  • Affiliation : Laboratoire de Mathématiques Nicolas Oresme (LMNO), Université de Caen
  • Webpage : https://chamroukhi.users.lmno.cnrs.fr/
  • Title :  Hierarchical dynamical mixture models for high-dimensional data.
  • Abstract : The unsupervised statistical analysis of high-dimensional data, in particular functional data, is a popular topic in modern statistics and is related to the filed of statistical inference of latent data models. In this talk, I will present latent data models and inference algorithms to learn from heterogeneous temporal and functional data. First, I will present hidden process regression models for non-stationary temporal data modeling and segmentation. Then, I will consider the problem of statistical modeling when the basic unit of information is a curve, that is, the framework of functional data analysis, and present hierarchical dynamical mixtures for simultaneous clustering and segmentation of heterogeneous functional data. The presented models will be illustrated on real-world applications.
  • Slides : TBA

Antoine Channarond

  • Date : 19 january 2017 at 14:00
  • Room : S3 124
  • Affiliation : Laboratoire de Mathématiques Raphaël Salem (LMRS), Université de Rouen
  • Webpage : http://lmrs.univ-rouen.fr/Persopage/Channarond/
  • Title : Modèle de graphe aléatoire à positions latentes, et applications statistiques
  • Abstract : On considère le modèle de graphe aléatoire suivant: les noeuds sont aléatoirement disposés dans un espace euclidien selon une certaine densité non-paramétrique f et la probabilité de connexion entre deux noeuds ne dépend que de la distance entre eux. D'un point de vue statistique, les positions des noeuds ne sont pas observées: elles sont dites latentes. Un défi majeur dans ce contexte est d'obtenir de l'information sur l'espace latent à partir du graphe seulement. L'exposé abordera les problème d'estimation des distances et de clustering des noeuds du graphe: les clusters sont définis comme les composantes connexes d'un ensemble de niveau t de la densité f, et il s'agit d'inférer quels noeuds sont dans l'ensemble de niveau, et dans quel cluster.
  • Slides : TBA

Vincent Roger

  • Date : 17 November 2016 at 14:00
  • Room : S3 279
  • Affiliation : Laboratoire des Sciences de L'information et des Systèmes (LSIS), Université de Toulon
  • Title : Unsupervised learning from large-scale bioacoustic data
  • Abstract : Understanding communication or interpreting different animal signals is
    an important topic in bioacoustics. We investigate probabilistic
    models on real-world challenging bioacoustic sound scenes. Theses challenging
    problems do not have ground truth and we do not have a prior knowledge like in
    speech analysis.
    Thus, we investigate Bayesian non-parametric models to segment the
    different sounds. First, we study sequential models based on Hiden Markov
    Model (HMM), that is Hierarchical Dirichlet Process HMM (HDP-HMM).
    Next, we study a non-sequential model: Dirichlet Process Gaussian Mixture
    Model (DPGMM). The main problem of such approaches is the evaluation
    the results. We give a first answer, and we will show the next steps we want
    to follow.
  • Slides :

Charles Bouveyron

  • Date : 10 November 2016 at 14.00
  • Room : S3 263
  • Affiliation : MAP5, Université Paris Descartes
  • Webpage : http://w3.mi.parisdescartes.fr/~cbouveyr/
  • Title : The Stochastic Topic Block Model for the Clustering of Vertices in Networks with Textual Edges
  • Abstract
    Due to the significant increase of communications between individuals via social media (Facebook, Twitter, Linkedin) or electronic formats (email, web, e-publication) in the past two decades, net- work analysis has become a unavoidable discipline. Many random graph models have been proposed to extract information from networks based on person-to-person links only, without taking into account information on the contents. This paper introduces the stochastic topic block model (STBM), a probabilistic model for networks with textual edges. We address here the problem of discovering meaningful clusters of vertices that are coherent from both the network interactions and the text contents. A classification variational expectation-maximization (C-VEM) algorithm is proposed to perform inference. Simulated data sets are considered in order to assess the proposed approach and to highlight its main features. Finally, we demonstrate the effectiveness of our methodology on two real-word data sets: a directed communication network and a undirected co-authorship network.
  • Slides :  link

Emeline Perthame

  • Date : 27 October 2016 at 14.00
  • Room : S3 279
  • Affiliation : INRIA MISTIS, Grenoble
  • Webpage : https://emelineperthame.wordpress.com/
  • Title : Inverse regression approach to non-linear high-to-low dimensional mapping
  • Abstract
    During this presentation, I will introduce a model that adresses non-linear regression issues, when the number of covariates is large with regard to the number of responses. In the proposed method, non linearity is handled via a mixture of regressions. Mixture models and paradoxically the so-called mixture of regression models are mostly used to handle clustering issues and few articles refer to mixture models for actual regression and prediction purposes. Interestingly, it was shown in (Deleforge et al., 2015 [1]) that a prediction approach based on mixture of regressions and on an inverse regression trick in a Gaussian setting achieves low prediction errors compared to the literature. However, the method developed by these authors is not designed to perform robust regression. Indeed, under a Gaussian setting, outliers are known to affect the stability of the results and can lead to misleading predictions. Robust approaches that are tractable in high dimension are therefore needed in order to improve the accuracy of regression methods under the presence of outliers.

    The goal of this talk is to present how we refine the work in [1]  by considering mixture of Student distributions that are able to handle outliers. As in [1], we propose to handle high-dimensional data by using an inverse regression trick. However, in the Student mixture context, a joint modelling approach on both responses and regressors is necessary in order to guarantee the tractability of the inverse regression of interest.

    This work is a collaboration with Florence Forbes (INRIA, Grenoble) and Antoine Deleforge (INRIA, Rennes). 

    [1] Deleforge, A., Forbes, F. and Horaud, R. (2015). High-dimensional regression with gaussian mixtures and partially-latent response variables. Statistics and Computing, 25(5):893–911.
    [2] Perthame, E., Forbes, F. and Deleforge, A. (2016).  Inverse regression approach to robust non-linear high-to-low dimensional mapping. Submitted, 2016.
  • Slides : Perthame.pdf