The First Symposium of Geometry and Statistics

tour

General Information

The purpose of the conference is to promote the research on the emerging field of interface of statistics and geometry among researchers in China and beyond. This is a continuous effort, following the recent Harvard conference on geometry and statistics. Anyone who has received a Ph.D. or expects to receive a Ph.D. by the end of 2023 in the relevant field is eligible to attend. Participants from under-represented groups are especially encouraged to attend.

The conference will take place at the Yanqi Lake Beijing Institute of Mathematical Sciences and Applications (BIMSA), sponsored by BIMSA and the Yau Mathematical Sciences Center at Tsinghua University, during July 29 - 31, 2023. This conference is a satellite conference of the International Congress of Basic Science scheduled in Beijing during July 16 - 28, 2023.

Registration (Group photo)

Invited speakers

Louis Christie (Cambridge)
Ke Deng (Tsinghua)
Scott V. Edwards (Harvard)
Xinqi Gong (Renmin U)
Yang-Hui He (London Institute for Mathematical Sciences)
Stephan Huckemann (Georg-August-Universität Göttingen)
Yongdai Kim (Seoul National U)
Xiangdong Li (AMSS/CAS)
Kefeng Liu (UCLA)
Ezra Miller (Duke)
Stefan Sommer (U of Copenhagen)
Zaiwen Wen (Peking U)
Hao Xu (Zhejiang U)
Zhigang Yao (NUS/CMSA Harvard)
Stephen Yau (Tsinghua)
Chunming Zhang (UW Madison)
Jian Zhang (U of Kent)

Organizing Committee

Rongling Wu (BIMSA)
Lijian Yang (Tsinghua)
Zhigang Yao (NUS/Harvard CMSA and Committee Chair)

Scientific Advisors:

Shiu-Yuen Cheng (Tsinghua)
Shing-Tung Yau (Tsinghua)

Contact Information

Scientific Aspects Enquiries: zhigang.yao(AT)nus.edu.sg

Schedule (PDF)

Saturday, July 29, 2023 (Beijing Time)

8:30-8:45 am	Check in at BIMSA
8:45–9:00 am	Zhigang Yao/Rongling Wu	Welcome Remarks
	Morning Session Chair: Zhigang Yao
9:00–10:00 am	Stephan Huckemann (Göttingen)	Title: The wald space for for phylogenetic trees Abstract: Most existing metrics between phylogenetic trees directly measure differences in topology and edge weights, and are unrelated to the models of evolution used to infer trees. We describe metrics which instead are based on distances between the probability models of discrete or continuous characters induced by trees. We describe how construction of information-based geodesics leads to the recently proposed wald space of phylogenetic trees. As a point set, it sits between the BHV space (Billera, Holmes and Vogtmann, 2001) and the edge-product space (Moulton and Steel 2004). It has a natural embedding into the space of positive definite matrices, equipped with the information geometry. Thus, singularities such as overlapping leaves are infinitely far away, proper forests, however, comprising the "BHV-boundary at infinity", are part of the wald space, adding boundary correspondences to groves (corresponding to orthants in the BHV space). In fact the wald space contracts to the complete disconnected forest. Further, it is a geodesic space, exhibiting the structure of a Whitney stratified space of type (A) where strata carry compatible Riemannian metrics. We explore some more geometric properties, but the full picture remains open. We conclude by identifying open problems, we deem interesting.
10:00–10:10 am	Break
10:10–11:10 am	Yang-Hui He (London Institute for Mathematical Sciences)	Title: AI for mathematics Abstract: We summarize how AI can approach mathematics in three ways: theorem-proving, conjecture formulation, and language processing. Inspired by initial experiments in geometry and string theory, we present a number of recent experiments on how various standard machine-learning algorithms can help with pattern detection across disciplines ranging from algebraic geometry to representation theory, to combinatorics, and to number theory.
11:10–11:20 am	Break
11:20 am–12:20 pm	Zhigang Yao (NUS/Harvard CMSA)	Title: Random fixed boundary flows: a twin sister of principal flow? Abstract: While classical statistics has dealt with observations which are real numbers or elements of a real vector space, nowadays many statistical problems of high interest in the sciences deal with the analysis of data which consist of more complex objects, taking values in spaces which are naturally not (Euclidean) vector spaces but which still feature some geometric structure. We consider fixed boundary flows with canonical interpretability as principal components extended on non-linear Riemannian manifolds. We aim to find a flow with fixed starting and ending points for noisy multivariate data sets lying near an embedded non-linear Riemannian manifold. In geometric terms, the fixed boundary flow is defined as an optimal curve that moves in the data cloud with two fixed end points. At any point on the flow, we maximize the inner product of the vector field, which is calculated locally, and the tangent vector of the flow. The rigorous definition is derived from an optimization problem using the intrinsic metric on the manifolds. For random data sets, we name the fixed boundary flow the random fixed boundary flow and analyze its limiting behavior under noisy observed samples. We show that the fixed boundary flow yields a concatenate of three segments, one of which coincides with the usual principal flow when the manifold is reduced to the Euclidean space. We further prove that the random fixed boundary flow converges largely to the population fixed boundary flow with high probability. Finally, we illustrate how the random fixed boundary flow can be used and interpreted, and demonstrate its application in real data sets.
12:20–1:50 pm	12:20 pm Group Photo followed by Lunch
	Afternoon Session Chair: Ezra Miller
1:50–2:50 pm	Stephen Yau (Tsinghua)	Title: Grand Biological Universe: The geometric construction of genome space and its applications Abstract: Imitating Hilbert who proposed 23 problems in mathematics in 1900, Defense Advanced Research Projects Agency (DARPA) proposed 23 problems in pure and applied mathematics in 2008. These problems will prove to be very influential for the development of mathematics in the 21st century. In the DARPA problems, we are asked to understand “The Geometry of Genome Space” (the number 15) and “What are the Fundamental Laws of Biology” (the number 23). Our convex hull principle for molecular biology states that the convex hull formed from Natural Vectors of one biological group does not intersect with the convex hull formed from any other biological group. This can be viewed as one of the Fundamental Laws of Biology for which DARPA has been looking for since 2008. On the basis of the convex hull principle, we can construct the geometry of the genome space. A genome space consists of all known genomes of living beings and provides insights into their relationships. The genome space can be considered as the moduli space in mathematics, and genome sequences can be canonically embedded in a high-dimensional Euclidean space by means of Natural Vectors. In this space, a sequence is uniquely represented as a point by the nucleotide distribution information of the sequence. Similar sequences lie closely, and convex hulls of different groups are disjoint according to the convex hull principle. The geometry of space is reflected in the similarity of sequences. The similarity of sequences can be measured by the Natural Metric, which is different from the induced metric from the ambient Euclidean space. Like our physical world, dark matter and dark energy play a crucial role in the construction of the correct Natural Metric in genome space. Our goal is to construct the genome spaces of seven kingdoms with Natural Metrics. These metrics are quite different in each genome space because different dark matter and dark energy may bend space-time as predicted by Einstein’s theory. As applications, we provide the first mathematical method to find undiscovered genome sequence. Our theory allows us to explore the phylogenetic relationships of biological sequences and where SARS-CoV-2 originated from. It provides a novel geometric perspective to study molecular biology. It also gives an accurate way for large-scale sequences comparison in real-time manner.
2:50–3:00 pm	Break
3:00-4:00 pm	Jian Zhang (U of Kent)	Title: Cross-Validated Estimation for Penalised Skew Normal and Beyond Abstract: Skew normal model is often used in various scientific research fields in bioscience, business and finance studies. The skew normal distribution extends the normal distribution by including one more parameter called shape parameter, which is used to gauge the magnitude of skewness. However, the skew normal is a singular model; when the shape parameter is approaching zero, the corresponding Fisher information matrix fails to be invertible. This makes the standard maximum likelihood estimation ill-posed. The standard Bayesian information criterion may not work. Here, we address the problem by penalised likelihood estimation with penalty coefficient being determined by cross-validation. We show phase-transition behaviour of the cross-validated coefficient when the shape parameter is closing to zero. We establish a large sample theory for the penalised MLE. We evaluate the performance of the proposed method in multiple anti-cancer drug studies.
4:00-4:10 pm	Break
4:10-5:10 pm	Chunming Zhang (UW Madison)	Title: New Statistical Learning Method for Independent Component Analysis with Applications to Brain EEG Abstract: Independent Component Analysis (ICA) is a widely used unsupervised learning method in medical imaging and signal processing, aimed at extracting non-Gaussian independent components (ICs) from multi-dimensional data. However, existing optimization methods often recover ICs from observed signals in unrealistic noiseless settings, with limited theoretical guarantees. We propose a new framework for "noisy ICA" that tackles this challenge from different perspectives, inspired by the desire to identify latent components resembling neural sources of cortical origin from electroencephalography (EEG) recordings of brain activity. Our approach not only directly estimates ICs but also enables the estimation of the unknown number of latent ICs. We have developed a computationally efficient algorithm that solves the non-convex and non-smooth optimization problem with guaranteed convergence. Furthermore, we prove that our estimator is consistent under mild conditions. Numerical simulations demonstrate that our approach outperforms existing methods. Finally, we apply our method to EEG data and show that it can reveal brain source signals with improved quantity and quality.

Sunday, July 30, 2023 (Beijing Time)

	Morning Session Chair: Zhigang Yao
9:00-10:00 am	Ezra Miller (Duke)	Title: Geometry of measures on stratified spaces Abstract: The central limit theorem (CLT) is commonly thought of as occurring on the real line, or in multivariate form on a real vector space. Motivated by statistical applications involving nonlinear data, such as angles or phylogenetic trees, the past twenty years have seen CLTs proved for Fréchet means on manifolds and on certain examples of singular spaces built from flat pieces glued together in combinatorial ways. These CLTs reduce to the linear case by tangent space approximation or by gluing. What should a CLT look like on general non-smooth spaces, where tangent spaces are not linear and no combinatorial gluing or flat pieces are available? Answering this question involves figuring out appropriate classes of spaces and measures, correct analogues of Gaussian random variables, how the geometry of the space (think "curvature") is reflected in the limiting distribution, and generally how the geometry of sampling from measures on singular spaces behaves. This talk provides an overview of these answers, concluding with gateways this investigation opens to further advances in geometry, probability, topology, and statistics. Joint work with Jonathan Mattingly and Do Tran.
10:00-10:10 am	Break
10:10-11:10 am	Scott Edwards (Harvard)	Title: Bayesian inference of genome-phenotype associations using phylogenies and genome sequence data Abstract: Connecting genotype and phenotype is of ongoing interest in evolutionary biology. Comparative genomics is now allowing us to map genes for traits using phylogenetic approaches (‘PhyloG2P’), which leverage phenotypically unique lineages or convergent evolution to provide surprisingly precise mapping of loci underlying evolutionarily labile traits. An example focusing on loss of flight in birds reveals a strong role for non-coding regulatory evolution in the origin of key adaptations of birds. I will introduce two statistical models for associating rates of genome change and change in binary or continuous traits using phylogenetic trees. Such models will improve our power to detect associations between genome and phenotype evolution. Additionally, we employ ATAC-seq and high-throughput enhancer screens to sift through hundreds of potential candidate enhancers whose evolution could influence traits associated with loss of flight.
11:10-11:20 am	Break
11:20 am-12:20 pm	Xinqi Gong (Renming U)	Title: Geometry enhanced deep learning prediction of multibody protein interaction complex structures Abstract: Improved from our dimer protein-protein docking methods, in the last several years we have designed new geometry-enhanced deep learning algorithms to predict the interface residue pair in trimer, tetramer and even bigger multibody protein complex structures. Furtherly, we assembled a holistic procedure for multibody protein interaction complex structure prediction, which can give out results from monomer sequences. Trained and tested on an experimental dataset, our procedure show promise advances and advantages.

12:20–1:50 pm	Lunch
	Afternoon Session Chair: Stephan Huckemann
1:50-2:50 pm	Stefan Sommer (U of Copenhagen)	Title: Diffusions means in geometric statistics Abstract: Analysis and statistics of shape variation and, more generally, manifold valued data can be formulated probabilistically with geodesic distances between shapes exchanged with (-log)likelihoods. This leads to new statistics and estimation algorithms. One example is the notion of diffusion mean. In the talk, I will discuss the motivation behind and construction of diffusion means and discuss properties of the mean, including reduced smeariness when estimating diffusion variance together with the mean. I will connect this to most probable paths to data and algorithms for computing diffusion means, particularly bridge sampling algorithms. Finally, we will discuss computationally efficient approaches to mean estimation and links to conditioned diffusion processes in morphology and phylogenetic analysis.
2:50-3:00 pm	Break
3:00-4:00 pm	Yongdai Kim (SNU)	Title: On the use of the beta-VAE for learning probabilistic generative models Abstract:Probabilistic generative models have recieved much attention in AI, and various learning algorithms such as GAN (Generative Adversarial Networks), VAE (Variational Auto-Encoder) and diffusion models have been proposed. In this talk. I consider the VAE algorithm based on the binomial likelihood. I derive the convergence rate of the corresponding estimator under regularity conditions. An interesting message obtained from the convergence rate is that the beta-VAE, a variation of the original VAE, would be more appropriate for learning probabilistic generative models.
4:00-5:10 pm	Break
4:10-5:00 pm	Zaiwen Wen (Peking U)	Title: A Monte Carlo Policy Gradient Method with Local Search for Binary Optimization Abstract: Binary integer programming problems are ubiquitous in many practical applications, including the MaxCut and cheeger cut problem, the MIMO detection and MaxSAT, etc. They are NP-hard due to the combinatorial structure. In this talk, we present a policy gradient method using deep Monte Carlo local search to ensure sufficient exploration in discrete spaces. The local search method is proved to improve the quality of integer solutions and the policy gradient descent converges to stationary points in expectation. Numerical results show that this framework provides near-optimal solutions efficiently for quite a few binary optimization problems.

Monday, July 31, 2023 (Eastern Time)

	Morning Session Chair: Zhigang Yao
9:00-10:00 am	Hao Xu (Zhejiang U) and Kefeng Liu (UCLA)	Title: Frobenius algebra structure of statistical manifold Abstract: In information geometry, a statistical manifold is a Riemannian manifold (M,g) equipped with a totally symmetric (0,3)-tensor. We show that the tangent bundle of a statistical manifold has a Frobenius algebra structure if and only if the sectional K-curvature vanishes. This gives a statistical-geometric curvature interpretation for WDVV equation and thus solving for constant sectional K-curvature becomes a natural generalization of the WDVV equation. We also study natural statistical structures on the tangent bundle of a statistical manifold and gave a new proof of Alekseevsky-Cortes' geometric construction of r-maps that associates a special real manifold to a special Kahler manifold.
10:00-10:10 am	Break
10:10-11:10 am	Xiangdong Li (AMSS/CAS)	Title: Optimal transport problem and random matrices theory Abstract:In 1776, G. Monge raised the optimal transport problem from the study of military engineering problem. In 1939, L. Kantorovich reformulated the optimal transport problem and used it to study the optimal allocation problem. In 1975, Kantorovich shared the Nobel prize for economics with T. Koopmans "for their contribution on the optimal allocation of scare resources". In 1992, Y. Brenier solved the optimal transport problem with quadratic distance cost function. Since then, the optimal transport problem has received a lot of attentions both from theoretic and applied mathematics. In this talk, I will give a short survey on the history in the optimal transport problem, and then present our recent work in the study of random matrices theory using the approach from optimal transport problem.
11:10-11:20 am	Break
11:20 am-12:20 pm	Ke Deng (Tsinghua)	Title: TBA Abstract: TBA
12:20-1:50 pm	Lunch
	Afternoon Session Chair: Ezra Miller
1:50-2:50 pm	Stephan Huckemann (Göttingen)	Title: Statistical Challenges in Shape Prediction of Biomolecules Abstract: The three-dimensional /higher-order structure of biomolecules determines its functionality. While assessing their primary structure is fairly easily accessible, reconstruction of the higher order structure is costly. It often requires elaborate correction of atomic clashes, frequently not fully successful. Using RNA data, we describe a purely statistical method, learning error correction, drawing power from a two-scale approach: Our microscopic scale describes single suites by dihedral angles of individual atom bonds. Here, addressing the challenge of torus principal component analysis (PCA) leads to a fundamentally new approach to PCA building on principal nested spheres by Jung et al. (2012). Based on an observed relationship with a mesoscopic scale, landmarks describing several suites, we use Fréchet means for angular shape and size-and-shape, correcting within-suite-backbone-to-backbone clashes. We validate this method by comparison to reconstructions obtained from simulations approximating biophysical chemistry and illustrate its power by the RNA example of SARS-CoV-2. This is joint work with Benjamin Eltzner, Kanti V. Mardia and Henrik Wiechers.
2:50-3:00 pm	Break
3:00-4:00 pm	Louis Christie (Cambridge)	Title: Estimating Maximal Symmetries of Regression Functions Abstract: Often the objects we model have many symmetries and estimators that utilise these symmetries have been shown empirically and theoretically to vastly outperform those that do not. However, if an incorrect symmetry is used we introduce bias that persists asymptotically. Without a priori knowledge of suitable symmetries, we present a method to estimate the symmetries of non-parametric regression functions. Symmetry estimation is carried out using hypothesis testing for invariance strategically over the subgroup lattice of a search group G acting on the feature space. We show that the estimation of the unique maximal invariant subgroup of G generalises useful tools from linear dimension reduction to a non linear context. We demonstrate the performance of this estimator in synthetic settings and apply the methods to an application using satellite measurements of the earth's magnetic field.
4:00-4:10 pm	Zhigang Yao	Closing Remarks