Advanced Studies Institute in Mathematics of Data Science & Machine Learning
Date: January 5-13, 2024
Venue: Urgench State University (Uzbekistan)
Contact: Zair Ibragimov (California State University, Fullerton)
E-mail:
zibragimov@fullerton.edu
Key Lecturer:
Guido Montufar
University of California
Los Angeles
Overview
This series of lectures will cover introductory and advanced topics in
data science and machine learning with a focus on mathematical and
statistical aspects. Knowledge outcomes: Understanding of the
mathematical and statistical foundations of data science and machine
learning. Tradeoffs of machine learning in regard to approximation,
estimation, and optimization. Contemporary views of machine learning
with overparametrized models, learning regimes, and algorithmic
regularization. Parameter space and function space perspectives in
learning. Quantitative analysis of parameter optimization and
statistical generalization in overparametrized learning models
covering methods such as the neural tangent kernel and neural network
Gaussian processes. Overview of learning modalities and architectures,
such as graph neural networks, generative models, and transformers.
Topics
- Learning from data
- Statistical learning theory
- Models, architectures, regimes
- Geometric techniques
- Optimization and algorithmic biases
Key Lecturer:
Volodymyr Melnykov
University of Alabama
Overview
Cluster analysis is one of the fundamental unsupervised machine learning problems that aims at constructing data groups in such a way so that observations within each group are similar but data points in different groups are relatively distinct. Applications of cluster analysis can be found in image analysis, pattern recognition, and social network analysis. Model-based clustering is a popular clustering technique relying on the notion of finite mixture models. It assumes the existence of a one-to-one correspondence between data groups and mixture components and provides a highly interpretable and remarkably flexible approach to data partitioning. We will consider several recent developments in model-based clustering supporting the discussion with various illustrative real-life applications.
Topics
- Introduction to unsupervised machine learning and model-based clustering
- Modeling matrix- and tensor-variate data
- Semi-supervised machine learning and model-based clustering
- Modeling time-dependent categorical sequences
- Finite mixture modeling in stylometry
Invited Speakers
Jamolbek Mattiev
Urgench State University, Uzbekistan
Title: Associative classification model based on clustering
Abstract:
The size of collected data is increasing and the number of
rules generated on those datasets is getting bigger. Producing compact
and accurate models is being the most important task of data mining. In
this talk, we present a new associative classifier that utilizes
agglomerative hierarchical clustering. Experimental evaluations show
that proposed method achieves significantly better results than
classical rule learning algorithms in terms of rules on bigger datasets
while maintaining classification accuracy on those datasets.
Michael Murray
University of California, Los Angeles
Title: An Introduction to Benign Overfitting
Abstract: Conventional machine learning wisdom suggests that the generalization error of a complex model will typically be worse versus a simpler model when both are trained to interpolate data. Indeed, the bias-variance trade-off implies that although choosing a complex model is advantageous in terms of approximation error, it comes at the price of an increased risk of overfitting. The traditional solution to managing this trade-off is to use some form of regularization, allowing the optimizer to select a predictor from a rich class of functions while at the same time encouraging it to choose one that is in some sense simple. However, in recent years it has been observed that many models, including deep neural networks, trained with minimal if any form of explicit regularization, can almost perfectly interpolate noisy data with nominal cost to their generalization performance. This phenomenon is referred to as benign or tempered overfitting and there is now great interest in characterizing it mathematically. In this talk I’ll give an introduction and motivation for the topic, describe some of the key results derived so far and and highlight open questions.
Michael Porter
University of Virginia
Title:
Modeling contagion, excitation, and social influence with Hawkes point processes
Abstract:
Many social and physical processes (e.g., crime, conflict, social media activity, financial markets, new product adoption, social network communication, earthquakes, neural spiking, disease spread) produce event point patterns that exhibit clustering. Hawkes, or self-exciting point process, models are a popular choice for modeling the clustering patterns which can be driven by both exogenous influences and endogenous forces like contagion/self-excitement. These models stipulate that each event can be triggered by past events creating a branching structure that produces the endogenous clustering. The contagion effects are modeled by a shot-noise term which aggregates the influence of past events to temporarily increase the event rate following each event. This talk will introduce the hawkes process and illustrate some extensions and uses of hawkes models in three areas: modeling contagion in terrorist attacks, forecasting crime hotspots, and identifying social influence in Yelp restaurant reviews.
Rishi Sonthalia
University of California, Los Angeles
Title: From Classical Regression to the Modern Regime: Surprises for Linear Least Squares Problems
Abstract: Linear regression is a problem that has been extensively studied. However, modern machine learning has brought to light many new and exciting phenomena due to overparameterization. In this talk, I briefly introduce the new phenomena observed in recent years. Then, building on this, I present recent theory work on linear denoising. Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisy-input regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to real-life data and modern machine learning. First, we assume that our data matrices are low-rank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it essential to study non-classical learning regimes. Thus, we work in the non-classical proportional regime, where data dimension $d$ and number of samples N grow as d/N = c + o(1). For this setting, we derive general test error expressions for both denoising and noisy-input regression and study when overfitting the noise is benign, tempered, or catastrophic. We show that the test error exhibits double descent under general distribution shifts, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using real-life data, matching the theoretical predictions with under 1% MSE error for low-rank data.
Angelica Torres
Max Planck Institute, Germany
Title: Algebraic Geometry meets Structure from Motion
Abstract: The Structure from Motion (SfM) pipeline aims to create a 3D model of a scene using two-dimensional images as input. The process has four main stages: Feature detection, matching, camera pose, and triangulation. In the first step, features such as points and lines are detected in the images, then they are matched to features appearing in other images. After the matching stage, the actual images are forgotten and the data that remains are tuples of points or lines that are believed to come from the same world object. This is geometric data, hence the toolbox coming from Algebraic Geometry can be used to estimate the camera positions and to triangulate the objects in the scene. In this talk I will introduce the SfM pipeline, and present some of the algebraic varieties that arise when the pinhole camera model is assumed. During the talk we will highlight how some properties of the varieties translate into properties of data and how it can affect the image reconstruction process.
Program Schedule
January 5, 2024
8:30 – 10:00 Arrival/Hotel Check-in
10:00 – 12:30 Rest
12:30 – 14:00 Lunch
14:00 – 14:20 Registration
14:20 – 14:50 Opening Remarks
Bakhrom Abdullaev - Rector, Urgench State University
Guido Montufar - University of California, Los Angeles
Volodymyr Melnykov - University of Alabama
Zair Ibragimov - California State University, Fullerton
15:00 – 15:50 Speaker: Guido Montufar
Title: Learning from data, I
16:00 – 16:50 Speaker: Volodymyr Melnykov
Title: Introduction to unsupervised machine learning and model-based clustering, I
17:00 – 17:50 Speaker: Jamolbek Mattiev
Title: Associative classification model based on clustering
18:30 – 21:30 Welcome Reception and Dinner
January 6, 2024
09:00 – 09:50 Speaker: Guido Montufar
Title: Learning from data, II
10:00 – 10:50 Speaker: Guido Montufar
Title: Statistical learning theory, I
11:00 – 11:30 Coffee Break
11:30 – 12:20 Recitation Session (Moderator: Kedar Karhadkar)
12:30 – 14:00 Lunch
14:00 – 16:00 Free Time/Rest (Faculty Housing)
16:00 – 16:50 Speaker: Volodymyr Melnykov
Title: Introduction to unsupervised machine learning and model-based clustering, II
17:00 – 17:50 Speaker: Volodymyr Melnykov
Title: Advances in model-based clustering
18:30 – 21:30 Dinner
January 7, 2024
09:00 – 09:50 Speaker: Guido Montufar
Title: Statistical learning theory, II
10:00 – 10:50 Speaker: Guido Montufar
Title: Models, architectures, regimes
11:00 – 11:30 Coffee Break
11:30 – 12:20 Recitation Session (Moderator: Kedar Karhadkar)
12:30 – 14:00 Lunch
14:00 – 16:00 Free Time/Rest (Faculty Housing)
16:00 – 16:50 Speaker: Volodymyr Melnykov
Title: Modeling matrix- and tensor variate data
17:00 – 17:50 Speaker: Volodymyr Melnykov
Title: Semi-supervised machine learning and model-based clustering
18:30 – 21:30 Dinner
January 8, 2024 (KHORAZM MAMUN ACADEMY in khiva)
09:00 – 09:50 Speaker: Rishi Sonthalia
Title: From Classical Regression to the Modern Regime: Surprises for LLS Problem
10:00 – 10:50 Speaker: Michael Murray
Title: An Introduction to Benign Overfitting
11:00 – 11:50 Guided Tour of Mamun Academy
12:00 – 12:50 Speaker: Michael Porter
Title: Modeling contagion, excitation, and social influence with Hawkes point processes
13:30 – 14:30 Lunch (Ichan Kala)
14:30 – 18:00 Guided Tour/Free time/shopping in Ichan Kala
18:30 – 21:30 Dinner (Ichan Kala)
January 9, 2024
09:00 – 09:50 Speaker: Guido Montufar
Title: Geometric techniques
10:00 – 10:50 Speaker: Guido Montufar
Title: Optimization and algorithmic biases
11:00 – 11:50 Speaker: Angelica Torres
Title: Algebraic Geometry meets Structure from Motion
13:00 – 14:00 Lunch
14:00 – 16:00 Free Time/Rest (Faculty Housing)
16:20 – 17:10 Speaker: Volodymyr Melnykov
Title: Modeling time-dependent categorical sequences
17:30 – 18:20 Speaker: Volodymyr Melnykov
Title: Finite mixture modeling in stylometry
19:00 – 22:00 Banquet in Urgench
List of Student Participants
- Ryan Anderson (University of California, Los Angeles)
- Navya Annapareddy (University of Virginia)
- Shoira Atanazarova (Romanovski Institute of Mathematics)
- Oygul Babajanova (Romanovski Institute of Mathematics)
- Sardor Bekchanov (Urgench State University)
- Joshua Berlinski (Iowa State University)
- Hao Duan (University of California, Los Angeles)
- Adriana Duncan (University of Texas at Austin)
- Isabella Foes (University of Alabama)
- Juliann Geraci (University of Nebraska-Lincoln)
- Chase Holcombe (University of Alabama)
- Sarvar Iskandarov (Urgench State University)
- Kedar Karhadkar (University of California, Los Angeles)
- Adrienne Kinney (University of Arizona)
- Elmurod Kuriyozov (Urgench State University)
- Sarah McWaid (Sonoma State University)
- Sean Mulherin (University of California, Los Angeles)
- Abhijeet Mulgund (University of Illinois at Chicago)
- Alexander Myers (University of Nebraska-Lincoln)
- Klara Olloberganova (Novosibirsk State University)
- Mahliyo Qodirova (Novosibirsk State University)
- Ilhom Rahimov (Urgench State University)
- Ulugbek Salaev (Urgench State University)
- Raj Sawhney (Claremont Graduate University)
- Nodirbek Shavkatov (Urgench State University)
- Jonah Smith (University of Kentucky)
- Ogabek Sobirov (Urgench State University)
- Shakhnoza Takhirova (Bowling Green State University)
- Spencer Wadsworth (Iowa State University)
- Sheila Whitman (University of Arizona)