Advanced Studies Institute in Mathematics of Data Science & Machine Learning

Date: January 5-13, 2024
Venue: Urgench State University (Uzbekistan)
Contact: Zair Ibragimov (California State University, Fullerton)
E-mail: zibragimov@fullerton.edu

Key Lecturer:

Guido Montufar

University of California

Los Angeles

Overview

This series of lectures will cover introductory and advanced topics in data science and machine learning with a focus on mathematical and statistical aspects. Knowledge outcomes: Understanding of the mathematical and statistical foundations of data science and machine learning. Tradeoffs of machine learning in regard to approximation, estimation, and optimization. Contemporary views of machine learning with overparametrized models, learning regimes, and algorithmic regularization. Parameter space and function space perspectives in learning. Quantitative analysis of parameter optimization and
statistical generalization in overparametrized learning models covering methods such as the neural tangent kernel and neural network Gaussian processes. Overview of learning modalities and architectures,
such as graph neural networks, generative models, and transformers.

Topics

Learning from data
Statistical learning theory
Models, architectures, regimes
Geometric techniques
Optimization and algorithmic biases

Key Lecturer:

Volodymyr Melnykov

University of Alabama

Overview

Cluster analysis is one of the fundamental unsupervised machine learning problems that aims at constructing data groups in such a way so that observations within each group are similar but data points in different groups are relatively distinct. Applications of cluster analysis can be found in image analysis, pattern recognition, and social network analysis. Model-based clustering is a popular clustering technique relying on the notion of finite mixture models. It assumes the existence of a one-to-one correspondence between data groups and mixture components and provides a highly interpretable and remarkably flexible approach to data partitioning. We will consider several recent developments in model-based clustering supporting the discussion with various illustrative real-life applications.

Topics

Introduction to unsupervised machine learning and model-based clustering
Modeling matrix- and tensor-variate data
Semi-supervised machine learning and model-based clustering
Modeling time-dependent categorical sequences
Finite mixture modeling in stylometry

Invited Speakers

Jamolbek Mattiev

Urgench State University, Uzbekistan

Title: Associative classification model based on clustering

Abstract: The size of collected data is increasing and the number of
rules generated on those datasets is getting bigger. Producing compact
and accurate models is being the most important task of data mining. In
this talk, we present a new associative classifier that utilizes
agglomerative hierarchical clustering. Experimental evaluations show
that proposed method achieves significantly better results than
classical rule learning algorithms in terms of rules on bigger datasets
while maintaining classification accuracy on those datasets.

Michael Murray

University of California, Los Angeles

Title: An Introduction to Benign Overfitting

Abstract: Conventional machine learning wisdom suggests that the generalization error of a complex model will typically be worse versus a simpler model when both are trained to interpolate data. Indeed, the bias-variance trade-off implies that although choosing a complex model is advantageous in terms of approximation error, it comes at the price of an increased risk of overfitting. The traditional solution to managing this trade-off is to use some form of regularization, allowing the optimizer to select a predictor from a rich class of functions while at the same time encouraging it to choose one that is in some sense simple. However, in recent years it has been observed that many models, including deep neural networks, trained with minimal if any form of explicit regularization, can almost perfectly interpolate noisy data with nominal cost to their generalization performance. This phenomenon is referred to as benign or tempered overfitting and there is now great interest in characterizing it mathematically. In this talk I’ll give an introduction and motivation for the topic, describe some of the key results derived so far and and highlight open questions.

Michael Porter

University of Virginia

Title: Modeling contagion, excitation, and social influence with Hawkes point processes

Abstract: Many social and physical processes (e.g., crime, conflict, social media activity, financial markets, new product adoption, social network communication, earthquakes, neural spiking, disease spread) produce event point patterns that exhibit clustering. Hawkes, or self-exciting point process, models are a popular choice for modeling the clustering patterns which can be driven by both exogenous influences and endogenous forces like contagion/self-excitement. These models stipulate that each event can be triggered by past events creating a branching structure that produces the endogenous clustering. The contagion effects are modeled by a shot-noise term which aggregates the influence of past events to temporarily increase the event rate following each event. This talk will introduce the hawkes process and illustrate some extensions and uses of hawkes models in three areas: modeling contagion in terrorist attacks, forecasting crime hotspots, and identifying social influence in Yelp restaurant reviews.

Rishi Sonthalia

University of California, Los Angeles

Title: From Classical Regression to the Modern Regime: Surprises for Linear Least Squares Problems

Abstract: Linear regression is a problem that has been extensively studied. However, modern machine learning has brought to light many new and exciting phenomena due to overparameterization. In this talk, I briefly introduce the new phenomena observed in recent years. Then, building on this, I present recent theory work on linear denoising. Despite the importance of denoising in modern machine learning and ample empirical work on supervised denoising, its theoretical understanding is still relatively scarce. One concern about studying supervised denoising is that one might not always have noiseless training data from the test distribution. It is more reasonable to have access to noiseless training data from a different dataset than the test dataset. Motivated by this, we study supervised denoising and noisy-input regression under distribution shift. We add three considerations to increase the applicability of our theoretical insights to real-life data and modern machine learning. First, we assume that our data matrices are low-rank. Second, we drop independence assumptions on our data. Third, the rise in computational power and dimensionality of data have made it essential to study non-classical learning regimes. Thus, we work in the non-classical proportional regime, where data dimension $d$ and number of samples N grow as d/N = c + o(1). For this setting, we derive general test error expressions for both denoising and noisy-input regression and study when overfitting the noise is benign, tempered, or catastrophic. We show that the test error exhibits double descent under general distribution shifts, providing insights for data augmentation and the role of noise as an implicit regularizer. We also perform experiments using real-life data, matching the theoretical predictions with under 1% MSE error for low-rank data.

Angelica Torres

Max Planck Institute, Germany

Title: Algebraic Geometry meets Structure from Motion

Abstract: The Structure from Motion (SfM) pipeline aims to create a 3D model of a scene using two-dimensional images as input. The process has four main stages: Feature detection, matching, camera pose, and triangulation. In the first step, features such as points and lines are detected in the images, then they are matched to features appearing in other images. After the matching stage, the actual images are forgotten and the data that remains are tuples of points or lines that are believed to come from the same world object. This is geometric data, hence the toolbox coming from Algebraic Geometry can be used to estimate the camera positions and to triangulate the objects in the scene. In this talk I will introduce the SfM pipeline, and present some of the algebraic varieties that arise when the pinhole camera model is assumed. During the talk we will highlight how some properties of the varieties translate into properties of data and how it can affect the image reconstruction process.

Program Schedule

January 5, 2024

8:30 – 10:00 Arrival/Hotel Check-in

10:00 – 12:30 Rest

12:30 – 14:00 Lunch

14:00 – 14:20 Registration

14:20 – 14:50 Opening Remarks

Bakhrom Abdullaev - Rector, Urgench State University

Guido Montufar - University of California, Los Angeles

Volodymyr Melnykov - University of Alabama

Zair Ibragimov - California State University, Fullerton

15:00 – 15:50 Speaker: Guido Montufar

Title: Learning from data, I

16:00 – 16:50 Speaker: Volodymyr Melnykov

Title: Introduction to unsupervised machine learning and model-based clustering, I

17:00 – 17:50 Speaker: Jamolbek Mattiev

Title: Associative classification model based on clustering

18:30 – 21:30 Welcome Reception and Dinner

January 6, 2024

09:00 – 09:50 Speaker: Guido Montufar

Title: Learning from data, II

10:00 – 10:50 Speaker: Guido Montufar

Title: Statistical learning theory, I

11:00 – 11:30 Coffee Break

11:30 – 12:20 Recitation Session (Moderator: Kedar Karhadkar)

12:30 – 14:00 Lunch

14:00 – 16:00 Free Time/Rest (Faculty Housing)

16:00 – 16:50 Speaker: Volodymyr Melnykov

Title: Introduction to unsupervised machine learning and model-based clustering, II

17:00 – 17:50 Speaker: Volodymyr Melnykov

Title: Advances in model-based clustering

18:30 – 21:30 Dinner

January 7, 2024

09:00 – 09:50 Speaker: Guido Montufar

Title: Statistical learning theory, II

10:00 – 10:50 Speaker: Guido Montufar

Title: Models, architectures, regimes

11:00 – 11:30 Coffee Break

11:30 – 12:20 Recitation Session (Moderator: Kedar Karhadkar)

12:30 – 14:00 Lunch

14:00 – 16:00 Free Time/Rest (Faculty Housing)

16:00 – 16:50 Speaker: Volodymyr Melnykov

Title: Modeling matrix- and tensor variate data

17:00 – 17:50 Speaker: Volodymyr Melnykov

Title: Semi-supervised machine learning and model-based clustering

18:30 – 21:30 Dinner

January 8, 2024 (KHORAZM MAMUN ACADEMY in khiva)

09:00 – 09:50 Speaker: Rishi Sonthalia

Title: From Classical Regression to the Modern Regime: Surprises for LLS Problem

10:00 – 10:50 Speaker: Michael Murray

Title: An Introduction to Benign Overfitting

11:00 – 11:50 Guided Tour of Mamun Academy

12:00 – 12:50 Speaker: Michael Porter

Title: Modeling contagion, excitation, and social influence with Hawkes point processes

13:30 – 14:30 Lunch (Ichan Kala)

14:30 – 18:00 Guided Tour/Free time/shopping in Ichan Kala

18:30 – 21:30 Dinner (Ichan Kala)

January 9, 2024

09:00 – 09:50 Speaker: Guido Montufar

Title: Geometric techniques

10:00 – 10:50 Speaker: Guido Montufar

Title: Optimization and algorithmic biases

11:00 – 11:50 Speaker: Angelica Torres

Title: Algebraic Geometry meets Structure from Motion

13:00 – 14:00 Lunch

14:00 – 16:00 Free Time/Rest (Faculty Housing)

16:20 – 17:10 Speaker: Volodymyr Melnykov

Title: Modeling time-dependent categorical sequences

17:30 – 18:20 Speaker: Volodymyr Melnykov

Title: Finite mixture modeling in stylometry

19:00 – 22:00 Banquet in Urgench

List of Student Participants

Ryan Anderson (University of California, Los Angeles)
Navya Annapareddy (University of Virginia)
Shoira Atanazarova (Romanovski Institute of Mathematics)
Oygul Babajanova (Romanovski Institute of Mathematics)
Sardor Bekchanov (Urgench State University)
Joshua Berlinski (Iowa State University)
Hao Duan (University of California, Los Angeles)
Adriana Duncan (University of Texas at Austin)
Isabella Foes (University of Alabama)
Juliann Geraci (University of Nebraska-Lincoln)
Chase Holcombe (University of Alabama)
Sarvar Iskandarov (Urgench State University)
Kedar Karhadkar (University of California, Los Angeles)
Adrienne Kinney (University of Arizona)
Elmurod Kuriyozov (Urgench State University)
Sarah McWaid (Sonoma State University)
Sean Mulherin (University of California, Los Angeles)
Abhijeet Mulgund (University of Illinois at Chicago)
Alexander Myers (University of Nebraska-Lincoln)
Klara Olloberganova (Novosibirsk State University)
Mahliyo Qodirova (Novosibirsk State University)
Ilhom Rahimov (Urgench State University)
Ulugbek Salaev (Urgench State University)
Raj Sawhney (Claremont Graduate University)
Nodirbek Shavkatov (Urgench State University)
Jonah Smith (University of Kentucky)
Ogabek Sobirov (Urgench State University)
Shakhnoza Takhirova (Bowling Green State University)
Spencer Wadsworth (Iowa State University)
Sheila Whitman (University of Arizona)

Advanced Studies Institute in Mathematics of Data Science & Machine Learning

Key Lecturer:

Guido Montufar

University of California

Los Angeles

Overview

Topics

Key Lecturer:

Volodymyr Melnykov

University of Alabama

Overview

Topics

Invited Speakers

Jamolbek Mattiev

Michael Murray

Michael Porter

Rishi Sonthalia

Angelica Torres

Program Schedule

January 5, 2024

January 6, 2024

January 7, 2024

January 8, 2024 (KHORAZM MAMUN ACADEMY in khiva)

January 9, 2024

List of Student Participants

Web Accessibility