CV | Sidong Zhang

General Information

Full Name	Sidong Zhang
Languages	English, Mandarin

Education

Sep. 2020 - present
Doctor of Philosophy

UMass Amherst’s College of Information and Computer Sciences, Amherst, MA, USA
- Teaching assistant of graduate level CS 589 machine learning and CS 651 optimization class
- Research assistant in Information Fusion lab, currently funded by an NIH RO3
Sep. 2018 - Jan. 2021

Master of Science

UMass Amherst’s College of Information and Computer Sciences, Amherst, MA, USA
Sep. 2018 - Jan. 2021

Bachelor of Engineering

Nanjing University, Software Institute, Nanjing, China

Experience

Feb. 2026 - Apr. 2026
Video2Reaction: Training Foundation Video Models to Predict Audience Reaction

UMass Amherst's College of Information and Computer Sciences
- Accepted to CVPR 2026 Workshop on Emerging Directions in Data for Multimodal Foundation Models
- Built on V2R, a novel dataset of audience emotional reactions to movie clips with distributional labels across 21 emotion categories
- Developed a cross-taxonomy emotion transfer pipeline that fine-tunes VLMs on V2R via word-token scoring, enabling zero-shot transfer to VCE with up to +0.269 top-3 accuracy gain over baseline
- Showed that V2R-pretrained VLMs (LLaVA-NeXT-Video-7B, Qwen2.5-VL-7B) match VideoMAE trained on the full 50K-sample target domain using only 1% of its training data
Feb. 2024 - Sep. 2024
Audio-Visual Speech Separation via Bottleneck Iterative Network

UMass Amherst's College of Information and Computer Sciences & Dolby Laboratories
- Accepted to ICML 2025 Workshop on Machine Learning for Audio
- Addressed audio-visual speech separation on noisy mixtures from NTCD-TIMIT and LRS3
- Proposed a multimodal fusion framework combining audio and visual modalities via a bottleneck iterative architecture
- Achieved state-of-the-art SI-SDRi while reducing training time by 50% compared to prior SOTA
- {"Project webpage"=>"https://stonezhng.github.io/projects/avssbin/"}
Sep. 2020 - July 2025
Longitudinal Multimodal Modeling for Alzheimer's Early Detection

UMass Amherst's College of Information and Computer Sciences
- Submitting to Medical Image Analysis
- Showed that task-agnostic representations learned from T1-weighted brain MRIs via mutual information maximization capture knowledge predictive of cognitive decline scores
- Trained a CNN from scratch and fine-tuned BiomedCLIP as two representation learning backbones
- Designed a staged training strategy for an existing AD forecasting model, stabilizing performance on small validation splits
- Evaluated using micro F1 on CN/MCI/AD classification and MCI-to-AD transition timing precision across 100 repeated experiments for statistical robustness
- Task-agnostic MRI representations from both CNN and BiomedCLIP improved 2-year forecasting F1 and transition timing accuracy
Feb. 2026 -
Understanding 3D Brain MRI Scans in Foundation Modeling

UMass Amherst's College of Information and Computer Sciences
- Ongoing work
- Existing foundation models (BiomedCLIP, MedGemma) use 2D medical image encoders, with 3D extensions largely treating volumes as sequences of 2D slices — losing inter-slice structural context
- Proposing native 3D conv patch embeddings to better capture volumetric structure in brain MRIs, where information is equally distributed across all three axes
- Fine-tuning BiomedCLIP and MedSigLIP image encoders with 3D conv patch embedding layers on brain MRI data
- Applying atlas-guided flexible patch sizing, assigning finer-grained patches to cognition-relevant regions (e.g., hippocampi) for cognitive state fine-tuning tasks
Jan. 2025 – May 2025
Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness

UMass Amherst's College of Information and Computer Sciences
- Accepted to ICML 2025 Workshop on Methods and Opportunities at Small Scale
- Compared fusion architectures with and without domain knowledge embedded in the model structure
- Evaluated on MMSD for sarcasm detection and a synthetic task with controllable domain knowledge levels; added Gaussian noise to inputs to probe robustness
- Found that aligning fusion design with domain priors improves clean-data accuracy under limited data but significantly degrades robustness to noisy inputs
Feb. 2025 - May 2025
Video2Reaction: Mapping Video to Audience Reaction in the Wild

UMass Amherst's College of Information and Computer Sciences
- Under review at ECCV
- Curated a multimodal dataset mapping short movie clips to distributional emotional reactions of real-world viewers
- Collected movie clips and YouTube comments, extracting per-clip emotion distributions from viewer responses
- Deployed 3 LLMs to rephrase and independently label comment emotions; applied majority voting per comment and discarded samples with no consensus across all three models
- Established a comprehensive benchmark for distributional video-to-reaction modeling

Skill

{"Languages"=>"English (Proficient), Mandarin (Native)"}
{"Programming language"=>"Java, Python, C, Lisp, Markdown, Latex"}
{"Development tools"=>"Pytorch"}

General Information

Education

Doctor of Philosophy

Master of Science

Bachelor of Engineering

Experience

Video2Reaction: Training Foundation Video Models to Predict Audience Reaction

Audio-Visual Speech Separation via Bottleneck Iterative Network

Longitudinal Multimodal Modeling for Alzheimer's Early Detection

Understanding 3D Brain MRI Scans in Foundation Modeling

Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness

Video2Reaction: Mapping Video to Audience Reaction in the Wild

Skill