CV
General Information
| Full Name | Sidong Zhang |
| Languages | English, Mandarin |
Education
-
Sep. 2020 - present Doctor of Philosophy
UMass Amherst’s College of Information and Computer Sciences, Amherst, MA, USA - Teaching assistant of graduate level CS 589 machine learning and CS 651 optimization class
- Research assistant in Information Fusion lab, currently funded by an NIH RO3
-
Sep. 2018 - Jan. 2021 Master of Science
UMass Amherst’s College of Information and Computer Sciences, Amherst, MA, USA -
Sep. 2018 - Jan. 2021 Bachelor of Engineering
Nanjing University, Software Institute, Nanjing, China
Experience
-
Feb. 2026 - Apr. 2026 Video2Reaction: Training Foundation Video Models to Predict Audience Reaction
UMass Amherst's College of Information and Computer Sciences - Accepted to CVPR 2026 Workshop on Emerging Directions in Data for Multimodal Foundation Models
- Built on V2R, a novel dataset of audience emotional reactions to movie clips with distributional labels across 21 emotion categories
- Developed a cross-taxonomy emotion transfer pipeline that fine-tunes VLMs on V2R via word-token scoring, enabling zero-shot transfer to VCE with up to +0.269 top-3 accuracy gain over baseline
- Showed that V2R-pretrained VLMs (LLaVA-NeXT-Video-7B, Qwen2.5-VL-7B) match VideoMAE trained on the full 50K-sample target domain using only 1% of its training data
-
Feb. 2024 - Sep. 2024 Audio-Visual Speech Separation via Bottleneck Iterative Network
UMass Amherst's College of Information and Computer Sciences & Dolby Laboratories - Accepted to ICML 2025 Workshop on Machine Learning for Audio
- Addressed audio-visual speech separation on noisy mixtures from NTCD-TIMIT and LRS3
- Proposed a multimodal fusion framework combining audio and visual modalities via a bottleneck iterative architecture
- Achieved state-of-the-art SI-SDRi while reducing training time by 50% compared to prior SOTA
- {"Project webpage"=>"https://stonezhng.github.io/projects/avssbin/"}
-
Sep. 2020 - July 2025 Longitudinal Multimodal Modeling for Alzheimer's Early Detection
UMass Amherst's College of Information and Computer Sciences - Submitting to Medical Image Analysis
- Showed that task-agnostic representations learned from T1-weighted brain MRIs via mutual information maximization capture knowledge predictive of cognitive decline scores
- Trained a CNN from scratch and fine-tuned BiomedCLIP as two representation learning backbones
- Designed a staged training strategy for an existing AD forecasting model, stabilizing performance on small validation splits
- Evaluated using micro F1 on CN/MCI/AD classification and MCI-to-AD transition timing precision across 100 repeated experiments for statistical robustness
- Task-agnostic MRI representations from both CNN and BiomedCLIP improved 2-year forecasting F1 and transition timing accuracy
-
Feb. 2026 - Understanding 3D Brain MRI Scans in Foundation Modeling
UMass Amherst's College of Information and Computer Sciences - Ongoing work
- Existing foundation models (BiomedCLIP, MedGemma) use 2D medical image encoders, with 3D extensions largely treating volumes as sequences of 2D slices — losing inter-slice structural context
- Proposing native 3D conv patch embeddings to better capture volumetric structure in brain MRIs, where information is equally distributed across all three axes
- Fine-tuning BiomedCLIP and MedSigLIP image encoders with 3D conv patch embedding layers on brain MRI data
- Applying atlas-guided flexible patch sizing, assigning finer-grained patches to cognition-relevant regions (e.g., hippocampi) for cognitive state fine-tuning tasks
-
Jan. 2025 – May 2025 Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness
UMass Amherst's College of Information and Computer Sciences - Accepted to ICML 2025 Workshop on Methods and Opportunities at Small Scale
- Compared fusion architectures with and without domain knowledge embedded in the model structure
- Evaluated on MMSD for sarcasm detection and a synthetic task with controllable domain knowledge levels; added Gaussian noise to inputs to probe robustness
- Found that aligning fusion design with domain priors improves clean-data accuracy under limited data but significantly degrades robustness to noisy inputs
-
Feb. 2025 - May 2025 Video2Reaction: Mapping Video to Audience Reaction in the Wild
UMass Amherst's College of Information and Computer Sciences - Under review at ECCV
- Curated a multimodal dataset mapping short movie clips to distributional emotional reactions of real-world viewers
- Collected movie clips and YouTube comments, extracting per-clip emotion distributions from viewer responses
- Deployed 3 LLMs to rephrase and independently label comment emotions; applied majority voting per comment and discarded samples with no consensus across all three models
- Established a comprehensive benchmark for distributional video-to-reaction modeling
Skill
- {"Languages"=>"English (Proficient), Mandarin (Native)"}
- {"Programming language"=>"Java, Python, C, Lisp, Markdown, Latex"}
- {"Development tools"=>"Pytorch"}