CV

General Information

Full Name Sidong Zhang
Languages English, Mandarin

Education

  • Sep. 2020 - present
    Doctor of Philosophy
    UMass Amherst’s College of Information and Computer Sciences, Amherst, MA, USA
    • Teaching assistant of graduate level CS 589 machine learning and CS 651 optimization class
    • Research assistant in Information Fusion lab, currently funded by an NIH RO3
  • Sep. 2018 - Jan. 2021
    Master of Science
    UMass Amherst’s College of Information and Computer Sciences, Amherst, MA, USA
  • Sep. 2018 - Jan. 2021
    Bachelor of Engineering
    Nanjing University, Software Institute, Nanjing, China

Experience

  • Feb. 2026 - Apr. 2026
    Video2Reaction: Training Foundation Video Models to Predict Audience Reaction
    UMass Amherst's College of Information and Computer Sciences
    • Accepted to CVPR 2026 Workshop on Emerging Directions in Data for Multimodal Foundation Models
    • Built on V2R, a novel dataset of audience emotional reactions to movie clips with distributional labels across 21 emotion categories
    • Developed a cross-taxonomy emotion transfer pipeline that fine-tunes VLMs on V2R via word-token scoring, enabling zero-shot transfer to VCE with up to +0.269 top-3 accuracy gain over baseline
    • Showed that V2R-pretrained VLMs (LLaVA-NeXT-Video-7B, Qwen2.5-VL-7B) match VideoMAE trained on the full 50K-sample target domain using only 1% of its training data
  • Feb. 2024 - Sep. 2024
    Audio-Visual Speech Separation via Bottleneck Iterative Network
    UMass Amherst's College of Information and Computer Sciences & Dolby Laboratories
    • Accepted to ICML 2025 Workshop on Machine Learning for Audio
    • Addressed audio-visual speech separation on noisy mixtures from NTCD-TIMIT and LRS3
    • Proposed a multimodal fusion framework combining audio and visual modalities via a bottleneck iterative architecture
    • Achieved state-of-the-art SI-SDRi while reducing training time by 50% compared to prior SOTA
    • {"Project webpage"=>"https://stonezhng.github.io/projects/avssbin/"}
  • Sep. 2020 - July 2025
    Longitudinal Multimodal Modeling for Alzheimer's Early Detection
    UMass Amherst's College of Information and Computer Sciences
    • Submitting to Medical Image Analysis
    • Showed that task-agnostic representations learned from T1-weighted brain MRIs via mutual information maximization capture knowledge predictive of cognitive decline scores
    • Trained a CNN from scratch and fine-tuned BiomedCLIP as two representation learning backbones
    • Designed a staged training strategy for an existing AD forecasting model, stabilizing performance on small validation splits
    • Evaluated using micro F1 on CN/MCI/AD classification and MCI-to-AD transition timing precision across 100 repeated experiments for statistical robustness
    • Task-agnostic MRI representations from both CNN and BiomedCLIP improved 2-year forecasting F1 and transition timing accuracy
  • Feb. 2026 -
    Understanding 3D Brain MRI Scans in Foundation Modeling
    UMass Amherst's College of Information and Computer Sciences
    • Ongoing work
    • Existing foundation models (BiomedCLIP, MedGemma) use 2D medical image encoders, with 3D extensions largely treating volumes as sequences of 2D slices — losing inter-slice structural context
    • Proposing native 3D conv patch embeddings to better capture volumetric structure in brain MRIs, where information is equally distributed across all three axes
    • Fine-tuning BiomedCLIP and MedSigLIP image encoders with 3D conv patch embedding layers on brain MRI data
    • Applying atlas-guided flexible patch sizing, assigning finer-grained patches to cognition-relevant regions (e.g., hippocampi) for cognitive state fine-tuning tasks
  • Jan. 2025 – May 2025
    Encoding Domain Insights into Multi-modal Fusion: Improved Performance at the Cost of Robustness
    UMass Amherst's College of Information and Computer Sciences
    • Accepted to ICML 2025 Workshop on Methods and Opportunities at Small Scale
    • Compared fusion architectures with and without domain knowledge embedded in the model structure
    • Evaluated on MMSD for sarcasm detection and a synthetic task with controllable domain knowledge levels; added Gaussian noise to inputs to probe robustness
    • Found that aligning fusion design with domain priors improves clean-data accuracy under limited data but significantly degrades robustness to noisy inputs
  • Feb. 2025 - May 2025
    Video2Reaction: Mapping Video to Audience Reaction in the Wild
    UMass Amherst's College of Information and Computer Sciences
    • Under review at ECCV
    • Curated a multimodal dataset mapping short movie clips to distributional emotional reactions of real-world viewers
    • Collected movie clips and YouTube comments, extracting per-clip emotion distributions from viewer responses
    • Deployed 3 LLMs to rephrase and independently label comment emotions; applied majority voting per comment and discarded samples with no consensus across all three models
    • Established a comprehensive benchmark for distributional video-to-reaction modeling

Skill

  • {"Languages"=>"English (Proficient), Mandarin (Native)"}
  • {"Programming language"=>"Java, Python, C, Lisp, Markdown, Latex"}
  • {"Development tools"=>"Pytorch"}