Hello 🌎

My name is Hasan — I'm a Machine Learning Engineer and recent MSCS graduate from NYU Courant living in Brooklyn, New York. My work involves building RAG frameworks, fine tuning LLMs and deploying ML models to production. I have 4+ years of experience as a Data Scientist and ML Engineer, with previous experience in the federal consulting space.

Academically I'm interested in the interesection of natural language processing and cognitive science, with a particular focus on curriculum learning and knowledge distillation in large language models (see projects).

On the side I run Rumuuz, a poetry collective based in NYC.

I also enjoy 🏕️ 🎹 & 🧗



Projects

Curriculum Learning for Knowledge Distillation in LMs


Knowledge distillation (KD) is a powerful, well-established model compression technique that can face performance limitations when the capacity difference between student and teacher models is severely mismatched (Cho and Hariharan, 2019), or when multiple teachers cause competing distillation objectives (Du et al., 2020). To address these issues and improve performance in KD for large language modeling, I explore the implementation of two ideas: using curriculum learning during KD, where training data is sorted based on difficulty, and using Selective Reliance during KD, where a student language model selectively leverages teacher distillation loss for data samples deemed difficult by the curriculum.
Read more ⟶

Neural Question Generation with GPT-J


Neural Question Generation (QG) systems aim to automate the process of question construction by generating novel questions given a particular context, thus reducing time and costs of question generation for educators and test developers. We propose Question Generation using GPT-J in a few-shot setting. Generating questions in this manner reduces time and resource cost required to construct datasets and fine-tune increasingly complex models like GPT-J, thereby increasing usage for educational purposes such as adaptive education. We compare our results against a GPT-J model fine-tuned on the task.
Read more ⟶

Selective Teacher Reliance for Knowledge Distillation


Knowledge distillation (KD) is a powerful, well established model compression technique that can face performance limitations when student models attempt to mimic large teacher models on high dimensional tasks like image classification (Cho and Hariharan, 2019). Motivated from ideas in curriculum learning, we explore the idea of selective reliance on the task of image recognition, where a student model relies more heavily on teacher guidance for data samples deemed difficult by a teacher generated curriculum. Experimental results show minimal effect of curriculum setting and selective reliance techniques on student accuracy and convergence.
Read more ⟶

Detecting Atrial Fibrillation Burden Using 1-D CNNs


1-D CNNs have recently been used to classify various classes of Arrhythmias using ECG sequences. In this paper, a 1-D CNN is used to predict the Atrial Fibrillation (AF) burden of ECG sequences, which is a useful continuous metric that helps gauge AF severity and can be predictive for risk of near term stroke.
Read more ⟶