About

Marco Garosi

I'm Marco, a PhD student at the Department of Information Engineering and Computer Science (DISI) at the University of Trento (Italy). I work in the Multimedia and Human Understanding Group (MHUG) under the supervision of Prof. Elisa Ricci. My primary research interest lies at the intersection of computer vision and natural language processing, and I'm focusing on vision-language models (VLMs). Nonetheless, I enjoy many other research topics as well as software engineering.

EXPERIENCE

I have worked on various research topics, including: log file modelling, 3D point cloud part segmentation, and visual attribute recognition.

I am also a software engineer. As such, I have developed several projects, many of which are open-sourced on my GitHub. The largest and most challenging project is the "IFS Simulator", a company and business simulator created for high-school students and commissioned by Confao.

EDUCATION

I earned my Bachelor's Degree in Computer Science at the University of Verona, and my Master's Degree in Artificial Intelligence Systems at the University of Trento. My thesis was about visual attribute recognition with vision-language models, in a training-free setting.

I am now pursuing a PhD in artificial intelligence, dedicating my time to research on vision-language models.

LIBRARIES AND TOOLS (Misc.)

✔ PyTorch

✔ Diffusers

✔ Transformers

✔ PyTorch Lightning

✔ CLIP

✔ Open CLIP

✔ LAION

✔ Deap

✔ NumPy

✔ Django

✔ MySQL and PostgreSQL

✔ LangChain and LangGraph

✔ ChromaDB

✔ ...

With a solid background in computer science and software engineering, I love crafting solutions at any scale - from small, personal projects, to large systems that handle huge amounts of data and require lots of processing power.

As a PhD student, I have experience on zero-shot 3D point cloud part segmentation (i.e., decomposing objects into parts) and training-free fine-grained attribute detection.
At a higher level, I am focusing on vision-language models.

I have experience with large language models (LLMs), vision-language models (VLMs), and diffusion models.
I love applying them at many tasks, improving them, and exploit what they can offer from their large pre-trainings.

General

Research

Technologies