Openstream.aiSummer 2024

Multimodal Personality Detection

A research-driven multimodal learning project exploring how audio and video signals can predict OCEAN personality traits from live interaction.

Machine Learning Intern

Models trained: 100+
Traits scored: 5 (OCEAN)

PythonPyTorchMultimodal fusionAudio / video

At Openstream.ai — a multimodal, plan-based conversational AI platform — I researched how audio and video signals could be fused to infer stable personality traits beyond transcript content.

Problem

Conversational systems are richer when they can read tone and presence, not only transcripts. That means fusing audio and video that are temporally aligned but carry very different signals, and turning them into a stable, interpretable output.

The hard part is that personality is not visible in one frame or one word. The signal is distributed across voice dynamics, facial presence, timing, and context. The model had to handle noisy real-time inputs while still producing a compact set of scores people could understand.

Multimodal fusion

Reading personality from synchronized audio and video

Click through the stages to see how live signal becomes embeddings, how temporal aggregation stabilizes those features, and how fusion produces interpretable OCEAN scores.

Audio + video to OCEAN

5 traits100+ modelsreal-time feed

Research exploration

The research question was which signals were worth trusting.

I evaluated audio and visual backbones, then trained model variants around fusion and temporal aggregation strategy to study which representations produced the most stable personality predictions.

Approach

I researched and designed a multimodal fusion mechanism for temporally aligned audio and video inputs. The pipeline encoded audio and visual streams separately, aggregated features over time, then fused the modalities before a final regression layer produced personality scores.

The work involved comparing audio and visual representation backbones, testing temporal aggregation strategies, and training 100+ model variants to understand which signals were most predictive and stable. The output was a vector across the five OCEAN traits: openness, conscientiousness, extraversion, agreeableness, and neuroticism.

Result

The result was a research prototype that translated multimodal signal exploration into a real-time personality scoring system, giving the conversational AI platform a richer signal than transcript-only analysis.

Note: kept high-level — the underlying algorithm is proprietary.

Multimodal Personality Detection

Problem

Reading personality from synchronized audio and video

Audio and video arrive together

Each modality becomes an embedding

Temporal windows smooth noisy moments

Audio and visual evidence meet

Output five interpretable trait scores

The research question was which signals were worth trusting.

Approach

Result