# Methodology: Introduction to MMRF

### **Developing MMRF to Render 3D scenes based on Multimodal Inputs**  <a href="#toc149147381" id="toc149147381"></a>

!\[A diagram of a multimodal model

Description automatically generated]\(<https://1521529408-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FRf6rxD6VHFAbFQeTaCq0%2Fuploads%2FwcL7fWLiAjTADQ2ROFjK%2F14.png?alt=media>)

*Figure 9: MMRF Model*

**Initialising MMRF for Textual Description Inputs**

We introduce a Multimodal Radiance Field (MMRF) model (Fig. 9) with the aforementioned pre-trained datasets: MMRF combines 2D images and textual descriptions, inspired by NeRF \[22]. By initialising NeRF’s architecture with a generated Signed Distance Function (SDF), it creates features that instantiate geometrically specific 3D mesh models, like “reinforced concrete beams”.

**Audio-Driven Scene Dynamics**

MMRFs will incorporate an advanced audio processing module that utilises acoustic scene analysis and auditory scene synthesis, a combination of self-supervised speech representations and neural network-based audio processing algorithms. This translates real-world auditory data into dynamic scene influencers.

**Adaptive Resolution Scaling**

MMRF further implements adaptive resolution scaling. With a fixed camera location, we apply a transvoxel algorithm to SDF points in spherical coordinates. This concentrates computational resources on key focal areas, such as beam-column intersections, while relegating peripheral zones to lower resolution for efficiency.

**Modular Design**

Architecturally, MMRF is modular, with dedicated processors for each modality (visual, textual, auditory), mirroring 3D-GPT’s multi-agent approach \[19] for future scalability. MMRF hence manages the rendering process of our 3D assets and open world. Graphically, we will convert the open world into a VR asset using game engine plugins, while a separate physics engine handles the customised physics.
