KEYWORDS: Education and training, 3D modeling, Magnetic resonance imaging, Acoustics, Tongue, Motion detection, Data modeling, Motion models, Performance modeling, Diseases and disorders
Understanding the relationship between tongue motion patterns during speech and their resulting speech acoustic outcomes—i.e., articulatory-acoustic relation—is of great importance in assessing speech quality and developing innovative treatment and rehabilitative strategies. This is especially important when evaluating and detecting abnormal articulatory features in patients with speech-related disorders. In this work, we aim to develop a framework for detecting speech motion anomalies in conjunction with their corresponding speech acoustics. This is achieved through the use of a deep cross-modal translator trained on data from healthy individuals only, which bridges the gap between 4D motion fields obtained from tagged MRI and 2D spectrograms derived from speech acoustic data. The trained translator is used as an anomaly detector, by measuring the spectrogram reconstruction quality on healthy individuals or patients. In particular, the cross-modal translator is likely to yield limited generalization capabilities on patient data, which includes unseen out-of-distribution patterns and demonstrates subpar performance, when compared with healthy individuals. A one-class SVM is then used to distinguish the spectrograms of healthy individuals from those of patients. To validate our framework, we collected a total of 39 paired tagged MRI and speech waveforms, consisting of data from 36 healthy individuals and 3 tongue cancer patients. We used both 3D convolutional and transformer-based deep translation models, training them on the healthy training set and then applying them to both the healthy and patient testing sets. Our framework demonstrates a capability to detect abnormal patient data, thereby illustrating its potential in enhancing the understanding of the articulatory-acoustic relation for both healthy individuals and patients.
The detection of anatomical structures in medical imaging data plays a crucial role as a preprocessing step for various downstream tasks. It, however, poses a significant challenge due to highly variable appearances and intensity values within medical imaging data. In addition, there is a scarcity of annotated datasets in medical imaging data, due to high costs and the requirement for specialized knowledge. These limitations motivate researchers to develop automated and accurate few-shot object detection approaches. While there are generalpurpose deep learning models available for detecting objects in natural images, the applicability of these models for medical imaging data remains uncertain and needs to be validated. To address this, we carry out an unbiased evaluation of the state-of-the-art few-shot object detection methods for detecting head and neck anatomy in CT images. In particular, we choose Query Adaptive Few-Shot Object Detection (QA-FewDet), Meta Faster R-CNN, and Few-Shot Object Detection with Fully Cross-Transformer (FCT) methods and apply each model to detect various anatomical structures using novel datasets containing only a few images, ranging from 1- to 30-shot, during the fine-tuning stage. Our experimental results, carried out under the same setting, demonstrate that few-shot object detection methods can accurately detect anatomical structures, showing promising potential for integration into the clinical workflow.
Multimodal Magnetic Resonance (MR) Imaging plays a crucial role in disease diagnosis due to its ability to provide complementary information by analyzing a relationship between multimodal images on the same subject. Acquiring all MR modalities, however, can be expensive, and, during a scanning session, certain MR images may be missed depending on the study protocol. The typical solution would be to synthesize the missing modalities from the acquired images such as using generative adversarial networks (GANs). Yet, GANs constructed with convolutional neural networks (CNNs) are likely to suffer from a lack of global relationships and mechanisms to condition the desired modality. To address this, in this work, we propose a transformer-based modality infuser designed to synthesize multimodal brain MR images. In our method, we extract modality-agnostic features from the encoder and then transform them into modality-specific features using the modality infuser. Furthermore, the modality infuser captures long-range relationships among all brain structures, leading to the generation of more realistic images. We carried out experiments on the BraTS 2018 dataset, translating between four MR modalities, and our experimental results demonstrate the superiority of our proposed method in terms of synthesis quality. In addition, we conducted experiments on a brain tumor segmentation task and different conditioning methods.
We propose a method that computes subtle motion variation patterns as principal components of a subject group’s dynamic motion fields. Coupled with the real-time speech audio recordings during image acquisition, the key time frames that contain maximum speech variations are identified by the principal components of temporally aligned audio waveforms, which in turn inform the temporal location of the maximum spatial deformation variation. Henceforth, the motion fields between the key frames and the reference frame for each subject are computed and warped into the common atlas space, enabling a direct extraction of motion variation patterns via quantitative analysis.
Investigating the relationship between internal tissue point motion of the tongue and oropharyngeal muscle deformation measured from tagged MRI and intelligible speech can aid in advancing speech motor control theories and developing novel treatment methods for speech related-disorders. However, elucidating the relationship between these two sources of information is challenging, due in part to the disparity in data structure between spatiotemporal motion fields (i.e., 4D motion fields) and one-dimensional audio waveforms. In this work, we present an efficient encoder-decoder translation network for exploring the predictive information inherent in 4D motion fields via 2D spectrograms as a surrogate of the audio data. Specifically, our encoder is based on 3D convolutional spatial modeling and transformer-based temporal modeling. The extracted features are processed by an asymmetric 2D convolution decoder to generate spectrograms that correspond to 4D motion fields. Furthermore, we incorporate a generative adversarial training approach into our framework to further improve synthesis quality on our generated spectrograms. We experiment on 63 paired motion field sequences and speech waveforms, demonstrating that our framework enables the generation of clear audio waveforms from a sequence of motion fields. Thus, our framework has the potential to improve our understanding of the relationship between these two modalities and inform the development of treatments for speech disorders.
Lesions or organ boundaries visible through medical imaging data are often ambiguous, thus resulting in significant variations in multi-reader delineations, i.e., the source of aleatoric uncertainty. In particular, quantifying the inter-observer variability of manual annotations with Magnetic Resonance (MR) Imaging data plays a crucial role in establishing a reference standard for various diagnosis and treatment tasks. Most segmentation methods, however, simply model a mapping from an image to its single segmentation map and do not take the disagreement of annotators into consideration. In order to account for inter-observer variability, without sacrificing accuracy, we propose a novel variational inference framework to model the distribution of plausible segmentation maps, given a specific MR image, which explicitly represents the multi-reader variability. Specifically, we resort to a latent vector to encode the multi-reader variability and counteract the inherent information loss in the imaging data. Then, we apply a variational autoencoder network and optimize its evidence lower bound (ELBO) to efficiently approximate the distribution of the segmentation map, given an MR image. Experimental results, carried out with the QUBIQ brain growth MRI segmentation datasets with seven annotators, demonstrate the effectiveness of our approach.
Cycle reconstruction regularized adversarial training—e.g., CycleGAN, DiscoGAN, and DualGAN—has been widely used for image style transfer with unpaired training data. Several recent works, however, have shown that local distortions are frequent, and structural consistency cannot be guaranteed. Targeting this issue, prior works usually relied on additional segmentation or consistent feature extraction steps that are task-specific. To counter this, this work aims to learn a general add-on structural feature extractor, by explicitly enforcing the structural alignment between an input and its synthesized image. Specifically, we propose a novel input-output image patches self-training scheme to achieve a disentanglement of underlying anatomical structures and imaging modalities. The translator and structure encoder are updated, following an alternating training protocol. In addition, the information w.r.t. imaging modality can be eliminated with an asymmetric adversarial game. We train, validate, and test our network on 1,768, 416, and 1,560 unpaired subject-independent slices of tagged and cine magnetic resonance imaging from a total of twenty healthy subjects, respectively, demonstrating superior performance over competing methods.
Unsupervised domain adaptation (UDA) has been widely used to transfer knowledge from a labeled source domain to an unlabeled target domain to counter the difficulty of labeling in a new domain. The training of conventional solutions usually relies on the existence of both source and target domain data. However, privacy of the large-scale and well-labeled data in the source domain and trained model parameters can become the major concern of cross center/domain collaborations. In this work, to address this, we propose a practical solution to UDA for segmentation with a black-box segmentation model trained in the source domain only, rather than original source data or a white-box source model. Specifically, we resort to a knowledge distillation scheme with exponential mixup decay (EMD) to gradually learn target-specific representations. In addition, unsupervised entropy minimization is further applied to regularization of the target domain confidence. We evaluated our framework on the BraTS 2018 database, achieving performance on par with white-box source model adaptation approaches.
Accurate measurement of strain in a deforming organ has been an important step in motion analysis using medical images. In recent years, internal tissue’s in vivo motion and strain computation is mostly achieved through dynamic magnetic resonance (MR) imaging. However, such data lack information on tissue’s intrinsic fiber directions, preventing computed strain tensors from being projected onto a direction of interest. Although diffusion-weighted MR imaging excels at providing fiber tractography, it yields static images unmatched with dynamic MR data. In this work, we report an algorithm workflow that estimates strain values in the diffusion MR space by matching corresponding tagged dynamic MR images. We focus on processing a dataset of various human tongue deformations in speech. The geometry of tongue muscle fibers is provided by diffusion tractography, while spatiotemporal motion fields are provided by tagged MR analysis. The tongue’s deforming shapes are determined by segmenting a synthetic cine dynamic MR sequence generated from tagged data using a deep neural network. Estimated motion fields are transformed into the diffusion MR space using diffeomorphic registration, eventually leading to strain values computed in the direction of muscle fibers. The method was tested on 78 time volumes acquired during three sets of specific tongue deformations including both speech and protrusion motion. Strain in the line of action of seven internal tongue muscles was extracted and compared both intra- and inter-subject. Resulting compression and stretching patterns of individual muscles revealed unique behavior of individual muscles and their potential activation pattern.
To advance our understanding of speech motor control, it is essential to image and assess dynamic functional patterns of internal structures caused by the complex muscle anatomy inside the human tongue. Speech pathologists are investigating into new tools that help assessment of internal tongue muscle’s cooperative mechanics on top of their anatomical differences. Previous studies using dynamic magnetic resonance imaging (MRI) of the tongue revealed that tongue muscles tend to function in different groups during speech, especially the floor-of-the-mouth (FOM) muscles. In this work, we developed a method that analyzed the unique functional pattern of the FOM muscles in speech. First, four-dimensional motion fields of the whole tongue were computed using tagged MRI. Meanwhile, a statistical atlas of the tongue was constructed to form a common space for subject comparison, while a manually delineated mask of internal tongue muscles was used to separate individual muscle’s motion. Then we computed four-dimensional motion correlation between each muscle and the FOM muscle group. Finally, dynamic correlation of different muscle groups was compared and evaluated. We used data from a study group of nineteen subjects including both healthy controls and oral cancer patients. Results revealed that most internal tongue muscles coordinated in a similar pattern in speech while the FOM muscles followed a unique pattern that helped supporting the tongue body and pivoting its rotation. The proposed method can help provide further interpretation of clinical observations and speech motor control from an imaging point of view.
The tongue is capable of producing intelligible speech because of successful orchestration of muscle groupings— i.e., functional units—of the highly complex muscles over time. Due to the different motions that tongues produce, functional units are transitional structures which transform muscle activity to surface tongue geometry and they vary significantly from one subject to another. In order to compare and contrast the location and size of functional units in the presence of such substantial inter-person variability, it is essential to study both common and subject-specific functional units in a group of people carrying out the same speech task. In this work, a new normalization technique is presented to simultaneously identify the common and subject-specific functional units defined in the tongue when tracked by tagged magnetic resonance imaging. To achieve our goal, a joint sparse non-negative matrix factorization framework is used, which learns a set of building blocks and subject-specific as well as common weighting matrices from motion quantities extracted from displacements. A spectral clustering technique is then applied to the subject-specific and common weighting matrices to determine the subject-specific functional units for each subject and the common functional units across subjects. Our experimental results using in vivo tongue motion data show that our approach is able to identify the common and subject-specific functional units with reduced size variability of tongue motion during speech.
Amyotrophic Lateral Sclerosis (ALS) is a neurological disease that causes death of neurons controlling muscle movements. Loss of speech and swallowing functions is a major impact due to degeneration of the tongue muscles. In speech studies using magnetic resonance (MR) techniques, diffusion tensor imaging (DTI) is used to capture internal tongue muscle fiber structures in three-dimensions (3D) in a non-invasive manner. Tagged magnetic resonance images (tMRI) are used to record tongue motion during speech. In this work, we aim to combine information obtained with both MR imaging techniques to compare the functionality characteristics of the tongue between normal and ALS subjects. We first extracted 3D motion of the tongue using tMRI from fourteen normal subjects in speech. The estimated motion sequences were then warped using diffeomorphic registration into the b0 spaces of the DTI data of two normal subjects and an ALS patient. We then constructed motion atlases by averaging all warped motion fields in each b0 space, and computed strain in the line of action along the muscle fiber directions provided by tractography. Strain in line with the fiber directions provides a quantitative map of the potential active region of the tongue during speech. Comparison between normal and ALS subjects explores the changing volume of compressing tongue tissues in speech facing the situation of muscle degradation. The proposed framework provides for the first time a dynamic map of contracting fibers in ALS speech patterns, and has the potential to provide more insight into the detrimental effects of ALS on speech.
Representation of human tongue motion using three-dimensional vector fields over time can be used to better understand
tongue function during speech, swallowing, and other lingual behaviors. To characterize the inter-subject variability of
the tongue’s shape and motion of a population carrying out one of these functions it is desirable to build a statistical
model of the four-dimensional (4D) tongue. In this paper, we propose a method to construct a spatio-temporal atlas of
tongue motion using magnetic resonance (MR) images acquired from fourteen healthy human subjects. First, cine MR
images revealing the anatomical features of the tongue are used to construct a 4D intensity image atlas. Second, tagged
MR images acquired to capture internal motion are used to compute a dense motion field at each time frame using a
phase-based motion tracking method. Third, motion fields from each subject are pulled back to the cine atlas space using
the deformation fields computed during the cine atlas construction. Finally, a spatio-temporal motion field atlas is
created to show a sequence of mean motion fields and their inter-subject variation. The quality of the atlas was evaluated
by deforming cine images in the atlas space. Comparison between deformed and original cine images showed high
correspondence. The proposed method provides a quantitative representation to observe the commonality and variability
of the tongue motion field for the first time, and shows potential in evaluation of common properties such as strains and
other tensors based on motion fields.
The human tongue is composed of multiple internal muscles that work collaboratively during the production of speech. Assessment of muscle mechanics can help understand the creation of tongue motion, interpret clinical observations, and predict surgical outcomes. Although various methods have been proposed for computing the tongue’s motion, associating motion with muscle activity in an interdigitated fiber framework has not been studied. In this work, we aim to develop a method that reveals different tongue muscles’ activities in different time phases during speech. We use fourdimensional tagged magnetic resonance (MR) images and static high-resolution MR images to obtain tongue motion and muscle anatomy, respectively. Then we compute strain tensors and local tissue compression along the muscle fiber directions in order to reveal their shortening pattern. This process relies on the support from multiple image analysis methods, including super-resolution volume reconstruction from MR image slices, segmentation of internal muscles, tracking the incompressible motion of tissue points using tagged images, propagation of muscle fiber directions over time, and calculation of strain in the line of action, etc. We evaluated the method on a control subject and two postglossectomy patients in a controlled speech task. The normal subject’s tongue muscle activity shows high correspondence with the production of speech in different time instants, while both patients’ muscle activities show different patterns from the control due to their resected tongues. This method shows potential for relating overall tongue motion to particular muscle activity, which may provide novel information for future clinical and scientific studies.
Image labeling is an essential step for quantitative analysis of medical images. Many image labeling algorithms require
seed identification in order to initialize segmentation algorithms such as region growing, graph cuts, and the random
walker. Seeds are usually placed manually by human raters, which makes these algorithms semi-automatic and can be
prohibitive for very large datasets. In this paper an automatic algorithm for placing seeds using multi-atlas registration
and statistical fusion is proposed. Atlases containing the centers of mass of a collection of neuroanatomical objects are
deformably registered in a training set to determine where these centers of mass go after labels transformed by
registration. The biases of these transformations are determined and incorporated in a continuous form of Simultaneous
Truth And Performance Level Estimation (STAPLE) fusion, thereby improving the estimates (on average) over a single
registration strategy that does not incorporate bias or fusion. We evaluate this technique using real 3D brain MR image
atlases and demonstrate its efficacy on correcting the data bias and reducing the fusion error.
Image labeling is an essential task for evaluating and analyzing morphometric features in medical imaging data. Labels
can be obtained by either human interaction or automated segmentation algorithms. However, both approaches for
labeling suffer from inevitable error due to noise and artifact in the acquired data. The Simultaneous Truth And
Performance Level Estimation (STAPLE) algorithm was developed to combine multiple rater decisions and
simultaneously estimate unobserved true labels as well as each rater's level of performance (i.e., reliability). A
generalization of STAPLE for the case of continuous-valued labels has also been proposed. In this paper, we first show
that with the proposed Gaussian distribution assumption, this continuous STAPLE formulation yields equivalent
likelihoods for the bias parameter, meaning that the bias parameter-one of the key performance indices-is actually
indeterminate. We resolve this ambiguity by augmenting the STAPLE expectation maximization formulation to include
a priori probabilities on the performance level parameters, which enables simultaneous, meaningful estimation of both
the rater bias and variance performance measures. We evaluate and demonstrate the efficacy of this approach in
simulations and also through a human rater experiment involving the identification the intersection points of the right
ventricle to the left ventricle in CINE cardiac data.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.