Developed by Stéphanie Battini, PhD Ph.D. in Medical Sciences from the University of Strasbourg
From native research files to standards-aligned, repository-ready metadata
BioMetaXtract©DSBU is a tool developed by Stéphanie Battini at the Data Stewardship Biomed Unit to help researchers extract, organize and standardize metadata directly from native research files.
Many scientific instruments automatically record rich technical information during data acquisition. This information is often stored inside the files themselves or in associated sidecar files, but it is not always easy to access, interpret or reuse. BioMetaXtract makes these internal metadata visible, structured and reusable.
The objective is to reduce the manual burden of dataset documentation and to support researchers in preparing their data for FAIR sharing, repository submission and long-term preservation.
BioMetaXtract extracts metadata from acquisition files, maps them to relevant community standards, and reorganizes them into repository-ready formats adapted to the data type and target repository.
What BioMetaXtract does
BioMetaXtract supports the documentation workflow from raw research files to FAIR metadata outputs.
It can
- read native metadata embedded in research files;
- detect key acquisition parameters produced by instruments or acquisition software;
- extract information from associated sidecar files when available;
- reorganize metadata according to discipline-specific standards;
- prepare structured metadata outputs that can support deposition in recognized repositories;
- facilitate dataset documentation for FAIR data sharing and Open Research Data requirements.
In practice, BioMetaXtract helps researchers transform instrument-generated metadata into understandable, standardized and reusable information.
Why this matters
Research datasets are often difficult to reuse because important contextual information is missing, scattered or stored in technical formats that are hard to interpret.
For example, a microscopy file may contain information about the microscope, objective, detector, laser wavelengths, channels and acquisition settings. A flow cytometry file may contain parameters related to the cytometer configuration and measured channels. A DICOM file may contain information about imaging modality, acquisition date, scanner settings and image structure.
BioMetaXtract helps capture this information automatically, so that datasets can be better described, checked, shared and preserved.
This supports:
- improved dataset documentation;
- better traceability of data acquisition;
- easier preparation of repository submissions;
- alignment with FAIR principles;
- reduced manual metadata entry;
- better long-term reuse of research data.
Supported data modalities and standards
BioMetaXtract is designed as a multimodal metadata extraction tool. It supports several types of research data commonly produced within biomedical and life science research environments.
Flow cytometry
For flow cytometry data, BioMetaXtract extracts metadata from .fcs files and aligns them with the MIFlowCyt standard.
The extracted metadata can support documentation and deposition workflows, for example in Zenodo or other appropriate repositories depending on the sharing strategy.
Current support includes extraction of key metadata from FCS files and of FlowJo .wsp sidecar files.
Microscopy
For light microscopy data, BioMetaXtract supports formats such as:
.czi.nd2.oir.lsm.lif.tif
The extracted metadata are mapped to the REMBI standard used by the BioImage Archive.
BioMetaXtract extracts and structures metadata related to image acquisition, including instrument information and, when available, detailed acquisition parameters such as:
- microscope information;
- objective information;
- detector information;
- laser wavelengths;
- filter information;
- channel-level metadata;
- pinhole size;
- detector gain and offset.
These metadata are useful for preparing submissions to the BioImage Archive and for improving the documentation of bioimaging datasets.
Current support includes extraction of key metadata from images files and of metadata sidecar files as well as Imaris files.
Electron microscopy
For electron microscopy data, BioMetaXtract supports formats such as:
.mrc.tif.mdocsidecar files
The extracted metadata are aligned with EMPIAR repository guidelines.
This supports documentation workflows for electron microscopy datasets, including tomograms, snapshots, montage data and associated acquisition metadata.
Target repositories may include:
- BioImage Archive
- EMPIAR
- EMDB
depending on the nature of the dataset and the scientific domain.
Metabolomics
For metabolomics data, BioMetaXtract supports formats such as:
.mzML- Agilent
.d
The metadata are aligned with MSI reporting recommendations and ISA-Tab structures used for metabolomics data deposition.
The tool supports metadata preparation for repositories such as MetaboLights, and can also support FAIR documentation workflows for datasets deposited in Zenodo, depending on the publication and sharing requirements.
Clinical DICOM imaging
For clinical imaging data, BioMetaXtract supports DICOM files such as:
.dcm- CT
- MR
- PET
- NM
The extracted metadata are based on native DICOM information and can support mapping toward BIDS-compatible documentation structures.
Depending on the research context and the sensitivity of the data, outputs may support deposition or metadata sharing through appropriate repositories or catalogues.
For clinical datasets, additional attention must always be given to:
- de-identification;
- consent;
- access restrictions;
- sensitive metadata pathways;
- institutional and legal requirements.
MRI, PET-CT and preclinical imagin
For MRI and PET-CT data, including preclinical imaging workflows, BioMetaXtract supports metadata extraction from imaging files and aims to organize them according to BIDS-compatible documentation structures.
For neuroscience-related imaging datasets, E-BRAINS may be an appropriate target repository. For other disciplines, Zenodo may be more appropriate, depending on the dataset type, sensitivity and publication requirements.
Current BIDS-related outputs should be considered as structured support for documentation and future repository preparation. Full BIDS compliance may require additional curation depending on the dataset and repository requirements.
microCT, IVIS and TIFF stacks
BioMetaXtract can also support metadata extraction from modalities such as:
- microCT;
- IVIS;
- TIFF image stacks;
- derived imaging outputs.
For these modalities, no single mature community minimum-information standard may be available. BioMetaXtract therefore applies a best-effort structured metadata approach, capturing available native metadata and organizing them according to FAIR documentation principles.
For neuroscience-related imaging datasets, E-BRAINS may be an appropriate target repository. For other disciplines, Zenodo may be more appropriate, depending on the dataset type and publication requirements.
From extracted metadata to repository-ready outputs
BioMetaXtract is not only a metadata extraction tool. Its goal is also to help researchers move toward repository-ready documentation.
The workflow can be summarized as follows:
- Native datasets
Researchers provide native acquisition files generated by instruments or acquisition software. - Metadata extraction
BioMetaXtract reads embedded metadata and available sidecar files. - Standards mapping
The extracted metadata are reorganized according to relevant community standards, such as MIFlowCyt, REMBI, EMPIAR guidelines, MSI / ISA-Tab, DICOM or BIDS-inspired structures. - Repository-ready exports
The output can support dataset documentation and preparation for deposition in appropriate FAIR repositories.
What researchers gain
BioMetaXtract helps researchers save time and improve the quality of dataset documentation.
It supports:
- automatic extraction of technical metadata;
- more complete dataset descriptions;
- better alignment with FAIR principles;
- easier preparation for repository submission;
- improved reproducibility and reuse;
- better traceability from acquisition files to published datasets.
By extracting and structuring metadata early, BioMetaXtract helps ensure that important acquisition information is not lost when data are prepared for sharing or long-term preservation.
BioMetaXtract is therefore part of an evolving DSBU ecosystem designed to support researchers throughout the FAIR data lifecycle: from acquisition metadata to dataset documentation, repository submission and long-term preservation.