EMBED consists of images and two primary tables:
When working with EMBED it's very important to understand how its data hierarchies interact with one another. An overview of the hierarchy is shown in Figure 1.
cohortnum
) that do not overlap with one another (the public Open Data version only contains the first 2 cohorts).empi_anon
) refers to all data from 1 unique patient. All exams, images, and findings for a patient will be linked with this column.acc_anon
), also called an accession, refers to all data from 1 unique exam for a patient. All images and findings for a given exam will be linked with this column.side
column in the clinical data, and the ImageLateralityFinal
column in the metadata.numfind
column which enumerates each unique finding within an exam.numfind
should be used to identify unique findings in these cases).
Figure 1: Data hierarchy diagram for EMBED
Figure 2 shows how these hierarchies are represented in the EMBED tables (cohort not shown). Patient, exam, and side identifiers are present in both tables, while images are unique to the metadata and findings/procedures are unique to the clinical data.
Figure 2: Visual representation of the EMBED data hierarchy in the clinical (shown here as: Magview) and metadata tables
Figure 3 shows the data engineering order of operations we generally recommend when working with EMBED. The two tables can technically be merged whenever (even at the very start) but keeping the two separate for as long as possible generally helps to keep operations clearer and prevents unintended duplications.
Figure 3: Flowchart of the recommended data engineering approach for the clinical (shown here as: Magview) and metadata tables
For example, when trying to define a cancer versus no-cancer dataset with only 3D images, we would do the following:
FinalImageType == '3D'
acc_anon
and side columns (side
in clinical, ImageLateralityFinal
in metadata). Ensure data is deduplicated and handle any final splitting or cleaning.anon_dicom_path
column to match the relevant images to your dataset.