EMBED contains 364,000 screening and diagnostic mammographic exams for 110,000 patients from four hospitals over an 8-year period. The EMBED consists of ten equal patient-level cohorts, with the Open Data version (which is publicly available for non-commercial research purposes) contains data from the first two cohorts only.
A full discussion on screening mammography is beyond the scope of this documentation, however we will provide a brief primer. In the United States, women are recommended annual screening mammography beginning at age 40 and biennially after age 55. Screening mammograms are recommended for asymptomatic patients to detect occult breast cancer. Approximately 90% of screeninng mammograms are normal, and assigned a BIRADS 1 (negative) or BIRADS 2 (benign) after which the patient will return to screening in 1-2 years. However, approximately 10% of screening exams demonstrate an abnormality that requires further imaging and are assigned BIRADS 0 (additional evaluation). BIRADS 3 is a special case where there is low suspicion and the patient will return for screening again in 6 months.
If a patient is assigned BIRADS 0, they will proceed to diagnostic mammogram with special views including compression and magnification paddles. They may also undergo concurrent ultrasound. In 2/3 of diagnostic mammograms, the finding disappears or is resolved as benign and the patient will be assigned BIRADS 1 or 2. In 1/3 of diagnostic mammograms, the finding persists and the patient is assigned a BIRADS score of 4 (suspicious) or 5 (highly suspicious) and will proceed to biopsy. Approximately 2/3 of biopsy results are benign and the patient will return to annual screening. Approximately 1/3 of biopsy results are malignant in which case they will go for further imaging (MRI to evaluate extent of disease) and/or surgical resection (lumpectomy or mastectomy). The patient will receive diagnostic mammograms for several years after this before returning to annual screening.
Lastly, symptomatic patients do not receive screening mammography. Any patient with pain, discharge, or other symptoms would go straight to diagnostic mammography where they will be evaluated, assigned a BIRADS score, and proceed to biopsy or further imaging if indicated. In these cases, patients may have a pathology result linked to a diagnostic exam that is NOT preceeded by a screening exam.
Images in the dataset are structured as /cohort/patient_ID/study_ID/series_ID/DICOMS
Each cohort contains 11,000 unique patients and all exams for those patients. There is no overlap in patients or exams between cohorts. All exams over time for a patient will be contained in the same /cohort/patient_ID/
folder. Similarly, all images for an exam will be contained in the same /cohort/patient_ID/study_ID/
folder.
V. Data Tables
EMBED consists of several image modalities and two primary tables that are used to assign clinical factors and image metadata.
Magview contains all clinical data and demographics data collected for an exam, as detailed below. In this file, each row represents a single finding on mammography, and therefore there can be 1 to several rows per exam. These data are entered by the radiologist at the time of interpretation. Pathology information is entered retroactively by an administrator for all patients who receive breast biopsies, lumpectomies, or mastectomies and are attributed to the original finding that led to the biopsy/surgery.
Feature name | Description |
---|---|
empi_anon |
Unique anonymized patient ID, all exams for a patient will have the same ID |
acc_anon |
Unique ID per exam, all rows for an exam will have the same acc_anon (and the same empi_anon). Negative exams or exams with only one finding will have a single row per acc_anon, and exams with multiple findings will have multiple rows. |
study_date_anon |
Anonymized date that the exam was signed. This may differ slightly than the date the exam was acquired. All dates are shifted randomly across patients, but the same within a patient to maintain temporality between multiple exams and pathology results for a patient. |
desc |
The study description such as screening or diagnostic mammogram |
tissueden |
BIRADS breast density 1: The breasts are almost entirely fat (BIRADS A) 2: Scattered fibroglandular densities (BIRADS B) 3: Heterogeneously dense (BIRADS C) 4: Extremely dense (BIRADS D) 5: Normal male** |
asses |
The BI-RADS score of the exam. This is assigned globally for an exam but repeated in each finding row. BIRADS 0: A – Additional evaluation BIRADS 1: N – Negative BIRADS 2: B - Benign BIRADS 3: P – Probably benign BIRADS 4: S – Suspicious BIRADS 5: M - Highly suggestive of malignancy BIRADS 6: K - Known biopsy proven Screening exams may have BIRADS 0, 1, 2, or 3. Diagnostic exams may have BIRADS 4, 5, or 6. |
numfind |
Index of the finding number for an exam beginning with 1 |
side |
Side of the finding described in the current row L: left R: right B: bilateral |
total_l_find |
Number of unique findings for the left breast for a given exam. |
total_r_find |
Number of unique findings for the right breast for a given exam. |
massshape |
Mass shape according to BIRADS descriptors. Also includes asymmetries and architectural distortion (see ./tables/clinical_legend.csv) |
massmargin |
Mass margin according to BIRADS descriptors (see ./tables/clinical_legend.csv) |
massdens |
Mass density according to BIRADS descriptors (see ./tables/clinical_legend.csv) |
calcfind |
Type of calcification according to BIRADS descriptors (see ./tables/clinical_legend.csv) |
calcdistri |
Distribution of calcifications according to BIRADS descriptors (see ./tables/clinical_ legend.csv) |
bside |
Laterality of any pathology result L: left R: right |
procdate_anon |
Date of pathology result. |
type |
Source of the of tissue specimen obtained - biopsy, FNA, lumpectomy, etc. (see ./tables/clinical_legend.csv). This is helpful in cases where this is a biopsy followed by a lumpectomy. The pathology entries will contain information from both events, however the lumpectomy pathology results would typically supersede biopsy results |
path1 - path10 |
Individual pathologic diagnoses from a given specimen. For example, a given specimen may contain invasive ductal carcinoma (ID), ductal carcinoma in situ (DC), and radial scar (RS). This row would contain these entries in path 1 – path 3 (see ./tables/clinical_legend.csv) |
path_severity |
The most severe pathology result from a given specimen, abstracted from path1 – path10 (see see ./tables/pathology_legend.csv for classification schema) 0: invasive cancer 1: non-invasive cancer 2: high-risk lesion 3: borderline lesion 4: benign findings 5: negative (normal breast tissue) 6: non-breast cancer |
RACE_DESC |
Patient Race |
ETHNIC_GROUP_DESC |
Patient Ethnicity |
Contains image level information and is structured as one row per file. Information includes DICOM metadata, ROI location, image type (2D, c-view), and file path.
Feature name | Description |
---|---|
empi_anon |
Unique anonymized patient ID, all exams for a patient will have the same ID |
acc_anon |
Unique ID per exam, all images within an exam will have the same acc_anon (and the same empi_anon) |
anon_dicom_path |
Anonymized full dicom file path |
png_path |
Full png image path (not relevant for AWS Open Data release) |
png_filename |
Filename only of the png (not relevant for AWS Open Data release). Filenames are hashed and therefore are unique across all files, allowing this field to be used as an index if desired. |
study_date_anon |
Anonymized date of acquisition of the exam. This may differ slightly from study_date_anon in the clinical data which represents the date the report was signed. |
StudyDescription |
The exam type - for example a screening or diagnostic mammogram. On occasion, the study type may differ than what was recorded in the clinical data sheet due to mistakes in data entry at the time of acquisition. Images and other fields can be reviewed to troubleshoot, or these exams can be discarded. |
SeriesDescription |
The name of the series for an exam. This can vary depending on 2D, 3D, or C-view images but typically contains the type of view (CC, MLO, etc) and/or the laterality. This field is not frequently required. See FinalImageType and ImageLateralityFinal |
FinalImageType |
Derived by combining information from several other DICOM fields to ascertain the image type 2D: standard 2D digital mammogram 3D: digital breast tomosynthesis (DBT). These cannot be currently used as they are locked in a proprietary container and are not part of the AWS Open Data release. C-view: synthetic 2D image derived from the DBT ROI_SSC/ROI_SS: screensave images annotated by the radiologist with a circlular ROI burned into the pixel data. These ROIs have already been extracted and mapped to their source image. These are not part of the AWS Open Data release. |
ImageLateralityFinal |
Derived by combining information from several other DICOM fields. L: left breast R: right breast |
ViewPosition |
Type of view acquired such as CC or MLO |
spot_mag |
Indicates if the image is a special view such as spot compression or magnificiation. 0: image is a full field digital mammogram (FFDM). All screening studies are FFDM. 1: image is a special. Often used in diagnostic exams but should not occur for screening exams. |
num_roi |
Number of ROIs for a given file |
ROI_coords |
Coordinate(s) of any detected ROI on the image, represented as a list of lists. Sublists contains corner coordinates for ROI in the format ‘ymin, xmin, ymax, xmax’. For 2D and C-view images, this field is the location of the ROI on the image. For the screensave (ROI_SSC/ROI_SS) images, this field is the location of the burned in ROI on the screensave image which serves as the source of the ROI location. Screensaves are not part of the AWS Open Data release. |
A recurring challenge for many new users is merging information from the clinical and metadata files due to differences in the way these data are indexed. The clinical data is indexed by individual findings which can correspond to either breast and also varies by exam. The metadata is indexed by file wherein each row corresponds to one image. Both of these files can be linked by patient ID (empi_anon
), exam ID (acc_anon
), and laterality of the clinical finding and image.
If clinical and metadata are merged directly on the exam ID alone (acc_anon
), findings that relate to one breast in the clinical data will be mapped to images from both breasts in the metadata. Instead, we must match clinical findings of the left breast with files of the left breast, clinical findings of the right breast to files of the right breast, and clinical findings of both breasts to all files.
To do this, use the side
field in clinical data (L, R, B) and only merge rows where side
matches the ImageLateralityFinal
field in metadata.
side
L -> metadata: ImageLateralityFinal
Lside
R -> metadata: ImageLateralityFinal
Rside
B -> metadata: all files (L and R)side
NaN -> metadata: all files (L and R)This resultant table will have each clinical finding mapped to its corresponding files in metadata. Note that files can still be repeated in the joined table. For example, if a given accession has 2 findings in the left breast, each finding will be mapped to all the left breast images, resulting in each left breast image appearing twice in the resultant dataframe - once for numfind
1 and again for numfind
2. This is expected behavior.
In typical mammography screening workflow, an abnormal screening study (BIRADS 0) proceeds to a diagnostic study. If the diagnostic study is deemed suspicious (BIRADS 4 or 5), the patient will proceed to biopsy followed by surgery if indicated.
Each of these pathology results is recorded by a human administrator and linked the exam and finding from which they originated. That means that if a screening study has a finding that triggers diagnostic exam and biopsy, the resultant pathology result would be recorded on rows for both the screening and diagnostic exams in the clinical dataframe. Similarly, if the patient proceeds to surgery, the pathology result from surgery will also be recorded in the appropriate rows for both exams. However, because this is a manual process there is potential for error in which the pathology result is recorded only to the diagnostic study and not the screening exam. In this case, it may be useful to create a table of patient ID, pathology specimen date, and pathology results and then manually attribute results back to prior studies based on a date range.