This guide provides instructions on how to install the AWS Command Line Interface (CLI), configure your credentials, and download the open-access portion of the Emory Breast Imaging Dataset (EMBED).
These instructions assume:
- You have already created an AWS account.
- You have already reviewed the Data Use Agreement for EMBED Open Data and your intended use conforms to these terms.
- You have already requested access on the online form and received an email confirming that your access has been granted.
Before proceeding, please ensure you have your AWS Access Key ID and Secret Access Key.
The AWS CLI is a unified tool to manage your AWS services. You will use this tool to interact with the S3 bucket hosting the dataset.
The easiest way to install the AWS CLI on macOS is via the GUI installer:
Alternatively, if you use Homebrew:
brew install awscli
curl "[https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip](https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip)" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
To verify the installation on any operating system, open your terminal or command prompt and run:
aws --version
Once installed, you must configure the CLI with your credentials to authenticate your requests.
Run the following command:
aws configure
You will be prompted to enter the following information:
us-east-1 (or your preferred region).json.The EMBED Open dataset is located at: s3://embed-dataset-open
To list the contents of the root directory, use the ls command:
aws s3 ls s3://embed-dataset-open/
You should see two primary directories:
tables/images/The tables/ directory contains CSV files describing the clinical features and image metadata. Because these files are relatively small, you can download them individually using the cp (copy) command.
To download the entire tables directory to your current local folder:
aws s3 cp s3://embed-dataset-open/tables/ ./tables/ --recursive
To download specific files:
Clinical Data: Contains features extracted from the enterprise radiology reporting system (BI-RADS, breast density, pathology, etc.).
aws s3 cp s3://embed-dataset-open/tables/EMBED_OpenData_clinical.csv .
Clinical Data Legend: Explains the variables in the clinical data table.
aws s3 cp s3://embed-dataset-open/tables/AWS_Open_Data_Clinical_Legend.csv .
Image Metadata: Contains DICOM attributes (from FFDM and C-View images) and derived columns. This file is essential for linking clinical data to specific image paths.
aws s3 cp s3://embed-dataset-open/tables/EMBED_OpenData_metadata.csv .
The imaging directory contains the actual DICOM files. Please review the sections below carefully to determine the best download strategy for your available storage and research needs.
Warning: The complete dataset is approximately 2.5 TB in size. Ensure you have sufficient local storage bandwidth and capacity before initiating a full download.
You can download the full dataset using either the copy (cp) or sync (sync) commands.
Option A: Using aws s3 sync (Recommended)
We highly recommend using sync for large datasets. Unlike cp, the sync command compares the source and destination directories. If your download is interrupted (e.g., internet failure or computer shutdown), running the command again will resume from where it left off rather than starting over.
aws s3 sync s3://embed-dataset-open/images/ ./local_embed_images/
Option B: Using aws s3 cp
The copy command with the recursive flag will also download the data, but it does not natively support resuming interrupted transfers as efficiently as sync.
aws s3 cp s3://embed-dataset-open/images/ ./local_embed_images/ --recursive
If you have storage limitations or only require specific sub-cohorts, you should filter the data tables first to generate a specific list of files to download.
The Workflow:
tables/ directory as described in Section 4.EMBED_OpenData_clinical.csv and EMBED_OpenData_metadata.csv into Python (Pandas).df_clinical[df_clinical.path_severity.isin([0.0, 1.0])]df_meta[df_meta.manufacturer.str.contains("hologic", case=False)]Python Download Script:
This script uses the boto3 library to download files listed in your filtered CSV. It handles directory creation automatically.
Prerequisites: pip install boto3 pandas tqdm
import os
import boto3
import pandas as pd
from tqdm import tqdm
# --- CONFIGURATION ---
BUCKET_NAME = "embed-dataset-open"
# The CSV file containing your filtered cohort
INPUT_CSV = "my_cohort.csv"
# Where you want the images to be saved locally
DEST_DIR = "./my_cohort_images"
def download_cohort_images():
# 1. Setup AWS Client
s3 = boto3.client('s3')
# 2. Load the CSV
# Ensure your CSV was saved with the 'anon_dicom_path' column
try:
df = pd.read_csv(INPUT_CSV)
if 'anon_dicom_path' not in df.columns:
raise ValueError("CSV must contain 'anon_dicom_path' column.")
except Exception as e:
print(f"Error reading CSV: {e}")
return
# 3. Iterate and Download
print(f"Starting download of {len(df)} files from {BUCKET_NAME}...")
for _, row in tqdm(df.iterrows(), total=len(df), unit="file"):
s3_key = row['anon_dicom_path']
# Remove any leading slashes if present to avoid path issues
if s3_key.startswith('/'):
s3_key = s3_key[1:]
# Construct local path
local_path = os.path.join(DEST_DIR, s3_key)
local_folder = os.path.dirname(local_path)
# Create local directory if it doesn't exist
if not os.path.exists(local_folder):
os.makedirs(local_folder)
# Download file
try:
s3.download_file(BUCKET_NAME, s3_key, local_path)
except Exception as e:
print(f"Failed to download {s3_key}: {e}")
print("Download complete.")
if __name__ == "__main__":
download_cohort_images()
When working with the downloaded data, use the EMBED_OpenData_metadata.csv file as your index.
anon_dicom_path column in the metadata CSV provides the relative path to the image file (e.g., images/EMBED001/...).empi_anon (Patient ID) and acc_anon (Exam ID) columns to correlate pixel data with clinical outcomes and pathology results.