Working with DICOM files

Before we start analysing our files, let’s install some additional libraries, which add support for various medical imaging formats:

conda install -c glueviz gdcm # Python 3.5 and 3.6
conda install -c conda-forge gdcm # Python 3.7

We’ll be working with a subset of the CT Lymph Nodes dataset which can be downloaded here.

path = '~/dicom_data/'

Crawling

join_tree is the main function that collects the DICOM files’ metadata:

from dicom_csv import join_tree

df = join_tree(path, relative=True, verbose=False)

It recursively visits the subfolders of path, also it adds some additional attributes: NoError, HasPixelArray, PathToFolder, FileName:

len(df), df.NoError.sum(), df.HasPixelArray.sum()
(2588, 2587, 2587)

Thre resulting dataframe has 2588 files’ metadata in it, and only one file was openned with errors, let’s check which one:

df.loc[~df.NoError, ['FileName', 'PathToFolder']]
FileName PathToFolder
0 readme.txt .

There is a file readme.txt in the root of the folders tree, which is obvisously not a DICOM file.

Note that PathToFolder is relative to path, this is because we passed relative=True to join_tree.

# leave only dicoms that contain images (Pixel Arrays)
dicoms = df[df.NoError & df.HasPixelArray]

dicoms.FileName[1], dicoms.PathToFolder[1]
('000466.dcm',
 'ABD_LYMPH_001/09-14-2014-ABDLYMPH001-abdominallymphnodes-30274/abdominallymphnodes-26828')

Aggregation

Next, we can join the dicom files into series, which are often easier to operate with:

from dicom_csv import aggregate_images

images = aggregate_images(dicoms)
len(images)
4

aggregate_images also adds some attributes: SlicesCount, FileNames, InstanceNumbers, check its docstring for more information.

For example FileNames contains all the files that are part of a particular series:

images.FileNames[0][:50] + '...'
'000466.dcm/000312.dcm/000150.dcm/000357.dcm/000311...'

As you can see, they are not ordered by default, but you can change this behaviour by passing the process_series argument which receives a subset of the dataframe, containing files from the same series:

images = aggregate_images(dicoms, process_series=lambda series: series.sort_values('FileName'))

images.FileNames[0][:50] + '...'
'000000.dcm/000001.dcm/000002.dcm/000003.dcm/000004...'

Loading

You can load a particular series’ images stacked into a numpy array using the following function:

img = load_series(images.loc[0], path)

it expects a row from the aggregated dataframe and, optionally, the path argument, if the paths are relative.

The image’s orientation as well as the slices’ order can be determined automatically, if you pass orientation=True:

img = load_series(images.loc[0], path, orientation=True)
print(img.shape, images.PixelArrayShape[0], images.SlicesCount[0])
(512, 512, 661) 512,512 661