LUNA16_Challange数据预处理2

ps 直接上好像有点困难，那么先整理下LUNA16_Challange中平安科技公司的技术说明中预处理部分（还是比较好理解，理解错误欢迎指正）

Data Preprocessing

At first, we get the lung area by using traditional methods, and then preprocessing is performed on the lung area. Use -600HU as a threshold to get a 0-1 3D map, based on the size of the 3D map block and the average distance traveled by the tile to the center, and make up for all small cave depths. The final 0-1 three-dimensional map is the lung area. As the CT picture up and down there will be some slices connected with the outside world, should be removed. The final image pixel values are clip to [-1200,600], then zoom to [0,255]. Pixels for non-lung regions are set to 170. Pretreatment can eliminate the noise, such as the bright spots of the bones, the metal lines of the CT bed. And finally, we get 128*128*128 cube.

For false positive reduction track, we use a multi-scale strategy, and prepare two sizes little cube: 36*48*48 and 20*36*36. We crop little cube from the whole lung area to feed into the two classification networks. Obtain different fields of vision size of nodules to predict the overall results.

More importantly, the training dataset has extremely high false positive to true positive ratio (735418:1557). To solve the problem of category imbalance, in addition to the focal loss function, we used oversampling to increase the number of positive samples. Specific methods are sliding window crop, flip respect to x-axis, y-axis and z-axis, rotate 90, 180, 270 degrees, multi-scale transform. Finally, we expanded the positive sample more than 300 times.

数据预处理

把-600hu作为阈值获得一个0-1二值三维图。并把其中的小块都补上，最终得到一个仅含肺区域三维图。其他链接外部组织的切片去掉，最终图像值在[-1200,600]，然后缩放到[0,255],非肺区域设置为170.预处理可以消除噪声，如骨骼的亮点，CT床的金属线。最终得到128*128*128的立方体。

这短短几句话其实工作量还是有点的，很多讲得不够细。下面先了解下hu值。参考：http://shartoo.github.io/medical_image_process/

substance HU 空气-1000肺-500脂肪-100到-50水0CSF15肾30血液+30到+45肌肉+10到+40灰质+37到+45白质+20到+30Liver+40到+60软组织,constrast+100到+300骨头+700(软质骨)到+3000(皮质骨) 这是具体值。

计算方法像素值转换到hu值：Hounsfield Unit = pixel_value * rescale_slope + rescale_intercept

可以查看病人的扫描HU值分布情况代码代码如下：

# -*- coding:utf-8 -*-
'''
this script is used for basic process of lung 2017 in Data Science Bowl
'''
import glob
import os
import pandas as pd
import SimpleITK as sitk
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import skimage, os
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, reconstruction, binary_closing
from skimage.measure import label,regionprops, perimeter
from skimage.morphology import binary_dilation, binary_opening
from skimage.filters import roberts, sobel
from skimage import measure, feature
from skimage.segmentation import clear_border
from skimage import data
from scipy import ndimage as ndi
import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import pydicom
import scipy.misc
import numpy as npdef load_scan(path):slices = [pydicom.read_file(path + '/' + s) for s in os.listdir(path)]slices.sort(key = lambda x: int(x.ImagePositionPatient[2]))try:slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])except:slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)for s in slices:s.SliceThickness = slice_thicknessreturn slicesdef get_pixels_hu(slices):image = np.stack([s.pixel_array for s in slices])# Convert to int16 (from sometimes int16),# should be possible as values should always be low enough (<32k)image = image.astype(np.int16)# Set outside-of-scan pixels to 0# The intercept is usually -1024, so air is approximately 0image[image == -2000] = 0# Convert to Hounsfield units (HU)for slice_number in range(len(slices)):intercept = slices[slice_number].RescaleInterceptslope = slices[slice_number].RescaleSlopeif slope != 1:image[slice_number] = slope * image[slice_number].astype(np.float64)image[slice_number] = image[slice_number].astype(np.int16)image[slice_number] += np.int16(intercept)return np.array(image, dtype=np.int16)first_patient = load_scan('E:/DcmData/xlc/Fracture_data/Me/3004276169/3302845/')
first_patient_pixels = get_pixels_hu(first_patient)
plt.hist(first_patient_pixels.flatten(), bins=80, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()

结果如下：

这里的读入数据是一个ct序列（dcm文件）就是在E:/DcmData/xlc/Fracture_data/Me/3004276169/3302845/路径下，由于是公司的数据这里就不共享了。转为luna16数据集该怎么处理呢。一直在寻找MHD格式数据的处理方法，对于dicom格式的CT有很多论文根据其HU值域可以轻易地分割肺、骨头、血液等，但是对于MHD没有这样的参考。从LUNA16论坛得到的解释是，LUNA16的MHD数据已经转换为HU值了。

图像值在[-1200,600]，然后缩放到[0,255],非肺区域设置为170.代码：

import SimpleITK as sitk
import matplotlib.pyplot as plt
import numpy as np
# -*- coding:utf-8 -*-
'''
this script is used for basic process of lung 2017 in Data Science Bowl
'''
import glob
import os
import pandas as pd
import SimpleITK as sitk
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import skimage, os
from skimage.morphology import ball, disk, dilation, binary_erosion, remove_small_objects, erosion, closing, reconstruction, binary_closing
from skimage.measure import label,regionprops, perimeter
from skimage.morphology import binary_dilation, binary_opening
from skimage.filters import roberts, sobel
from skimage import measure, feature
from skimage.segmentation import clear_border
from skimage import data
from scipy import ndimage as ndi
import matplotlib
#matplotlib.use('Agg')
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import pydicom
import scipy.misc
import numpy as npfilename='E:\\JLS\\dcm_data\\luna\\subsets\\subset1\\1.3.6.1.4.1.14519.5.2.1.6279.6001.173106154739244262091404659845.mhd'
#filename='E:\\DcmData\\xlc\\Fracture_data\\mhd_raw\\1.3.12.2.1107.5.1.4.75751.30000018103122041563700126965.mhd'
itkimage = sitk.ReadImage(filename)#读取.mhd文件
numpyImage = sitk.GetArrayFromImage(itkimage)#获取数据，自动从同名的.raw文件读取#numpyImage[numpyImage > -600] = 1
#numpyImage[numpyImage <= -600] = 0
def get_segmented_lungs(im, plot=False):'''This funtion segments the lungs from the given 2D slice.'''if plot == True:f, plots = plt.subplots(4, 2, figsize=(5, 40))'''Step 1: Convert into a binary image.'''binary = im < -600if plot == True:plots[0][0].axis('off')plots[0][0].set_title('binary image')plots[0][0].imshow(binary, cmap=plt.cm.bone)'''Step 2: Remove the blobs connected to the border of the image.'''cleared = clear_border(binary)if plot == True:plots[0][1].axis('off')plots[0][1].set_title('after clear border')plots[0][1].imshow(cleared, cmap=plt.cm.bone)'''Step 3: Label the image.'''label_image = label(cleared)if plot == True:plots[1][0].axis('off')plots[1][0].set_title('found all connective graph')plots[1][0].imshow(label_image, cmap=plt.cm.bone)'''Step 4: Keep the labels with 2 largest areas.'''areas = [r.area for r in regionprops(label_image)]areas.sort()if len(areas) > 2:for region in regionprops(label_image):if region.area < areas[-2]:for coordinates in region.coords:label_image[coordinates[0], coordinates[1]] = 0binary = label_image > 0if plot == True:plots[1][1].axis('off')plots[1][1].set_title(' Keep the labels with 2 largest areas')plots[1][1].imshow(binary, cmap=plt.cm.bone)'''Step 5: Erosion operation with a disk of radius 2. This operation isseperate the lung nodules attached to the blood vessels.'''selem = disk(2)binary = binary_erosion(binary, selem)if plot == True:plots[2][0].axis('off')plots[2][0].set_title('seperate the lung nodules attached to the blood vessels')plots[2][0].imshow(binary, cmap=plt.cm.bone)'''Step 6: Closure operation with a disk of radius 10. This operation isto keep nodules attached to the lung wall.'''selem = disk(10)binary = binary_closing(binary, selem)if plot == True:plots[2][1].axis('off')plots[2][1].set_title('keep nodules attached to the lung wall')plots[2][1].imshow(binary, cmap=plt.cm.bone)'''Step 7: Fill in the small holes inside the binary mask of lungs.'''edges = roberts(binary)binary = ndi.binary_fill_holes(edges)if plot == True:plots[3][0].axis('off')plots[3][0].set_title('Fill in the small holes inside the binary mask of lungs')plots[3][0].imshow(binary, cmap=plt.cm.bone)'''Step 8: Superimpose the binary mask on the input image.'''im=(255/1800 *im +1200*255/1800)get_high_vals = binary == 0im[get_high_vals] = 170im=np.rint(im)if plot == True:plots[3][1].axis('off')plots[3][1].set_title('Superimpose the binary mask on the input image')plots[3][1].imshow(im, cmap=plt.cm.bone)return im
def plot_ct_scan(scan):'''plot a few more images of the slices:param scan::return:'''f, plots = plt.subplots(int(scan.shape[0] / 20) + 1, 4, figsize=(50, 50))for i in range(0, scan.shape[0], 5):plots[int(i / 20), int((i % 20) / 5)].axis('off')plots[int(i / 20), int((i % 20) / 5)].imshow(scan[i])
for idx in range(1):print(len(numpyImage))plot_ct_scan(numpyImage)data = numpyImage[50]print(data[314][303])plt.figure(100)plt.imshow(data, cmap='gray')im=get_segmented_lungs(data, plot=True)print(im[7][7])print(im[314][303],np.max(np.max(im)),np.min(np.min(im)))plt.figure(200)plt.imshow(im, cmap='gray')
plt.show()

结果如下：

原图与最终结果如下：

没缩放到[0,255]时的最终结果与缩放之后的图是一致的，原图与最终结果中间颜色不太一致应该是画图自适应的缘故，中间部分值是一样的（没缩放到[0,255]之前）。

把-600hu作为阈值获得一个0-1二值三维图。并把其中的小块都补上，最终得到一个仅含肺区域三维图这步直接做卡在了参考的http://shartoo.github.io/medical_image_process/中measure.marching_cubes函数已经被移除无法画出3d图，结果很不直观，其实上面一步代码中已经用到-600hu作为阈值，而并把其中的小块都补上，最终得到一个仅含肺区域三维图这一步在二维层面也已经得到了。重新拼接起来就是3维。

其他链接外部组织的切片去掉这步比较麻烦，感觉还是在上面代码中根据肺部面积占总面积的比重来判断，上面代码中添加代码如下：

    '''Step 8: Superimpose the binary mask on the input image.'''sum=0for r in regionprops(label_image):sum=sum+r.areaproportion=sum/(512*512)LP.append(proportion)

函数添加一个LP=[] 参数，最后返回LP（lung proportion），最后要选取128*128*128.根据比重最大值来选那128个切片。

但是还有一个问题是还没有归一化尺度，感觉是要放之之前要做的。那么先归一化尺度为[1，0.5556，0.5556]（zxy）另一片技术文档。代码是根据上一篇博文中的参考中的代码修改的：

from os import listdir
import numpy as np
import Provider
from scipy import ndimage# Directories
baseSubsetDirectory = r'E:/JLS/dcm_data/luna/subsets/subset'
targetSubsetDirBase = r'E:/JLS/dcm_data/luna/Resampled_1_0.7_0.7/subsets/subset'RESIZE_SPACING = [1, 0.5556, 0.5556]
NUM_SUBSET = 10for setNumber in range(NUM_SUBSET):subsetDirectory = baseSubsetDirectory + str(setNumber)list = listdir(subsetDirectory)subsetList = []# Create Subset Listfor file in list:if file.endswith(".mhd"):subsetList.append(file)for file in subsetList:try:fileName = file[:-4]filePath = subsetDirectory + '/' + filevolumeImage, numpyOrigin, numpySpacing = Provider.load_itk_image(filePath)resize_factor = numpySpacing / RESIZE_SPACINGnew_real_shape = volumeImage.shape * resize_factor# volumeImage[volumeImage <= -600]= 0# volumeImage[volumeImage > -600] = 1new_shape = np.round(new_real_shape)real_resize = new_shape / volumeImage.shapenew_volume = ndimage.zoom(volumeImage, zoom=real_resize)

然后将两段代码拼接起来，博文太长了就先不放了，等下篇博文再放。获得128*128*128的立方体的代码也是。