Charles R. Qi et al. CVPR 2017
PC(irregular) transformation to regular format(image grids or 3D voxels) renders data unnecessarily voluminous and cause issues.
In this paper, novel NN that directly consumes PC which respects the permutation invariance(치환 불변) of points is designed.
PointNet directly takes point clouds as input with three coordinates(x, y, z) and provides object classification, part segmentation, scene semantic parsing.
Novel deep net architecture suitable for consuming unordered point sets in 3D;
Show how such a net can be trained to perform 3D shape classification, shape part segmentation and scene semantic parsing tasks;
Provide thorough empirical and theoretical analysis on the stability and efficiency.
Illustrate the 3D features computed by the selected neurons in the net and develop intuitive explanations for its performance.

PC Features: handcrafted towards specific tasks. intrinsic or extrinsic, local or global features. not trivial to find the optimal feature combination.
DL on 3D Data:
PC is represented as a set of 3D points P_i coordinate(x, y, z) plus extra features(color, normal etc). PointNet only used (x, y, z) coordinates.
Object Classification & Semantic Segmentation.
Unordered, interation among points(not isolated, combinatorial interactions), invariance under transformations(rotation, translation)

Three key modules: max pooling layer(as a symmetric function to aggregate information from all points), local and global information combination structure, two joint alignment networks(align both input points and point features)

Local and Global Information Aggregation: Train SVM or MLP classifier global features. Feed global feature back to per point features by concatenating the global feature with each of the point features(nx1088 = {nx64, nx1024}). –> extract new per point features(aware of both local and global info, nx128) based on the combined point features(segmentation network).
Joint Alignment Network: Semantic labeling has to be invariant to geometric transformations. Expect that the learnt representation by point set is invariant to transformations. Natural solution is to align all input set to a canonical space before feature extraction. Apply input coordinates directly to an affine transformation matrix(T-Net) composed by basic modules of point independent feature extraction, max pooling and fully connected layers. To prevent increasing difficulty of optimization due to higher dimension, it added a regularization term to its softmax training loss. It constrain the feature transformation matrix to be close to orthogonal matrix(will not lose information):



(a) says that f(S) is unchanged up to the input corruption if all points in Cs are preserved and also unchanged with extra noise points up to Ns. (b) says that Cs only contains a bounded number of points, determined by a finite subset Cs(critical point set of S) of less or equal to K(bottleneck dimension of f) elements.
Paper propose a novel DNN PointNet that directly consumes point cloud and provides 3D recognition tasks including classification, part segmentation and semantic segmentation while obtaining on par or better results than state of the arts.
