4 Implementation of the 3D Virtual Learning System

4.1 Stereo Camera

Considering that the two applications are e-learning and e-business, mature and lowcost stereo imaging technology should be used. Webcams are chosen to be the image sensors for our system. Two high quality webcams with VGA resolution and 30fps frame rate are physically aligned and fixed on a metal base. They can be easily mounted on computer screens or tripods. The physical alignment makes the optic axis of the two cameras parallel and pointing to the same direction. Due to the manufacturing defects and the imperfect alignment, the output images should be undistorted and rectified before they are used to extract depth information.

4.2 Camera Calibration and Rectification

Single camera checkerboard calibrations are implemented for both left and right cameras. We use Heikkila and Silven's [11] camera model that takes focal points and principal points as the camera intrinsic parameters. Lens distortion, including radial distortion and tangential distortion, are described by 5 parameters. 16 different checkerboard images are taken to guarantee a robust estimation of the camera parameters. Then, the stereo calibration estimates the translation vector T and rotation vector R characterizing the relative position of the right camera with respect to the left camera (reference camera).

With the intrinsic parameters, an undistortion process [11] is applied on each camera in each frame to suppress tangential and radial distortion. To simplify the computation of pixel correspondence, two image planes need to be rectified first. A. Fusiello et al. [12] proposed a rectification procedure that includes image plane rotation, principal point adjustment and focal length adjustment. Let m = [u v 1]^{T}

be the homogeneous coordinates of pixels on the right cameras image plane. The transformation of the right cameras image plane are m^{new} = (K R )(K R )^{-}^{1}m^{old}^{, where}_{m}oldave no o and m^{new}^{are the homogeneous coordinates of pixels on the right cameras image} plane before and after rectification, R_{n}^{is an identity matrix, and}R_{o}^{is the rotation} matrix of the camera before the rotation.

4.3 Hand Gesture Recognition

For the purpose of generating skin color statistics, luminance and chrominance need to be separated. We convert the image sequence from RGB color space to YCbCr [13] by:

ì^{Y}^{=}^{0.299R}^{+}^{0.587G}^{+}^{0.114B}

ï

_{í}C_{r} = R - Y

_{î}C_{b} = B - Y, (1)

where, Y is the luminance component, and Cb and Cr are the chrominance components. This color space conversion has to be done on both left and right cameras.

Color-based segmentation is used to discriminate hands from their background.

S. L. Phung and et al. [14] proved that Bayesian classifier performs better compared to linear classifier and Gaussian single and mixture models. Whether a pixel is considered as a skin pixel is decided by a threshold τ:

where ω_{O} and ω_{l} denote skin color and non-skin color, p( X | w_{0} ) and(2) p( X | w_{1} ) are the conditional probability density functions of skin and non-skin colors. A color calibration procedure is needed when users first use the system. Users are asked to wave their hands in the camera view so that the training data of the skin color can be acquired. With this, the system is able to adaptively learn users' skin color as well as lighting conditions.

We want to discriminate hand in open and closed poses by learning the geometrical features extracted from hands. A contour retrieving algorithm is applied to topologically extract all possible contours in the segmented images. We empirically use the two largest segmented areas as hand segmentations because normally two hands are the largest skin color areas in the view. A convex hull and its vertex set are computed [15]. The number of vertex after a polygon approximation procedure should be in the range of 8 to 15 considering both computational cost and accuracy. Several features can be extracted from the convexity: the distance between the starting point A and the ending point B of each defect, and the distance between depth points C and the farthest points on hand D. Distance l_{AB} and l_{CD} fully describe the situation of two adjacent fingers.

To help determinate the open hand and closed hand poses, we train a classifier using the Cambridge Hand Gesture Dataset [16]. The reason is that the image in the dataset has the similar camera position with ours, and the dataset provides sequences of hand actions that are suitable for learning hand dynamics. We select 182 images from the dataset and manually label them with w_{O} (open hand) and w_{l} (closed hand). For each image, we extract l_{AB} and l_{CD} distance from all convexity defects of the hand. The training vector is described as ^{{}L, ω_{i}^{}}, where L is the set of l_{AB} and l_{CD} distance in a hand. A support vector machine is trained on the resulting 14- dimensional descriptor vectors. Radial basis function is used as the kernel function to nonlinearly map the vectors to higher dimension so that linear hyper plane can be decided.

Since there is no need to track single finger movements, positions of hands on both camera views are decided by two coordinates: (x_{L}, y_{L}) and (x_{R}, y_{R}). The coordinate of one hand on each camera view is calculated by the center of gravity of the hand segment. This will smooth the vibration caused by the segmentation. After the image rectification, we have y_{L} = y_{R}. The disparity along x direction is computed by d = x_{L} x_{R}. The depth z of the point is given by:

where f is the focal length, T is the baseline of the stereo camera. Note that the unit in equation (3) is in pixel.

Existing hand interaction is highly limited by the current two-hand rotation gesture due to the lack of the research on hand fist kinematics. A single fist rotation detector (FRD) is crucial to implement the SH-GMR activity that makes possible control of different objects by two hands simultaneously. With this concern, a feature-based FRD was proposed to extract robust and accurate fist rotation angle [9]. The features we find on fists are called "fist lines" which are 3 clearly dark lines between index, middle, ring and pinky fingers.

The FRD is a three-step approach. The first step is fist shape segmentation locating single fist in a search window. A clustering process is used to decide the fist position along human arms. The second step finds rough rotation angles with histograms of feature gradients using Laplacian of Gaussian (LOG), and then refines the angles to higher accuracy within (-90^{0}, 90^{0}) with constrained multiple linear regression. The third step decides the angle within (-360^{0}, 360^{0}) by making use of the distribution of other edge features on the fist.

Found a mistake? Please highlight the word and press Shift + Enter