Log in / Register
Home arrow Computer Science arrow Augmented and Virtual Reality
< Prev   CONTENTS   Next >

3 The Developed Solution

The proposed system is designed in order to accomplish to three different tasks:

(1) sign acquisition and recognition (front-end), (2) sign conversion and transmission, (3) sign synthesis (back-end); that are performed by three different sub-blocks, i.e., the input module, the transmission module and the robotic hand module.

The input module is connected to a depth camera (the acquisition device) and is able to identify signs made by the human hand in front of the device. The transmission module is in charge of encoding the information generated by this first block, sending them through the web, and decoding them in a way that is suitable for the last block. Finally, the robotic hand module is composed by the robotic haptic interface and by a controller that uses the information from the first module to control robotic hand in a proper way.

3.1 The Input Module

The proposed implementation of the input module follows the work proposed in [16], where authors propose a full-DoF appearance-based hand tracking approach that uses a random forest (RF) classifier [23]. RF is a classification and regression technique that has become popular recently due to its efficiency and simplicity [16].

In the proposed system, a low-cost depth-camera (see Fig. 1), is used as only input to the hand segmentation phase, that is the task of isolating hand from the background (RGB information is discarded). Once foreground pixels have been recognized and separated from background, the hand pose can be reconstructed, resorting to two main blocks, that are the hand labelling block and the joints position estimating block. Hand labelling is an appearance-based method that aims at recognizing single sub-parts of the hand in order to isolate the joints, while the joints positions estimation block aims at approximating the joints 3D position starting from the noisy labelling and depth measurements. As done in [23], in our approach the RF classifier is employed to label pixels of the depth-image according to the region of the hand they should belong to, and than clusters each region in order to find the position of the centre of that

Fig. 1. The hand tracking input system

region. Regions are chosen in order to be centred over the joints of the hand, so that, at the end of the clustering process, the algorithm outputs the 3D position of each joint of the hand.

The developed code can recognize 22 different sub-parts of the hand, which are palm, wrist and 4 joints for each of the fingers. Each part is centred around a specific joint. Parts are tagged with different encoding and the tags are visually represented by different colours.

The hand is first segmented by thresholding depth values. The segmented hand is isolated from the background and tracked resorting to OpenNi tracker [1]. Finally, a point cloud for further processing is obtained, taking into considerations all the points within the sphere centred in the centre of the tracking and with a conservative radius τ .

To label the hand, an approach based on machine learning algorithms has been developed. Basically, at the very beginning, a RF classifier [6] is trained on thousands of different hands performing different signs, also turned or oriented differently. The classifier reads and examines such signs, and calculates the same feature for all of them; then, it keeps the more discriminative features. Such features can be later used to distinguish, with a certain confidence, the different hand sub-parts, and especially pixels that belong to different labels. Finally, the joints position is approximated applying the mean shift clustering [8] algorithm on the hand sub-parts. This approach provides promising results: first experiments with real-world depth map image show that it can properly label most parts of the hand in real time without requiring excessive computational resources.

Fig. 2. 3D model in different poses used to generate the synthetic training set

In our approach we perform a per-pixel classification, where each pixel x of the hand is described using the following feature where I is the depth-image so that I() represent the depth value of the image at a given point, while u, v are two offset limited to a finite R length.

We use this feature because, in combination with RF, it has proved to very quickly succeed in discriminating hand parts, as shown in [16]. Hand poses can be estimated by labelled segmented hands resorting on mean shift [8]. Also, we resort on the mean shift local mode finding algorithm (as in [24]) to reduce the risk of outliers, that might have a large effect on the computation of the centroids for the pixel locations belonging to a hand part. In such a way, we obtain a more reliable and coherent estimation of the joints set S.

Note that (1) is not invariant to rotations, while in the other hand it is invariant to distance and 3D translations (thanks to the normalization factor I (x)). So, it is necessary to build a wide training set composed of the same sign framed from different point of view; for this reason, we have also investigated ways to effectively and automatically build comprehensive large train sets.

To train the algorithm, a training set with labelled samples is necessary. Since manually building a dataset is a tedious, time-consuming and error-prone process, a system able to create a synthetic training set was developed. Such system is based on the 3D model of a human hand shown in Fig. 2. Some examples of the outcomes of the synthetic training tool are shown in Fig. 3.

Main parameters describing the RF we trained were chosen as the ones providing best results after several tests and are summarized in Table 1. Each tree we use is trained with 2'000 random pixels from each training image. Offset vectors u and v from (1) are sampled uniformly between -30 and 30 pixels.

Finally, using a look-up table, the module converts the recognized hand pose in a list of 19 joints positions, that represents the angular positions that each

Fig. 3. Outcomes from the synthetic training tool: depth images and related labeling in 3 different poses

Fig. 4. Structure of the packet with the joints positions

joint of the hand have to reach in order to perform the sign. Global hand rotation (3 DOF) is at the moment discarded as the robotic hand used cannot rotate over the palm base.

Found a mistake? Please highlight the word and press Shift + Enter  
< Prev   CONTENTS   Next >
Business & Finance
Computer Science
Language & Literature
Political science