• m5_537203
  • 3MB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-05-21 11:00
通过观察唇部动作来检测说话者 一个简单的基于RNN的检测器,通过观看某人视频1秒钟(即25个视频帧的序列)的嘴唇运动来确定某人是否在说话。 通过使用滑动窗口技术,可以在视频文件上实时运行检测器,也可以在网络摄像头的输出上实时运行检测器。 有关详情,请参见下文。 RNN的检测算法如下: 从输入视频文件或网络摄像头中接受25帧的序列(即25 fps视频的1秒钟的帧值)。 序列长度是可配置的。 对于此序列中的每个视频帧: 将其转换为灰色框。 这样可以加快接下来的几个步骤(减少要处理的像素数量)。 使用某些面部检测器检测帧中的面部。 我使用了基于HOG的检测器。 您可以用您选择的检测器替换它。 将检测器的边界框输入到面部界标预测器。 我已经使用了dlib随附的形状预测器。 从地标预测变量中,获取标记上嘴唇和下嘴唇的内部边缘的点。 对于dlib预测器,上唇的零件号为61、62、63,下
  • lip-movement-net-master
  • models
  • 2_64_False_True_0.5_lip_motion_net_model.h5
  • 1_128_True_False_0.5_lip_motion_net_model.h5
  • 1_64_True_True_0.0_lip_motion_net_model.h5
  • the_shape_predictor_is_from_the_dlib_project.txt
  • 1_64_True_True_0.25_lip_motion_net_model.h5
  • 1_32_False_True_0.25_lip_motion_net_model.h5
  • src
  • scripts
  • train.bat
  • prepare.bat
  • train_using_grid_search.bat
  • assets
  • dataset_folder_structure.txt
  • figure1.png
  • videos
  • 044610038-close-astronaut-buzz-aldrin-sm.mp4
  • 044610066-astronaut-michael-collins-talk.mp4
  • copyright.txt
  • 044598381-carl-rowan-delivering-speech-a.mp4
  • csv
  • grid_options.csv
  • data
  • dataset_source
  • create_source_dataset_directory_structure_here.txt
# Speaker detection by watching lip movements A simple RNN based detector that determines whether someone is speaking by watching their lip movements for 1 second of video (i.e. a sequence of 25 video frames). The detector can be run in real time on a video file, or on the output of a webcam by using a sliding window technique. See below for details. **The RNN's detection algorithm is as follows:** 1. Accept a sequence of 25 frames (i.e. 1 second's worth of frames of a 25 fps video), from the input video file or webcam. The sequence length is configurable. 2. For each video frame in this sequence: 1. Convert it to a gray frame. This speeds up the next few steps (less number of pixels to process). 2. Detect the face in the frame using some face detector. I have used the HOG based detector from [dlib]( You can replace it with a detector of your choice. 3. Feed the detector's bounding box to a facial landmarks predictor. I have used the shape predictor that comes with dlib. 4. From the landmark predictor, fetch the points that mark the inner edges of the top and bottom lip. In case of the dlib predictor, these are part numbers 61, 62, 63 for the top lip, and 67, 66, 65 on the bottom lip. See Figure 1 below. 5. Calculate the average pixel separation between part pairs 61 & 67, 62 & 66, and 63 & 65. i.e. (dist(61, 67) + dist(62, 66) + dist(63, 65))/3. 6. Store away this distance value into the lip separation sequence. 3. Once all 25 frames are processed this way, perform **min-max scaling** over the 25-length sequence. **This normalization step is very important.** 4. Feed this normalized lip separation sequence into the RNN. 5. The RNN outputs a 2-element tuple (speech, silence) containing the respective probability of whether the speaker was speaking or was silent during the preceding 25 video frames. 6. Slide forward the 25-frame window by one frame through the input video, and repeat steps 2 through 5. ### Figure 1: Calculate the lip separation distance ![](assets/figure1.png) ## How to use the codebase You can either use one of the pre-trained networks, or you can train your own model using the codebase. ### Using pre-trained models The models directory contains 10 models with various network configurations. **To use one of the pre-trained models on a test video, run the following command:** ```bash python -v <path to test video file> -p <path to shape predictor file> -m <path to model .h5 file> ``` The [**videos**](videos/) directory contains a few test videos You can refer at the [](scripts/ file for a sample command. The name of the model file encodes the network's configuration as folows: ```bash <num RNN layers>_<num neurons in each RNN layer>_<is_RNN_bidirectional>_<is_RNN_a_GRU>_<Dropout_Fraction_On_output_of_last_RNN_layer>.h5 ``` When <is_RNN_a_GRU> is False, the net contains a simple RNN cell based RNN layer. For e.g. for the model file: ```bash 2_64_False_False_0.5_lip_motion_net_model.h5 ``` This model contains: - Two stacked RNN layers. - Each layer is composed of 64 non-bidirectional, simple RNN cells. - There is a dropout of 0.5 applied to the output of the second RNN layer before the output is finally fed to the final softmax classification layer. In all pre-trained models in the models directory, the final output layer is a 2-neuron softmax classifier, and the network is trained using an ADAM optimizer. ### How to train the Lip Movement Net model Training the model involves the following steps: 1. Prepare the training, validation and test datasets. 2. Run the trainer on the training data. ### Step 1 of 2: Prepare the training, validation and test datasets 1. Grab a large set of video files, or frame sequences. Some datasets to look at are: 1. [GRID](, 2. [AMFED](, 3. [DISFA](, 4. [HMDB](, 5. [Cohn-Kanade](, 2. **IMPORTANT**: Be sure to read the EULAs before you use any of these datasets. **Some datasets such as the Cohn-Kanade data set tend to be for non-commercial use only. If you are going to use this Lip Movement Net codebase for commercial use, you may not be able to use such datasets without the dataset owner's consent. Better check with the data set's owner first.** 3. Organize the datasets as follows: 1. Create a top level dataset/ directory 2. Create a train/, val/, test/, models/ and tensorboard/ directories under dataset/ 3. Under each one of \[dataset/train/, dataset/val/, dataset/test/\] directories: 1. Create a directory named after each class you want to detect. For e.g. dataset/train/speech/ and dataset/train/silence/, dataset/val/speech/ and dataset/val/silence/ and so on. 3. Under each class directory such as dataset/train/speech/ and dataset/train/silence/, create a directory for each appropriate dataset. For e.g. if you are using the GRID and HMDB datasets for generating speech sequences, and the HMDB dataset for generating the silence class sequences, create directories as follows: dataset/train/speech/GRID/, dataset/train/speech/HMDB/, dataset/val/speech/HMDB/, dataset/train/silence/HMDB/ and so on. 4. Next, under each one of the dataset directories created in above step, such as GRID/ and HMDB/, create a directory for each person in that dataset. For e.g. the GRID dataset contains speakers numbered from 1 thru 34. So you would create a directory called dataset/train/speech/GRID/1/, dataset/train/speech/GRID/2/, dataset/train/speech/GRID/3/ etc. The HMDB dataset contains video files orgainzed by category such a talk, chew etc. Each video file is for one individual person. So just put all the video files under chew under dataset/train/silence/HMDB/, or dataset/val/silence/HMDB/ directories. No need to create individual person directories. The prepare script will assume that each video file belongs to a different person. 4. Ensure that you have a good mix of speakers in train, val and test, and that **there are no common video files among train, val and test!**. Preferably keep a different disjoiint set of speakers or persons among train/, val/ and test/ directories for better training results. 5. Download and expand the [](assets/ file to get started with the folder structure. 4. Open 5. Set the following parameters to the values you want for your network: 1. VIDEO_START_OFFSETS 2. VIDEO_END_OFFSETS 3. CROSS_FILE_BOUNDARIES 6. Run the following command: ```bash python -i <path to input directory> -o <path to output sequences dataset directory> -p <path to facial landmarks model file> ``` Refer to [](scripts/ for an example command. ### Step 2 of 2: Run the trainer on the training data that you prepared using instructions in Step 1. Run the following command: ```bash python -i <path to sequences dataset directory> ``` Refer to [](scripts/ for an example training command. ### Running the trainer in GRID SEARCH mode You can OPTIONALY run the training algorithm iteratively over a grid of network hyperparameters so as to find a parameter combination that best fits the needs of your domain. In this mode, the trainer will do the following: 1. Trains the network on each hyper-parameter combination that you have specified in the grid_options.csv file, 2. Creates a directory under the \<dataset\>tensorboard directory in which it will create and update the tensorboard log file for that training run. 3. Saves the model generated from that training run into an .h5 file in the /<dataset/>/models directory 4. Logs the results of the training as a line in a C
    • template-scala-rnn:RNN算法实现
      该模板提供情感分析算法 。 安装。 遵循。 安装后启动所有 PredictionIO 供应商并检查 pio 状态: pio-start-all pio status 将此模板复制到您的本地目录: pio template get ts335793/template-scala-spark-...
    • vec2.js:操纵2D向量
      另外,由于Infinity和NaN非常阴险,因此该库将在检测到这两个库后立即抛出该异常,因此您可以采取措施来修复数据/算法。 支持的运营 变更([fn]) 添加一个观察者fn ,只要此向量改变,该观察者就会被调用。 在...
    • CMOS校验和算法
    • rnn.rar
    • rnn-by-numpy:使用numpy实现rnn和语言模型
      本仓库主要用numpy从头开始构建rnn结构,包括前向传播算法 反向传播算法 学习率 随机梯度下降; 实例:给定一个x来预测y,虽然这个没有实际意义,所以在这里主要目的是为了阐释rnn算法 x: SENTENCE_START what are ...
    • RNN算法推导过程及代码.zip
    • RNN算法打包matlab
    • GaussDB_100_1.0.1-DATABASE-REDHAT-64bit.tar.gz