Computer-Vision-Lip-Reading-2

所属分类:模式识别(视觉/语音等)
开发工具:Jupyter Notebook
文件大小:1105649KB
下载次数:0
上传日期:2023-04-13 07:03:20
上 传 者sh-1993
说明:  Computer-Vision-Lip-Reading-2.0,一种使用3D细胞神经网络的语音识别系统。最终模型的训练精度达到97.4%,测试精度达到99.2%...
(A speech recognition system using 3D CNNs. The final model achieves 97.4% training accuracy and a 99.2% testing accuracy and the system can accurately recognize spoken words from a set of pre-defined words in real-time.)

文件列表:
.DS_Store (10244, 2023-04-13)
collected_data (0, 2023-04-13)
collected_data\.DS_Store (12292, 2023-04-13)
collected_data\a_1 (0, 2023-04-13)
collected_data\a_1\0.png (11667, 2023-04-13)
collected_data\a_1\1.png (11803, 2023-04-13)
collected_data\a_1\10.png (13295, 2023-04-13)
collected_data\a_1\11.png (12909, 2023-04-13)
collected_data\a_1\12.png (12529, 2023-04-13)
collected_data\a_1\13.png (12398, 2023-04-13)
collected_data\a_1\14.png (12144, 2023-04-13)
collected_data\a_1\15.png (11967, 2023-04-13)
collected_data\a_1\16.png (11836, 2023-04-13)
collected_data\a_1\17.png (11872, 2023-04-13)
collected_data\a_1\18.png (11852, 2023-04-13)
collected_data\a_1\19.png (11919, 2023-04-13)
collected_data\a_1\2.png (11719, 2023-04-13)
collected_data\a_1\20.png (12036, 2023-04-13)
collected_data\a_1\21.png (11964, 2023-04-13)
collected_data\a_1\3.png (11728, 2023-04-13)
collected_data\a_1\4.png (12572, 2023-04-13)
collected_data\a_1\5.png (13235, 2023-04-13)
collected_data\a_1\6.png (13671, 2023-04-13)
collected_data\a_1\7.png (13589, 2023-04-13)
collected_data\a_1\8.png (13525, 2023-04-13)
collected_data\a_1\9.png (13386, 2023-04-13)
collected_data\a_1\data.txt (3247233, 2023-04-13)
collected_data\a_1\video.mp4 (8158, 2023-04-13)
collected_data\a_10 (0, 2023-04-13)
collected_data\a_10\0.png (11625, 2023-04-13)
collected_data\a_10\1.png (11696, 2023-04-13)
collected_data\a_10\10.png (12143, 2023-04-13)
collected_data\a_10\11.png (12020, 2023-04-13)
collected_data\a_10\12.png (11818, 2023-04-13)
collected_data\a_10\13.png (11626, 2023-04-13)
collected_data\a_10\14.png (11550, 2023-04-13)
collected_data\a_10\15.png (11376, 2023-04-13)
... ...

# Computer-Vision-Lip-Reading-2.0 Why "2.0"? This project builds upon one of my past projects where I attempted to create a solution to the same problem but using a vastly different strategy. Visit my old project [here](https://github.com/allenye66/Computer-Vision-Lip-Reading). ### Read the paper for my project [here](https://docs.google.com/document/d/1FLVwjXf4BfxgjIBl9CszCMwwwQ-Tm0crAv71qGPmLCM/edit)! ### View some demos [here](https://www.youtube.com/watch?v=E6bSWkDdcQM), [here](https://www.youtube.com/watch?v=RDC2C3MRoEU&feature=youtu.be), and [here](https://www.youtube.com/watch?v=6S--eCpAwHk). ## Synopsys The goal of this project is to build a speech recognition system that can accurately recognize spoken words from a set of pre-defined words. The algorithm used to recognize words uses computer vision + deep learning and is trained on a large dataset generated by myself and a friend. The dataset consists of around 700 individual video clips (can be found in the ```/collected_data/``` folder), totaling approximately 3 GB of data. The model architecture includes several convolutional and dense layers, and was trained using TensorFlow and Keras. The training process achieved a 95.7% training accuracy and a ***.5% validation accuracy, demonstrating strong classification performance. Once trained, the system can be used to recognize spoken commands in a live setting. ## Training Data You can view the data in the ```/collected_data/``` folder or on Kaggle [here](https://www.kaggle.com/datasets/allenye66/best-lip-reading-dataset). Since a suitable dataset was not available for this problem, I took the initiative to create my own dataset by collecting approximately 700 video clips of words being spoken. Each video clip was manually labeled with a word from a predefined set and in the end, I had around 3 gigabytes worth of total data. How does my `/data_collection/collect.py` script work? - First detect if someone is talking (based upon if the distance between the upper lip and lower lip is greater than a threshold) - If someone is talking: - Start "recording" or storing the current frames - The user can close their mouths for some short periods of time, but if they close their lips for too many frames then the script assumes they have finished talking. - Once the user finishes talking, we stop saving frames and then pad the current saved frames with "previous" and "after" frames - "previous" frames are frames stored (in a circular buffer) before we started saving frames - "after" frames are just continously added until we reach 22 frames (a predefined constant) - all 22 frames are stored in a numpy array and written to a text file (we also record the frames as images and turn those into a video file) ### View a demo [here](https://www.youtube.com/watch?v=qKRgHJYTVJ0)! Example outputs:
gif 1 gif 2 gif 3 gif 4
## Demo https://user-images.githubusercontent.com/46653284/228344036-ad19e45c-d329-4040-9e05-ef3bca7b3e50.mov ## Files/Folders - `writeup.pdf`: a detailed documentation of my entire project. To view the gifs, please visit [here](https://docs.google.com/document/d/1FLVwjXf4BfxgjIBl9CszCMwwwQ-Tm0crAv71qGPmLCM/edit). - `/data_collection/collect.py`: This script is used to collect the data for training the speech recognition model. It records audio clips of people speaking the different commands and saves them to a directory. - `/training/3DCNN.ipynb`: This Jupyter Notebook contains the code used to train the speech recognition model. It uses a 3D convolutional neural network to extract features from the audio clips and then classifies them using a softmax layer. - `/demo/predict_live.py`: This script can be used to test my trained speech recognition model in a live demo. It uses similar logic from the data collection script by collecting frames, feeding it into the model, and then displays the predicted command. - `/model/`: contains various model weights for trained models. - `/demo_examples/`: mp4 files containing recorded demos of my model predicting words that I spoke in real-time. ## Usage To use my project, please read the following: 1. Run `collect.py` to record audio clips of yourself speaking a chosen word and save them to a directory. - First, enter the word you want to collect data for. - If this is your first time running the script, do not enter a lip distance (it will calibrate itself). 2. View `/training/3DCNN.ipynb` to see how I trained the speech recognition model on my collected data. 3. Use `predict_live.py` to test the trained model in a live demo. Note: The current model implemented in `predict_live.py` is a less accurate model. I am unable to upload the weights for the model showcased in `/training/3DCNN.ipynb` as the file is too large. ## Dependencies The following Python packages required to run the scripts can be found in ```requirements.txt```. There might also be some issues relating to OS/OS versions. I am running this project on a macOS Catalina with a Intel Core i7 processor. ## Note To use the live prediction script `/demo/predict_live.py`, please note that it may not work on all systems. You may need to modify the script to work with your specific setup, including having the same operating system and webcam as the one used during training.

近期下载者

相关文件


收藏者