news_doc_token_classification

所属分类:聚类算法
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2023-09-26 13:52:45
上 传 者sh-1993
说明:  新闻文档令牌分类,,
(news doc token classification,,)

文件列表:
.DS_Store (6148, 2023-09-26)
.ipynb_checkpoints/ (0, 2023-09-26)
.ipynb_checkpoints/NLP_HW4-checkpoint.ipynb (34615, 2023-09-26)
.ipynb_checkpoints/NLP_HW4_1-checkpoint.ipynb (156304, 2023-09-26)
.ipynb_checkpoints/NLP_HW4_doc_clf (3)-checkpoint.ipynb (79955368, 2023-09-26)
NER (Extra).ipynb (113734, 2023-09-26)
doc_clf.ipynb (48803400, 2023-09-26)
token_clf_HMM.ipynb (110427, 2023-09-26)
token_clf_transformer.ipynb (174883, 2023-09-26)

# News Document & Token Classification ### Document Classification - Two methods are employed for classification: Naive Bayes and Transformers. - Each method’s results are evaluated for effectiveness. - In the Naive Bayes approach, the TF-IDF technique is used to vectorize the data. - For Transformer method, the ‘SajjadAyoubi/distil-bigbird-fa-zwnj’ model is utilized to tokenize the data. ### Token Classification (HMM) - The model takes a text and a question as inputs and provides an answer extracted directly from the text. - Use ‘SajjadAyoubi/persian_qa’ dataset. - Labels are created from the start and end indices of the answer within the text. These labels are vectorized using the TF-IDF technique. ### Token Classification (Tranformers) - The model is designed to accept a text and a question, and it provides an answer derived directly from the input text. - Use 'SajjadAyoubi/persian_qa' dataset. - Labels are generated from the start and end indices of the answer within the text. These labels are then vectorized using a pretrained model from HuggingFace, which is also used to predict the answers to the corresponding questions. ### NER - It involves extracting information to identify and categorize named entities in unstructured text into predefined categories such as Person (PER), Location (LOC), Main location (mainLoc), Event (EVE), Date (DAT), Organization (ORG), Time (TIM), Facility (FAC), Money (MON), Percent (PCT), and Product (PRO). - The dataset is then preprocessed and the dataset labels are translated into model labels. - The model is trained using the prepared data. - The final step involves checking the true and predicted outputs and evaluating their performance.

近期下载者

相关文件


收藏者