CN-WikiCatReader

所属分类:博客
开发工具:Java
文件大小:0KB
下载次数:0
上传日期:2019-12-31 05:37:20
上 传 者sh-1993
说明:  CN WikiCatReader,,
(CN-WikiCatReader,,)

文件列表:
.classpath (916, 2019-12-30)
.project (366, 2019-12-30)
.settings/ (0, 2019-12-30)
.settings/org.eclipse.jdt.core.prefs (587, 2019-12-30)
bin/ (0, 2019-12-30)
bin/com/ (0, 2019-12-30)
bin/com/ansj/ (0, 2019-12-30)
bin/com/ansj/vec/ (0, 2019-12-30)
bin/com/ansj/vec/Learn.class (11231, 2019-12-30)
bin/com/ansj/vec/Word2VEC.class (9418, 2019-12-30)
bin/com/ansj/vec/domain/ (0, 2019-12-30)
bin/com/ansj/vec/domain/HiddenNeuron.class (389, 2019-12-30)
bin/com/ansj/vec/domain/Neuron.class (718, 2019-12-30)
bin/com/ansj/vec/domain/WordEntry.class (1207, 2019-12-30)
bin/com/ansj/vec/domain/WordNeuron.class (1610, 2019-12-30)
bin/com/ansj/vec/util/ (0, 2019-12-30)
bin/com/ansj/vec/util/Haffman.class (1475, 2019-12-30)
bin/com/ansj/vec/util/MapCount.class (2979, 2019-12-30)
bin/com/ansj/vec/util/WordKmeans$Classes$1.class (1516, 2019-12-30)
bin/com/ansj/vec/util/WordKmeans$Classes.class (3278, 2019-12-30)
bin/com/ansj/vec/util/WordKmeans.class (3509, 2019-12-30)
bin/isa/ (0, 2019-12-30)
bin/isa/HypernymExpander.class (3416, 2019-12-30)
bin/isa/ProjectionBasedIsAGenerator$1Stat.class (921, 2019-12-30)
bin/isa/ProjectionBasedIsAGenerator.class (7691, 2019-12-30)
bin/isa/ProjectionBasedIsARelationExtractor.class (4042, 2019-12-30)
bin/isa/ProjectionModelTrainer.class (6347, 2019-12-30)
bin/isa/RuleBasedIsAGenerator.class (4124, 2019-12-30)
blacklist.txt (1552, 2019-12-30)
cat.txt (46326168, 2019-12-30)
dd_extracted_relations.txt (19420812, 2019-12-30)
extracted_is_a_relations.txt (43133348, 2019-12-30)
fdnlp/ (0, 2019-12-30)
fdnlp/TimeExp.m (3041, 2019-12-30)
fdnlp/ar.m (1414, 2019-12-30)
fdnlp/dep.m (27408624, 2019-12-30)
fdnlp/dict.txt (251, 2019-12-30)
fdnlp/dict_ambiguity.txt (310, 2019-12-30)
fdnlp/dict_dep.txt (192, 2019-12-30)
... ...

# CN-WikiCatReader: Fine-grained Relation Miner from Chinese Wikipedia Categories ### By Chengyu Wang (https://chywang.github.io) **Introduction:** This software extracts various types of fine-grained semantic relations from the Chinese Wikipedia category system, serving as a "deep reader" for Chinese short texts. It employs a rule-based extractor, a word embedding based projection learner, a collective inference step and hypernym expansion techniques to extract hypernymy relations from Chinese Wikipedia categories. We further design several fully unsupervised, data-driven algorithm to identify non-taxonomic relations (i.e., relations other than hypernymy) from these category names. **Papers** 1. Wang et al. Learning Fine-grained Relations from Chinese User Generated Categories. EMNLP 2017 2. Wang et al. Decoding Chinese User Generated Categories for Fine-grained Knowledge Harvesting. TKDE (2019) (extended version) 3. Wang et al. Open Relation Extraction for Chinese Noun Phrases. TKDE (accepted) **APIs** #### Part I: Hypernymy Relation Extraction Please run the programs according to the following order. + RuleBasedIsAGenerator (in the isa package) Required Inputs: 1. blacklist.txt: The list of thematic words in Chinese (provided in this project). 2. whitelist: The list of conceptual words in Chinese (provided in this project). 3. cat.txt: Entity and category names in Chinese Wikipedia. > NOTE: We provide the names of version 20170120, which is rather old. Readers are suggested to replace it with the up-to-date version. + ProjectionModelTrainer (in the isa package) Required Inputs: 1. positive.txt and negative.txt: Automatically generated training sets. Refer to our paper for details. 2. The Word2Vec model: Due to the large size of neural language models, we do not provide the model here. Please use your own neural language model instead and replace the values of the parameters "dimension" (the dimensionality of word embeddings) and "w2vModel" (the path of the model), if you would like to try the algorithm over your datasets. > NOTE: The inputs of outputs of previous programs are omitted here. + ProjectionBasedIsAGenerator (in the isa package) + ProjectionBasedIsARelationExtractor (in the isa package) + HypernymExpander (in the isa package) Final Output: 1. total-isa-expand.txt: The extracted is-a relations from Chinese Wikipedia categories. #### Part II: Pattern-based Non-hypernymy Relation Extraction Please run the programs according to the following order. + WikiDicGenerator (in the nontaxonomic package) Required Input: 1. cat.txt: Entity and category names in Chinese Wikipedia. It generates dictionaries of Chinese Wikipedia entities. + RelationPatternMiner (in the nontaxonomic package) It extracts frequent category patterns from Chinese Wikipedia categories. + RelationPatternConfCalculator (in the nontaxonomic package) It computes confidence scores for frequent category patterns. + VerbBasedFilter (in the nontaxonomic package) It selects confident category patterns based on threshold filering and POS rules. + VerbBasedRelationExtractor (in the nontaxonomic package) It extracts non-hypernymy relations from selected patterns. Final Outputs: 1. verb-relations.txt 2. verb-relations-infer.txt #### Part III: Data-driven Non-hypernymy Relation Extraction Please run the programs according to the following order. + WikiSentenceExtractor (in the wiki package) Extract all the sentences from the Chinese Wikipedia data dump (xml format). + Indexer and Searcher (in the lucene package) Build sentence-level inverted index using Apache Lucene. >NOTE: Due to the large size of the texts and index, we do not provide the data here. Users can download the newest Wikipedia data dumps to build the index using our code. + MPS: ModSegmenter (in the mps package) Run the modifier-sensitive phrase segmentation algorithm. >NOTE: Due to the large size of the language models, we do not provide the data here. + CRG: CandidateRelGen, RawRelationCount, CatVerbBasedExtractor, CatVerbComplexBasedExtractor (in the crg package) Run the candidate relation generation algorithm. + MRPD: RelationGenerator (in the mrpd package) >NOTE: The algorithm of the paper is directly implemented over Baidu Baike. We are unable to provide the data and processing details here due to copyright issues. Instead, we release a simple heurisitc algorithm to extract commonsense relations from Wikipedia categories. Final Output: 1. dd_extracted_relations.txt **More Notes on the Algorithms** The codes in this projects are updated versions of the algorithms proposed in our papers. We make slight changes and add more heuristics to extract more semantic relations. **Dependencies** 1. This software is run in the JaveSE-1.8 environment. With a large probability, it runs properly in other versions of JaveSE as well. However, there is no guarantee. 2. It requires the FudanNLP toolkit for Chinese NLP analysis (https://github.com/FudanNLP/fnlp/), Apache Lucene (version 4.7.2) (https://lucene.apache.org) and the JAMA library for matrix computation (https://math.nist.gov/javanumerics/jama/). We use Jama-1.0.3.jar in this project. 3. Please refer to the JAVA implementation of the Word2Vec model here: https://github.com/NLPchina/Word2VEC_java. **Citations** If you find this software useful for your research, please cite the following papers. > @inproceedings{emnlp2017a,
   author = {Chengyu Wang and Yan Fan and Xiaofeng He and Aoying Zhou},
   title = {Learning Fine-grained Relations from Chinese User Generated Categories},
   booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing},
   pages = {2567–2577},
   year = {2017}
} > @article{tkde2018,
   author = {Chengyu Wang and Yan Fan and Xiaofeng He and Aoying Zhou},
   title = {Decoding Chinese User Generated Categories for Fine-grained Knowledge Harvesting},
   journal = {IEEE Transactions on Knowledge and Data Engineering},
   volume = {31},
   number = {8},
   pages = {1491–1505},
   year = {2018}
} > @article{tkde2019,
   author = {Chengyu Wang and Xiaofeng He and Aoying Zhou},
   title = {Open Relation Extraction for Chinese Noun Phrases},
   journal = {IEEE Transactions on Knowledge and Data Engineering},
   doi = {10.1109/TKDE.2019.2953839},
   year = {2019}
} More research works can be found here: https://chywang.github.io.

近期下载者

相关文件


收藏者