博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
自然语言处理(2)之文本资料库
阅读量:4945 次
发布时间:2019-06-11

本文共 33511 字,大约阅读时间需要 111 分钟。

自然语言处理(2)之文本资料库

1.获取文本资料库

本章首先给出了一个文本资料库的实例:nltk.corpus.gutenberg,通过gutenberg实例来学习文本资料库。我们用help来查看它的类型

1 >>> import nltk  2 >>> help(nltk.corpus.gutenberg)  3 Help on PlaintextCorpusReader in module nltk.corpus.reader.plaintext object:  4   5 class PlaintextCorpusReader(nltk.corpus.reader.api.CorpusReader)  6  |  Reader for corpora that consist of plaintext documents.  Paragraphs  7  |  are assumed to be split using blank lines.  Sentences and words can  8  |  be tokenized using the default tokenizers, or by custom tokenizers  9  |  specificed as parameters to the constructor. 10  |   11  |  This corpus reader can be customized (e.g., to skip preface 12  |  sections of specific document formats) by creating a subclass and 13  |  overriding the ``CorpusView`` class variable. 14  |   15  |  Method resolution order: 16  |      PlaintextCorpusReader 17  |      nltk.corpus.reader.api.CorpusReader 18  |      __builtin__.object 19  |   20  |  Methods defined here: 21  |   22  |  __init__(self, root, fileids, word_tokenizer=WordPunctTokenizer(pattern='\\w+|[^\\w\\s]+', gaps=False, discard_empty=True, flags=56), sent_tokenizer=
, para_block_reader=
, encoding=None) 24 | Construct a new plaintext corpus reader for a set of documents 25 | located at the given root directory. Example usage: 26 | 27 | >>> root = '/usr/local/share/nltk_data/corpora/webtext/' 28 | >>> reader = PlaintextCorpusReader(root, '.*\.txt') 29 | 30 | :param root: The root directory for this corpus. 31 | :param fileids: A list or regexp specifying the fileids in this corpus. 32 | :param word_tokenizer: Tokenizer for breaking sentences or 33 | paragraphs into words. 34 | :param sent_tokenizer: Tokenizer for breaking paragraphs 35 | into words. 36 | :param para_block_reader: The block reader used to divide the 37 | corpus into paragraph blocks. 38 | 39 | paras(self, fileids=None, sourced=False) 40 | :return: the given file(s) as a list of 41 | paragraphs, each encoded as a list of sentences, which are 42 | in turn encoded as lists of word strings. 43 | :rtype: list(list(list(str))) 44 | 45 | raw(self, fileids=None, sourced=False) 46 | :return: the given file(s) as a single string. 47 | :rtype: str 48 | 49 | sents(self, fileids=None, sourced=False) 50 | :return: the given file(s) as a list of 51 | sentences or utterances, each encoded as a list of word 52 | strings. 53 | :rtype: list(list(str)) 54 | 55 | words(self, fileids=None, sourced=False) 56 | :return: the given file(s) as a list of words 57 | and punctuation symbols. 58 | :rtype: list(str) 59 | 60 | ---------------------------------------------------------------------- 61 | Data and other attributes defined here: 62 | 63 | CorpusView =
64 | A 'view' of a corpus file, which acts like a sequence of tokens: 65 | it can be accessed by index, iterated over, etc. However, the 66 | tokens are only constructed as-needed -- the entire corpus is 67 | never stored in memory at once. 68 | 69 | The constructor to ``StreamBackedCorpusView`` takes two arguments: 70 | a corpus fileid (specified as a string or as a ``PathPointer``); 71 | and a block reader. A "block reader" is a function that reads 72 | zero or more tokens from a stream, and returns them as a list. A 73 | very simple example of a block reader is: 74 | 75 | >>> def simple_block_reader(stream): 76 | ... return stream.readline().split() 77 | 78 | This simple block reader reads a single line at a time, and 79 | returns a single token (consisting of a string) for each 80 | whitespace-separated substring on the line. 81 | 82 | When deciding how to define the block reader for a given 83 | corpus, careful consideration should be given to the size of 84 | blocks handled by the block reader. Smaller block sizes will 85 | increase the memory requirements of the corpus view's internal 86 | data structures (by 2 integers per block). On the other hand, 87 | larger block sizes may decrease performance for random access to 88 | the corpus. (But note that larger block sizes will *not* 89 | decrease performance for iteration.) 90 | 91 | Internally, ``CorpusView`` maintains a partial mapping from token 92 | index to file position, with one entry per block. When a token 93 | with a given index *i* is requested, the ``CorpusView`` constructs 94 | it as follows: 95 | 96 | 1. First, it searches the toknum/filepos mapping for the token 97 | index closest to (but less than or equal to) *i*. 98 | 99 | 2. Then, starting at the file position corresponding to that100 | index, it reads one block at a time using the block reader101 | until it reaches the requested token.102 | 103 | The toknum/filepos mapping is created lazily: it is initially104 | empty, but every time a new block is read, the block's105 | initial token is added to the mapping. (Thus, the toknum/filepos106 | map has one entry per block.)107 | 108 | In order to increase efficiency for random access patterns that109 | have high degrees of locality, the corpus view may cache one or110 | have high degrees of locality, the corpus view may cache one or111 | more blocks.112 | 113 | :note: Each ``CorpusView`` object internally maintains an open file114 | object for its underlying corpus file. This file should be115 | automatically closed when the ``CorpusView`` is garbage collected,116 | but if you wish to close it manually, use the ``close()``117 | method. If you access a ``CorpusView``'s items after it has been118 | closed, the file object will be automatically re-opened.119 | 120 | :warning: If the contents of the file are modified during the121 | lifetime of the ``CorpusView``, then the ``CorpusView``'s behavior122 | is undefined.123 | 124 | :warning: If a unicode encoding is specified when constructing a125 | ``CorpusView``, then the block reader may only call126 | ``stream.seek()`` with offsets that have been returned by127 | ``stream.tell()``; in particular, calling ``stream.seek()`` with128 | relative offsets, or with offsets based on string lengths, may129 | lead to incorrect behavior.130 | 131 | :ivar _block_reader: The function used to read132 | a single block from the underlying file stream.133 | :ivar _toknum: A list containing the token index of each block134 | that has been processed. In particular, ``_toknum[i]`` is the135 | token index of the first token in block ``i``. Together136 | with ``_filepos``, this forms a partial mapping between token137 | indices and file positions.138 | :ivar _filepos: A list containing the file position of each block139 | that has been processed. In particular, ``_toknum[i]`` is the140 | file position of the first character in block ``i``. Together141 | with ``_toknum``, this forms a partial mapping between token142 | indices and file positions.143 | :ivar _stream: The stream used to access the underlying corpus file.144 | :ivar _len: The total number of tokens in the corpus, if known;145 | or None, if the number of tokens is not yet known.146 | :ivar _eofpos: The character position of the last character in the147 | file. This is calculated when the corpus view is initialized,148 | and is used to decide when the end of file has been reached.149 | :ivar _cache: A cache of the most recently read block. It150 | is encoded as a tuple (start_toknum, end_toknum, tokens), where151 | start_toknum is the token index of the first token in the block;152 | end_toknum is the token index of the first token not in the153 | block; and tokens is a list of the tokens in the block.154 | 155 | ----------------------------------------------------------------------156 | Methods inherited from nltk.corpus.reader.api.CorpusReader:157 | 158 | __repr__(self)159 | 160 | abspath(self, fileid)161 | Return the absolute path for the given file.162 | 163 | :type file: str164 165 | :param file: The file identifier for the file whose path166 | should be returned.167 | :rtype: PathPointer168 | 169 | abspaths(self, fileids=None, include_encoding=False, include_fileid=False)170 | Return a list of the absolute paths for all fileids in this corpus;171 | or for the given list of fileids, if specified.172 | 173 | :type fileids: None or str or list174 | :param fileids: Specifies the set of fileids for which paths should175 | be returned. Can be None, for all fileids; a list of176 | file identifiers, for a specified set of fileids; or a single177 | file identifier, for a single file. Note that the return178 | value is always a list of paths, even if ``fileids`` is a179 | single file identifier.180 | 181 | :param include_encoding: If true, then return a list of182 | ``(path_pointer, encoding)`` tuples.183 | 184 | :rtype: list(PathPointer)185 | 186 | encoding(self, file)187 | Return the unicode encoding for the given corpus file, if known.188 | If the encoding is unknown, or if the given file should be189 | processed using byte strings (str), then return None.190 | 191 | fileids(self)192 | Return a list of file identifiers for the fileids that make up193 | this corpus.194 | 195 | open(self, file, sourced=False)196 | Return an open stream that can be used to read the given file.197 | If the file's encoding is not None, then the stream will198 | automatically decode the file's contents into unicode.199 | 200 | :param file: The file identifier of the file to read.201 | 202 | readme(self)203 | Return the contents of the corpus README file, if it exists.204 | 205 | ----------------------------------------------------------------------206 | Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:207 | 208 | __dict__209 | dictionary for instance variables (if defined)210 | 211 | __weakref__212 | list of weak references to the object (if defined)213 | 214 | root215 | The directory where this corpus is stored.216 | 217 | :type: PathPointer

在PlaintextCorpusReader中可以看到很多本文例子中方法,比如fileids(),words()等等。

1.1 fileids()返回语料库的文件标识符

1 >>> from nltk.corpus import gutenberg2 >>> gutenberg.fileids()3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

1.2 words()返回文件的单词列表

1 >>> from nltk.corpus import gutenberg2 >>> gutenberg.fileids()3 ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']4 >>> gutenberg.words('austen-emma.txt')5 ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]6 >>> len(gutenberg.words('austen-emma.txt'))7 192427

用concordance()来搜索文本里的单词

1 >>> emma = nltk.Text(gutenberg.words('austen-emma.txt')) 2 >>> emma 3 
4 >>> emma.concordance('surperize') 5 Building index... 6 No matches 7 >>> emma.concordance('surprize') 8 Displaying 25 of 37 matches: 9 er father , was sometimes taken by surprize at his being still able to pity ` 10 hem do the other any good ." " You surprize me ! Emma must do Harriet good : a11 Knightley actually looked red with surprize and displeasure , as he stood up ,12 r . Elton , and found to his great surprize , that Mr . Elton was actually on 13 d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,14 father was quite taken up with the surprize of so sudden a journey , and his f15 y , in all the favouring warmth of surprize and conjecture . She was , moreove16 he appeared , to have her share of surprize , introduction , and pleasure . Th17 ir plans ; and it was an agreeable surprize to her , therefore , to perceive t18 talking aunt had taken me quite by surprize , it must have been the death of m19 f all the dialogue which ensued of surprize , and inquiry , and congratulation20 the present . They might chuse to surprize her ." Mrs . Cole had many to agre21 the mode of it , the mystery , the surprize , is more like a young woman ' s s22 to her song took her agreeably by surprize -- a second , slightly but correct23 " " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 24 t to be considered . Emma ' s only surprize was that Jane Fairfax should accep25 of your admiration may take you by surprize some day or other ." Mr . Knightle26 ation for her will ever take me by surprize .-- I never had a thought of her i27 expected by the best judges , for surprize -- but there was great joy . Mr . 28 sound of at first , without great surprize . " So unreasonably early !" she w29 d Frank Churchill , with a look of surprize and displeasure .-- " That is easy30 ; and Emma could imagine with what surprize and mortification she must be retu31 tled that Jane should go . Quite a surprize to me ! I had not the least idea !32 . It is impossible to express our surprize . He came to speak to his father o33 g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai

这里用到了nltk.Text类,再次通过help查看这个类,通过method的查看发现这个类非常有用。

1 class Text(__builtin__.object)  2  |  A wrapper around a sequence of simple (string) tokens, which is  3  |  intended to support initial exploration of texts (via the  4  |  interactive console).  Its methods perform a variety of analyses  5  |  on the text's contexts (e.g., counting, concordancing, collocation  6  |  discovery), and display the results.  If you wish to write a  7  |  program which makes use of these analyses, then you should bypass  8  |  the ``Text`` class, and use the appropriate analysis function or  9  |  class directly instead. 10  |   11  |  A ``Text`` is typically initialized from a given document or 12  |  corpus.  E.g.: 13  |   14  |  >>> import nltk.corpus 15  |  >>> from nltk.text import Text 16  |  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) 17  |   18  |  Methods defined here: 19  |   20  |  __getitem__(self, i) 21  |   22  |  __init__(self, tokens, name=None) 23  |      Create a Text object. 24  |       25  |      :param tokens: The source text. 26  |      :type tokens: sequence of str 27  |   28  |  __len__(self) 29  |   30  |  __repr__(self) 31  |      :return: A string representation of this FreqDist. 32  |      :rtype: string 33  |   34  |  collocations(self, num=20, window_size=2) 35  |      Print collocations derived from the text, ignoring stopwords. 36  |       37  |      :seealso: find_collocations 38  |      :param num: The maximum number of collocations to print. 39  |      :type num: int 40  |      :param window_size: The number of tokens spanned by a collocation (default=2) 41  |      :type window_size: int 42  |   43  |  common_contexts(self, words, num=20) 44  |      Find contexts where the specified words appear; list 45  |      most frequent common contexts first. 46  |       47  |      :param word: The word used to seed the similarity search 48  |      :type word: str 49  |      :param num: The number of words to generate (default=20) 50  |      :type num: int 51  |      :seealso: ContextIndex.common_contexts() 52  |   53 |  concordance(self, word, width=79, lines=25) 54  |      Print a concordance for ``word`` with the specified context window. 55  |      Word matching is not case-sensitive. 56  |      :seealso: ``ConcordanceIndex`` 57  |   58  |  count(self, word) 59  |      Count the number of times this word appears in the text. 60  |   61  |  dispersion_plot(self, words) 62  |      Produce a plot showing the distribution of the words through the text. 63  |      Requires pylab to be installed. 64  |       65  |      :param words: The words to be plotted 66  |      :type word: str 67  |      :seealso: nltk.draw.dispersion_plot() 68  |   69  |  findall(self, regexp) 70  |      Find instances of the regular expression in the text. 71  |      The text is a list of tokens, and a regexp pattern to match 72  |      a single token must be surrounded by angle brackets.  E.g. 73  |       74  |      >>> from nltk.book import text1, text5, text9 75  |      >>> text5.findall("<.*><.*>
") 76 | you rule bro; telling you bro; u twizted bro 77 | >>> text1.findall("
(<.*>)
") 78 | monied; nervous; dangerous; white; white; white; pious; queer; good; 79 | mature; white; Cape; great; wise; wise; butterless; white; fiendish; 80 | pale; furious; better; certain; complete; dismasted; younger; brave; 81 | brave; brave; brave 82 | >>> text9.findall("
{3,}") 83 | thread through those; the thought that; that the thing; the thing 84 | that; that that thing; through these than through; them that the; 85 | through the thick; them that they; thought that the 86 | 87 | :param regexp: A regular expression 88 | :type regexp: str 89 | 90 | generate(self, length=100) 91 | Print random text, generated using a trigram language model. 92 | 93 | :param length: The length of text to generate (default=100) 94 | :type length: int 95 | :seealso: NgramModel 96 | 97 | index(self, word) 98 | Find the index of the first occurrence of the word in the text. 99 | 100 | plot(self, *args)101 | See documentation for FreqDist.plot()102 | :seealso: nltk.prob.FreqDist.plot()103 | 104 | readability(self, method)105 | 106 | similar(self, word, num=20)107 | Distributional similarity: find other words which appear in the108 | same contexts as the specified word; list most similar words first.109 | 110 | :param word: The word used to seed the similarity search111 | :type word: str112 | :param num: The number of words to generate (default=20)113 | :type num: int114 | :seealso: ContextIndex.similar_words()115 | 116 | vocab(self)117 | :seealso: nltk.prob.FreqDist118 | 119 | ----------------------------------------------------------------------120 | Data descriptors defined here:121 | 122 | __dict__123 | dictionary for instance variables (if defined)124 | 125 | __weakref__126 | list of weak references to the object (if defined)

1.3 raw,sent,words的区别

我们通过以下例子来查看raw,sent,words的区别:

1 #!/bin/envs python  2 from nltk.corpus import gutenberg  3 for fileid in gutenberg.fileids():  4     num_chars = len(gutenberg.raw(fileid))                                  // 字母的个数  5     num_words = len(gutenberg.words(fileid))                                // 单词的个数  6     num_sents = len(gutenberg.sents(fileid))                                // 句子的个数  7     num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)]))      // 不相同的单词的个数  8     print int(num_chars/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid  4 21 26 austen-emma.txt  //平均单词长度   平均每句单词个数   平均单词的重复个数4 23 16 austen-persuasion.txt4 23 22 austen-sense.txt4 33 79 bible-kjv.txt4 18 5 blake-poems.txt4 17 14 bryant-stories.txt4 17 12 burgess-busterbrown.txt4 16 12 carroll-alice.txt4 17 11 chesterton-ball.txt4 19 11 chesterton-brown.txt4 16 10 chesterton-thursday.txt4 17 24 edgeworth-parents.txt4 24 15 melville-moby_dick.txt4 52 10 milton-paradise.txt4 11 8 shakespeare-caesar.txt4 12 7 shakespeare-hamlet.txt4 12 6 shakespeare-macbeth.txt4 35 12 whitman-leaves.txt

获取并查看shakespeare-macbeth.txt文本最长的一个句子

1 #!/bin/envs python  2 from nltk.corpus import gutenberg  3 macbenth_sentences = gutenberg.sents('shakespeare-macbeth.txt') # 获取句子的list  4 print macbenth_sentences  5 print macbenth_sentences[1037]  6 longtest_len=max([len(s) for s in macbenth_sentences])         # 获取最长句子的长度  7 [ s for s in macbenth_sentences if longtest_len == len(s)]     # 获取最长句子的内容[['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...]['Good', 'night', ',', 'and', 'better', 'health', 'Attend', 'his', 'Maiesty'][['Doubtfull', 'it', 'stood', ',', 'As', 'two', 'spent', 'Swimmers', ',', 'that', 'doe', 'cling', 'together', ',', 'And', 'choake', 'their', 'Art', ':', 'The', 'mercilesse', 'Macdonwald', '(', 'Worthie', 'to', 'be', 'a', 'Rebell', ',', 'for', 'to', 'that', 'The', 'multiplying', 'Villanies', 'of', 'Nature', 'Doe', 'swarme', 'vpon', 'him', ')', 'from', 'the', 'Westerne', 'Isles', 'Of', 'Kernes', 'and', 'Gallowgrosses', 'is', 'supply', "'", 'd', ',', 'And', 'Fortune', 'on', 'his', 'damned', 'Quarry', 'smiling', ',', 'Shew', "'", 'd', 'like', 'a', 'Rebells', 'Whore', ':', 'but', 'all', "'", 's', 'too', 'weake', ':', 'For', 'braue', 'Macbeth', '(', 'well', 'hee', 'deserues', 'that', 'Name', ')', 'Disdayning', 'Fortune', ',', 'with', 'his', 'brandisht', 'Steele', ',', 'Which', 'smoak', "'", 'd', 'with', 'bloody', 'execution', '(', 'Like', 'Valours', 'Minion', ')', 'caru', "'", 'd', 'out', 'his', 'passage', ',', 'Till', 'hee', 'fac', "'", 'd', 'the', 'Slaue', ':', 'Which', 'neu', "'", 'r', 'shooke', 'hands', ',', 'nor', 'bad', 'farwell', 'to', 'him', ',', 'Till', 'he', 'vnseam', "'", 'd', 'him', 'from', 'the', 'Naue', 'toth', "'", 'Chops', ',', 'And', 'fix', "'", 'd', 'his', 'Head', 'vpon', 'our', 'Battlements']]

1.4 NPSChatCorpusReader类

接下来学习下新的一个reader类,nltk给出另一个实例类nltk.corpus.nps_chat,同样用help来查看下该类的信息。可以初步看出该类与xml格式的文件有关。

1 nps_chat = class NPSChatCorpusReader(nltk.corpus.reader.xmldocs.XMLCorpusReader)2  |  Method resolution order:3  |      NPSChatCorpusReader4  |      nltk.corpus.reader.xmldocs.XMLCorpusReader5  |      nltk.corpus.reader.api.CorpusReader6  |      __builtin__.object7  |  8  |  Methods defined here:9 ...
1 >>> from nltk.corpus import nps_chat2 >>> nps_chat.fileids()3 ['10-19-20s_706posts.xml', '10-19-30s_705posts.xml', '10-19-40s_686posts.xml', '10-19-adults_706posts.xml', '10-24-40s_706posts.xml', '10-26-teens_706posts.xml', '11-06-adults_706posts.xml', '11-08-20s_705posts.xml', '11-08-40s_706posts.xml', '11-08-adults_705posts.xml', '11-08-teens_706posts.xml', '11-09-20s_706posts.xml', '11-09-40s_706posts.xml', '11-09-adults_706posts.xml', '11-09-teens_706posts.xml']4 >>> chartoom=nps_chat.posts('10-19-20s_706posts.xml')5 >>> chartoom[123]6 ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']

 1.5 CategorizedTaggedCorpusReader类

本文以brown类为实例介绍了CategorizedTaggedCorpusReader类。

1 >>> from nltk.corpus import brown  2 >>> help(brown)  3 class CategorizedTaggedCorpusReader(nltk.corpus.reader.api.CategorizedCorpusReader, TaggedCorpusReader)  4  |  A reader for part-of-speech tagged corpora whose documents are  5  |  divided into categories based on their file identifiers.  6  |    7  |  Method resolution order:  8  |      CategorizedTaggedCorpusReader  9  |      nltk.corpus.reader.api.CategorizedCorpusReader 10  |      TaggedCorpusReader 11  |      nltk.corpus.reader.api.CorpusReader 12  |      __builtin__.object 13  |   14  |  Methods defined here: 15  |   16  |  __init__(self, *args, **kwargs) 17  |      Initialize the corpus reader.  Categorization arguments 18  |      (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to 19  |      the ``CategorizedCorpusReader`` constructor.  The remaining arguments 20  |      are passed to the ``TaggedCorpusReader``. 21  |   22  |  paras(self, fileids=None, categories=None) 23  |   24  |  raw(self, fileids=None, categories=None) 25  |   26  |  sents(self, fileids=None, categories=None) 27  |   28  |  tagged_paras(self, fileids=None, categories=None, simplify_tags=False) 29  |   30  |  tagged_sents(self, fileids=None, categories=None, simplify_tags=False) 31  |   32  |  tagged_words(self, fileids=None, categories=None, simplify_tags=False) 33  |   34  |  words(self, fileids=None, categories=None) 35  |   36  |  ---------------------------------------------------------------------- 37  |  Methods inherited from nltk.corpus.reader.api.CategorizedCorpusReader: 38  |   39  |  categories(self, fileids=None) 40  |      Return a list of the categories that are defined for this corpus, 41  |      or for the file(s) if it is given. 42  |   43  |  fileids(self, categories=None) 44  |      Return a list of file identifiers for the files that make up 45  |      this corpus, or that make up the given category(s) if specified. 46  |   47  |  ---------------------------------------------------------------------- 48  |  Data descriptors inherited from nltk.corpus.reader.api.CategorizedCorpusReader: 49  |   50  |  __dict__ 51  |      dictionary for instance variables (if defined) 52  |   53  |  __weakref__ 54  |      list of weak references to the object (if defined) 55  |   56  |  ---------------------------------------------------------------------- 57  |  Methods inherited from nltk.corpus.reader.api.CorpusReader: 58  |   59  |  __repr__(self) 60  |   61  |  abspath(self, fileid) 62  |      Return the absolute path for the given file. 63  |       64  |      :type file: str 65  |      :param file: The file identifier for the file whose path 66  |          should be returned. 67  |      :rtype: PathPointer 68  |   69  |  abspaths(self, fileids=None, include_encoding=False, include_fileid=False) 70  |      Return a list of the absolute paths for all fileids in this corpus; 71  |      or for the given list of fileids, if specified. 72  |       73  |      :type fileids: None or str or list 74  |      :param fileids: Specifies the set of fileids for which paths should 75  |          be returned.  Can be None, for all fileids; a list of 76  |          file identifiers, for a specified set of fileids; or a single 77  |          file identifier, for a single file.  Note that the return 78  |          value is always a list of paths, even if ``fileids`` is a 79  |          single file identifier. 80  |       81  |      :param include_encoding: If true, then return a list of 82  |          ``(path_pointer, encoding)`` tuples. 83  |       84  |      :rtype: list(PathPointer) 85  |   86  |  encoding(self, file) 87  |      Return the unicode encoding for the given corpus file, if known. 88  |      If the encoding is unknown, or if the given file should be 89  |      processed using byte strings (str), then return None. 90  |   91  |  open(self, file, sourced=False) 92  |      Return an open stream that can be used to read the given file. 93  |      If the file's encoding is not None, then the stream will 94  |      automatically decode the file's contents into unicode. 95  |       96  |      :param file: The file identifier of the file to read. 97  |   98  |  readme(self) 99  |      Return the contents of the corpus README file, if it exists.100  |  101  |  ----------------------------------------------------------------------102  |  Data descriptors inherited from nltk.corpus.reader.api.CorpusReader:103  |  104  |  root105  |      The directory where this corpus is stored.106  |      107  |      :type: PathPointer

看下 brown的内容,如果获取brown资料库的主题和文件

1 >>> from nltk.corpus import brown 2 >>> brown.categories()   //返回brown资料库的主题种类 3 ['adventure', 'belles_lettres', 'editori', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 4 >>> brown.fileids()[1:10] //返回brown资料库内的文件 5 ['ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10'] 6 >>> brown.words(categories='news') //返回brown资料库内类别名为news的类别,并按次进行切分 7 ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] 8 >>> brown.words(fileids=['cg22'])  //返回brown资料库内的文件名为cg22的文件,并按词进行切分 9 ['Does', 'our', 'society', 'have', 'a', 'runaway', ',', ...]10 >>> brown.sents(categories=['news','editori','reviews'])//返回多个类别,并按句进行切分11 [['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

对brown内的特定的文体进行计数:

1 from nltk.corpus import brown2 import nltk3 news_text = brown.words(categories='news')   //返回brown资料库内类别名为news的类别,并按次进行切分 4 fdist = nltk.FreqDist([w.lower() for w in news_text])   //获取news的频率分布 5 modals = ['can','could','may','might','must','will']    6 for m in modals : 7     print m + ':',fdist[m],                  //获取modals的计数 输出

  can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

计算多个特定类别的多个文体进行统计

1 from nltk.corpus import brown  2 import nltk  3 cfd = nltk.ConditionalFreqDist(  4         (genre,word)  5         for genre in brown.categories()  6         for word in brown.words(categories=genre))  7 genres=['new','religion','hobbies','science_fiction','romance','humor']  8 modals = ['can','could','may','might','must','will']  9 cfd.tabulate(conditions=genres,samples=modals)                 can could  may might must will            new    0    0    0    0    0    0       religion   82   59   78   12   54   71        hobbies  268   58  131   22   83  264science_fiction   16   49    4   12    8   16        romance   74  193   11   51   45   43          humor   16   30    8    8    9   13

1.6  CategorizedPlaintextCorpusReader类

相比与brown(CategorizedTaggedCorpusReader),retuters(CategorizedPlaintextCorpusReader)的区别在于,retuters可以查找一个或者多个文档涵盖的主题,也可以查找包含在一个或多个类别的文档。

1 >>> from nltk.corpus import reuters 2 >>> reuters.fileids()[1:10] 3 ['test/14828', 'test/14829', 'test/14832', 'test/14833', 'test/14839', 'test/14840', 'test/14841', 'test/14842', 'test/14843'] 4 >>> reuters.categories() 5 ['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc'] 6 >>> reuters.categories('training/9865') 7 ['barley', 'corn', 'grain', 'wheat'] 8 >>> reuters.categories(['training/9865','training/9880']) 9 ['barley', 'corn', 'grain', 'money-fx', 'wheat']10 >>> reuters.categories('training/9880')11 ['money-fx']

对比brown:

1 >>> from nltk.corpus import brown2 >>> brown.categories(['news','reviews'])   //不能对多个主题进行查找3 [] 4 >>> brown.fileids(['cr05','cr06']) 5 []

1.7 基本语料库函数

示例 描述
fileids() 语料库的文件
fileids([categories]) 分类对应的语料库中的文件
categories() 语料库中的分类
categoried([fileids]) 文件对应的语料库中的分类
raw() 语料库的原始内容
raw(fileids=[f1,f2,f3]) 指定文件的原始内容
raw(categories=[c1,c2]) 制定分类的原始内容
words() 整个语料库中的词汇
words(fileids=[f1,f2,f3]) 指定文件的词汇
words(categories=[c1,c2]) 指定分类的词汇
sents() 指定分类的句子
sents(fileids=[f1,f2,f3]) 指定文件的句子
sents(categories=[c1,c2]) 指定分类的句子
abspath(fileid) 制定文件在磁盘的位置
encoding(fileid) 文件的编码(如果知道的话)
open(fileid) 打开指定语料库文件的文件流
root() 到本地安装的语料库根目录的路径
readme() 语料库的README文件的内容

1.8 载入自己的语料库

1 >>> from nltk.corpus import PlaintextCorpusReader2 >>> corpus_root='/Users/rcf/workspace/python/python_test/NLP_WITH_PYTHON/chapter_2'3 >>> wordlist=PlaintextCorpusReader(corpus_root,'.*')   //corpus_root 资料库路径,'.*'文件类型 4 >>> wordlist.fileids() 5 ['1.py', '2.py', '3.py', '4.py'] 6 >>> wordlist.words('3.py') 7 ['from', 'nltk', '.', 'corpus', 'import', 'brown', ...]

 

转载于:https://www.cnblogs.com/rcfeng/p/3930464.html

你可能感兴趣的文章
POJ3122Pie(二分)
查看>>
WF+WCF+WPF第二天--模拟超市收银
查看>>
爬取贴吧好看的桌面图片 -《狗嗨默示录》-
查看>>
[转]这13个开源GIS软件,你了解几个?
查看>>
Shell批量启动、关闭tomcat
查看>>
C++成员函数的重载、覆盖与隐藏【转载】
查看>>
网站开发技能图谱
查看>>
4.27随笔
查看>>
CSS实例:图片导航块
查看>>
poj1860 Currency Exchange(spfa判断正环)
查看>>
SQL CHECK 约束&Case when 的使用方法
查看>>
[整理]HTTPS和SSL证书
查看>>
[转载] Android 异步加载图片,使用LruCache和SD卡或手机缓存,效果非常的流畅
查看>>
水晶苍蝇拍:聊聊估值那些事儿——“指标”背后的故事 (2011-11-01 14:58:32)
查看>>
3.每周总结
查看>>
应用提交 App Store 上架被拒绝
查看>>
Android实现异步处理 -- HTTP请求
查看>>
数据清空js清空div里的数据问题
查看>>
Fortran中的指针使用
查看>>
移动终端app测试点总结
查看>>