Intelligent Book Categorizer
For the course Speech and Language Processing 1 we had to make a language classifier. We choose to determine the category of a book based on the language used in the book. Four categories were chosen: crime non-fiction, mythology, science fiction and children. For each category ten books were used as train set data, and five books were used to test.
The methods used to determine the category were:
- Unigrams, bigrams, trigrams (of words)
- Word length
- Sentence length
- Part-of-speech (POS)
The classifier could categorize the books quite accurately. Especially unigrams created an high accuracy. Word and sentence length were only useful to determined some of the categories. The part-of-speech worked also quite well, but probably a combination of the methods could result in the best classifier.
Tools used
Python
Tree-tagger (POS library for Python)