Intelligent Book Categorizer

For the course Speech and Language Processing 1 we had to make a language classifier. We choose to determine the category of a book based on the language used in the book. Four categories were chosen: crime non-fiction, mythology, science fiction and children. For each category ten books were used as train set data, and five books were used to test.

The methods used to determine the category were:

Unigrams, bigrams, trigrams (of words)
Word length
Sentence length
Part-of-speech (POS)

The classifier could categorize the books quite accurately. Especially unigrams created an high accuracy. Word and sentence length were only useful to determined some of the categories. The part-of-speech worked also quite well, but probably a combination of the methods could result in the best classifier.

Tools used

Python

Tree-tagger (POS library for Python)

Intelligent Book Categorizer

Tools used

Photos