NClassifier

19 July 2006 by Ismail

This is a pretty cool library I found the other day NClassifer. Its a port of Java JClassifier. Some nice little functions in there like:

  • BayesianClassifier - uses Bayes' theorem to rate the text against a known input
  • VectorClassifier - uses the vector space search algorithm
  • Summariser - Auto summarize long text

    The areas of interest are the summarizer nice for auto generating teasers for news items or events in umbraco. It also has GetMostFrequentWords could possibly use it to auto generate keywords.
    Just wrap the classifer up as an xsl extension and its ready to use in Umbraco.

    If you wanted to be really brave you could create your own datatype possibly something based on myURL to auto generate abstracts for long pieces of content.

    If you're feeling really really adventurous you could have a stab at trying to build some kind of automatic classification control. You would need to create a training set first so that the bayesian classifier has a reference set to work with but you really need to know what your doing to get this going. More information on this at the original JClassifier site

5 comment(s) for “NClassifier”

  1. Darren,

    Nice one! Im gonna have a play at some point when I get some free time.

    Cheers

    Ismail
  2. Darren,

    Im not planning anything to complex with it at the moment but if we got a job that needed it then I may get onto it.

    The bayesian auto classifier looks pretty solid. However I need to play with GetMostFrequentWords to see what it comes up with and then see if that it is good enough.

    Do a search on sourceforge for auto keyword generator libraries if no .net ones out there im sure there will be java ones that could quite easily be ported.
  3. Simple keyword suggestion is pretty easy to produce from the lucene index with a few tweaks - you can store the term frequency vectors in the index and then pretty simply extract the most significant terms for the document.

    Adaptive classification into a taxonomy like EGMS is going to be trickier as there isn't necessarily any direct relationship between the text of the category and the content / other meta data.

    I've been reading some painful papers on the subject and it looks like it might be possible to do for well known stuff like the EGMS with sufficient data mining and maximum entropy modelling. Custom taxonomies will be trickier to automate as the size of the potential training set will be much smaller.
  4. Hi,

    Finally got around to classifying blog posts based on a simple configuration.

    http://www.darren-ferguson.com/classify.xml

    If you click through to the detail view of any of my blog posts you should see the result of the classification.

    As you can see it's fairly basic at the moment, but the potential is there.

    Not very impressed with the summarizer though.
  5. Ismail, this sounds very cool. In the ECM world, I work with Interwoven and I've been hunting down something open source that does the job of MetaTagger. Please let me know if you are planning to do any dev with this, I'd really like to do some keyword suggestion work for Umbraco.

Leave a comment


(not shown)


(optional - remember http://)


Stop those damn spammers