Published

23 October 2013

Sharing

The goal

I want a system to help me to cross navigate my organization's Alfresco repository.

Being able to browse categories in Alfresco, I need to enhance my content by adding relevant categories. There are 2 kinds of solutions:

  • semantic improvement through content analysis and specific part identification in this content like place, people and organization.

  • semantic improvement through paths' and content's name analysis

The solution: semantic improvement

People don't like to input metadata like categories. We're all lazy and even if people were eager at entering some information, the risk would exist they choose the wrong one.

Semantic improvement through content analysis

There are at least 2 solutions :

  • Stanbol, an apache project, which you can download and install into your system

  • OpenCalais, a private project in the cloud from Reuters, which you can use

These 2 solutions analyse the content of a document and tag the document when specific information is identified into it and linked with an external database. For example, if a document contains the word "Paris", this latter will be suggested as a category because there is such an information into external database (dbpedia). The ability to detect places is directly related to the solution and the richness of your external database.

For example, if your document content contains the word "Marseille", which is the 2nd french city (for the number of people), Stanbol won't give you any information, which is disappointing enough to prevent people to use such a system.

More information on this in another post, to write...

Semantic improvement through paths' and content's name analysis

Another kind of solution consists in analyzing paths and content to extract relevant information. Indeed, while people are reluctant to input metadata, there is one they are used to enter: directory and file names. Moreover, these informations are related to the content.

By analyzing your repository in such a way, you'll have numerous categories at the end. Just eliminate the ones present once, which means there no added value in cross navigation, and think about the others. Of course, the most used these categories are, the most interesting they are. It's up to you to choose the right analysis level. You may conclude your work with 200 or 300 categories structured into 2 or 3 levels.

You can then cross navigate on your repository content and relate things like never before.

Conclusion

I think this technique provides quickly some added value to your repository.

Next post will provide an implementation example so you can set up this cross navigation.

Resources

[stanbol] http://stanbol.apache.org [semantics4alfresco] https://code.google.com/p/semantics4alfresco/




blog comments powered by Disqus