New language auto-detection over Blogs
February 19, 2009 1 Comment
We are pleased to announce the upcoming launch of improved language detection for blogs in the UGC Metabase in two weeks. We’re also introducing new blog lists sorted by language, so you can see all the English, French, German, Chinese blogs, etc, in our index.
And we’re adding a new date field, showing the time we indexed a particular post. This is in addition to the publish date already provided, as copied from the original XML/RSS feed.
1. Improved language detection at post level
Blog feeds normally state which language they are in. However, this isn’t always reliable – typically blog publishing platforms have a default language setting, and bloggers do not always update their blogs to give their local language. The result is a significant portion of blog feeds with the wrong language.
We’ve been working hard in the background to produce a more reliable approach to language detection. We’ll be rolling this out next month as the basis for setting the post’s language, as provided in the <language> tag. Only when this approach is unable to confidently determine the language, will we revert to using the language tag provided in the original XML as fallback.
2. New language tagging at feed level
Further to this, we are adding a new <feedLanguage> tag, showing the language of the blog feed. This is in addition to the existing <language> tag referred to above, which is at post level.
Adding language categorisation at feed level makes it possible to better organise the index by language – for example we can identify exactly which blogs are in French, which are in English, etc, and provide and manage these in lists.
The new language tag will appear in the UGC XML as follows
3. Introducing a new Harvest Date field
Lastly, we’re adding a new <itemHarvestDate> field to the feed. This gives the time Moreover actually indexed the item. We already pass on the publish date of the post, as provided in the original XML/RSS feed — The new index time complements this tag and can provide, for example, additional information about the latency of indexing as it occurs across the feeds.
The new harvest date tag will appear in the UGC XML as follows:
All times are shown in GMT.
We believe in being open and transparent about our crawling performance, and are confident about our technology. We invite comparison with other, similar services (for example, see Technorati and a recent comment on ReadWriteWeb), and welcome any feedback you, as customers and users, have.