We cope but it isn’t getting much better.
And sometimes finding what we’re looking for is like a needle in a field of haystacks. Or a leaf in forest of trees.
Search alone is rarely enough to find what you need in very large data spaces. For example, Google search results and Monster candidate listings often return thousands of close hits. Matching engines efficiently apply criteria to a two-sided search (both employer and worker have demands to be met and supply ways to meet the others’ demands).
Taxonomies are another approach. Yahoo! and Open Directory show the value of navigating through clumps and clusters of related sites. But you have your own data to mine. And creating a taxonomy by hand is expensive and slow.
Enter taxonomy helpers. They do several things:
- Analyze source files: Suck metadata from your diverse resources (documents, web pages, emails, news feeds, etc.) into a common and comparable format
- Define clusters: Help define your topics and how the topics are related. This is compute and storage intense, so it is often done bit by bit. Starting with broad categories and refining and splitting them as they fill up.
- Categorize: Assign each resource into one or more categories in the taxonomy, typically using metadata.
- Serve: Manage a user experience for surfing or flying through the taxonomy.
Here’s a roundup on some shipping categorizers.
First, I noted Quiver, a tool that recommends topics for human review and approval.
Taxonomy technology greatly assists the sharing of enterprise knowledge. But don’t expect to sit back and watch it go. Experts agree that those searching for an out-of-the-box solution shouldn’t hold their breath. Count on adding a little elbow grease, but the results will be worth it.
They mentioned taxonomy vendors:
Autonomy creates and maintains outlines using pattern and cluster analysis. Separate components analyze documents for their content and categorize them to taxonomy branches and leaves.
Inxight Software’s Categorizer filters, classifies and delivers content to users and corporate knowledge bases. It scales to millions of documents and thousands of topics in multiple languages. A sister product, MetaText Server elicits structured data from unstructured sources.
Lotus Discovery Server extracts, analyzes, and categorizes structured and unstructured content to reveal the relationships between the information as well as the people, topics, and user activity in an organization.
Semio’s SemioTagger autocategorizes content.
Sopheon autocategorizes content from multiple sources, including sources external to the enterprise.
They also pointed out taxonomy visualization sites.
Antarcti.ca uses cartography to map clusters of information spatially.
Now eWeek reviews three more products in this space:
- Applied Semantics’ Auto-Categorizer 1.1
- Interwoven’s Metatagger 3.0
- Thunderstone Software’s Texis Categorizer 4.1.
eWeek’s overview of the comparison findings is worth reading as is their eVal Scorecard: Content Categorization. Note they used very small record sets, the low thousands. Even a small company will organize hundreds of thousands of records, if not millions.
Now where should I categorize this post?[a klog apart]