Taxonomy/Thesaurus/Ontology

E-discovering Language to Launch Your Taxonomy

New enterprise initiatives, whether for implementing search solutions or beginning a new product development program, demand communication among team leaders and participants. Language matters; defining terminology for common parlance is essential to smooth progress toward initiative objectives.

Glossaries, dictionaries, taxonomies, thesauri and ontologies are all mechanisms we use routinely in education and work to clarify terms we use to engage and communicate understanding of any specialized domain. Electronic social communication added to the traditional mix of shared information (e.g. documents, databases, spreadsheets, drawings, standardized forms) makes business transactional language more complex. Couple this with the use of personal devices for capturing and storing our work content, notes, writings, correspondence, design and diagram materials and we all become content categorizing managers. Some of us are better than others at organizing and curating our piles of information resources.

As recent brain studies reveal, humans, and probably any animal with a brain, have established cognitive areas in our brains with pathways and relationships among categories of grouped concepts. This reinforces our propensity for expending thought and effort to order all aspects of our lives. That we all organize differently across a huge spectrum of concepts and objects makes it wondrous that we can live and work collaboratively at all. Why after 30+ years of marriage do I arrange my kitchen gadget drawer according to use or purpose of devices while my husband attempts to store the same items according to size and shape? Why do icons and graphics placed in strange locations in software applications and web pages rarely impart meaning and use to me, while others “get it” and adapt immediately?

The previous paragraph may seem to be a pointless digression from the subject of the post but there are two points to be made here. First, we all organize both objects and information to facilitate how we navigate life, including work. Without organization that is somehow rationalized, and established accordingly to our own rules for functioning, our lives descend into dysfunctional chaos. People who don’t organize well or struggle with organizing consistently struggle in school, work and life skills. Second, diversity of practice in organizing is a challenge for working and living with others when we need to share the same spaces and work objectives. This brings me to the very challenging task of organizing information for a website, a discrete business project, or an entire enterprise, especially when a diverse group of participants are engaged as a team.

So, let me make a few bold suggestions about where to begin with your team:

  • Establish categories of inquiry based on the existing culture of your organization and vertical industry. Avoid being inventive, clever or idiosyncratic. Find categories labels that everyone understands similarly.
  • Agree on common behaviors and practices for finding by sharing openly the ways in which members of the team need to find, the kinds of information and answers that need discovering, and the conditions under which information is required. These are the basis for findability use cases. Again, begin with the usual situations and save the unusual for later insertion.
  • Start with what you have in the form of finding aids: places, language and content that are already being actively used; examine how they are organized. Solicit and gather experiences about what is good, helpful and “must have” and note interface elements and navigation aids that are not used. Harvest any existing glossaries, dictionaries, taxonomies, organization charts or other definition entities that can provide feeds to terminology lists.
  • Use every discoverable repository as a resource (including email stores, social sites, and presentations) for establishing terminology and eventually writing rules for applying terms. Research repositories that are heavily used by groups of specialists and treat them as crops of terminology to be harvested for language that is meaningful to experts. Seek or develop linguistic parsing and term extraction tools and processes to discover words and phrases that are in common use. Use histograms to determine frequency of use, then alphabetize to find similar terms that are conceptually related, and semantic net tools to group discovered terms according to conceptual relationships. Segregate initialisms, acronyms, and abbreviations for analysis and insertion into final lists, as valid terms or synonyms to valid terms.
  • Talk to the gurus and experts that are the “go-to people” for learning about a topic and use their experience to help determine the most important broad categories for information that needs to be found. Those will become your “top term” groups and facets. Think of top terms as topical in nature (e.g. radar, transportation, weapons systems) and facets as other categories by which people might want to search (e.g. company names, content types, conference titles).
  • Simplify your top terms and facets into the broadest categories for launching your initiative. You can always add more but you won’t really know where to be the most granular until you begin using tags applied to content. Then you will see what topics have the most content and require narrower topical terms to avoid having too much content piling up under a very broad category.
  • Select and authorize one individual to be the ultimate decider. Ambiguity of categorizing principles, purpose and needs is always a given due to variations in cognitive functioning. However, the earlier steps outlined here will have been based on broad agreement. When it comes to the more nuanced areas of terminology and understanding, a subject savvy and organizationally mature person with good communication skills and solid professional respect within the enterprise will be a good authority for making final decisions about language. A trusted professional will also know when a change is needed and will seek guidance when necessary.

Revisit the successes and failures of the applied term store routinely: survey users, review search logs, observe information retrieval bottlenecks and troll for new electronic discourse and content as a source of new terminology. A recent post by taxonomy expert Heather Hedden gives more technical guidance about evaluating and sustaining your taxonomy maintenance.

Read More

Taxonomy and Glossaries for Enterprise Search Terminology

Two years ago when I began blogging for the Gilbane Group on enterprise search, the extent of my vision was reflected in the blog categories I defined and expected to populate with content over time. They represented my personal “top terms” that were expected to each have meaningful entries to educate and illuminate what readers might want to know about search behind the firewall of enterprises.
A recent examination of those early decisions showed me where there are gaps in content, perhaps reflecting that some of those topics were:

  • Not so important
  • Not currently in my thinking about the industry
  • OR Not well defined

I also know that on several occasions I couldn’t find a good category in my list for a blog I had just written. Being a former indexer and heavy user of controlled vocabularies, on most occasions I resisted the urge to create a new category and found instead the “best fit” for my entry. I know that when the corpus of content or domain is small, too many categories are useless for the reader. But now, as I approach 100 entries, it is time to reconsider where I want to go with blogging about enterprise search.
In the short term, I am going to try to provide entries for scantily covered topics because I still think they are all relevant. I’ll probably add a few more along the way or perhaps make some topics a little more granular.
Taxonomies are never static, and require periodic review, even when the amount of content is small. Taxonomists need to keep pace with current use of terminology and target audience interests. New jargon creeps in although I prefer to use generic and terms broadly understood in the technology and business world.
That gives you an idea of some of my own taxonomy process. To add to the entries on terminology (definitions) and taxonomies, I am posting a glossary I wrote for last year’s report on the enterprise search market and recently updated for the Gilbane Workshop on taxonomies. While the definitions were all crafted by me, they are validated through the heavy use of the Google “define” feature. If you aren’t already a user, you will find it highly useful when trying to pin down a definition. At the Google search box, simply type define: xxx xxx (where xxx represents a word or phrase for which you seek a definition). Google returns all the public definition entries it finds on the Internet. My definitions are then refined based on what I learn from a variety of sources I discover using this technique. It’s a great way to build your knowledge-base and discover new meanings.
Glossary Taxonomy and Search-012009.pdf

Read More

Taxonomy, Yes, but for What?

The term taxonomy crept into the search lexicon by stealth and is now firmly entrenched. The very early search engines, circa 1972-73, presented searchers with the retrieval option of selecting content using controlled vocabularies from a standardized thesaurus of terminology in a particular discipline. With no neat graphical navigation tools, searches were crafted on a typewriter-like device, painfully typed in an arcane syntax. A stray hyphen, period or space would render the query un-computable, so after deciphering the error message, the searcher would try again. Each minute and each result cost money, so errors were a real expense.
We entered the Web search era bundling content into a directory structure, like the “Yellow Pages,” or organizing query results into “folders” labeled with broad topics. The controlled vocabulary that represented directory topics or folder labels became known as a taxonomic structure, with the early ones at NorthernLight and Yahoo crafted by experts with knowledge of the rules of controlled vocabulary, thesaurus development and maintenance. Google derailed that search model with its simple “search box” requiring only a word or phrase to grab heaps of results. Today we are in a new era. Some people like searching by typing keywords in a box, while others prefer the suggestions of a directory or tree structure. Building taxonomic structures for more than e-commerce sites is now serious business for searches within enterprises where many employees prefer to navigate through the terminology to browse and discover the full scope of what is there.
Taxonomies for navigation are but one purpose for them to be used in search. Depending on the application domain, richness of the subject matter, scope and depth of topics, these lists can become quite large and complex. The more cross-references (e.g. cell phones USE wireless phones) are embedded in the list, the more likely the searcher’s preferred term will be present. There is a diminishing return, however; if the user has to navigate to a system’s preferred term too often; the entire process of searching becomes unwieldy and abandoned. On the other hand, if the system automates the smooth transition from one term to another, the richness and complexity of a taxonomy can be an asset.
In more sophisticated applications of taxonomies, the thesaurus model of relationships becomes a necessity. When a search engine, has embedded algorithms that can interpret explicit term relationships, it indexes content according to a taxonomy and all its cross-references. Taxonomy here informs the index engine. It requires substantial maintenance and governance of a much more granular nature than for navigation. To work well, a large corpus of terminology needs to be built to assure that what the content says and means, and what the searcher expects are a match in results. If the results of a search give back unsatisfactory results due to a poor taxonomy, trust in the search system fails rapidly and the benefits of whatever effort was put into building a taxonomy are lost.
I bring this up because the intent of any taxonomy is the first step in deciding whether to start building one. Either model is an on-going commitment but the latter is a much larger investment in sophisticated human resources. The conditions that must be met to have any taxonomy succeed must be articulated in selling the project and value proposition.

Read More

Ontologies and Semantic Search

Recent studies describe the negative effect of media including video, television and on-line content on attention spans and even comprehension. One such study suggests that the piling on of content accrued from multiple sources throughout our work and leisure hours has saturated us to the point of making us information filterers more than information “comprehenders”. Hold that thought while I present a second one.
Last week’s blog entry reflected on intellectual property (IP) and knowledge assets and the value of taxonomies as aids to organizing and finding these valued resources. The idea of making search engines better or more precise in finding relevant content is edging into our enterprises through semantic technologies. These are search tools that are better at finding concepts, synonymous terms, and similar or related topics when we execute a search. You’ll find an in depth discussion of some of these in the forthcoming publication, Beyond Search by Steve Arnold. However, semantic search requires more sophisticated concept maps than taxonomy. It requires ontology, rich representations of a web of concepts complete with all types of term relationships.
My first comment about a trend toward just browsing and filtering content for relevance to our work, and the second one about the idea of assembling semantically relevant content for better search precision are two sides of a business problem that hundreds of entrepreneurs are grappling with, semantic technologies.
Two weeks ago, I helped to moderate a meeting on the subject, entitled Semantic Web – Ripe for Commercialization? While the assumed audience was to be a broad business group of VCs, financiers, legal and business management professionals, it turned out to have a lot of technology types. They had some pretty heavy questions and comments about how search engines handle inference and its methods for extracting meaning from content. Semantic search engines need to understand both the query and the target content to retrieve contextually relevant content.
Keynote speakers and some of the panelists introduced the concept of ontologies as being an essential backbone to semantic search. From that came a lot of discussion about how and where these ontologies originate, how and who vets them for authoritativeness, and how their development in under-funded subject areas will occur. There were no clear answers.
Here I want to give a quick definition for ontology. It is a concept map of terminology which, when richly populated, reflects all the possible semantic relationships that might be inferred from different ways that terms are assembled in human language. A subject specific ontology is more easily understood in a graphical representation. Ontologies also help to inform semantic search engines by contributing to an automated deconstruction of a query (making sense out of what the searcher wants to know) and automated deconstruction of the content to be indexed and searched. Good semantic search, therefore, depends on excellent ontologies.
To see a very simple example of an ontology related to “roadway”, check out this image. Keep in mind that before you aspire to implementing a semantic search engine in your enterprise, you want to be sure that there is a trusted ontology somewhere in the mix of tools to help the search engine retrieve results relevant to your unique audience.

Read More

Taxonomy and Enterprise Search

This blog entry on the “Taxonomy Watch” website prompts me to correct the impression that I believe naysayers who say that taxonomies take too much time and effort to be valuable. Nothing could be further from the truth. I believe in and have always been highly vested in taxonomies because I am convinced that an investment in pre-processing enterprise generated content into meaningfully organized results brings large returns in time savings for a searcher. S/he, otherwise, needs to invest personally in the laborious post-processing activity of sifting and rejecting piles of non-relevant content. Consider that categorizing content well and only once brings benefit repeatedly to all who search an enterprise corpus.
Prime assets of enterprises are people and their knowledge; the resulting captured information can be leveraged as knowledge assets (KA). However, there is a serious problem “herding” KA into a form that results in leveragable knowledge. Bringing content into a focus that is meaningful to a diverse but specialized audience of users, even within a limited company domain is tough because the language of the content is so messy.
So, what does this have to do with taxonomies and enterprise search, and how they factor into leveraging KA? Taxonomies have a role as a device to promote and secure the meaningful retrievability of content when we need it most or fastest, just-in-time retrieval. If no taxonomies exist to pre-collocate and contextualize content for an audience, we will be perpetually stuck in a mode of having to do individual human filtering of excessive search results that come from “keyword” queries. If we don’t begin with taxonomies for helping search engines categorize content, we will certainly never get to the holy grail of semantic search. We need every device we can create and sustain to make information more findable and understandable; we just don’t have time to both filter and read, comprehensively, everything a keyword search throws our way to gain the knowledge we need to do our jobs.
Experts recognize that organizing content with pre-defined terminology (aka controlled vocabularies) that can be easily displayed in an expandable taxonomic structure is a useful aid for a certain type of searcher. The audience for navigated search is one that appreciates the clustering of search results into groups that are easily understood. They find value in being able to move easily from broad concepts to narrower ones. They especially like it when the categories and terminology are a close match to the way they view a domain of content in which they are subject experts. It shows respect for their subject area and gives them a level of trust that those maintaining the repository know what they need.
Taxonomies, when properly employed, serve triple duty. Exposing them to search engines that are capable of categorizing content puts them into play as training data. Setting them up within content management systems provides a control mechanism and validation table for human assigned metadata. Finally, when used in a navigated search environment, they provide a visual map of the content landscape.
U.S. businesses are woefully behind in “getting it;” they need to invest in search and surrounding infrastructure that supports search. Comments from a recent meeting I attended reflected the belief that the rest of the world is far ahead in this respect. As if to highlight this fact, a colleague just forwarded this news item yesterday. “On February 13, 2008, the XBRL-based financial listed company taxonomy formulated by the Shanghai Stock Exchange (SSE) was “Acknowledged” by the XBRL International. The acknowledgment information has been released on the official website of the XBRL International (http://www.xbrl.org/FRTaxonomies/)….”.
So, let’s get on with selling the basic business case for taxonomies in the enterprise to insure that the best of our knowledge assets will be truly findable when we need them.

Read More