Last night I somehow stumbled on a link to the March 19, 1998 issue of David Weinberger’s JOHO (Journal of the Hyperlinked Organization), where David posits The Death of Documents and the End of Doneness - because of the Web of course - and I disagree that documents are dead. David and I are old friends and I am sure we each had more to say to each other on this topic, but I can’t remember if he ever accepted my corrections to his obviously misguided position, whether he just decided to spare me the embarrassment of pointing out gaping inconsistencies in my argument and gloat privately, or whether we figured out a weaselly way to agree. I have a vague memory of the latter – perhaps in an AIIM publication?
In any case, I was gratified to find that I still agree with my 1998-self, and will check with David to see whether he is the same self he was. You can reach your own conclusions and also have a fun read (if you don’t know him, David is very funny) at http://www.hyperorg.com/backissues/joho-march19-98.html.
See David’s response at http://www.hyperorg.com/blogger/2012/05/22/documents-dead-or-grizzled-survivors/
Aha! We now agree and in a non-squirrely way. You didn’t have to say you were wrong, now I am going to have to admit the same when it’s my turn. :( …Besides, you were only a little wrong…
In the past few months, it is rare that I am briefed on an enterprise search product without a claim to provide “federated search.” Having worked with the original ANSI standard, Z39.50, and on one of the many review committees for it back in the early 1990s, it is a topic that always catches my attention.
Some of the history of search federation is described in this rather sketchy article at Wikipedia. However, I want clarify the original call for such a standard. It comes from the days when public access to search technologies was available primarily through library on-line catalogs in pubic and academic institutional libraries. A demand for the ability to search not only one’s local library system and network (e.g. a university often standardized on one library system to include all the holdings of a number of its own libraries), but also the holdings of other universities or major public libraries. The problem was that the data structures and protocols from one library system product to the next varied in way that made it difficult for the search engine of the first system to penetrate the database of records in another system. Records might have been meta-tagged similarly, but the way the metadata were indexed and accessible to retrieval algorithms was not possible with a translating layer between systems. Thus, the Z39.50 standard was established, originally to let one library system’s user search from that library system into the contents of other libraries with different systems.
Ideally, results were presented to the searcher in a uniform citation format, organized to help the user easily recognize duplicated records, each marked with location and availability. Usually there was a very esoteric results presentation that could only be readily interpreted by librarians and research scholars.
Now we live in a digitized content environment in which the dissimilarities across content management systems, content repositories, publishers’ databases, and library catalogs have increased a hundred fold. The need for federating or translation layers to bring order to this metadata or metadata-less chaos has only become stronger. The ANSI standard is largely ignored by content platform vendors, thus leaving the federating solution to non-embedded search products. A buyer of search must do deep testing to determine if the enterprise search engine you have acquired actually stands up well under a load of retrieving across numerous disparate repositories. And you need a very astute and experienced searcher with expert familiarity of content in all the repositories to make an evaluation as to suitability for the circumstance in which the engine will be used.
So, let’s just recap what you need to know before you select and license a product claiming to support what you expect from search federation:
- Federated search is a process for retrieving content either serially or concurrently from multiple targeted sources that are indexed separately, and then presenting results in a unified display. You can imagine that there will be a huge variation in how well those claims might be satisfied.
- Federation is an expansion of the concept of content aggregation. It has play in a multi-domain environment of only internal sites OR a mix of internal and external sites that might include the deep (hidden) web. Across multiple domains complete federation supports at least four distinct functions:
o Integration of the results from a number of targeted searchable domains, each with its own search engine
o Disambiguation of content results when similar but non-identical pieces of content might be included
o Normalization of search results so that content from different domains is presented similarly
o Consolidation of the search operation (standardizing a query to each of the target search engines) and standardizing the results so they appear to be coming from a single search operation
In order to do this effectively and cleanly, the federating layer of software, which probably comes from a third-party like MuseGlobal, must have “connectors” that recognize the structures of all the repositories that will be targeted from the “home” search engine.
Why is this relevant? In short, because it is expected by users that when they search, all the results they are looking at represent all the content from all the repositories they believed they were searching in a format that makes sense to them. It is a very tall order for any search system to do this but when enterprise information managers are trying to meet a business manager’s or executive’s lofty expectations, anything less is viewed as the failure of enterprise search. Or else, they better set expectations lower.
Two years ago when I began blogging for the Gilbane Group on enterprise search, the extent of my vision was reflected in the blog categories I defined and expected to populate with content over time. They represented my personal “top terms” that were expected to each have meaningful entries to educate and illuminate what readers might want to know about search behind the firewall of enterprises.
A recent examination of those early decisions showed me where there are gaps in content, perhaps reflecting that some of those topics were:
- Not so important
- Not currently in my thinking about the industry
- OR Not well defined
I also know that on several occasions I couldn’t find a good category in my list for a blog I had just written. Being a former indexer and heavy user of controlled vocabularies, on most occasions I resisted the urge to create a new category and found instead the “best fit” for my entry. I know that when the corpus of content or domain is small, too many categories are useless for the reader. But now, as I approach 100 entries, it is time to reconsider where I want to go with blogging about enterprise search.
In the short term, I am going to try to provide entries for scantily covered topics because I still think they are all relevant. I’ll probably add a few more along the way or perhaps make some topics a little more granular.
Taxonomies are never static, and require periodic review, even when the amount of content is small. Taxonomists need to keep pace with current use of terminology and target audience interests. New jargon creeps in although I prefer to use generic and terms broadly understood in the technology and business world.
That gives you an idea of some of my own taxonomy process. To add to the entries on terminology (definitions) and taxonomies, I am posting a glossary I wrote for last year’s report on the enterprise search market and recently updated for the Gilbane Workshop on taxonomies. While the definitions were all crafted by me, they are validated through the heavy use of the Google “define” feature. If you aren’t already a user, you will find it highly useful when trying to pin down a definition. At the Google search box, simply type define: xxx xxx (where xxx represents a word or phrase for which you seek a definition). Google returns all the public definition entries it finds on the Internet. My definitions are then refined based on what I learn from a variety of sources I discover using this technique. It’s a great way to build your knowledge-base and discover new meanings.
Glossary Taxonomy and Search-012009.pdf
This title by Mike Altendorf, in CIO Magazine, October 31, 2008, mystifies me, Search Will Outshine KM. I did a little poking around to discover who he is and found a similar statement by him back in September, Search is being implemented in enterprises as the new knowledge management and what’s coming down the line is the ability to mine the huge amount of untapped structured and unstructured data in the organisation.
Because I follow enterprise search for the Gilbane Group while maintaining a separate consulting practice in knowledge management, I am struggling with his conflation of the two terms or even the migration of one to the other. The search we talk about is a set of software technologies that retrieve content. I’m tired of the debate about the terminology “enterprise search” vs. “behind the firewall search.” I tell vendors and buyers that my focus is on software products supporting search executed within (or from outside looking in) the enterprise on content that originates from within the enterprise or that is collected by the enterprise. I don’t judge whether the product is for an exclusive domain, content type or audience, or whether it is deployed with the “intent” of finding and retrieving every last scrap of content lying around the enterprise. It never does nor will do the latter but if that is what an enterprise aspires to, theirs is a judgment call I might help them re-evaluate in consultation.
It is pretty clear that Mr. Altendorf is impressed with the potential for Fast and Microsoft so he knows they are firmly entrenched in the software business. But knowledge management (KM) is not now, nor has it ever been, a software product or even a suite of products. I will acknowledge that KM is a messy thing to talk about and the label means many things even to those of us who focus on it as a practice area. It clearly got derailed as a useful “discipline” of focus in the 90s when tool vendors decided to place their products into a new category called “knowledge management.”
It sounded so promising and useful, this idea of KM software that could just suck the brains out of experts and the business know-how of enterprises out of hidden and lurking content. We know better, we who try to refine the art of leveraging knowledge by assisting our clients with blending people and technology to establish workable business practices around knowledge assets. We bring together IT, business managers, librarians, content managers, taxonomists, archivists, and records managers to facilitate good communication among many types of stakeholders. We work to define how to apply behavioral business practices and tools to business problems. Understanding how a software product is helpful in processes, its potential applications, or to encourage usability standards are part of the knowledge manager’s toolkit. It is quite an art, the KM process of bringing tools together with knowledge assets (people and content) into a productive balance.
Search is one of the tools that can facilitate leveraging knowledge assets and help us find the experts who might share some “how-to” knowledge, but it is not, nor will it ever be a substitute for KM. You can check out these links to see how others line up on the definitions of KM: CIO introduction to KM and Wikipedia. Let’s not have the “KM is dead” discussion again!
I am surprised how often various content organizing mechanisms on the Web are compared to the Dewey Decimal System. As a former librarian, I am disheartened to be reminded how often students were lectured on the Dewey Decimal system, apparently to the exclusion of learning about subject categorization schemes. They complemented each other but that seems to be a secret among all but librarians.
I’ll try to share a clearer view of the model and explain why new systems of organizing content in enterprise search are quite different than the decimal model.
Classification is a good generic term for defining physical organizing systems. Unique animals and plants are distinguished by a single classification in the biological naming system. So too are books in a library. There are two principal classification systems for arranging books on the shelf in Western libraries: Dewey Decimal and Library of Congress (LC). They each use coding (numeric for Dewey decimal and alpha-numeric for Library of Congress) to establish where a book belongs logically on a shelf, relative to other books in the collection, according to the book’s most prominent content topic. A book on nutrition for better health might be given a classification number for some aspect of nutrition or one for a health topic, but a human being has to make a judgment which topic the book is most “about” because the book can only live in one section of the collection. It is probably worth mentioning that the Dewey and LC systems are both hierarchical but with different priorities. (e.g. Dewey puts broad topics like Religion and Philosophy and Psychology at top levels and LC puts those two topics together while including more scientific and technical topics at the top of the list, like Agriculture and Military Science.)
So why classify books to reside in topic order? It requires a lot of labor to move the collections around to make space for new books. It is for the benefit of the users, to enable “browsing” through the collection, although it may be hard to accept that the term browsing was a staple of library science decades before the internet. Library leaders established eons ago the need for a system of physical organization to help readers peruse the book collection by topic, leading from the general to the specific.
You might ask what kind of help that was for finding the book on nutrition that was classified under “health science.” This is where another system, largely hidden from the public or often made annoyingly inaccessible, comes in. It is a system of categorization in which any content, book or otherwise, can be assigned an unlimited number of categories. Wondering through the stacks, one would never suspect this secret way of finding a nugget in a book about your favorite hobby if that book was classified to live elsewhere. The standard lists of terms for further describing books by multiple headings are called “subject headings” and you had to use a library catalog to find them. Unfortunately, they contain mysterious conventions called “sub-divisions,” designed to pre-coordinate any topic with other generic topics (e.g. Handbooks, etc. and United States). Today we would call these generic subdivision terms, facets. One reflects a kind of book and the other reveals a geographical scope covered by the book.
With the marvel of the Web page, hyperlinking, and “clicking through” hierarchical lists of topics we can click a mouse to narrow a search for handbooks on nutrition in the United States for better health beginning at any facet or topic and still come up with the book that meets all four criteria. We no longer have to be constrained by the Dewey model of browsing the physical location of our favorite topics, probably missing a lot of good stuff. But then we never did. The subject card catalog gave us a tool for finding more than we would by classification code alone. But even that was a lot more tedious than navigating easily through a hierarchy of subject headings, narrowing the results by facets on a browser tab and further narrowing the results by yet another topical term until we find just the right piece of content.
Taking the next leap we have natural language processing (NLP) that will answer the question, “Where do I find handbooks on nutrition in the United States for better health?” And that is the Holy Grail for search technology – and a long way from Mr. Dewey’s idea for browsing the collection.
This one almost slipped right past me but I see we are in another shoot-out in the naming of search market segments. Probably it is because we have too many offerings in the search industry. When any industry reaches a critical mass, players need to find a way to differentiate what they sell. Products have to be positioned as, well, “something else.”
In my consulting practice “knowledge management” has been hot (1980s and 90s), dead (late ’90s and early 2000s), relevant again (now). In my analyst role for “enterprise search” Gilbane has been told by experts that the term is meaningless and should be replaced with “behind the firewall search,” as if that clarifies everything. Of course, marketing directories might struggle with that as a category heading.
For the record, “search” has two definitions in my book. The first is a verb referring to the activity of looking for anything. The second, newer, definition is a noun referring to technologies that support finding “content.” Both are sufficiently broad to cover a lot of activities, technologies and stuff. “Enterprises” are organizations of any type in which business, for-profit, non-for-profit, or government, is being conducted. Let us quibble no more.
But I digress; Endeca has broadened its self-classification in any number of press releases to referring to its products that were “search” products last year, as “information access software.” This is the major category used by IDC to include “search.” That’s what we called library systems in the 1970s and 80s. New products still aim for accessing content, albeit with richer functions and features but where are we going to put them in our family of software lists? One could argue that Endeca’s products are really a class of “search,” search on steroids, a specialized form of search. What are the defining differentiators between “search software” and “information access software?” When does a search product become more than it was or narrower, refined in scope? (This is a rhetorical question but I’m sure each vendor in this new category will break-it out for me in their own terms.)
Having just finished reviewing the market for enterprise search, I believe that many of the products are reaching for the broader scope of functionality defined by IDC as being: search and retrieval, text analytics, and BI. But are they really going to claim to be content management and data warehousing software, as well? Those are included in IDC’s definition of “information access software.” May-be we are going back to single-vendor platforms with everything bundled and integrated. Sigh… it makes me tired, trying to keep up with all this categorizing and re-redefining.
Steve Arnold’s Beyond Search report is finally launched and ready for purchase. Reviewing it gave me a different perspective on how to look at the array of 83 search companies I am juggling in my upcoming report: Enterprise Search Markets and Applications. For example, technological differentiators can channel your decisions about must haves/have nots in your system selection. Steve codifies considerations and details 15 technology tips that will help you frame those considerations.
We are getting ready for the third Gilbane Conference in which “search” has been a significant part of the presentation landscape in San Francisco, June 17 – 20th.Six sessions will be filled with case studies and enlightening “how-to-do-it-better” guidance from search experts with significant “hands-on” experience in the field. I will be conducting a workshop, immediately after the conference, How to Successfully Adopt and Deploy Search. Presentations by speakers and the workshop will focus on users’ experiences and guidance for evaluating, buying and implementing search. Viewing search from a usage perspective begs a different set of classification criteria for divvying up the products.
In February, Business Trends published an interview I gave them in December, Revving up Search Engines in the Enterprise. There probably isn’t much new in it for those who routinely follow this topic but if you are trying to find ways to explain what it is, why and how to get started, you might find some ideas for opening the discussion with others in your business setting. The intended audience is those who don’t normally wallow in search jargon. This interview pretty much covers the what, why, who, and when to jump into procuring search tools for the enterprise.
For my report, I have been very pleased with discussions I’ve had with a couple dozen people immersed in evaluating and implementing search for their organizations. Hearing them describe their experiences guides other ways to organize a potpourri of search products and how buyers should approach their selection. With over eighty products we have a challenge in how to parse the domain. I am segmenting the market space into multiple dimensions from the content type being targeted by “search” to the packaging models the vendors offer. When laying out a simple “ontology” of concepts surrounding the search product domain, I hope to clarify why there are so many ways of grouping the tools and products being offered. If vendors read the report to decide which buckets they belong in for marketing and buyers are able to sort out the type of product they need, the report will have achieved one positive outcome. In the meantime, read Frank Gilbane’s take on the whole topic of enterprise tacked onto any group of products.
As serendipity would have it, a colleague from Boston KM Forum, Marc Solomon, just wrote a blog on a new way of thinking of the business of classifying anything, “Word Algebra.” And guess who gave him the inspiration, Mr. Search himself, Steve Arnold. As a former indexer and taxonomist I appreciate this positioning of applied classification. Thinking about why we search gives us a good idea for how to parse content for consumption. Our parameters for search selection must be driven by that WHY?
Called to account for the nomenclature “enterprise search,” which is my area of practice for The Gilbane Group, I will confess that the term has become as tiresome as any other category to which the marketplace gives full attention. But what is in a name, anyway? It is just a label and should not be expected to fully express every attribute it embodies. A year ago I defined it to mean any search done within the enterprise with a primary focus of internal content. “Enterprise” can be an entire organization, division, or group with a corpus of content it wants to have searched comprehensively with a single search engine.
A search engine does not need to be exclusive of all other search engines, nor must it be deployed to crawl and index every single repository in its path to be referred to as enterprise search. There are good and justifiable reasons to leave select repositories un-indexed that go beyond even security concerns, implied by the label “search behind the firewall.” I happen to believe that you can deploy enterprise search for enterprises that are quite open with their content and do not keep it behind a firewall (e.g. government agencies, or not-for-profits). You may also have enterprise search deployed with a set of content for the public you serve and for the internal audience. If the content being searched is substantively authored by the members of the organization or procured for their internal use, enterprise search engines are the appropriate class of products to consider. As you will learn from my forthcoming study, Enterprise Search Markets and Applications: Capitalizing on Emerging Demand, and that of Steve Arnold (Beyond Search) there are more than a lot of flavors out there, so you’ll need to move down the food chain of options to get it right for the application or problem you are trying to solve.
OK! Are you yet convinced that Microsoft is pitting itself squarely against Google? The Yahoo announcement of an offer to purchase for something north of $44 billion makes the previous acquisition of FAST for $1.2 billion pale. But I want to know how this squares with IBM, which has a partnership with Yahoo in the Yahoo edition of IBM’s OmniFind. This keeps the attorneys busy. Or may-be Microsoft will buy IBM, too.
Finally, this dog fight exposed in the Washington Post caught my eye, or did one of the dogs walk away with his tail between his legs? Google slams Autonomy – now, why would they do that?
I had other plans for this week’s blog but all the Patriots Super Bowl talk puts me in the mode for looking at other competitions. It is kind of fun.
It has been a week since the annual Gilbane Boston 2007 Conference closed and I am still searching for the most important message that came out of Enterprise Search and Semantic Web Technology sessions. There were so many interesting case studies that I’ll begin with a search function that illustrates one major enterprise search requirement – aggregation.
Besides illustrating a business case for aggregating disparate content using search, the case studies shared three themes:
> Search is just a starting point for many business processes
> While few very large organizations present all of their organization’s content through a single portal, the technology options to manage such an ideal design are growing and up to supporting entire enterprises
> All systems were implemented and operational for delivering value in less than one year, underscoring the trend toward practical and more out-of-the box solutions
Here is a brief take on what came out of just the first two of seven sessions.
> Use of ISYS to manipulate search results and function as a back-office data analysis tool for DirectEDGAR, the complete SEC filings, presented by Prof. Burch Kealey of the University of Nebraska. Presentation
> Support for search by serendipity across the shareable content domains of members of a trade association (ARF) by finding results that satisfy the searcher in his pursuit of understanding with Exalead, presented by Alain Heurtebise CEO of Exalead. Presentation
> A knowledge portal enabling rapid and efficient retrieval of the complete technical documentation for field service engineers at Otis Elevator to meet rapid response goals when supporting customers using a customized implementation of dtSearch, presented by project consultant Rob Wiesenberg of Contegra Systems, Inc. Presentation
Large solutions calling for search across multi-million record domains:
> Hosted Vivisimo solution federating over 40 million documents across 22,000 government web sites accessible with search results clustered; it records over a half million page views per day on http://USA.gov and was deployed in 8 weeks, presented by Vivisimo co-founder Jerome Pesenti. Presenation
> Intranet knowledge portal for improving customer services by enabling access to internal knowledge assets (over half a million customer cases with all their associated documents) at USi (an AT&T company) using Endeca, a search product USi had experience deploying and hosting for very large e-commerce catalogs, presented by development leader Toby Ford of USi. With one developer it was running in six months. Presentation
> Within a large law firm (Morrison Foerster) and the legal departments of two multi-national pharmaceutical companies (Pfizer and Novartis), Recommind aggregates and indexes content for numerous internal application repositories, file shares and external content sources for unified search across millions of documents, contributing a direct ROI in saved labor by ensuring that required documents are retrieved in a single search process. Presentation
In each of these cases, content from numerous sources was aggregated through the crawling and indexing algorithms of a particular search engine pointed at a bounded and defined corpus of content, with or without associated metadata to solve a particular business problem. In each case, there were surrounding technologies, human architected design elements, and interfaces to present the search interface and results for a predefined audience. This is what we can expect from search in the coming months and years, deployments to meet specialized enterprise needs, an evolving array of features and tools to leverage search results, and a rapid scaling of capabilities to match the explosion of enterprise content that we all need to find and manipulate to do our jobs.
Next week, I will reconstruct more themes and messages from the conference.
Structured search (noun) was rooted firmly in the enterprise when publishers of print index resources (e.g. Chemical Abstracts, Index Medicus from the National Library of Medicine, GRA&I from the National Technical Information Service) became available on-line in the early 1970s. The Systems Development Corporation launched ORBIT developed by a team lead by Carlos Cuadra. Orbit was a command driven search tool accessible to professional searchers. In those days searchers were usually special librarians in corporations, large public libraries, government agencies and major universities. Using the ORBIT command language through a terminal connected by a phone line to remote large computers, librarians would type search commands to find data in specific structured fields. These remote computers held electronic versions of paper indices. Citations resulting from a query for specific chemical compounds, diseases, or government reports, would contain information needed to retrieve articles, patents or books from library shelves.
Corporations spent hundreds of thousands of dollars each year to access external specialized, and structured indices, and the journals, conference proceeding, patents and government documents to which the indices pointed. Hard copy (paper or microform) was the only practical way to read content. Computer screens were not accessible to most researchers and even if they had been, content could not be rendered on them in easily readable forms. Also, until computer storage technologies became cheap, indexing large amounts of text (full-text, or unstructured content) was not affordable.
Even with the advent of graphical interfaces, searching for non-specialists made only minor advances in the early-1980s when library systems offered index browsing to find citations. Library users still needed to read content in hard copy. It was only in the late 1980s and early 90s that full-text content began to be searchable by large numbers of library users on CD-ROMs. Users would go to a library computer, which held multiple CD-ROMs containing journals and other subscriptions, and use a menu to find content on the CD-ROMs by typing keywords that would look through all the content to find matches. This was the first routine use of full-text searching by library users.
These technologies are just memories for a few of us, and unknown to most, but they do point to the differentiation between structured and unstructured searching. Both have been around for a couple of decades but it has taken Web search engines to put search in the hands of everyone. Only recently is frustration with retrieving buckets of unfiltered content pushing enterprises to reconfirm the added value of structured searching.
Technical and business users are appreciating the value of being able to search for a precise title, all documents contributed to a specific project, or all presentations delivered by the CEO in the past two years. Each of these searches requires a defined set of data points, stored with the content and retrievable with a search interface that can support the “structured” query.
Yes, librarians have been here before but, just now, the rest of the organization is learning how they managed to get such good search results all along. Structured searching is now a lot simpler than it was in the 1970s. It is only one aspect in enterprise search but it is an important requirement for most enterprise users when they need reliable and clearly defined search results. And, by the way, Carlos is still around building systems for enterprises to manage and search their critical proprietary content.