Competency E
Design, query and evaluate information retrieval systems.
Introduction
Libraries exist as records of human knowledge of which the library has proudly been the depository for thousands of years. Yet, information is only as good as it is useful. So how do we control, provide access to, and expand upon all that recorded information? Due to technology the sheer quantity of recorded information has multiplied, though much of the classification principles that emerged before the Internet are still in use today, such as controlled vocabulary, post and pre-coordinate systems, and Boolean operators. Information science, also known as informatics, is defined as the “scientific study of gathering, manipulation, classification, storage and retrieval of recorded knowledge” (Webster’s Dictionary, 2001), and the information profession continues to create and apply sophisticated classification and information retrieval systems (IRS) to accommodate the growing number of formats and users.
There are many standards applied to cataloging and classification, often determined by the niche or format they are classifying. Bibliographic control standards are articulated in the Anglo-American Cataloguing Rules (AACR2), while specific types of formats use metadata standards, such as Dublin Core or Encoded Archival Description (EAD). Subject access points are described through Dewed Decimal Classification (DCC) or the Library of Congress Classification (LCC). Standards seek to apply collaboration and uniformity to an otherwise complex goal, which is to give enough information so that the searcher reading the descriptions can tell whether the item is a fair match to what he or she had in mind when formulating the search (Chan, p.13), further complicated by formats, search behaviors and local cataloging procedures. The Association for Library Collections & Technical Services (ALCTS) promotes cataloging, acquisition, organization and management through education, publication, and collaboration, and is a primary section in ALA.
Designing of Information Retrieval Systems
In order to design an information retrieval system, an in-depth knowledge of the user of the IRS is essential, and their needs and abilities must be considered, such as do they want to browse for results, or do users want specific responses to their query? For example, Yahoo was the leading search database in the 1990’s based upon a directory search platform, but Yahoo has all but been eclipsed by Google’s query specific design. Defining the problem is essential to building the IRS, and defining how the records will relate to one another is a decision the designer must make early on: will the records use a simple one-to-one relationship, or a more complicated many-to-many. There are many different points of entry into the “bibliographic universe” (Tillett, p.2) which has both breadth and depth. Understanding how the information seeker organizes their information choices in the IRS—whether it is by category, subgroup, word, iteration, or topic—is an important consideration when designing a database, and relies upon information seeking behaviors of the searcher, along with creativity, logic, insight and a high tolerance for uncertainty characteristic of the designer.
Language is essentially what the database’s product is—how does a designer represent the many variations of language, and the different methods of searching address these nuances? Authority control is the process of defining entry access points into the information system, and aids in identifying, collocating, evaluation and selection, and locating the information represented (Taylor, in Haycock, p.103-4). Bibliographic control has been developed as a standard entry point to deliver recall and precision of information (Chan, 2007 p.12). Natural language searching is a popular means of searching, exemplified best in search engines such as Google, Bing, Yahoo, and many others. Keyword searching is another access point, using the bibliographic record, subject fields such as title, author or abstract, return search results based upon the keyword, or fee-text entered. Controlled vocabulary, by contrast, is a surrogate or simile designated by the designer, that represents the term or subject best, and then links all records to bearing that term (Chan, p.195). An example of a controlled vocabulary database would Dialog, which provides extensive thesauri and subject heading lists, and often many years of experience with the CV, to search these very powerful databases.
Important concepts used for the organization and representation of information is pre-coordinate and post-coordinate standards. The pre-coordinate system relies upon a single word to summate the entire concept of a word, used particularly for the card catalog, when cross-referencing could not be included simply due to space and resources. Post-coordination, particularly in the age of the internet and computer search engines, allows searching (using Boolean and other operators) through any vocabulary using simple terms with an infinite amount of cross-referencing and combinations.
Querying of Information Retrieval Systems
Querying is the process of requesting information within a created system, retrieving information based upon recall and precision to bring together variant forms and related terms. To achieve the greatest recall and highest precision, Boolean operators are applied to the search to help limit or expand the combination of subjects (AND, NOT, OR). Truncation is another important limiter when searching or querying the design. Truncation works differently per database, but usually involves the first 3-5 letters, or root word, of the term followed by an asteric (*). This then forces the system to search for all variations of the term. For example, truncating friend* can retrieve: friends, friendship, friendly, etc.
Evaluation of Information Retrieval Systems
Successful system development requires constant evaluation, and may require input from the patrons or users in order to ascertain the problems or weaknesses, in order to gauge how the users are really using it. Relevancy is an important measurement to evaluating an IRS, and within that measurement, recall and precision are indicators of how well the search engine or database is working. Recall is how close the system gets to retrieving “all of the relevant documents”, while precision is how close it gets to retrieving “only the relevant documents” (Taylor, in Haycock p.123). Relevancy can have a variant meaning, for example search engines such as Google or Bing, deliver results based upon relevancy, but the degree to which it matches the query is often debatable and based upon multiple factors, such as popularity, proximity, and often commercial bias.
Evidence
As my first evidence I demonstrate my understanding of the elements required to design a database, primarily the pre-coordinate and post-coordinate indexing used to cross reference and classify terms. Through LIBR 202 Information Retrieval (LIBR 202-Pre & Post-Co Group Assign), a group of four were assigned a subject heading, and the subjects and had to designate post-coordinate and pre-coordinate terms for each subject. I was assigned to the pre-coordinate team, and through this exercise we as a group were able to ascertain the difficulty in assigning subject terms within the different systems. This group project particularly highlighted the difficulty in assigning subject authority for subjective terms. Post-coordination yielded a vastly expanded depth and breadth of information possibilities within the index and subject fields, though it requires technology and encoding of metadata to process.
The second evidence, I provide an artifact that demonstrates my ability to evaluate a search system through contrasting two major image repositories, ArtStor and Flickr.com (LIBR 202-Evaluate Flickr). The contrast of the two databases began as an exploration of the method of identifying and creating an image bibliography, but evolved into a contrast of controlled vocabulary, as used in ArtStor, compared to tagging or user folksonomies, as used in Flickr. The merits of both were discussed, but the linking capabilities of Flickr revealed a robustness and new natural language that contributed to its success and depth.
Conclusion
Designing, querying and retrieving information are essential duties to the information professional—design must reflect what the user’s need is now, and account for trends and expectations in use. Querying rely on principles of language, bibliographic control, controlled vocabulary and natural language, and vigilant evaluation of the database.
References
Chan, Lois Mai. (2007). Cataloging and classification: an Introduction. Lanham, MY: The Scarecrow Press, Inc.
Haycock, K., and Sheldon, B.E. (2008). The Portable MLIS: Insights from the experts. Westport, CT: Libraries Unlimited.
Hock, R. (2012). The Extreme searcher’s Internet handbooks: a guide for the serious searcher, 3rd Ed. Medford, NJ: CyberAge Books.
Introduction
Libraries exist as records of human knowledge of which the library has proudly been the depository for thousands of years. Yet, information is only as good as it is useful. So how do we control, provide access to, and expand upon all that recorded information? Due to technology the sheer quantity of recorded information has multiplied, though much of the classification principles that emerged before the Internet are still in use today, such as controlled vocabulary, post and pre-coordinate systems, and Boolean operators. Information science, also known as informatics, is defined as the “scientific study of gathering, manipulation, classification, storage and retrieval of recorded knowledge” (Webster’s Dictionary, 2001), and the information profession continues to create and apply sophisticated classification and information retrieval systems (IRS) to accommodate the growing number of formats and users.
There are many standards applied to cataloging and classification, often determined by the niche or format they are classifying. Bibliographic control standards are articulated in the Anglo-American Cataloguing Rules (AACR2), while specific types of formats use metadata standards, such as Dublin Core or Encoded Archival Description (EAD). Subject access points are described through Dewed Decimal Classification (DCC) or the Library of Congress Classification (LCC). Standards seek to apply collaboration and uniformity to an otherwise complex goal, which is to give enough information so that the searcher reading the descriptions can tell whether the item is a fair match to what he or she had in mind when formulating the search (Chan, p.13), further complicated by formats, search behaviors and local cataloging procedures. The Association for Library Collections & Technical Services (ALCTS) promotes cataloging, acquisition, organization and management through education, publication, and collaboration, and is a primary section in ALA.
Designing of Information Retrieval Systems
In order to design an information retrieval system, an in-depth knowledge of the user of the IRS is essential, and their needs and abilities must be considered, such as do they want to browse for results, or do users want specific responses to their query? For example, Yahoo was the leading search database in the 1990’s based upon a directory search platform, but Yahoo has all but been eclipsed by Google’s query specific design. Defining the problem is essential to building the IRS, and defining how the records will relate to one another is a decision the designer must make early on: will the records use a simple one-to-one relationship, or a more complicated many-to-many. There are many different points of entry into the “bibliographic universe” (Tillett, p.2) which has both breadth and depth. Understanding how the information seeker organizes their information choices in the IRS—whether it is by category, subgroup, word, iteration, or topic—is an important consideration when designing a database, and relies upon information seeking behaviors of the searcher, along with creativity, logic, insight and a high tolerance for uncertainty characteristic of the designer.
Language is essentially what the database’s product is—how does a designer represent the many variations of language, and the different methods of searching address these nuances? Authority control is the process of defining entry access points into the information system, and aids in identifying, collocating, evaluation and selection, and locating the information represented (Taylor, in Haycock, p.103-4). Bibliographic control has been developed as a standard entry point to deliver recall and precision of information (Chan, 2007 p.12). Natural language searching is a popular means of searching, exemplified best in search engines such as Google, Bing, Yahoo, and many others. Keyword searching is another access point, using the bibliographic record, subject fields such as title, author or abstract, return search results based upon the keyword, or fee-text entered. Controlled vocabulary, by contrast, is a surrogate or simile designated by the designer, that represents the term or subject best, and then links all records to bearing that term (Chan, p.195). An example of a controlled vocabulary database would Dialog, which provides extensive thesauri and subject heading lists, and often many years of experience with the CV, to search these very powerful databases.
Important concepts used for the organization and representation of information is pre-coordinate and post-coordinate standards. The pre-coordinate system relies upon a single word to summate the entire concept of a word, used particularly for the card catalog, when cross-referencing could not be included simply due to space and resources. Post-coordination, particularly in the age of the internet and computer search engines, allows searching (using Boolean and other operators) through any vocabulary using simple terms with an infinite amount of cross-referencing and combinations.
Querying of Information Retrieval Systems
Querying is the process of requesting information within a created system, retrieving information based upon recall and precision to bring together variant forms and related terms. To achieve the greatest recall and highest precision, Boolean operators are applied to the search to help limit or expand the combination of subjects (AND, NOT, OR). Truncation is another important limiter when searching or querying the design. Truncation works differently per database, but usually involves the first 3-5 letters, or root word, of the term followed by an asteric (*). This then forces the system to search for all variations of the term. For example, truncating friend* can retrieve: friends, friendship, friendly, etc.
Evaluation of Information Retrieval Systems
Successful system development requires constant evaluation, and may require input from the patrons or users in order to ascertain the problems or weaknesses, in order to gauge how the users are really using it. Relevancy is an important measurement to evaluating an IRS, and within that measurement, recall and precision are indicators of how well the search engine or database is working. Recall is how close the system gets to retrieving “all of the relevant documents”, while precision is how close it gets to retrieving “only the relevant documents” (Taylor, in Haycock p.123). Relevancy can have a variant meaning, for example search engines such as Google or Bing, deliver results based upon relevancy, but the degree to which it matches the query is often debatable and based upon multiple factors, such as popularity, proximity, and often commercial bias.
Evidence
As my first evidence I demonstrate my understanding of the elements required to design a database, primarily the pre-coordinate and post-coordinate indexing used to cross reference and classify terms. Through LIBR 202 Information Retrieval (LIBR 202-Pre & Post-Co Group Assign), a group of four were assigned a subject heading, and the subjects and had to designate post-coordinate and pre-coordinate terms for each subject. I was assigned to the pre-coordinate team, and through this exercise we as a group were able to ascertain the difficulty in assigning subject terms within the different systems. This group project particularly highlighted the difficulty in assigning subject authority for subjective terms. Post-coordination yielded a vastly expanded depth and breadth of information possibilities within the index and subject fields, though it requires technology and encoding of metadata to process.
The second evidence, I provide an artifact that demonstrates my ability to evaluate a search system through contrasting two major image repositories, ArtStor and Flickr.com (LIBR 202-Evaluate Flickr). The contrast of the two databases began as an exploration of the method of identifying and creating an image bibliography, but evolved into a contrast of controlled vocabulary, as used in ArtStor, compared to tagging or user folksonomies, as used in Flickr. The merits of both were discussed, but the linking capabilities of Flickr revealed a robustness and new natural language that contributed to its success and depth.
Conclusion
Designing, querying and retrieving information are essential duties to the information professional—design must reflect what the user’s need is now, and account for trends and expectations in use. Querying rely on principles of language, bibliographic control, controlled vocabulary and natural language, and vigilant evaluation of the database.
References
Chan, Lois Mai. (2007). Cataloging and classification: an Introduction. Lanham, MY: The Scarecrow Press, Inc.
Haycock, K., and Sheldon, B.E. (2008). The Portable MLIS: Insights from the experts. Westport, CT: Libraries Unlimited.
Hock, R. (2012). The Extreme searcher’s Internet handbooks: a guide for the serious searcher, 3rd Ed. Medford, NJ: CyberAge Books.