Competency E
E. design, query and evaluate information retrieval systems;
Explication:
Information must be organized in some manner and in a way that researchers, students, and the average person can retrieve it. As Meadow notes, “In brief, IR (Information Retrieval) involves finding some desired information in a store of information or a database. Implicit in this view is the concept of selectivity; to exercise selectivity usually requires that a price be paid in effort, time, money, or all three" (Meadow, 2000, p. 2).
“The process of selectively searching for information in a database can be viewed as starting from at least two different points,” explains Meadow. One is that of the searcher or end user; the other viewpoint is from that of the institution or collector of the information who organize the information in such a way that people will be able to search for it successfully later (Meadow, 2000, p. 4).
A catalog is one way to organize information. Most subject catalogs are set up in a hierarchical manner, which can be limiting. They depend on a vision of what subjects should be subsets of other subjects. For example, if you want to learn about folklore from Bosnia, would you first look under Folklore, subset Europe, further subset Bosnia or would you look for Europe, then Bosnia, and then a subset of folklore?
The World Wide Web, on the other hand, is a like an ocean full of all kinds of fish. You throw in a hook and often are not sure what will be found. It may be useful and factual or it may be junk. Part of the job of a librarian or teacher librarian is to teach people how to search in order to find what they are looking for most expeditiously. As more and more people turn first to the Web for their searches, librarians need to be able to help them learn to use different sorts of databases for their searches (e.g. ABC Clio for history) and how to test the validity of the material they find. It is interesting to see library websites following the marketing design of Amazon.com, offering a Visual Catalog which shows the cover of the book as well as offering reviews.
Principles of Database Design
In my Information Retrieval (IR) class, LIBR 202, we received a handout titled, “Principles for the Design of Rules for a Bibliographic System,” from Svenonius, 2000, p.68. It describes the principles implicit in the Anglo-American Cataloging Rules (AACR). These principles include
In Natural Language systems, the language is taken from the title and subject matter of the item and each relevant word of the text are separated indexed and searchable.
“A Controlled Vocabulary is one for which some authority decides which words or codes are to be used and defines the meanings of these terms or the relationships among them,” explains Meadow (2000, p. 91). Use of a controlled vocabulary cannot ensure that all searchers will know to use a particular term to describe an item. However, each term in the controlled vocabulary “can be explained as to assumed meaning and differentiated from other terms” (Ibid.). This method does remove some of the ambiguity of searchable terms found in the Natural Language method.
In the Classification method, a hierarchical structure is created and then the indexer assigns each item to the appropriate category near to other similar items. Generally, we assign the item an alphanumeric code to denote its membership in a class. For example, in the Dewey Decimal Classification (DDC), 398.2 denotes folk tales or fairy tales so fairy tales from various cultures and geographical locations will be found in this section. A problem with this system is that it is sometimes up to the cataloger to decide where to shelve or file the item. We get around this problem by following the classification laid out on the title page verso in the book showing the Library of Congress code and subject classifications. For example, when I interned in an elementary school library, I found it interesting that using DDC, books on Animals were split into at least three sections. There was a section for books on wild animals, divided by geographic region and type of animal (Africa, Lions). Another section had to do with Agriculture and had the domestic animals such as horses and cattle. A third section had books on pets such as cats and hamsters. Could not a horse be regarded as an agricultural animal by some users and a pet by others?
Another term found in discussions of IR is an Attribute. An attribute is how we designate in symbols an actual physical object such as a book, a painting or a film. “It has a name and a value or content” (Meadow, 2000, p. 68). For example, a book called Huckleberry Finn by Mark Twain may be cataloged under Subject = American Fiction, Author = Twain, Mark or Samuel Clemens, and Title = Huckleberry Finn. It can be searched using the title, the author, or the subject matter. In most public libraries, Fiction is shelved by author’s last name. If you only knew the title Huckleberry Finn, you could type that into the search box of the catalog program and the catalog record would give you the author’s name so that you could search for the book on the shelf.
The symbols for attributes are most often a series of numbers, alphabet letters, other symbols, or a combination of all three, such as the Dewey Decimal Code (DDC) 398.2 Th for the fairy tale “Three Billy Goats Gruff.”
Disambiguation is the process of making a term unambiguous. For example, the term Mercury could pertain to the Roman God, the element or the planet. The search system needs to be set up so that people searching for this term can easily get to the type of Mercury for which they search (Wikipedia: disambiguation).This might require an area which defines the different forms of Mercury and then the person can use parts of the definition to narrow the search results. For example, for an astronomy class, the student could search under “Mercury, planet”.
Pre and Post coordination are used to decide on terms for search. Mortimer Taube adapted an idea of using a set of single words or a short phrase (which he called Uniterms) to describe a document’s content. Let us say we describe a document with the separate terms America, Fiction, and Historical, which could be “precoordinated, preformed into a syntactic unit (American historical fiction). In any given controlled vocabulary, only one term would be valid. His idea of entering each term separately into an index allowed the searcher to look for any combination of the terms of interest. This came to be known as postcoordination, where the terms are associated after indexing, at search time, as the needs of the search dictated” (Meadows, 2000, p.22).
Some of the advantages of using precoordinated terms are that the terms are appropriately specific and may suggest other terms the user would want to search. They are good for “browsing.” Disadvantages are that indexers need a higher degree of training as the burden is more on them and the thesaurus for coming up with the terms.
For postcoordinated terms, the advantages are that it is easier to achieve consistent indexing, easier to apply and a more flexible system. It is also easier to computerize. Disadvantages are that the burden of coordination is on the user. One cannot express more complex relationships between terms than AND, OR or NOT (used in the Boolean search). There can be more ambiguity of terms and more false drops in searches.
Principles of Querying
Since I began the SLIS program in January 2010, I have become a more sophisticated searcher partly by performing more searches than the average person on more topics. I have taught the use of the ABC Clio system for searching historical journals to students at Berkeley High School and have spent more time searching computerized databases than at any other point in my life. I have also worked one on one with Berkeley High School students on their research for their history papers, pulling books, searching catalogs and book indexes for topical material. I have found that some keywords are too broad (e.g. searching under information AND literacy), some too narrow (Bathory for information on Elizabeth Bathory), and some are dead ends. I have learned to think like a librarian in terms of keywords. Even my Google searches are now often “spot on” with success at the first trial search.
I have learned how to do a Boolean search and that I will get the biggest set of results using Term A OR Term B. I will get a smaller group with Term A and Term B and a smaller group using Term A NOT Term B. I have learned how to choose key words for searches and that it is often more effective to search for a book by its author than by its title. When I am searching a database for a journal article, I try to limit the results by date of publication as much as possible. I have to admit that I often find a recent journal article more readily by doing a Google search than by checking subscription databases.
Principles of Evaluating a Database
You know a database is working properly when you easily and efficiently can find the item you are searching. There are two mathematic formulae that can be used to determine effective searching: Precision and Recall. The formal definitions were created in 1955 by Kent, Berry, Leuhrs, and Perry, “based on our ability to partition the entire database into relevant and irrelevant records and into retrieved and not retrieved records” (Meadows, 2000, p. 322).
Precision is the ratio of how many relevant items you find from the search of the entire pool of items. This can usually be determined fairly easily by asking the end user who was conducting the search. Recall is the ratio of the number of relevant records actually retrieved compared to the total number of relevant records that could have been retrieved. There have been a number of research projects on the relationship between Precision and Recall. There would seem to be a negative correlation in that the more recall, the lower the precision and vice versa. The more skilled the searcher, the higher the precision. With care in the search, the recall also can be higher. The ultimate measure of a search is how satisfied the searcher is with the results (Meadows, 2000, p. 333).
Artifacts:
My first artifact is Exercise 1: Attribute Elicitation from LIBR 202, Information Retrieval (Artifact: Attribute Elicitation from LIBR 202). In the exercise, the class was given a collection of postcards with photographs on them. The assignment was to index them, create a data structure for them, and discuss the challenges to creating the data structure. This was an independent project as opposed to a group project. I enjoyed it because in a former career as an archaeologist I spent time thinking about attributes and classification of pottery. I discussed how one must think about how the “average person” would search for keywords as opposed to using terms more known to a professional in a certain field. This method breaks down somewhat if you are developing a classification system for an academic library or the library of a specialized business. In those cases, the professional term for a search may have fewer ambiguities.
My second artifact is Assignment 2: Subject Analysis, which was a large project, done in parts, for LIBR 202, Information Retrieval (Artifact: Subject Analysis LIBR 202). I have only selected parts of the project to show. The objective was to design, create and then evaluate a DB/TextWorks database. We used a software program that only allowed you to create a limited number of entries without paying for the program. I created surrogate records for fifteen articles I read over the semester. I had to decide what fields were important and to perform test searches for my records. This culminating project shows my competency at using a simple database program to construct records for journal articles and to construct precoordinated and postcoordinated vocabularies.
My third artifact is my Word Press blog (www.klevenson.wordpress.com). I have used metatags for each posting of a book or film review or my reflections from a class prompt. Thus, the reader can search for books for tweens, books relating to quests, history, or fishing. Folksomies also are becoming a popular tool on the web. These depend on how many people categorize an item in a certain way. On a graphic you can see which topical tags are viewed more often. For example, if World War II is in very large letters and the term butterflies is in very small letters, you can tell that more people are searching for World War II on that website than are searching for material on butterflies. In this way, lay people are creating classifications rather than library professionals. This shows competency in understanding metatags and folksonomies, both fairly new tools used on the World Wide Web.
Artifact Four is a Discussion Post from Week 5 of LIBR 248, Basic Cataloging (Artifact: LIBR 248 Discussion on LOC Search). I performed a Basic Search and a Guided Search using the online Library of Congress catalog. I was searching for the play No Exit by Jean-Paul Sartre. In it I discuss which search terms worked best and what some of the unexpected results were. For example, two listings were for theatrical programs, housed in the Rare Book Room. I also mentioned using the tip of searching by author when you have a title with very common words in it. It shows my competency in performing LOC searches. Part of the course focused on the differences between the organization of topics in the LOC versus the Dewey Decimal (DDC) system.
Conclusion
Information Retrieval is a very important topic in library management, whether in a physical or a virtual library. Users must be able to find the items they are searching for. Most people are short on time so you want to make the search as quick as possible. Being able to search library catalogs by computer database and from off-site has speeded the process up significantly from searching through each paper card in a series of small wooden drawers. If you have a collection of items but have no way of locating them it is reminiscent of the final scene in the filmRaiders of the Lost Ark where the forklift is depositing a crate containing the Ark of the Covenant in some back corner of the Smithsonian basement, never, probably, to be located again.
References:
Meadow, C.T., Boyce, B.R., and Kraft, D.H. (2000). Text Information Retrieval Systems, 2nd Ed. San Diego: Academic Press.
Tucker, V. (2000). Principles for the Design of Rules for a Bibliographic System, a hand out derived from Svenonius by Tucker, V. for LIBR 202, Information Retrieval. San Jose, CA: San Jose State University.
Wikipedia: Disambiguation. http://en.wikipedia.org/wiki/Wikipedia:Disambiguation. Accessed 09/11/2012.
Attribute Elicitation from LIBR 202
An assignment for LIBR 202, Information Retrieval, where I developed a set of attributes for a group of postcards. Then I indexed the postcards, created a data structure with a field name and field values, and discussed the challenges of creating a data structure.
Attribute Elicitation from LIBR 202
Subject Analysis LIBR 202
This was the culminating project for LIBR 202, Information Retrieval. The project was a Subject Analysis where I created a A3. User Guide, A4. Data Structure, Rules and Statement of Purpose, A5. Postco and Preco Vocabulary Lists, A6. Database Records for 15 journal articles read over the semester, B. Retrieval Analysis and C. Evaluation. As the final project was about 30 pages long, I uploaded the title page and a couple pages from each of the first four sections.
Subject Analysis LIBR 202
Levenson LIBR 248 Discussion on LOC Search
From Week 5 Discussion Board for LIBR 248, Basic Cataloging, a discussion of using a Basic Search vs. a Guided Search and which keywords I found most useful in a search for all copies of Sartre's play,No Exit.
Levenson LIBR 248 Discussion on LOC Search
Explication:
Information must be organized in some manner and in a way that researchers, students, and the average person can retrieve it. As Meadow notes, “In brief, IR (Information Retrieval) involves finding some desired information in a store of information or a database. Implicit in this view is the concept of selectivity; to exercise selectivity usually requires that a price be paid in effort, time, money, or all three" (Meadow, 2000, p. 2).
“The process of selectively searching for information in a database can be viewed as starting from at least two different points,” explains Meadow. One is that of the searcher or end user; the other viewpoint is from that of the institution or collector of the information who organize the information in such a way that people will be able to search for it successfully later (Meadow, 2000, p. 4).
A catalog is one way to organize information. Most subject catalogs are set up in a hierarchical manner, which can be limiting. They depend on a vision of what subjects should be subsets of other subjects. For example, if you want to learn about folklore from Bosnia, would you first look under Folklore, subset Europe, further subset Bosnia or would you look for Europe, then Bosnia, and then a subset of folklore?
The World Wide Web, on the other hand, is a like an ocean full of all kinds of fish. You throw in a hook and often are not sure what will be found. It may be useful and factual or it may be junk. Part of the job of a librarian or teacher librarian is to teach people how to search in order to find what they are looking for most expeditiously. As more and more people turn first to the Web for their searches, librarians need to be able to help them learn to use different sorts of databases for their searches (e.g. ABC Clio for history) and how to test the validity of the material they find. It is interesting to see library websites following the marketing design of Amazon.com, offering a Visual Catalog which shows the cover of the book as well as offering reviews.
Principles of Database Design
In my Information Retrieval (IR) class, LIBR 202, we received a handout titled, “Principles for the Design of Rules for a Bibliographic System,” from Svenonius, 2000, p.68. It describes the principles implicit in the Anglo-American Cataloging Rules (AACR). These principles include
- Principle of user convenience with a sub principle as the Principle of common usage. This principle involves keeping the common user in mind when making descriptions and using normal language that the general user would know and understand.
- Principle of representation, with a sub principle of the Principle of accuracy. This principle concerns describing an entity “faithfully”, as accurately as possible and in the way the information entity describes itself.
- Principle of sufficiency and necessity with a sub principle of the Principle of significance.This principle has to do with simple descriptions. One should only describe that which is necessary to achieve the goal of unambiguous identification and avoid including unnecessary additional information.
- Principle of standardization. This principle states that descriptions of information items should be standardized as much as is practical.
- Principle of integration. This principle states that descriptions for all types of materials (audio recordings, film, books, etc.) should be based on a common set of rules as far as is practical. In other words, one should avoid special case rules as much as is possible.
In Natural Language systems, the language is taken from the title and subject matter of the item and each relevant word of the text are separated indexed and searchable.
“A Controlled Vocabulary is one for which some authority decides which words or codes are to be used and defines the meanings of these terms or the relationships among them,” explains Meadow (2000, p. 91). Use of a controlled vocabulary cannot ensure that all searchers will know to use a particular term to describe an item. However, each term in the controlled vocabulary “can be explained as to assumed meaning and differentiated from other terms” (Ibid.). This method does remove some of the ambiguity of searchable terms found in the Natural Language method.
In the Classification method, a hierarchical structure is created and then the indexer assigns each item to the appropriate category near to other similar items. Generally, we assign the item an alphanumeric code to denote its membership in a class. For example, in the Dewey Decimal Classification (DDC), 398.2 denotes folk tales or fairy tales so fairy tales from various cultures and geographical locations will be found in this section. A problem with this system is that it is sometimes up to the cataloger to decide where to shelve or file the item. We get around this problem by following the classification laid out on the title page verso in the book showing the Library of Congress code and subject classifications. For example, when I interned in an elementary school library, I found it interesting that using DDC, books on Animals were split into at least three sections. There was a section for books on wild animals, divided by geographic region and type of animal (Africa, Lions). Another section had to do with Agriculture and had the domestic animals such as horses and cattle. A third section had books on pets such as cats and hamsters. Could not a horse be regarded as an agricultural animal by some users and a pet by others?
Another term found in discussions of IR is an Attribute. An attribute is how we designate in symbols an actual physical object such as a book, a painting or a film. “It has a name and a value or content” (Meadow, 2000, p. 68). For example, a book called Huckleberry Finn by Mark Twain may be cataloged under Subject = American Fiction, Author = Twain, Mark or Samuel Clemens, and Title = Huckleberry Finn. It can be searched using the title, the author, or the subject matter. In most public libraries, Fiction is shelved by author’s last name. If you only knew the title Huckleberry Finn, you could type that into the search box of the catalog program and the catalog record would give you the author’s name so that you could search for the book on the shelf.
The symbols for attributes are most often a series of numbers, alphabet letters, other symbols, or a combination of all three, such as the Dewey Decimal Code (DDC) 398.2 Th for the fairy tale “Three Billy Goats Gruff.”
Disambiguation is the process of making a term unambiguous. For example, the term Mercury could pertain to the Roman God, the element or the planet. The search system needs to be set up so that people searching for this term can easily get to the type of Mercury for which they search (Wikipedia: disambiguation).This might require an area which defines the different forms of Mercury and then the person can use parts of the definition to narrow the search results. For example, for an astronomy class, the student could search under “Mercury, planet”.
Pre and Post coordination are used to decide on terms for search. Mortimer Taube adapted an idea of using a set of single words or a short phrase (which he called Uniterms) to describe a document’s content. Let us say we describe a document with the separate terms America, Fiction, and Historical, which could be “precoordinated, preformed into a syntactic unit (American historical fiction). In any given controlled vocabulary, only one term would be valid. His idea of entering each term separately into an index allowed the searcher to look for any combination of the terms of interest. This came to be known as postcoordination, where the terms are associated after indexing, at search time, as the needs of the search dictated” (Meadows, 2000, p.22).
Some of the advantages of using precoordinated terms are that the terms are appropriately specific and may suggest other terms the user would want to search. They are good for “browsing.” Disadvantages are that indexers need a higher degree of training as the burden is more on them and the thesaurus for coming up with the terms.
For postcoordinated terms, the advantages are that it is easier to achieve consistent indexing, easier to apply and a more flexible system. It is also easier to computerize. Disadvantages are that the burden of coordination is on the user. One cannot express more complex relationships between terms than AND, OR or NOT (used in the Boolean search). There can be more ambiguity of terms and more false drops in searches.
Principles of Querying
Since I began the SLIS program in January 2010, I have become a more sophisticated searcher partly by performing more searches than the average person on more topics. I have taught the use of the ABC Clio system for searching historical journals to students at Berkeley High School and have spent more time searching computerized databases than at any other point in my life. I have also worked one on one with Berkeley High School students on their research for their history papers, pulling books, searching catalogs and book indexes for topical material. I have found that some keywords are too broad (e.g. searching under information AND literacy), some too narrow (Bathory for information on Elizabeth Bathory), and some are dead ends. I have learned to think like a librarian in terms of keywords. Even my Google searches are now often “spot on” with success at the first trial search.
I have learned how to do a Boolean search and that I will get the biggest set of results using Term A OR Term B. I will get a smaller group with Term A and Term B and a smaller group using Term A NOT Term B. I have learned how to choose key words for searches and that it is often more effective to search for a book by its author than by its title. When I am searching a database for a journal article, I try to limit the results by date of publication as much as possible. I have to admit that I often find a recent journal article more readily by doing a Google search than by checking subscription databases.
Principles of Evaluating a Database
You know a database is working properly when you easily and efficiently can find the item you are searching. There are two mathematic formulae that can be used to determine effective searching: Precision and Recall. The formal definitions were created in 1955 by Kent, Berry, Leuhrs, and Perry, “based on our ability to partition the entire database into relevant and irrelevant records and into retrieved and not retrieved records” (Meadows, 2000, p. 322).
Precision is the ratio of how many relevant items you find from the search of the entire pool of items. This can usually be determined fairly easily by asking the end user who was conducting the search. Recall is the ratio of the number of relevant records actually retrieved compared to the total number of relevant records that could have been retrieved. There have been a number of research projects on the relationship between Precision and Recall. There would seem to be a negative correlation in that the more recall, the lower the precision and vice versa. The more skilled the searcher, the higher the precision. With care in the search, the recall also can be higher. The ultimate measure of a search is how satisfied the searcher is with the results (Meadows, 2000, p. 333).
Artifacts:
My first artifact is Exercise 1: Attribute Elicitation from LIBR 202, Information Retrieval (Artifact: Attribute Elicitation from LIBR 202). In the exercise, the class was given a collection of postcards with photographs on them. The assignment was to index them, create a data structure for them, and discuss the challenges to creating the data structure. This was an independent project as opposed to a group project. I enjoyed it because in a former career as an archaeologist I spent time thinking about attributes and classification of pottery. I discussed how one must think about how the “average person” would search for keywords as opposed to using terms more known to a professional in a certain field. This method breaks down somewhat if you are developing a classification system for an academic library or the library of a specialized business. In those cases, the professional term for a search may have fewer ambiguities.
My second artifact is Assignment 2: Subject Analysis, which was a large project, done in parts, for LIBR 202, Information Retrieval (Artifact: Subject Analysis LIBR 202). I have only selected parts of the project to show. The objective was to design, create and then evaluate a DB/TextWorks database. We used a software program that only allowed you to create a limited number of entries without paying for the program. I created surrogate records for fifteen articles I read over the semester. I had to decide what fields were important and to perform test searches for my records. This culminating project shows my competency at using a simple database program to construct records for journal articles and to construct precoordinated and postcoordinated vocabularies.
My third artifact is my Word Press blog (www.klevenson.wordpress.com). I have used metatags for each posting of a book or film review or my reflections from a class prompt. Thus, the reader can search for books for tweens, books relating to quests, history, or fishing. Folksomies also are becoming a popular tool on the web. These depend on how many people categorize an item in a certain way. On a graphic you can see which topical tags are viewed more often. For example, if World War II is in very large letters and the term butterflies is in very small letters, you can tell that more people are searching for World War II on that website than are searching for material on butterflies. In this way, lay people are creating classifications rather than library professionals. This shows competency in understanding metatags and folksonomies, both fairly new tools used on the World Wide Web.
Artifact Four is a Discussion Post from Week 5 of LIBR 248, Basic Cataloging (Artifact: LIBR 248 Discussion on LOC Search). I performed a Basic Search and a Guided Search using the online Library of Congress catalog. I was searching for the play No Exit by Jean-Paul Sartre. In it I discuss which search terms worked best and what some of the unexpected results were. For example, two listings were for theatrical programs, housed in the Rare Book Room. I also mentioned using the tip of searching by author when you have a title with very common words in it. It shows my competency in performing LOC searches. Part of the course focused on the differences between the organization of topics in the LOC versus the Dewey Decimal (DDC) system.
Conclusion
Information Retrieval is a very important topic in library management, whether in a physical or a virtual library. Users must be able to find the items they are searching for. Most people are short on time so you want to make the search as quick as possible. Being able to search library catalogs by computer database and from off-site has speeded the process up significantly from searching through each paper card in a series of small wooden drawers. If you have a collection of items but have no way of locating them it is reminiscent of the final scene in the filmRaiders of the Lost Ark where the forklift is depositing a crate containing the Ark of the Covenant in some back corner of the Smithsonian basement, never, probably, to be located again.
References:
Meadow, C.T., Boyce, B.R., and Kraft, D.H. (2000). Text Information Retrieval Systems, 2nd Ed. San Diego: Academic Press.
Tucker, V. (2000). Principles for the Design of Rules for a Bibliographic System, a hand out derived from Svenonius by Tucker, V. for LIBR 202, Information Retrieval. San Jose, CA: San Jose State University.
Wikipedia: Disambiguation. http://en.wikipedia.org/wiki/Wikipedia:Disambiguation. Accessed 09/11/2012.
Attribute Elicitation from LIBR 202
An assignment for LIBR 202, Information Retrieval, where I developed a set of attributes for a group of postcards. Then I indexed the postcards, created a data structure with a field name and field values, and discussed the challenges of creating a data structure.
Attribute Elicitation from LIBR 202
Subject Analysis LIBR 202
This was the culminating project for LIBR 202, Information Retrieval. The project was a Subject Analysis where I created a A3. User Guide, A4. Data Structure, Rules and Statement of Purpose, A5. Postco and Preco Vocabulary Lists, A6. Database Records for 15 journal articles read over the semester, B. Retrieval Analysis and C. Evaluation. As the final project was about 30 pages long, I uploaded the title page and a couple pages from each of the first four sections.
Subject Analysis LIBR 202
Levenson LIBR 248 Discussion on LOC Search
From Week 5 Discussion Board for LIBR 248, Basic Cataloging, a discussion of using a Basic Search vs. a Guided Search and which keywords I found most useful in a search for all copies of Sartre's play,No Exit.
Levenson LIBR 248 Discussion on LOC Search