IN - Conflation Methods
Terminology
- conflate (general)
- act of fusing or combining
- conflate (IR)
- matching (approx.) morphological variants
- stem noun
- prefix, sometimes a root, of a word
- stem verb
- reduce a word to its stem form
- stemmer
- program to reduce words to stems
- truncate
- (manual) removal of some ending or suffix
Conflation processes are language dependent; herein we consider English only.
Measures
- linguistic correctness (generations to generate? disinterested to interest?)
- retrieval effectiveness
- space saving / compression
- speed
- overstemming (conflate when should not)
- understemming (not conflate when should)
Issues
- Purposes
- increase recall
- reduce size of index
- When: indexing vs. search time
- Who: computer vs. searcher
- speed
- overstemming (conflate when should not)
- understemming (not conflate when should)
Conflation Methods
Conflation methods vary in how they are done,
what the end results are, and
how effective they are.
Evaluation of Conflation Methods
Evaluation Issues
- Test collection (esp. vocabulary size)
- Measure(s): E, recall-precision, space saved
- Question / what are compared: stemming vs. not, stemming vs. truncation, method A vs. B
Results
- There is at least a little saving of space.
- Effectiveness effects are equivocal, perhaps confounded by the many variables that differ between tests, or situation dependent.
- Collection effects may be significant, and interact with methods.
- Truncation is probably not worthwhile.
- A number of algorithms give similar performance.
- Stemming is usually as good or better than not stemming.
- Full or strong stemming may be worse than weak stemming or plural removal.
- Avoid "stupid" errors in a "smart" process.
- Speed may be the deciding quality.
- In the future, linguistic accuracy may be of greatest importance..