learningmili.blogg.se - String similarity

The effectiveness of soft matching almost exclusively depends on the design of the similarity measure that quantifies the degree of similarity between two given strings. Moreover, soft matching can provide the user with multiple candidates that are ranked according to their similarity scores. Soft matching gives similarity scores between strings, which allows us to associate termforms even when they are not identical. For example, converting capital letters to lower case and deleting hyphens and spaces can resolve some of the mismatches caused by orthographic variation.Īnother approach, which may be employed in conjunction with the normalization approach, is to use soft string matching methods. One way to alleviate the problem is to normalize the terms (Fang et al., 2006). Attested term variants often result from a combination of these and can be very complex. ‘Serotonin receptor 1D’ and ‘Serotonin 1D receptor’) and parenthetical material. ‘Zfp580’ and ‘Zfp580 protein’), different word ordering (e.g. ‘IL-2’ and ‘interleukin-2’), extra words (e.g. ‘Synapsin 3’ and ‘Synapsin III’), acronym-definition (e.g. ‘GHF-1 transcriptional factor and ‘GHF-1 transcription factor’), Roman-Arabic (e.g. ‘IL2’ and ‘IL-2’), morphological variation (e.g.

Types of term variation include orthographic variation (e.g. 2005 Morgan and Hirschman, 2007 Yeganova et al., 2004). This is also one of the reasons why text mining systems often fail to find genes or proteins mentioned in the text (Crim et al., 2005 Hanisch et al. One of the major obstacles that hinder the effective use of a gene/protein dictionary is the problem of term variation. Databases of genes and proteins usually provide an interface that allows the user to search for the entry of interest using a name.

Many of the information extraction systems developed for biomedical documents provide a mapping between gene/protein names found in text and their corresponding identifiers (IDs) in biological databases (Hoffmann and Valencia, 2005 Miyao et al., 2006 Morgan et al., 2004). Looking up a gene/protein dictionary is a common task for both computer systems and researchers in biomedical research.