Taxonomic Concept Standard Workshop, Edinburgh 12/05/2004
Key points
- General agreement that the overall structure of the schema seems appropriate and suitable for different groups.
- Some concern over the ultimate meaningfulness of comparing widely disparate treatments using concept models based on bibliography, specimens and descriptions.
- Recognition that Metadata elements should be unified with current SDD/ABCD metadata coordination, Voucher elements should reuse ABCD elements, CharacterCircumscription? elements should re-use SDD.
- Protocol issues need separate consideration prior to TDWG. Three main transfer modes. First is simple point-to-point transfer between applications of data sets (export/import), second is browse mode similar to SPICE interface or current SEEK test implementation, third (probably) is a DiGIR-style search interface to locate concepts using any of the included elements (but some additional complexity for DiGIR in that document structure includes three top-level containers. GBIF to liaise with SPICE and SEEK to try to establish a common model for second. More work needed to understand what may be needed for a DiGIR-style interface.
- Discussion of GUIDs. SEEK currently prototyping a system based on Handle system, particularly because of its acceptability to journal community. LSIDs remain a clear alternative candidate.
- Continuing discussion of what events and judgments lead to new concepts. Unclear whether this needs any standardisation to make it useful.
- Main point in schema requiring immediate general input is determination of appropriate vocabulary of relationship types between concepts to ensure that different groups can map their synonymy (and other) relationships to the standard while still guaranteeing that it is possible to perform set logic on concepts in all suitable situations.
- GUIDs should be associated with concepts to ensure that common concepts can be related to one another. Grand challenge here is probably in managing identifiers for publications. If we can control identifiers for these, concept identifiers become rather trivial (similar issue to controlling identifiers for collections as a means to simplify specimen identifiers.
- Important to get as much testing of transformation of individual data sets into the schema (and importing of data from schema) as is possible now. Recommendation for any projects represented to investigate prototyping use of the standard prior to TDWG 2004 to ensure that the standard is convincing.
- Next steps must include assimilation of immediate comments on schema and then circulation in wider community for comments and buy-in.
JessieKennedy, “Why do we need a taxonomic concept transfer standard and for whom?”
· Most taxonomy databases name based. How to relate names not on synonymy lists. They are major source of on-line taxonomic information.
· Some databases model taxonomic concepts, but different concepts, not much data.
· Taxonomic concepts needed for serious communication about taxa and to match names from disparate sources
· What a taxonomic concept is depends on perspective and usage. Need a common definition.
· GBIF/SEEK funding
· Consultation with major taxonomic database developers. Determine similarities and differences. Amend/extend abstract model into transfer of schema. Follow-on consultation and final version TDWG October 2004.
· Berlin Model, GBIF, IPNI/APNI, ITIS, Nomencurator, Prometheus, SEEK, Species 2000/BDWorld, Taxonomer, VegBank?
- Berlin Model, Prometheus: Revisionary taxonomist perspective, full classification hierarchy for a group, explicit opinion on concept synonymy (Berlin), information on type and non-type specimens (Prometheus).
- Taxonomer, Nomencurator: Taxonomy as recorded in publications, taxonomic assertions, recording any useful statement regarding a taxon in any publication.
- Species 2000, ITIS, BDWorld: Species-focused taxonomy, unclear what the concept for each name is, uses experts to agree on what species concepts exist and what their valid names are.
- IPNI, APNI: Name-based taxonomy, recording all published names with validity
- GBIF, SEEK, VegBank?: Database taxonomy, taxonomic assertions as recorded in existing databases, whatever information is stored, all notions, cross-references
· Three distinct but related areas in taxonomy: Classification, Nomenclature, Identification/determination. These are not kept distinct: names and taxa get confused, defining new taxa gets confused with data on identification or description.
· Names versus Concepts
- Taxon Name: label/word/string used for communicating ideas about organisms, meaningless without at least implied definition (originally in mind of taxonomist introducing name), full scientific name implies an original concept.
- Original Concept: full scientific name “according to” author + publication + date (+ definition)
- Revisionary taxonomy combined with rules of nomenclature means that names have more than one meaning.
- Revised Concept: Same as original concept but with different “according to” part.
- Reference Concept: Reference from a revision concept to some other concept that may not be well referenced and will not be defined (full scientific name with or without “according to”). Over time these should be replaced by either original or revision concepts
- Vernacular Concept: Used to allow access to possible original or revision concepts.
· All names can be treated as concepts.
· How do we define a concept?
- Character circumscription? Context dependent, differentiates rather than describes, natural language.
- Taxon circumscription? In terms of lower level taxa.
- Specimen circumscription? Type specimens, complete specimen set used in revision.
- Relationship to other taxa? Synonymy
- Instance of publication in which concept was defined?
· Aims of workshop
- Present taxonomic concept transfer schema (semantics, features)
- Globally unique identifiers
- Experience from mapping models to transfer schema
- Proposed exchange protocols
- Results of discussion to form report on workshop
WalterBerendsohn?
· Why do we need these systems? The usability of the results is the key issue. The aim is to enable non-taxonomists to be able to use the names.
Frank Bisby
· For aggregation, need to be able to cover whole domains with a taxonomic treatment (know whether we have included all entities within a genus once and only once). ILDIS would regard itself as performing its own revision. Jessie: e.g. ITIS is taking other people’s concepts but it is not necessarily clear what their concept is.
RobertKukla, “Taxonomic Concept Transfer Schema”
· Transfer schema: taxonomic entities of interest, relationships between them
· Metadata element for human consumption
· Publications element based on simple endnote style structure
· Concept defined here as an opinion about a group of organisms, by a person/group of people, having a name/label, definition, record is available, “time stamped”
· TaxonConcepts? have @type attribute. May be own concept (original or revision). May be referenced (related to other authors’ concepts). May be vernacular.
· NameDetailed using ABCD element.
· Relationship @type attribute allows for (rarely found) direct assertion of relationship (boolean), synonymy via typification, lineage if derived from other concepts, vernacular.
· SpecimenCircumscription? @type specimen (for holotype, etc.)
· TaxonConceptCircumscription? has @type (one level higher in hierarchy than other @type attributes.
Is schema available? From SEEK web site. Will be on NeSC site.
WalterBerendsohn?:
· Metadata standardisation will be handled in SDD workshop next week. Let’s defer until then. Specimens (“Units”) could use ABCD elements. SDD has description and circumscription issues.
· You stated that Relationships are always directed – assumes that always have an expert who references earlier concepts.
JessieKennedy: Author of revision can just identify e.g. that his concept is congruent with several other people’s concepts. This is not an implementation model. Walter: Can allow relationships to be related to the author of the relationship.
JamesYtow
· What of inter-regnal organisms with names under two codes? Jessie: treat these as two congruent concepts.
FrankBisby
· Should the vernacular name have a structure (and hence support NameDetailed)?
· Should there be a congruenceOrInclusion (or synonymy) relationship to cover the majority of cases? Bob Peet: the existing values are candidates to which many others could be added.
WalterBerendsohn?
· Need to include all nomenclatural relationships. Some of this information might be best placed in name part rather than relationship part.
DonaldHobern
· Are all of these relationships suitable for inclusion in a single attribute? Jessie: use multiple relationships
JamesYtow:
· Need locale for vernacular names
WalterBerendsohn?
· Each revised concept has two author strings.
FrankBisby
· How do Relationships and TaxonConceptCircumscription? relate to each other? Don’t they blur into each other. Both are set relationships. Jessie: can find e.g. that two genera are stated to be equivalent, but that included species represent different sets.
Will this be mapped to OWL or some other semantic language? Jessie: This would be an enormous job for which we don’t have the necessary information. Dave Thau: We have looked into doing this for the schema (without the data), but the usefulness of such a representation is unclear. OWL representations are hard to query.
DonaldHobern
· Any thoughts on bidirectional links to simplify processing these large documents (moving up hierarchy)? Jessie: Intended purely as a transfer schema.
DaveThau?, “Globally Unique Identifiers, why, where and how?”
· What?
- GUIDs (e.g. ISBN, patent numbers, GenBank?, etc.) Digital Object Identifiers (10.121/3212, all start with “10.”) LSID.
- Common features: Short names for complex entities, resolvable, identifies only one entity, permanence
- Differences: may or may not have multiple ids for single entity (patent numbers: no, GenBank? accession numbers: yes, Web URLs: yes), is issuing decentralised (patent numbers: no, ISBNs: yes, in blocks, LSIDs yes)
· Why?
- Standard ways to resolve ids
- Useful internally for systems dealing with data objects
- Useful for communicating between separate unaffiliated systems dealing with data objects
- Integration with other communities (e.g. publishing industry)
- Short, permanent, unique, resolvable
- Can’t just use existing databases such as IPNI, ITIS since generally name-based and cannot guarantee permanence
- Taxon+author+year+publication is too long, hard to represent publication, non-ascii character problems, not resolvable by themselves
· When?
- What gets a GUID? Taxonomic concepts, Publications, Specimens, Data Providers?, Authors?, Journals?
- When are they assigned? (Issue particularly if want each object only to have a single GUID, which would be more efficient, but may be difficult) Assigned when a “new” concept is added. How do you define a concept? When is a concept new enough to get a new GUID? What minor changes are allowable?
- Examples of a new concept: Revision adds new species to genus (species is new concept, genus is new concept), Revision adds synonym to taxon, Flora misspells a scientific name (?) What about; wrong page number, journal title misspelled, author misspelled? Best to leavechoice to data provider
· Which?
- Home grown identifiers not resolvable
- GRID resource locators seem not stable
- LSID good candidate
- Handle system (e.g. DOI)
- LSID: lots of backers, web service protocols, caching, authentication, metadata, completely decentralised, less than 2 years old
- Handle system (e.g. 1883/t3_17555): underlies DOIs, many journals using them, mature (>10 years old), proprietary central system to assign and resolve prefixes, not using internet standards
- SEEK prototyping using handle system (maturity, better authentication, easier to separate handles from issuers, likely to be accepted by publishers, no tie between domain names and handles)
· What now?
- Specification and implementation
- When is one concept different from another
- Can there be more than 1 GUID per concept?
- What will encourage people to assign and use GUIDs?
· Discuss
- Are they necessary
- Are there other GUID systems
- Is the 1 GUID per concept rule necessary
- How to encourage use
AlexGray?
· Note that uniqueness is only within context defined for given identifier type.
RobertKukla, “Experience from Mapping Existing Models to the Transfer Schema”
· ITIS plants, Berlin mosses (both text files), Taxonomer fishes (Access database)
· Imported into MySQL
· Java program to generate XML
· 3 main aspects: Identifying concepts, extracting relationships, concept details
· No CharacterCircumscription? or SpecimenCircumscription? information
· No hybrids as implications are not fully understood
· ITIS: 97741 plants, 206649 concepts, ITIS’ own concepts (usage=”accepted” -> type=”revision”), synonyms (usage=”not accepted -> type=”referenced”), vernaculars (type=”vernacular”), concept circumscription (parent_tsn field), synonymy (explicit + vernaculars), lineage relationships (to concept of same name according to different publication), NameSimple calculated
· Berlin: 24368 concepts, explicit concept relationships and name-synonymy, many different relationship types (some very rare)
· Taxonomer: Parent links, but no relationships
Protocol questions
· Need an interface like the SPICE interface to allow users to find concepts (treewalking, etc.). SPICE itself is a candidate.
· SEEK has an early implementation of an API using the Napier schema.
· Bring these together to see what can be done to resolve them
· May have three different kinds of use for schema:
- Point-to-point transfer of complete data sets
- SPICE-like treewalking
- DiGIR-style free-form query
Question and Answer
FrankBisby
· Healthy level of unanimity in acceptance of basic model (with possible exception of what a concept is)
DonaldHobern
· The acceptability of the standard may be precisely because it does not seek to overdefine this.
SallyHinchcliffe
· Important question is how compatible the concepts may be at the end of the day and hence how useful it may be to bring the datasets together.
FrankBisby
· Lots of work still in getting a complete list of relationship types