(Photo by Gilzee)
Most genealogists have at one point or other run across the term “GEDCOM”. They might not know what it is, but they have at least heard of it. This most commonly occurs when moving data from one genealogy application to another. Most applications will allow exporting and importing data in the GEDCOM file format.
A GEDCOM file is made up of a limited set of record types and data fields that are formatted to follow a predefined syntax called the GEDCOM grammar. All of your genealogy data must fit into these available record types in order for your data to be transferred seamlessly. Unfortunately, very few genealogy applications use the GEDCOM data model as their native database format (Family Historian being one notable exception), so each application must translate their internal data fields to the appropriate GEDCOM data fields. This is called translation mapping. However, since GEDCOM only supports a limited set of record types and because it is not universally extensible, this leads to translation mapping errors and loss of data.
This is where GEDCOM alternatives come in. A GEDCOM alternative (GA) is similar to GEDCOM in that it represents a genealogy database using what are in effect record types and data fields, and it follows a predefined syntax, where they differ is in the data models they use, resulting in different types of records, and data fields. A primary design goal of any GA should be to provide for better translation mapping. This is done by defining a more useful set of record types and data fields, and by building into the design universal extensibility so that when users and applications need to create new record types or data fields to hold unmapped data, they can do this in such a way that an importing application can recognize these additions and make use of them without loss of data.
As a professional software engineer who has been writing genealogy applications based on the GEDCOM “standard” for over a decade, including VGed, the GEDCOM validator, and Adam, the website generator that is currently in use generating the The Forsythe Saga website, and having some experience in writing GEDCOM alternatives (The Genealogy Format (GDF 1.2) [2000-2002], The Genealogical Record Exchange and Description Language (GREnDL 1.1) , and now currently in development, GREnDL 2.0 ), I have developed an opinion or two on some additional design goals any GEDCOM alternative should incorporate. The following list of design requirement goals are I believe the most fundamental. The list is short, comprising only 10 fundamental goals, and is not therefore meant to be inclusive of all GA requirements.
A GEDCOM Alternative Construction Kit
Goal 1: As I’ve already said, one of the primary goals should be to build into the design a more useful set of record types and data fields than those that currently exist in the GEDCOM standard. I do not think that this needs much explanation. Anyone who has used any genealogy software for any length of time realizes that there are hundreds of obvious record types that could be added such as a ship’s name that is needed for all immigrants, dozens of estate types like probates, inventories, appraisements, estate sales, various administration records, etc. It should not, however, be the goal to necessarily include every possible record type. Since a GA’s primary use is in genealogy applications and is not intended for historical record use, the record types and data fields built into the design should attempt to encompass genealogical relevant records only. Building in good extensibility will allow it to be expanded for other uses such as historical record documentation. It will of course be impossible to garner a consensus on the eventual selection, so this necessary extensibility will be inevitable.
Goal 2: Extensibility is paramount to a good design. No matter how hard one tries to build into the design a complete set of record types, there will always be some that were left out. This may be because the designers were not familiar with what types of records were kept in some countries in the past, or because certain non-genealogical information was intentionally omitted, but the need for extending the available record fields without modifying the protocol will be necessary. GEDCOM attempted to accomplish this by adding user defined fields. This was only partially effective though because they failed to allow any type of discovery as to what those fields were. A good design should allow not only new record types and data fields, but also a way to specify what those record types are and the format of those data fields. At a minimum, full and abbreviated titles should be specifiable, along with the format of the data fields. Data fields should adhere to recognized international standards. Applications on export should then be able to create any extendable records they need to map their unplaced data into, and to define those fields in such a way that an importing application can read those fields, interpret the type and format and make at least, limited use of them. A well written application should at a minimum be able to display these fields so that the data is not lost to the user. It should, if it provides editing capabilities, allow full editing of these fields, and should allow exporting those fields without modifying their type or format. There will always be some information that does not readily fit into whatever the design supports, and this is probably unavoidable. To help get around this, there should always be some type of generic record that applications can use to dump any data that they have no way of mapping into. Importing applications should still have access to these, but may not be able to display them in a readable form. This should not prevent them from exporting these records unchanged.
Goal 3: Anyone designing a GA should seriously consider using one of the many recognized grammars (XML, SQL, etc.) in existence today. I would think that genealogy application manufactures would be less resistant to adopting a new “standard” if the grammar was supported by off-the-shelf libraries. I have tested the import and export capabilities of many of the leading genealogy applications on several occasions, and it is quite apparent that they do not use compliant GEDCOM parsers. I am not sure why this is, the grammar is about as simple as they come. Mapping problems can account for some of their issues, but when an exported file violates the GEDCOM syntax, it would appear the application manufacturers were not trying very hard. In order to make this process as simple as possible for these vendors, I highly recommend using a standardized grammar.
Goal 4: A good GA design should aim at reducing complexity. The less complex the protocol, the more likely it will be implemented correctly by genealogy vendors and understood by users. Its structure should lend itself to standard database models so that translation mapping can be easily implemented. Applications should be able to grasp the data model in such a way that the internal linking of associated records should not be subject to misinterpretation.
Goal 5: As stated in Goal #4, the data model used in a GA should be as simple as possible. It should also support good genealogy practices by encouraging users to add supporting evidence for their claims, and where ever reasonably possible, deter poor genealogical practices by not providing shortcuts and preventing loose associations between data model elements. Genealogy is, for all its intents and purposes, nothing more than a set of claims supported by reliable evidence. Individuals are simply a group of associated claims based around their name (claim) and families are groups of individuals based around their relationship claims. This data model can be represented as:
Evidence < -> Claim < -> Person
Where evidence may make one or more claims, claims may be associated with one or more persons, people may be associated with one or more claims, and claims may be supported by one or more pieces of evidence. This basic ECP data model, should be at the core of any GA.
Goal 6: No information should be built into the data model that can be discovered and represented programmatically by software applications. This supports Goal #5 by reducing the complexity of the data, but violates Goal #4 by increasing the complexity of the representing software. Obviously some sort of happy medium should be aspired to. Giving increased control to software manufacturers on how they go about discovering and representing the associations in the data will allow them to distinguish themselves from their competitors and better vie for your genealogical dollar. As an prime example of the type of information that should not be built into a data model is the family record. This will, I expect, be a contentious issue, as many GAs build into their models complex record structures to support groups like families.
From a data model perspective however, families are a grouping of individuals based around their relationship claims, and as such do not exist as separate entities. Families can easily be discovered in software and then represented on individual profiles. One of the advantages to the ECP data model is that there are no preconceived notions of what a family is. This is determined by the originator of the database. A user could, for instance, setup relationships for parents, spouses, children, biological, adopted, fostered, step, or otherwise, godparents, and whatever else strikes their fancy. Same sex marriages and polygamy are also not prohibited. Other types of groups can also be discovered and represented. A baptism for instance where there is a child, parents, godparents, perhaps witnesses, and officiators, can all be setup though relationships as well. A well written software application could find this grouping by any of several different methods. It could find all claims associated with the baptism record using the evidence record, or search for these claims by date and evidence references. More sophisticated programs might allow users to predefine group types (i.e. families, baptisms, etc.) so that all the information could be edited on a single page by adding dates, locations, linking persons, and assigning their relationships (roles in the event). One of the problems with building groups into a data model is that when claims are added for the group instead of the individuals separately, it can be very confusing as to which claims apply to which members of a group. When the data model only supports individual claims, programs have the option of letting the users specify which claims to apply to all members of a group, and which to specific members using check boxes and the like. If implemented correctly, the end-user should not be able to tell that those group structures are not contained in the underlying data model.
Goal 7: The evidence for a claim comes from the source material that derived that claim. No claim can exist without this evidence. We commonly refer to evidence as sources. It is important to be able to categorize sources to infer their reliability. A good design must therefore allow for sources to be categorized. The FAQ goes into some detail on how this can be accomplished and were designed expressly to aid in algorithmic inferences. The Genealogy Proof Standard also has source categories that partially overlap these guidelines, but are more limited in scope and rely more on human inferences that do not translate well to software algorithms.
Goal 8: Some of the common types of claims are a person’s name, the relationship between one person and another (i.e. biological child, step child, godmother, etc.), the date of an event, the location of an event, the age of a person at the time of event, and the type of an event (i.e. birth, death, etc.). There are of course many others types claims, but this short list should serve as an example. Claims must also have the property of certainty. As was discussed in Goal #7, source categorization can be used to infer claim certainty in a general sense. Algorithmic inferences can greatly simplify and therefore reduce the effort by users to add certainty properties to their claims (Goal #4), but any inferences made algorithmically must allow for overriding these inferences by allowing direct user input. Algorithmic inferences are, of course, software application specific giving another avenue for vendors to compete for you allegiance. By providing a certainty data field for all claims, users have the option of overriding the automatic inferences, but at the cost of more data entry.
Certainty assessments should support at a minimum the following values:
Certain. Other assessments like
Assumed may be useful as well. A certainty assessment of
Proved is more detrimental than not and should be avoided.
Goal 9: A good design should also allow the privatization of data at every level (i.e. claims, evidence, persons). Privatization should also be categorized by role, such as
Role 2, etc. Assigning roles to the privatization fields allows software applications to provide group features to their users, so that their users can then define which viewers have access to which data by role. For instance, an online genealogy service may wish to assign their members to various groups based on subscription or membership levels, thereby taking control of which members have access to which data. These roles could represent family members, copyrighted material, living persons (though this can usually be calculated algorithmically as well), etc.
Goal 10: Another important property of any claim is its occurrence event type. Some claims are single occurrence events (SOEs) and others are multiple occurrence events (MOEs). An example of a single occurrence event is a person’s birth. Obviously they can only have one actual birth date. This is not to imply they can only have one birth date claim. They should be allowed any number of claims making the distinction all the more important. An example of a multiple occurrence event would be a land deed. Some events are usually thought of as SOEs, but could actually be MOEs on rare occasions like burials and baptisms. I would argue that the first baptism and first burial are SOEs, and any subsequent ones are MOEs, and should probably be different claim types. The occurrence event type should be definable for all claim types, including any extensions to the set of claims included in the data model, and allow for individual claims to be overridden by user input. Providing an occurrence event type property for all claims gives software applications valuable tools that may be used for programmatic features from estimating birthdates for individuals to calculating claim improbabilities. In a similar vein, claims should also be categorized as those that occur while living, those that occur while an adult, and those that occur while flourishing.
This set of 10 fundamental design goals should be in, or influence, the design of any GA in order to make it a viable contender among the plethora of competing models. There are without question dozens of other important goals that should be considered as well, such as internationalization of text and dates, comment data formatting, alternate calendars, photo and URL support for local and external linking and submission identification, record modification date tracking, and the like.
Louis Kessler responded (Oct 16, 2011 11:08 PM)
So Tim, why don’t you consider getting involved with the BetterGEDCOM initiative? Or even better – why not take the initiative to lead it?
They’ve got the organization going with connections into the industry. You’ve got great technical insight into just what’s required.
Nudge, nudge. Wink, wink.
I responded (Oct 17, 2011 12:30 AM)
Louis, thanks for the vote of confidence, but unfortunately my project list is already stacked too deep for me to handle as it is, so I don’t see how I could take on another one, and especially one of this magnitude, anytime in the foreseeable future. I do wish the team the best of luck in coming up with a good design strategy that will be widely adopted. Here’s too better standards, cheers.
Taco Goulooze responded (Oct 20, 2011 10:40 PM)
Your proposed “Certainty assessments” values include ‘disproved’. Has there ever been a discussion about the fact that you have no way of keeping track of facts that have been disproven within most (if not all) genealogical database software? Or am I generalizing and just haven’t been using the right software?
I responded (Oct 21, 2011 04:26 AM)
I think you are correct, but I don’t use most genealogical applications, so cannot be certain. I’ve heard that some apps are adding support for Mill’s Genealogical Proof Standard, but I don’t think the GPS goes nearly far enough. I think Mill’s over compromised for simplicity. I don’t think my source categories and certainty assessments are overly complex – I use them everyday. The GPS has both ‘negative’ and ‘conflicting’ categories for source evidence, but that only applies indirectly to the claims that are referenced by that source. It is always better to assess the claims themselves for exactitude.
Taco Goulooze responded (Oct 21, 2011 07:50 AM)
Especially in a time where gedcoms and online family trees become more readily available and in larger quantities, you have to be able to record a claim that you or someone else has proven to be false, and be able to support that with proof (and keep record of its sources). Otherwise, it will keep on popping up. If 100 people claim something without sourcing it, it will gradually be more accepted as ‘true’ if there is no record kept of why it is a falsehood.