Project Details:

IUPAC - International Chemical Identifier

Project No.:2000-025-1-800
Start date:2001-01-01
End date:2005-04-15
Division:Chemical Nomenclature and Structure Representation Division
Objective:

The objective of the IUPAC Chemical Identifier Project is to establish a unique label, the IUPAC Chemical Identifier, which would be a non-proprietary identifier for chemical substances that could be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.

Remarks:
- initiated by the adhoc Committee on Chemical Identity and Nomenclature Systems
- In July 2004, the Identifier was renamed INChI (formerly IChI) toacknowledge the development work at NIST.
- In November 2004, the Identifier was renamed IUPAC International
Chemical Identifier (InChI), to allow trademark, copyright and licensing
issues to be resolved.

Description:

Develop a set of algorithms for the standard representation of chemical structures that will be readily accessible to chemists in all countries at no cost. The standard chemical representation could be used as input into existing and newly developed computer programs to generate a IUPAC name and a unique IUPAC identifier.

> See detailed description

Progress:

Our initial work has focussed on the development of algorithmsfor converting an input organic chemical structure to a unique (canonical)form. This, in effect, involves the unique numbering of each atom,with equivalent atoms being assigned identical numbers. "Serializing"the result to create a string is the final, straightforward, stepin creating an identifier.

As discussed in the Cambridge IUPAC meeting to consider the feasibilityof the project in August 2000, most of the ideas employed in thiswork have been reported in the technical literature. The principaltask of this project has been to identify and implement a workable,robust set of procedures that will provide effective IChI processingfor a large proportion of organic chemical structures in common use.

At the Cambridge meeting it was agreed to develop a "layered" approach,where different levels of structural information are separately representedin the identifier. Work has consequently proceeded by step-by-stepbuilding of the individual layers. Since the order of applicationof the layers could affect the final labeling, this process is somewhatmore complex that might initially appear.

The layers under development are:

  • constitutional - expresses pure connectivity ofthe atoms
  • stereochemical - includes conventional C-atom sp2and sp3 stereochemistry
  • isotopic - enables isotopes to be distinguished
  • tautomeric - implements simple forms of rapid H-migrationisomerization
  • Initial implementation and testing of this work have been completed,with the exception of the following two items:

  • Representing stereochemistry in systems with moving(mesomeric) bonds and electrons.
  • Representing H-migration tautomerism in systemscontaining 5-membered rings.
  • The first of these items does not seem to have been addressed adequatelyin the literature, although appropriate processing algorithms havebeen found in mathematical journals.

    We hope to complete these remaining tasks within two months and thento implement the IChI processor as a standalone program that can automaticallyprocess standard "MOL-files". When this is available, assistance willbe sought to further test, and possibly refine the IChI name generationprocess.

    Depending on results of these tests and discussions, it will be decidedwhether improvements or additional features are desirable, and, ifso, whether these need to be followed by another round of testing.For instance, it needs to be determined whether the first versionshould allow a canonical representation of partially-specified stereochemicalstructures.

    Finally, as discussed in the Cambridge meeting, there are no plansto include the following structural representations in the first version:

  • non C-atom sp2 and sp3 stereochemistry
  • ring-chain tautomerism (or any other variety not involving simpleH-migration)
  • non-covalent bonds
  • March 2002 update
    The first beta-test version of the program is now available. It runsas a conventional Windows application under 32-bit Microsoft Windowsoperating systems. Neither the underlying algorithms nor the programhave been perfected - this distribution is intended primarily to allowothers to participate in the further development.

    This program treats only covalently bonded compounds and uses Molfiles(and SDfiles) as input. Along with the executable programs, the distributionpackage contains documentation and example structure files.

    The package can be obtained from Steve Stein by e-mail to steve.stein@nist.gov.Unless requested otherwise, the package will be delivered as a 'zip'file in an e-mail attachment to the return address.

    A demonstration of Identifier generation within a (Windows) structure-drawingprogram, working in conjunction with the beta test program, can beobtained from Alan McNaught by e-mail to mcnaughta@rsc.org.

    There was a discussion of the project at the "CAS/IUPACConference on Chemical Identifiers and XML for Chemistry" on July1, 2002 in Columbus, Ohio. On the preceding day (June 30th) at thesame location the Project Group met to review progress and considercomments received.

    July 2002 update
    At the Task group meeting in Columbus, OH, on 30 June 2002,Steve Stein reviewed the progress made by NIST in developing the testversion of the IUPAC Chemical Identifier. The test version handlessimple organic molecules. To date, in all of the testing (almost 70copies have been distributed) there are no known examples of chemicalsthat the program does not handle. A number of suggestions (describedbelow) were made regarding testing and output. The overall view wasthat the project is progressing considerably faster than expected.
    > Download report - pdf file (118KB)

    A lecture by Steve Stein on the project was given the following dayat the CAS/IUPAC Conference on Chemical identifiers and XML for Chemistryand a copy of the slides presented can be viewed at: http://www.hellers.com/steve/pub-talks/columbus-702/frame.htm

    November 2003 update
    A combined meeting for two related IUPAC projects, the XMLData Dictionary Project (#2002-022-1-024) and this Chemical IdentifierProject (#2000-025-1-800), was held at the National Institute of Standardsand Technology (NIST, Gaithersburg, Maryland, US) on November 12-14,2003.

    A report on that meeting is published in Chem.Int.July-Aug 2004.
    A full account of the meeting is available at <www.warr.com/inchi.pdf>

    July 2004 update
    A new test version of the IUPAC-NIST Chemical Identifier (INChI)is now available. It replaces the previous test version issued lastNovember. All features planned for inclusion in the final releasehave now been implemented and the final format for Identifier hasbeen proposed. The new name of the Identifier (formerly IUPAC ChemicalIdentifier, INChI) acknowledges the development work at NIST. Thetest program accepts input in the form of MOLFiles (or SDfiles) andCML files. An Application Program Interface (API) for communicatingwith external programs is under development.

    A single INChI is generated for a single input structure, which cancontain multiple components. Identifiers can be created for organiccompounds with Z/E and sp3 stereochemistry, tautomers, and isotopesas well as salts, organometallic compounds and protonated forms ofa compound.

    Test programs (for Microsoft Windows), documentation and sample structurefiles are available upon request from Steve Stein <steve.stein@nist.gov>.The project team very much welcomes comments concerning the INChIand will be glad to assist in its testing or implementation.

    November 2004 update
    To allow trademark copyright and licensing issues to be resolvedbefore distribution of version 1.0, the name of the Identifier waschanged to IUPAC International Chemical Identifier (InChI).

    April 2005 - project completed
    Version 1 of IUPAC's International Chemical Identifier (InChI) hasnow been released; software, documentation, source code and licensingconditions are available from the IUPAC website at www.iupac.org/inchi.

    Promotion and extension continue through project2004-039-1-800.

    > see release;> FAQ (prepared by Nick Day of the Unilever Centre for MolecularInformatics, Cambridge University; http://wwmm.ch.cam.ac.uk/inchifaq/)


    Clipping
    > ThatINChI Feeling Reactive Reports, Sep 2004 (issue40)
    > Uniquelabels for compounds C&EN, 26 Nov 2002
    > ThatICHI feeling ... The Alchemist, 24 Apr 2002
    > What'sin a Name? The Alchemist, 21 Mar 2002

     

     

    > project announcement published in Chem.Int. 23(3) 2001

    Chairman: