Small library for validating and normalising persistent identifiers used in scholarly communication.

Features

  • Addition of custom schemes supporting all features of predefined schemes

  • Validation and normalization of persistent identifiers.

  • Detection of persistent identifier scheme.

  • Generation of resolving links for persistent identifiers.

  • Supported schemes: ISBN10, ISBN13, ISSN, ISTC, DOI, Handle, EAN8, EAN13, ISNI ORCID, ARK, PURL, LSID, URN, Bibcode, arXiv, PubMed ID, PubMed Central ID, GND, SRA, BioProject, BioSample, Ensembl, UniProt, RefSeq, Genome Assembly, GEO, ArrayExpress.

Installation

The IDUtils package is on PyPI so all you need is:

$ pip install idutils

API

Small library for persistent identifiers used in scholarly communication.

idutils.detect_identifier_schemes(val)[source]

Detect persistent identifier scheme for a given value.

Note

Some schemes like PMID are very generic.

idutils.is_ads(val)[source]

Test if argument is an ADS bibliographic code.

idutils.is_arrayexpress_array(val)[source]

Test if argument is an ArrayExpress array accession.

idutils.is_arrayexpress_experiment(val)[source]

Test if argument is an ArrayExpress experiment accession.

idutils.is_arxiv(val)[source]

Test if argument is an arXiv ID.

See http://arxiv.org/help/arxiv_identifier and

http://arxiv.org/help/arxiv_identifier_for_services.

idutils.is_arxiv_post_2007(val)[source]

Test if argument is a post-2007 arXiv ID.

idutils.is_arxiv_pre_2007(val)[source]

Test if argument is a pre-2007 arXiv ID.

idutils.is_bioproject(val)[source]

Test if argument is a BioProject accession.

idutils.is_biosample(val)[source]

Test if argument is a BioSample accession.

idutils.is_doi(val)[source]

Test if argument is a DOI.

idutils.is_ean(val)[source]

Test if argument is a International Article Number (EAN-13 or EAN-8).

See http://en.wikipedia.org/wiki/International_Article_Number_(EAN).

idutils.is_ean13(val)[source]

Test if argument is a International Article Number (EAN-13).

idutils.is_ean8(val)[source]

Test if argument is a International Article Number (EAN-8).

idutils.is_ensembl(val)[source]

Test if argument is an Ensembl accession.

idutils.is_genome(val)[source]

Test if argument is a GenBank or RefSeq genome assembly accession.

idutils.is_geo(val)[source]

Test if argument is a Gene Expression Omnibus (GEO) accession.

idutils.is_gnd(val)[source]

Test if argument is a GND Identifier.

idutils.is_handle(val)[source]

Test if argument is a Handle.

Note, DOIs are also handles, and handle are very generic so they will also match e.g. any URL your parse.

idutils.is_isbn(val)[source]

Test if argument is an ISBN-10 or ISBN-13 number.

idutils.is_isbn10(isbn10)[source]

Validate as ISBN-10.

idutils.is_isbn13(isbn13)[source]

Validate as ISBN-13.

idutils.is_isni(val)[source]

Test if argument is an International Standard Name Identifier.

idutils.is_issn(val)[source]

Test if argument is an ISSN number.

idutils.is_istc(val)[source]

Test if argument is a International Standard Text Code.

See http://www.istc-international.org/html/about_structure_syntax.aspx

idutils.is_lsid(val)[source]

Test if argument is a LSID.

idutils.is_orcid(val)[source]

Test if argument is an ORCID ID.

See http://support.orcid.org/knowledgebase/

articles/116780-structure-of-the-orcid-identifier

idutils.is_pmcid(val)[source]

Test if argument is a PubMed Central ID.

idutils.is_pmid(val)[source]

Test if argument is a PubMed ID.

Warning: PMID are just integers, with no structure, so this function will say any integer is a PubMed ID

idutils.is_purl(val)[source]

Test if argument is a PURL.

idutils.is_refseq(val)[source]

Test if argument is a RefSeq accession.

idutils.is_sra(val)[source]

Test if argument is an SRA accession.

idutils.is_uniprot(val)[source]

Test if argument is a UniProt accession.

idutils.is_url(val)[source]

Test if argument is a URL.

idutils.is_urn(val)[source]

Test if argument is an URN.

idutils.normalize_ads(val)[source]

Normalize an ADS bibliographic code.

idutils.normalize_arxiv(val)[source]

Normalize an arXiv identifier.

idutils.normalize_doi(val)[source]

Normalize a DOI.

idutils.normalize_gnd(val)[source]

Normalize a GND identifier.

idutils.normalize_handle(val)[source]

Normalize a Handle identifier.

idutils.normalize_orcid(val)[source]

Normalize an ORCID identifier.

idutils.normalize_pid(val, scheme)[source]

Normalize an identifier.

E.g. doi:10.1234/foo and http://dx.doi.org/10.1234/foo and 10.1234/foo will all be normalized to 10.1234/foo.

idutils.normalize_pmid(val)[source]

Normalize a PubMed ID.

idutils.to_url(val, scheme, url_scheme='http')[source]

Convert a resolvable identifier into a URL for a landing page.

Parameters:
  • val – The identifier’s value.

  • scheme – The identifier’s scheme.

  • url_scheme – Scheme to use for URL generation, ‘http’ or ‘https’.

Returns:

URL for the identifier.

Added in version 0.3.0: url_scheme used for URL generation.

Changes

Version v1.5.0 (released 2025-07-14)

  • chores: replaced importlib_metadata with importlib.metadata

Version 1.4.5 (2025-06-05)

  • ark: fix regex to match new ARK identifiers without slash

Version 1.4.4 (2025-06-03)

  • swhid: improved SWHID validation

  • tests: additional tests

Version 1.4.3 (2025-05-12)

  • is_url: allow URL parameters (i.e. semicolon)

  • gnd: improve validation and normalization

  • pmcid: fix url to a working location

  • pmid: add trailing slash

  • new: email and sha1 identifiers

Version 1.4.2 (2024-11-01)

  • setup: remove pytest-invenio to make imports cleaner

  • setup: install importlib_metadata for compatibility

  • bibcode/ads: normalize unicode

Version 1.4.1 (2024-10-18)

  • install: add importlib_metadata

Version 1.4.0 (2024-10-17)

  • Restructure module to be configurable and readable.

  • Adds a new entrypoint to register new custom schemes

  • Adds deprecations for direct imports of schemes

Version 1.3.0 (2024-10-15) (yanked due to undesired flask dependency)

  • Restructure module to be configurable and readable.

  • Adds a new entrypoint to register new custom schemes

  • Adds deprecations for direct imports of schemes

Version 1.2.1 (2023-03-02)

  • Fixes ORCiD validation, by adding the new ISNI block range.

Version 1.2.0 (2023-01-30)

  • schemes: add support for viaf and urn

Version 1.1.12 (2022-02-28)

  • Replaces isbnid_fork with isbnlib

Version 1.1.11 (2022-01-28)

  • Normalize pmid + their URL identifiers

Version 1.1.10 (2022-01-11)

  • Add purl.fdlp.gov as a valid PURL netloc

  • Normalize ror identifiers

Version 1.1.9 (2021-08-30)

Version 1.1.8 (2020-08-13)

  • Adds support for GEO and ArrayExpress identifiers.

Version 1.1.7 (2020-06-22)

  • Updates Software Heritage identifiers

  • Adds Research Organization Registry identifiers

  • Fixes DeprctationWarnings by using raw strings for regular expressions

Version 1.1.6 (2020-05-07)

  • Deprecates Python versions lower than 3.6.0. Now supporting 3.6.0 and 3.7.0.

Version 1.1.5 (2020-02-26)

  • Adds support for Software Heritage identifiers.

  • Fixes handling of non-digit characters in DOI detection.

Version 1.1.4 (2019-09-27)

  • Adds support for ASCL identifiers.

  • Fixes the ADS identifier regex to also detect lower-case author initials.

Version 1.1.3 (2019-09-17)

  • Adds support for HTTPS ORCiD identifiers.

Version 1.1.2 (2019-02-12)

  • Adds support for HAL identifiers.

Version 1.1.1 (2018-11-18)

Version 1.1.0 (2018-08-17)

  • Adds support for genomic identifiers: SRA, BioProject, BioSample, Ensembl, UniProt, RefSeq, GenBank/RefSeq.

  • Fixes bug in bibcode detection for non-capitalized journals.

Version 1.0.1 (2018-05-02)

  • Fixes bug causing invalid DOIs to be accepted.

Version 1.0.0 (2017-12-07)

  • Fixes handling of unicode characters in DOIs.

  • Adds support for APS style arXiv identifiers.

Version 0.2.4 (2017-01-30)

  • Removes Python 3.3 from a list of supported Python versions and adds Python 3.6

  • Moves from isbnid (v0.3.4) to isbnid_fork (v0.4.4) library.

Version 0.2.3 (2016-09-21)

  • Adds an optional parameter in idutils.to_url to use HTTPS scheme for PID providers that support it.

  • Detects and parses Handles and DOIs without the “http(s)://”, and ignores whitespace after scheme tags (eg. “doi: 10.123/456”).

Version 0.2.2 (2016-09-16)

  • Fixes issue where a valid ISBN with dashes and spaces could not be normalized.

Version 0.2.1 (2016-06-17)

  • Changes ISBN normalization to use isbnid instead of isbnlib. Now, importing this library will not change the default socket timeout, resulting in unwanted side effects.

Version 0.2.0 (2016-04-07)

Version 0.1.1 (2015-07-22)

  • Fixes GND validation and normalization.

  • Replaces invalid package name in run-tests.sh and makes run-tests.sh file executable. One can now use docker-compose run –rm web /code/run-tests.sh to run all the CI tests (pep257, sphinx, test suite).

  • Initial release of Docker configuration suitable for local developments. docker-compose build rebuilds the image, docker-compose run –rm web /code/run-tests.sh runs the test suite.

Version 0.1.0 (2015-07-02)

  • First public release.

Contributing

Bug reports, feature requests, and other contributions are welcome. If you find a demonstrable problem that is caused by the code of this library, please:

  1. Search for already reported problems.

  2. Check if the issue has been fixed or is still reproducible on the latest master branch.

  3. Create an issue with a test case.

If you create a feature branch, you can run the tests to ensure everything is operating correctly:

$ ./run-tests.sh

How to add your own schemes

Extension class to collect and register new schemes via entrypoints.

In order to define your own custom schemes you can use the following entrypoint to register them

[options.entry_points]
idutils.custom_schemes =
    my_new_scheme = my_module.get_scheme_config_func

The entry point 'my_new_scheme = my_module.get_scheme_config_func' defines an entry point named my_new_scheme pointing to the function my_module.get_scheme_config_func which returns the config for your new registered scheme.

That function must return a dictionary with the following format:

def get_scheme_config_func():
    return {
        # See examples in `idutils.validators` file.
        "validator": lambda value: True else False,
        # Used in `idutils.normalizers.normalize_pid` function.
        "normalizer": lambda value: normalized_value,
        # See examples in `idutils.detectors.IDUTILS_SCHEME_FILTER` config.
        "filter": ["list_of_schemes_to_filter_out"],
        # Used in `idutils.normalizers.to_url` function.
        "url_generator": lambda scheme, normalized_pid: "normalized_url",
    }

Each key is optional and if not provided a default value is defined in idutils.ext._set_default_custom_scheme_config() function.

Note: You can only add new schemes but not override existing ones.

class idutils.ext.CustomSchemesRegistry[source]

Singleton class for loading and storing custom schemes from entry points.

property custom_schemes

Return the registered custom registered schemes.

Each item of the registry is of the format:
{

“custom_scheme”: {

# See examples in idutils.validators file. “validator”: lambda value: True else False, # Used in idutils.normalizers.normalize_pid function. “normalizer”: lambda value: normalized_value, # See examples in idutils.detectors.IDUTILS_SCHEME_FILTER config. “filter”: [“list_of_schemes_to_filter_out”], # Used in idutils.normalizers.to_url function. “url_generator”: lambda scheme, normalized_pid: “normalized_url”

}

}

pick_scheme_key(key)[source]

Serialize the registered custom registered schemes by key.

Return a list of tuples [(<scheme_name>, <scheme_config_key_value>)]

idutils.ext.entry_points(group)[source]

Entry points.

Copied here from invenio-base so that we do not introduce a dependency on invenio-base.

License

IDUtils is free software; you can redistribute it and/or modify it under the terms of the Revised BSD License quoted below.

Copyright (C) 2015-2018 CERN. Copyright (C) 2018 Alan Rubin.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

In applying this license, CERN does not waive the privileges and immunities granted to it by virtue of its status as an Intergovernmental Organization or submit itself to any jurisdiction.

Authors

  • Adrian Pawel Baran

  • Alan Rubin

  • Alexander Ioannidis

  • Antoine Lambert

  • Bruno Marmol

  • Guillaume Viger

  • Jiri Kuncar

  • Lars Holm Nielsen

  • Pedro Gaudencio

  • Tibor Simko