nmdc-client

This module provides functions to interact with the NMDC API, including retrieving metadata into dataframes and linking across schema objects.

Getting started

To install:

pip install git+https://github.com/jeffbaumes/nmdc-client.git

To use:

from nmdc_client import NmdcClient
from nmdc_schema import nmdc

client = NmdcClient()
studies = client.find(nmdc.Study)

User guide

This module provides functions to interact with the NMDC API, including retrieving metadata into dataframes and linking across schema objects.

The NmdcClient.find and NmdcClient.lookup functions return data as a DataFrame. To retrieve data as a list of dictionaries, use NmdcClient.find_dict and NmdcClient.lookup_dict. To retrieve data as a list of nmdc_schema.nmdc objects, use NmdcClient.find_full and NmdcClient.lookup_full. See the NMDC schema documentation for more information on NMDC schema classes.

When working with DataFrames, there is a NmdcClient.merge_related function to merge related objects into one DataFrame. This uses the more low-level NmdcClient.related_ids function to find related objects of a given type from a single ID.

Use the following import to access Pythonic representations of the NMDC schema.

>>> from nmdc_schema import nmdc

Names of classes and slots can be discovered by autocompleting nmdc. in your IDE, or by visiting the NMDC schema documentation.

To interact with the NMDC API, create an NmdcClient instance.

>>> client = NmdcClient()

The following retrieves all studies with Wrighton as principal investigator as a DataFrame. Queries are specified using MongoDB query syntax.

>>> query = {'principal_investigator.has_raw_value': {'$regex': 'Wrighton'}}
>>> studies = client.find(nmdc.Study, query=query)
>>> len(studies)
2

The following retrieves a specific biosample by its identifier and associates it with study metadata. The first time NmdcClient.merge_related is called, it will fetch all linkages between objects in the NMDC schema and cache them in a file local to the package. You can force an update of this cache at a later time with by calling NmdcClient.fetch_links with force set to True.

>>> ids = ['nmdc:bsm-13-7qxjvr77']
>>> biosample = client.lookup(nmdc.Biosample, ids, fields=['id', 'name'])
>>> add_study = client.merge_related(biosample, 'id', nmdc.Study, fields=['id', 'title'])
>>> for col in add_study.columns:
...    print(col)
id
name
id_Study
title

The following retrieves 10 samples then merges them with all associated WorkflowExecutions.

>>> samp = client.find(nmdc.Biosample, fields=["id", "type"], limit=10)
>>> exec = client.merge_related(samp, "id", nmdc.WorkflowExecution, fields=["id", "type"])
>>> for col in exec.columns:
...     print(col)
id
type
id_WorkflowExecution
type_WorkflowExecution
class nmdc_client.NmdcClient(base_url: str = 'https://api.microbiomedata.org', max_page_size: int = 1000, sleep_seconds: float = 0.5, timeout: float = 30, verbose: bool = False, links: Dict[str, Tuple[List[str], List[str]]] | None = None)[source]

Bases: object

Class to manage calls to the NMDC API service.

Parameters:
  • base_url – The base URL of the NMDC API.

  • max_page_size – The maximum number of objects to retrieve per request.

  • sleep_seconds – The number of seconds to wait between API requests when paginating.

  • timeout – The number of seconds to wait for an API request to complete before throwing an exception.

  • verbose – Whether to print URLs of API requests.

  • links – A dictionary of links between objects in the NMDC schema. Defaults to None which fetches and caches the links on the first call to related_ids() or merge_related().

Example:

>>> client = NmdcClient(max_page_size=100)
database_collections_for_type(cls: Type[SchemaClass]) List[str][source]

Find the database collection names for a given schema class, if any.

Parameters:

cls – The schema class to find the collection name for. Must be a subclass of nmdc_schema.nmdc.NamedThing.

Returns:

The list of collection names containing items of that class or subclass, or an empty list if none are found.

fetch_codes()[source]

Fetch and cache typecodes from the NMDC typecode API.

Construct and cache “major” links between objects in the NMDC schema.

This function constructs links between objects in the NMDC schema and caches them in a pickle file. If the links have already been constructed and cached, they are loaded from the pickle file.

The “major” links through the NMDC schema objects follow this directional flow. The (process)* notation represents a process that may be repeated zero or more times.

Study (→ Study)* → Biosample (→ MaterialProcessing → ProcessedSample)* → DataGeneration → DataObject → (WorkflowExecution → DataObject)*

Parameters:

force – Whether to force the construction of links even if they have already been cached.

find(cls: Type[SchemaClass] | str, query: Dict[str, Any] | None = None, fields: List[str] | None = None, limit: int | None = None) DataFrame[source]

Retrieve a DataFrame of objects of the specified class from the NMDC schema.

Parameters:
  • cls – The class type or collection of the objects to retrieve. Must be a subclass of nmdc_schema.nmdc.NamedThing or the name of a NMDC database collection.

  • query – A dictionary representing the query to filter the objects. Defaults to None which returns all objects.

  • fields – A list of fields to include in the result. Defaults to None which returns all fields.

  • limit – The maximum number of objects to retrieve. Defaults to None which retrieves all matching objects.

Returns:

A DataFrame containing the retrieved objects.

find_dict(cls: Type[SchemaClass] | str, query: Dict[str, Any] | None = None, fields: List[str] | None = None, limit: int | None = None) List[Dict[str, Any]][source]

Retrieve a list of dictionaries of objects of the specified class from the NMDC schema.

Parameters:
  • cls – The class type or collection of the objects to retrieve. Must be a subclass of nmdc_schema.nmdc.NamedThing or the name of a NMDC database collection.

  • query – A dictionary representing the query to filter the results. Defaults to None which returns all objects.

  • fields – A list of fields to include in the result. Defaults to None which returns all fields.

  • limit – The maximum number of objects to retrieve. Defaults to None which retrieves all matching objects.

Returns:

A list of dictionaries representing the retrieved objects.

find_full(cls: Type[SchemaClass], query: Dict[str, Any] | None = None, limit: int | None = None) List[SchemaClass][source]

Retrieve a list of full objects of the specified class from the NMDC schema.

This function queries the NMDC API to retrieve objects of the specified class, converts the resulting dictionaries to instances of the specified schema class, and returns them as a list.

Parameters:
  • cls – The class type of the objects to retrieve. Must be a subclass of nmdc_schema.nmdc.NamedThing.

  • query – A dictionary representing the query to filter the results. Defaults to None which returns all objects.

  • limit – The maximum number of objects to retrieve. Defaults to None which retrieves all matching objects.

Returns:

A list of instances of the specified class type.

Find all object identifiers linked to the specified object identifier.

This function finds all object identifiers linked to the specified object identifier by traversing the links in the specified direction.

Parameters:
  • root_id – The object identifier to start from.

  • direction – The direction to traverse the links. Use SearchDirection.BACK for incoming links and SearchDirection.FORWARD for outgoing links.

  • steps – The maximum number of steps to follow. Defaults to None which follows all links.

Returns:

A list of object identifiers linked to the specified object identifier.

Example:

>>> client = NmdcClient()
>>> linked = client.follow_links('nmdc:bsm-13-7qxjvr77', SearchDirection.FORWARD, 1)
>>> linked.sort()
>>> for item_id in linked:
...     print(item_id)
nmdc:omprc-11-z841e208
nmdc:omprc-13-359qhn38
nmdc:omprc-13-sdcsk511
lookup(cls: Type[SchemaClass] | str, ids: List[str], fields: List[str] | None = None) DataFrame[source]

Retrieve a DataFrame of objects by their identifiers.

Parameters:
  • cls – The class type or collection of the objects to retrieve. Must be a subclass of nmdc_schema.nmdc.NamedThing or the name of an NMDC database collection.

  • ids – A list of object identifiers.

  • fields – A list of fields to include in the result. Defaults to None which includes all fields.

Returns:

A DataFrame representing the retrieved objects.

lookup_dict(cls: Type[SchemaClass] | str, ids: List[str], fields: List[str] | None = None) List[Dict[str, Any]][source]

Retrieve a list of dictionaries of objects by their identifiers.

Parameters:
  • cls – The class type or collection of the objects to retrieve. Must be a subclass of nmdc_schema.nmdc.NamedThing or the name of a NMDC database collection.

  • ids – A list of object identifiers.

  • fields – A list of fields to include in the result. Defaults to None which includes all fields.

Returns:

A list of dictionaries representing the retrieved objects.

lookup_full(cls: Type[SchemaClass] | str, ids: List[str]) DataFrame[source]

Retrieve a list of schema objects by their identifiers.

Parameters:
  • cls – The class type or collection of the objects to retrieve. Must be a subclass of nmdc_schema.nmdc.NamedThing or the name of an NMDC database collection.

  • ids – A list of object identifiers.

Returns:

A list of instances of the specified class type.

Merge related objects of the specified class into the specified DataFrame.

This function merges related objects of the specified class into the specified DataFrame by finding related objects for the specified IDs and merging them into the DataFrame.

Parameters:
  • data – The DataFrame to merge related objects into.

  • id_column – The name of the column containing the object IDs.

  • cls – The type of objects to merge into the DataFrame.

  • fields – A list of fields to include in the result. Defaults to None which includes all fields.

Returns:

A DataFrame containing the merged data.

related_ids(ids: List[str], cls: Type[SchemaClass]) List[Tuple[str, str]][source]

Find related object identifiers of the specified class for the specified IDs.

This function finds related objects of the specified class for the specified IDs, and returns a dictionary mapping each input object ID to a list of related object IDs matching the specified class.

Parameters:
  • ids – A list of object IDs.

  • cls – The type of objects to find related objects for.

Returns:

A list of tuples mapping each source ID to a related target ID.

Example:

>>> client = NmdcClient()
>>> related = client.related_ids(['nmdc:bsm-13-7qxjvr77'], nmdc.WorkflowExecution)
>>> related.sort()
>>> for source_id, target_id in related:
...     print(source_id, target_id)
nmdc:bsm-13-7qxjvr77 nmdc:wfmb-13-w61ppf20.1
nmdc:bsm-13-7qxjvr77 nmdc:wfmgan-11-szz9bq42.1
nmdc:bsm-13-7qxjvr77 nmdc:wfmgas-13-a7e90z13.1
nmdc:bsm-13-7qxjvr77 nmdc:wfmp-11-hpexdy53.1
nmdc:bsm-13-7qxjvr77 nmdc:wfrbt-13-8z2h4m87.1
nmdc:bsm-13-7qxjvr77 nmdc:wfrqc-13-zntcxa44.1
class nmdc_client.SearchDirection(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]

Bases: Enum

Enumeration to specify the direction to follow linkages among objects in the schema. See NmdcClient.fetch_links() for details.

BACK = 0
FORWARD = 1