John A. Bachman

We need to talk about how we talk about protein families and complexes

2018-07-08T00:00:00-05:00

(Our paper on FamPlex, a semantic resource for improving text mining and biocuration for protein families and complexes, is available at BMC Bioinformatics here. Synopsis follows below.)

When Ben Gyori and I started working with natural language processing (NLP) systems as part of the DARPA Big Mechanism program, we found that all the systems we worked with had a common problem: they were frequently unable to correctly and uniformly identify common protein families and complexes.

When I say "identify," I mean assign a database identifier (e.g., Uniprot ID, HGNC ID, Gene Ontology ID, etc.) to a text string denoting a protein family or complex (e.g., NF-kappaB, a complex, or Ras, a gene family). In the NLP world, this process is called "named entity linking", "named entity normalization", or simply "grounding."

Why is this important? Like many others in the systems/computational biology community, we are interested in mining the scientific literature for mechanistic information that can we can use to analyze data and build predictive, explanatory models. The problem in a nutshell is that scientists often write and talk about biological mechanisms in terms of protein families and functional complexes, whereas biological datasets are invariably expressed in terms of the abundances or activities of specific genes or proteins. So if we are going to make use of the (large!) amount of information expressed in terms of families and complexes we have to

Recognize and ground these terms to standard identifiers, and
Link these family/complex identifiers to their gene/protein-level constituents.

So why is this difficult for NLP algorithms in practice? Having looked at the types of errors made by two different reading systems, REACH (developed by Mihai Surdeanu's group at the University of Arizona) and TRIPS (developed by the Institute of Human and Machine Cognition), we concluded that the problem was not primarily with the NLP systems and their grounding algorithms, but rather in the lack of uniform resources for grounding and linking families and complexes.

For example, sometimes family/complex entities in text came back from the machine readers with no associated identifiers at all. The reason was usually straightforward: the databases indexed by the reading systems for grounding either didn't have much coverage of families and complexes, or those databases lacked the lexical synonyms necessary for accurate matching. For example, REACH, which indexed Uniprot, InterPro, Pfam, HMDB, ChEBI, Gene Ontology, MeSH, and other ontologies, found no grounding for the string NFkappaB, one of the most frequently occurring in our corpus. In fact, in a corpus of ~215,000 articles (a mix of full texts and abstracts), we found that the seven of ten of the most frequently occurring ungrounded entities were families and complexes!

Most frequently occurring ungrounded entity texts, with and without FamPlex. Families and complexes ungrounded without FamPlex are NF-kappaB, ERK1/2, mTORC1, NFkappaB, PDGF, IKK, and histone H3).

In other cases, family/complex names were incorrectly grounded to specific genes due to spurious exact matches in unexpected places. For example, ERK, the common name for the MAPK1/MAPK3 gene family, was incorrectly grounded to EPHB2 due to ERK being listed in Uniprot as a synonym for that gene. Similarly, human gene families were sometimes grounded to the single ortholog of the family in a different organism. My personal favorite was the grounding of AKT to the Dictyostelium (slime mold) gene pkbA instead of the human gene family consisting of the human genes AKT1/AKT2/AKT3.

Uniprot entry for the Dictyostelium gene pkbA has "akt" as a synonym, causing a spurious match.

In our evaluations, we found that the TRIPS system did a bit better, finding a higher percentage of matches using a different matching algorithm and a different set of databases, in particular the NCI Thesaurus (NCIT) and NextProt. Here, though, we found problems with resolving relationships to specific genes: 41% of the entities grounded to NCIT did not have any gene members defined, making it difficult to use this information in downstream data analysis.

More generally, among NLP event extraction systems and biocuration projects there seemed to be a complete lack of consistency in the resources used to identify families and complexes. Among the sources we encountered, there was literally no overlap!

Source	Type	Family/Complex Databases
REACH	NLP	InterPro, Pfam, GO
TRIPS	NLP	NextProt, NCIT
MedScan	NLP	Medscan IDs, Enzyme codes
TEES	NLP	Homologene
BEL	Curation	Selventa families and complexes
Pathway Commons	Curation	(genes enumerated)
SIGNOR	Curation	SIGNOR families and complexes
Reactome	Curation	Reactome families and complexes
EMBO Sourcedata	Curation	Ungrounded or genes enumerated

This was a problem for us, because we were interested primarily in aggregating and assembling information from both text mining and curated databases.

So...we started curating identifiers ourselves, defining IDs for protein families and complexes, linking in the synonyms that we were seeing in the literature, mapping them to the identifiers in both the protein databases and pathway resources, and defining the hierarchical relationships between complexes, families and their members. Mindful of how standards proliferate, our goal was not to supplant the existing resources but instead provide an extensible "bridging resource" for NLP developers and biocurators to ground and link the most commonly occurring entities and thereby combine information from multiple sources.

The project grew from a handful of CSV files intended for internal use to a fairly robust resource that improved grounding significantly enough that several of the NLP teams in the Big Mechanism program started to use it. We named it "FamPlex", for "Families, Complexes and their Lexicalizations". It consists of a set of files specifying identifiers for 441 human protein families and complexes, their synonyms, gene-level constituents, and equivalent identifiers in other resources.

Structure of the FamPlex resource.

We also included a curated list of gene/protein affixes annotated with their semantic meaning, useful for normalizing entity names and correctly interpreting extracted events.

Gene/protein prefixes in FamPlex.

We realized fairly early on that a simple two-layer mapping between families/complexes and their members would not be sufficient to capture the range of entities described in the literature. Articles frequently referred not just to families/complexes and their gene-level members, but often to intermediate groupings of entities, such as a class of subunits forming a part of a functional complex. To handle this, we defined two relationships, isa and partof, that could be nested to define the relationships within a family/complex, as for example with AMPK (a heterotrimer consisting of different combinations of genes drawn from three subunit families, alpha, beta, and gamma). We further defined synonyms for each element in the hierarchy to help NLP systems extract information about mechanisms at any level.

Hierarchical structure of FamPlex relationships, shown here for AMPK.

In sharing and publishing this resource with the community we sought to quantify the improvements attributable to the incorporation of FamPlex in an NLP or biocuration setting. We compared grounding accuracy for two readers (TRIPS and REACH) compiled with and without FamPlex. For REACH, the performance improvement was especially significant, with accuracy for protein families and complex rising from 15% to 71%.

Improvements in grounding accuracy for proteins/genes and families/complexes, with and without the use of FamPlex.

To get an estimate of the coverage of the resource relevant to a real-world biocuration project, we also quantified the proportion of family/complex-level annotations curated in the EMBO Sourcedata dataset that had relevant identifiers in FamPlex and obtained figures of 80-90%.

We also found empirically that the hierarchical structure of FamPlex allowed us to identify relationships between events extracted from families/complexes, genes, and intermediate levels. For some entities, for example Activin (a class of protein complexes belonging to the TGF-beta superfamily), the intermediate levels of representation (e.g., Activin A, Activin B, and Activin AB) were more frequently mentioned in extracted events than either genes or the top-level category!

The proportion of groundings at gene-level, intermediate-level, or top-level entities for five-multi-level families/complexes in FamPlex.

Amid a push to facilitate data integration and standardize nomenclature for human genes in the published literature, we speculate that FamPlex could be used to annotate text and data dealing with functional complexes and gene families (e.g., the results of antibody-based experiments involving multiple family members).

For more on the details of the construction and evaluation of FamPlex, see the paper! We've made FamPlex available under a CC0 license so that people can extend, remix, and combine FamPlex with other resources as necessary. We hope that this will be a useful tool for the computational biology and biocuration community. Feel free to ask questions or suggest additions via the Issues page on Github.

Links:

Paper: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5
Code (GitHub): https://github.com/sorgerlab/famplex
Code for paper: https://github.com/sorgerlab/famplex_paper
Browse FamPlex at NCBO BioPortal: http://purl.bioontology.org/ontology/FPLX
identifiers.org prefix: fplx

Building a Python 2/3 compatible Unicode Sandwich

2017-03-10T00:00:00-05:00

So you've decided that your code needs to be compatible with both Python 2 and Python 3. Most likely, you're upgrading your Python 2 code to work in Python 3, and know that you need to do things like:

Replace all calls to print with print()
Use absolute rather than relative imports
Call dict.items() instead of dict.iteritems()
etc.

But for me, and perhaps for you as well, by far the biggest and most complicated issue in getting code to be jointly Python 2/3 compatible is maintaining the Unicode Sandwich. If you don't know what a Unicode Sandwich is, please see these slides by Ned Batchelder: Pragmatic Unicode, or How do I stop the Pain? The idea is that in the 21st century, all text inside your application should be Unicode, with conversions to and from various specific character encodings at the margins.

Implementing a Unicode Sandwich in Python 3 isn't too bad because Python 3 explicitly distinguishes between decoded Unicode text (of type str) and encoded characters (of type bytes). This means that if you do a little bit of extra work to always explicitly convert bytes (that you might get from a web service, text file, or some other source) into Unicode, and do the reverse conversion when writing outputs, then voila, problem solved.

However, in Python 2 the str type contains bytes, which are implicitly converted (decoded) to Unicode when mixing bytes and Unicode. When you have an application that deals with lots of external files, services, and resources, some of which return Unicode content and others encoded bytes, along with the string literals in your own code (which are bytes in Python 2 and Unicode in Python 3), you have a recipe for confusion.

Making a Unicode sandwich work in both Python 2 and 3 requires systematically going through your code and enforcing the bytes/Unicode conversions at all the appropriate places, using code that works for both versions. The trick is that every data source and library is a little bit different--some accept only bytes to their key functions, others only Unicode strings, so the way of doing the appropriate conversions takes a little figuring. What follows are some notes I compiled when rewriting our INDRA software (which deals with natural language text and several different types of databases) to support Unicode in a Python 2/3 compatible way. Hopefully these notes will point you in the right direction if you are trying to do something similar.

Boilerplate imports

If you're going to maintain a Unicode sandwich, you'll need any strings that you define in your code to be Unicode strings:

from __future__ import unicode_literals

You'll also want (at least) the following builtins in your imports in every file:

from builtins import dict, str

Redefining dict and str in this way in Python 2 cause them to behave like the corresponding types in Python 3, e.g. dict.items() returns a generator rather than a list, and str can be treated as a Unicode string rather than a bytestring. Because of the redefinition of str, isinstance(u'foo', str) will be True in Python 2, so you can use the same code for both Python 2 and 3. Also, this redefines the str() constructor so that when you convert an object to a str in Python 2 (e.g., str(5)) you'll end up with a Unicode-compatible string, not bytes.

String type checking

In Python 2, it's common to check whether an argument is a string by checking isinstance(foo, basestring), because basestring is the supertype of both Python 2's str and unicode types. This is a handy way to, for example, tell whether then argument to a function is a string or a list. However, because basestring doesn't exist in Python 3, this has to be changed.

The best solution is to use Unicode everywhere in Python 2, importing from builtins import str (as recommended above) and then using isinstance(foo, str), where you would have previously used basestring. This is then compatible with both Python 2 and 3.

However, if your Unicode sandwich isn't completely airtight and there's a possibility that foo might be a Python 2 bytestring, then isinstance(foo, str) will be False when using the above approach, possibly leading to silent failures. In this case you might want to stick with the following workaround that retains basestring:

# Will be OK in Python 2
try:
    basestring
# Allows isinstance(foo, basestring) to work in Python 3
except:
    basestring = str

Custom str methods

In Python 2, custom __str__ methods are expected to return a bytes (Python 2 str), not a Unicode string (unicode). This can be a problem once your objects contain only hard-won Unicode strings. Fortunately the future package contains a decorator, @python_2_unicode_compatible, to make your __str__ method work in both Python 2 and 3. However, make sure that you apply the decorator only once in your object hierarchy, or you will get an error (i.e., don't apply the decorator both in a superclass and a subclass).

Web services: urllib and requests

The structure of the urllib library is very different between Python 2 and 3, and urllib calls require much more care in converting bytes/unicode. Do yourself a favor and rewrite any web service calls using requests instead of any of the urllib methods. requests takes a dict of query parameters directly, eliminating the need to urlencode and/or UTF-8 encode request content. The response object returned by the requests library gives you access to both the underlying bytes (in response.content) as well as a decoded Unicode version (in response.text). You can also get a JSON object directly using response.json.

If you're stuck using urllib for some reason, here are some things to note. First, wrap your imports in a try/catch, e.g.:

# Python 3 version
try:
    from urllib.request import urlopen
    from urllib.error import HTTPError
    from urllib.parse import urlencode
# Python 2 version
except ImportError:
    from urllib import urlencode
    from urllib2 import urlopen, HTTPError

When calling urlopen, the second argument must be a bytes, so you'll need to do:

result = urlopen(url, data.encode('utf-8'))

And when reading from the object, you'll need to decode back to Unicode:

response_text = result.read().decode('utf-8')

In some cases you'll want to keep the response content in bytes form if you're passing it to another library that expects only bytes. For example, rdflib and json perform the bytes/unicode conversion internally (see below).

Reading/writing CSV files

There are a few differences in the procedure for reading and writing CSV files between Python 2 and 3. In Python 3, the encoding should be specified when opening the file in text mode, which leads the csv.reader/writer object to then return Unicode strings. A newline='' argument is also required when opening the file.

For Python 2, the encoding and newline arguments are not permitted, so the file opening step has to occur in an alternative block. Also, the delimiter and quotechar arguments can be Unicode in Python 3, but in Python 2 they must be bytes (the lineterminator argument does not need to be encoded to bytes in Python 2, however).

Finally, the csv reader returns byte strings, so each field must be explicitly decoded into Unicode. Here is an example that handles the complete process and returns a generator. Note that the Python 2 version assumes that it is getting Unicode strings as arguments (which is the case when using unicode_literals as I recommend), which is why they have to be encoded:

def read_unicode_csv(filename, delimiter=',', quotechar='"',
                     quoting=csv.QUOTE_MINIMAL, lineterminator='\n',
                     encoding='utf-8'):
    # Python 3 version
    if sys.version_info[0] >= 3:
        # Open the file in text mode with given encoding
        # Set newline arg to ''
        # (see https://docs.python.org/3/library/csv.html)
        with open(filename, 'r', newline='', encoding=encoding) as f:
            # Next, get the csv reader, with unicode delimiter and quotechar
            csv_reader = csv.reader(f, delimiter=delimiter,
                                    quotechar=quotechar,
                                    quoting=quoting,
                                    lineterminator=lineterminator)
            # Now, iterate over the (already decoded) csv_reader generator
            for row in csv_reader:
                yield row
    # Python 2 version
    else:
        # Open the file in bytes mode
        with open(filename, 'rb') as f:
            # Next, get the csv reader, passing delimiter and quotechar as
            # bytestrings rather than unicode
            csv_reader = csv.reader(f, delimiter=delimiter.encode(encoding),
                                    quotechar=quotechar.encode(encoding),
                                    quoting=quoting,
                                    lineterminator=lineterminator)
            # Iterate over the file and decode each string into unicode
            for row in csv_reader:
                yield [cell.decode(encoding) for cell in row]

Follow the corresponding procedure for writing CSV files.

Pickling

Pickling and unpickling must always be done with files opened explicitly in binary mode. To maintain compatibility of pickled files with both Python 2 and 3, pickle files should be generated with protocol level 2, i.e., by pickle.dump(foo, fp, protocol=2). These files can be opened by Python 2 as well as 3.

If there are pre-existing pickle files generated by Python 2 that need to be openable by Python 3, there is an optional encoding argument to pickle.load that tells Python 3 how it should interpret non-ASCII byte strings that were encoded into pickle files by Python 2. For some reason, Python 2 pickles can sometimes fail to load in Python 3 unless the encoding argument to pickle.load is set to latin-1 (even if they were encoded in Python 2 using UTF-8). This has been reported in quite a few places, including:

Annoyingly, the encoding argument is not included in Python 2, so you will have to have two blocks of code for loading pickle files. If possible, it is far better to recreate and/or repickle the data in Python 3.

Parsing XML with xml.etree.ElementTree

This XML parser expects bytes, not Unicode, converting the bytes into Unicode internally. However, it is important to note than in Python 2, elements in the parsed XML will contain unicode text in the et.text field only when the element contains a Unicode character. This means that if the XML contains mostly ASCII-compatible strings, they will come back as Python 2 str, leaking bytes into your otherwise pure Unicode sandwich. This would generally be OK, except that if you have unit tests that check objects for Unicode objects this will lead to failures. Moreover, explicitly converting the ASCII-compatible strings with unicode(foo) is problematic in cases where the string can be None, as it will introduce the string 'None' into the data!

Here's a tricky solution (hack?) that I adapted from this thread that ensures that etree returns only Unicode strings and uses the same syntax in the caller between Python 2 and 3. It involves subclassing xml.etree.ElementTree.XMLTreeBuilder in Python 2 and overriding a single method. The trick is that in Python 3, a corresponding function is defined that simply returns None:

import xml.etree.ElementTree as ET

if sys.version_info[0] >= 3:
    def UnicodeXMLTreeBuilder():
        return None
else:
    class UnicodeXMLTreeBuilder(ET.XMLTreeBuilder):
        # See this thread:
        # http://www.gossamer-threads.com/lists/python/python/728903
        def _fixtext(self, text):
            return text

# Get XML content as bytes, e.g., via urlopen
response = urlopen(...)
tree = ET.parse(response, parser=UnicodeXMLTreeBuilder())

# Or, parse directly from a bytestring
xml_str = b'<foo><bar>baz</bar></foo>'
tree = ET.XML(xml_str, parser=UnicodeXMLTreeBuilder())

In Python 2, the call to UnicodeXMLTreeBuilder() returns an instance of the appropriate parser, whereas in Python 3, it returns None and allows the ElementTree.XML and ElementTree.parse functions to operate normally. The upshot is that the parser argument should always be passed when using either function.

JSON

When writing a Unicode-containing Python object to a JSON file or string using json.dump or json.dumps, note that the object produced is, counterintuitively, a str in Python 2 (bytestring), but with all non-ASCII characters escaped ("Python encoded") and hence suitable for writing to a file in text mode.

# Python 2
>>> import json
>>> foo = u'U\0001F4A9'
>>> type(foo)
<type 'unicode'>
>>> bar = json.dumps(foo)
>>> bar
'"U\\u00001F4A9"'
>>> type(bar)
<type 'str'>
>>> baz = json.loads(bar)
>>> baz
u'U\x001F4A9'
>>> type(baz)
<type 'unicode'>

In Python 3 json.dumps returns str, suitable for writing to text mode files.

Similarly, to load a JSON object with load or loads, in both Python 2 and 3 the json module expects a str (not a Python 3 bytes). This means that all JSON files should be opened in text (not binary) mode, and should be created by json.dump rather than by some other process that would leave encoded byte strings in the code.

Three other things are worth noting:

When dumping with json.dumps(foo) in Python 2, foo itself can contain a mix of str and unicode strings as long as the str objects are ASCII only.
When loading with foo = json.loads(...) in Python 2, the object returned will contain only unicode strings, even if those strings were str when they were dumped. For example:

# Python 2
>>> import json
# All strings are str, not unicode
>>> foo = ['foo', {'bar': ('baz', None, 1.0, 2)}]
# Will come back with all strings unicode
>>> json.loads(json.dumps(foo))
[u'foo', {u'bar': [u'baz', None, 1.0, 2]}]

In Python 3, calling json.dumps on an object containing any bytestrings will lead to a TypeError.

RDF and rdflib

When serializing an rdflib.graph object (e.g., for writing to a file), the encoding can be specified by an argument to the serialize function, which returns bytes:

g.serialize(format='xml', encoding='utf-8')

This can then be written to a file opened in bytes mode (i.e., with the wb arg), e.g.:

with open(file_path, 'wb') as out_file: # Binary mode
    xml_bytes = g.serialize(format='xml', encoding='utf-8')
    out_file.write(xml_bytes)

What fraction of articles can you expect to be available for text mining from Pubmed Central?

2016-05-20T00:00:00-05:00

In a previous post, I described assembling sets of Pubmed references relevant to 227 genes in the Ras pathway. The next problem was getting access to mineable content.

The most readily available source of content for text mining by researchers is the Pubmed Central Open Access article subset, and most demonstrations of text mining tools I have seen within the Big Mechanism program have used this set of articles almost exclusively. These articles have licenses allowing access and reuse suitable for our own (non-commercial research) purposes, though beyond those uses what is permitted depends on the specific, article-by-article license.

For those new to text-mining, as I am, it's worth noting that just because you can read all 4 million full-text articles on the Pubmed Central website doesn't mean you can mine them. In fact, only about ~1.2 million of the articles in Pubmed Central are in the Open Access subset--the majority are off-limits to bulk downloading and mining due to copyright restrictions. And that's out of a total of more than 25 million articles in Pubmed. So depending on what's in your denominator, the fraction of open access articles relevant to your topic could be pretty small.

Another source of articles for text mining is Pubmed Central's Author's Manuscript Collection. These consist of accepted manuscripts uploaded by authors in compliance with the NIH Public Access Policy, and they are also eligible for text mining. This dataset currently consists of ~386,000 articles. This brings the total number of mineable articles from Pubmed central to ~1.6 million.

Some publishers are beginning to make the full texts of non-open-access articles available to subscribers via the CrossRef text and data mining API, which is a subject for another day. Nevertheless, by far the most straightforward way to gain access to the full text of an article for mining is from Pubmed Central--if it's in the Pubmed Central Open Access or Author's Manuscript subsets.

I downloaded the Open Access and Author's Manuscript subsets from PMC and cross-referenced the available content to the full list of PMC articles. Before getting into the fraction of articles in these subsets, it's worth noting that many articles in PMC do not have PMIDs or DOIs associated with them, as shown in the following Venn diagram:

IDs (in addition to PMC ID) associated with the 3,998,986 articles in Pubmed Central. (PDF0)

This has two important consequences: first, since our current approach to building a corpus of papers involves searching Pubmed, we will always be starting with PMIDs, so the ~500,000 articles in Pubmed Central with no PMID are not relevant. Second, for papers that don't turn out to be available in mineable form from PMC, we'll have to look elsewhere (i.e., CrossRef) for full text content, for which we'll need DOIs. Since PMC doesn't have DOIs on file for nearly a third of its articles, we have to obtain them from somewhere else.

Starting with the subset of ~3.5 million articles in PMC that have PMIDs, I looked at how many of these are in the Open Access subset, the Author's Manuscript collection, or both. As the Venn diagram shows, these two collections of mineable articles are mostly complementary, with a relatively small number of articles appearing in both sets.

Articles in Pubmed Central that have PMIDs and are available for mining in either the Open Access subset (OA subset) or the Author's Manuscript collection (Author MS). (PDFz)

Next, I looked at the fraction of mineable articles in PMC (OA subset or Author's MS) that one tends to get from Pubmed search results, using the two sets of references for the 227 Ras genes I described in my previous post as examples (a set of ~356,000 papers obtained by gene name searches, and smaller set of ~54,000 papers obtained from the Entrez Gene database).

The results for the larger corpus of 356k papers show a roughly consistent fraction of mineable papers in PMC that appears to be independent of the total number of citations for the gene:

Percentage of references for each gene search with full text in Pubmed Central, sorted by number of total references (PDF1).

The results for the smaller corpus resulting from searching Entrez Gene by gene ID were similar:

Percentage of references for each gene search with full text in Pubmed Central, sorted by number of total references (PDF2).

For unique references combined across all 227 gene searches, the fraction of mineable articles in Pubmed Central made up roughly 16-18%:

	By gene name	By gene ID
Unique refs	355,781	54,308
Mineable In PMC	58,719	9,885
Percentage	16.5%	18.2%

There is another way to look at the data, which is not across the set of references for all genes, but on a gene-by-gene basis. In other words, if you are working on a particular gene, what fraction of the articles on your gene can you expect to find on PMC in a mineable form?

Here is the distribution for the 227 Ras gene searches by gene name:

Distribution of full text ratios for different gene name searches (PDF3).

And for the searches in Entrez Gene by gene ID:

Distribution of full text ratios for references in Entrez gene (PDF4).

The mean and standard deviation for full text percentages across the full set of genes:

	By gene name	By gene ID
Mean % in PMC	20.8%	18.7%
Std Deviation	9.4%	6.9%

As the histograms and the summary statistics show, while on average you might expect to find 1 out of every 5 articles on your gene available for mining in PMC, there is a lot of variability. If you're unlucky, you could easily end up with less than 1 in 10. Moreover, these results are specifically for a set of gene-based searches related to cancer biology. Our experience in obtaining references for two other less molecularly-focused projects (in diabetes and drug-induced cardiotoxicity), suggests that in other domains of biology, the fraction of open access articles may be substantially less, possibly due to different journal or publication practices.

Assembling a text-mining corpus for the Ras pathway

2016-05-19T00:00:00-05:00

Over the last year and a half or so I've been involved in the Big Mechanism program sponsored by DARPA. The practical goal of this program is to develop software systems to extract facts from the scientific literature by text mining and, from these facts, assemble causal, mechanistic models that can be used to explain and predict phenomena. The bigger picture goal is to explore an approach to science in which machines assume a greater share of the burden in aggregating and integrating research. Though DARPA envisions applications of this type of technology in multiple domains, the initial focus of the Big Mechanism program is in cancer biology, specifically Ras-driven cancer.

Most of my work on the Big Mechanism program up to this point has been to develop tools that assemble mechanisms into models, deconflicting, cleaning, and assembling findings into different formats. Having had some success in automated assembly of signaling models from databases (such as Pathway Commons) we are now looking to see how much more we can enrich these models using large-scale machine reading.

I started looking into this in the context of a specific use case: assembling a large-scale, high-quality model of the Ras signaling pathway, which I've been developing along with Ben Gyori, Kartik Subramanian and other collaborators here at HMS. As a starting point, we've defined the Ras signaling pathway according to Frank McCormick's RAS Pathway v2.0 diagram and accompanying table, which includes 227 genes organized into 65 groups.

The first question that arises is what is the best way to find papers relevant to a set of genes? A requirement is that the process of querying for publications should be automated, with minimal human intervention or curation. I tried two (very simple) approaches:

Query Pubmed using the canonical (HGNC) gene name
Get the set of references associated with the gene from the Entrez Gene database.

The first approach produces a ton of results, but has some issues: for one, it picks up false positives due to gene names that may be matches to other things: for example, the gene name JUN seems to pick any paper published in the month of June. On the other hand, this approach also seems to miss relevant papers due to the fact that most genes have several synonyms and many papers may refer to the gene using non-standard names.

The second approach pulls the curated PMID references associated with the gene from the Entrez Gene database. The set of PMIDs obtained by pulling all PMIDs out of the XML result for the gene corresponds closely to the "Bibliography" section of the Entrez Gene information page (e.g., see the Bibliography section for BRCA1).

As expected, searching by gene name returns a much larger set of PMIDs (more than 6 times larger) than obtaining the references from Entrez Gene. In both cases there was a substantial fraction of papers that were returned by searches for multiple genes, as might be expected for genes identified a priori as being involved in a common biological process. In both cases roughly 75% of the assembled list of PMIDs were unique.

	By gene name	By gene ID
Total refs	464,917	74,529
Unique refs	355,781	54,308

How many citations do we tend to get by gene? The figure below shows the distribution of the number of PMIDs returned for each gene, sorted by the number of PMIDs returned by gene name search, and plotted on a log scale. The distribution roughly follows a power law, with deviations for the most-cited and least-cited genes.

Citation distribution for 227 Ras genes, sorted by citation count for name-based search (PDF).

Reassuringly, the number of references returned by the gene ID search roughly follows the number of references returned by the name search, but with substantially fewer references overall. The least-cited genes appear to be an exception to this pattern: for these the gene ID search appears to return a larger number of references than the name search. This appears to be due to the fact that the least-cited genes often appear in the literature under different names, and Entrez Gene collates citations across multiple names.

The list of the top 10 genes (by citations) returns reassuringly familiar names. If anything, the gene ID search returns a list closer to what one might expect from how "famous" the genes tend to be, suggesting that it's less susceptible to variability due to the use of the particular name in the literature. For example, it's surprising that TP53 doesn't make the top 10 in the gene name search, probably because it's more frequently referred to by its protein name, p53, than its official gene name, TP53. Similarly, FOS is number 4 on the gene name list, but it's certainly not as well known as NFKB1 or KRAS, both of which make the top 10 by gene ID but not by gene name. A quick scan of the search results for "FOS" revealed hits not only for the gene FOS, but also false positives like "fructooligosaccharide" (FOS), "Framingham Offspring Study" (FOS), and "foot orthoses" (FOs).

Rank	By gene name (refs)	By gene ID (refs)
1	CASP3 (47320)	TP53 (7598)
2	EGFR (38072)	EGFR (4056)
3	MYC (30819)	NFKB1 (2508)
4	FOS (27521)	AKT1 (2370)
5	ERBB2 (23076)	BRCA1 (2304)
6	MTOR (18677)	ERBB2 (2107)
7	MAPK1 (12766)	MAPK1 (1719)
8	BRCA1 (12458)	KRAS (1609)
9	CDKN1A (12266)	PTEN (1571)
10	MAPK3 (12144)	BRAF (1503)

The genes with the fewest citations have a surprisingly small number of references given that they were explicitly included in a curated set of key Ras pathway genes. Many of them are lesser-known isoforms of widely-studied gene families (e.g., SPRED3, RASA2, PIK3R5/6):

Rank	By gene name (refs)	By gene ID (refs)
218	PIK3R5 (12)	RALGAPA2 (13)
219	SPRED3 (10)	RASGRP4 (13)
220	EXOC1 (7)	SPRY3 (12)
221	RALGAPA1 (7)	RASA2 (11)
222	RASSF9 (6)	RASSF10 (11)
223	CYTH2 (4)	RGL1 (11)
224	EXOC6 (4)	RASSF9 (10)
225	RALGAPA2 (4)	SPRED3 (8)
226	RASAL3 (3)	RASAL3 (7)
227	PIK3R6 (1)	RGL3 (5)

There are of course many other ways to assemple corpora, including systematic use of gene synonyms, exploiting MeSH terms and other metadata, as well as using other search tools (e.g., CrossRef). These were two very simple ways to get a sense of the scale of the relevant literature, with an expansive and a restricted search giving rough upper and lower bounds. My conclusion is that the curated references in Entrez Gene are less likely to contain false positives, with the downside of missing many potentially relevant articles. Given that the size of the corpus returned by Entrez Gene search is smaller, I'll use this set of roughly ~54k papers for an initial pilot study in machine reading for mechanisms.

In a subsequent post, I'll look at what fraction of the articles in these two corpora are available for text mining from Pubmed Central.