John A. Bachmanhttp://johnbachman.net/2018-09-27T10:26:15-05:00Systems and computational biology, programming, and statistics.We need to talk about how we talk about protein families and complexes2018-07-08T00:00:00-05:002018-09-27T10:26:15-05:00John A. Bachmantag:johnbachman.net,2018-07-08:/we-need-to-talk-about-how-we-talk-about-protein-families-and-complexes.html<p><em>(Our paper on FamPlex, a semantic resource for improving text mining and
biocuration for protein families and complexes, is available at BMC
Bioinformatics</em> <a class="reference external" href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5">here.</a>
<em>Synopsis follows below.)</em></p>
<p>When <a class="reference external" href="http://scholar.harvard.edu/bgyori">Ben Gyori</a> and I
started working with natural language processing (NLP) systems as part of the
<a class="reference external" href="https://www.darpa.mil/program/big-mechanism">DARPA Big Mechanism program</a>,
we found …</p><p><em>(Our paper on FamPlex, a semantic resource for improving text mining and
biocuration for protein families and complexes, is available at BMC
Bioinformatics</em> <a class="reference external" href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5">here.</a>
<em>Synopsis follows below.)</em></p>
<p>When <a class="reference external" href="http://scholar.harvard.edu/bgyori">Ben Gyori</a> and I
started working with natural language processing (NLP) systems as part of the
<a class="reference external" href="https://www.darpa.mil/program/big-mechanism">DARPA Big Mechanism program</a>,
we found that all the systems we worked with had a common problem: they were
frequently unable to correctly and uniformly identify common protein families
and complexes.</p>
<p>When I say "identify," I mean assign a database identifier (e.g., Uniprot ID,
HGNC ID, Gene Ontology ID, etc.) to a text string denoting a protein family or
complex (e.g., <tt class="docutils literal"><span class="pre">NF-kappaB</span></tt>, a complex, or <tt class="docutils literal">Ras</tt>, a gene family). In the NLP
world, this process is called <a class="reference external" href="https://en.wikipedia.org/wiki/Entity_linking">"named entity linking",</a> "named entity normalization",
or simply "grounding."</p>
<p>Why is this important? Like many others in the systems/computational biology
community, we are interested in <strong>mining the scientific literature for
mechanistic information that can we can use to analyze data and build
predictive, explanatory models.</strong> The problem in a nutshell is that scientists
often write and talk about biological mechanisms in terms of protein families
and functional complexes, whereas biological datasets are invariably expressed
in terms of the abundances or activities of specific <cite>genes or proteins</cite>. So if
we are going to make use of the (large!) amount of information expressed in
terms of families and complexes we have to</p>
<ol class="arabic simple">
<li><cite>Recognize and ground</cite> these terms to standard identifiers, and</li>
<li><cite>Link</cite> these family/complex identifiers to their gene/protein-level
constituents.</li>
</ol>
<p>So why is this difficult for NLP algorithms in practice? Having looked at the
types of errors made by two different reading systems, <a class="reference external" href="https://github.com/clulab/reach">REACH</a> (developed by Mihai Surdeanu's group at the
University of Arizona) and <a class="reference external" href="http://trips.ihmc.us/parser/">TRIPS</a> (developed
by the Institute of Human and Machine Cognition), we concluded that the problem
was not <cite>primarily</cite> with the NLP systems and their grounding algorithms, but
rather in the lack of uniform resources for grounding and linking families and
complexes.</p>
<p>For example, sometimes family/complex entities in text came back from the
machine readers with no associated identifiers at all. The reason was usually
straightforward: the databases indexed by the reading systems for grounding
either didn't have much coverage of families and complexes, or those databases
lacked the lexical synonyms necessary for accurate matching. For example,
REACH, which indexed Uniprot, InterPro, Pfam, HMDB, ChEBI, Gene Ontology, MeSH,
and other ontologies, found no grounding for the string <tt class="docutils literal">NFkappaB</tt>,
one of the most frequently occurring in our corpus. In fact, in a corpus of
~215,000 articles (a mix of full texts and abstracts), we found that the seven
of ten of the most frequently occurring <cite>ungrounded</cite> entities were families and
complexes!</p>
<div class="center figure">
<img alt="FamPlex Table 4." src="images/famplex_table4.png" style="width: 800px;" />
<p class="caption">Most frequently occurring ungrounded entity texts, with and without
FamPlex. Families and complexes ungrounded without FamPlex are
<tt class="docutils literal"><span class="pre">NF-kappaB</span></tt>, <tt class="docutils literal">ERK1/2</tt>, <tt class="docutils literal">mTORC1</tt>, <tt class="docutils literal">NFkappaB</tt>, <tt class="docutils literal">PDGF</tt>, <tt class="docutils literal">IKK</tt>, and
<tt class="docutils literal">histone H3</tt>).</p>
</div>
<p>In other cases, family/complex names were <cite>incorrectly</cite> grounded to specific
genes due to spurious exact matches in unexpected places. For example, <tt class="docutils literal">ERK</tt>,
the common name for the MAPK1/MAPK3 gene family, was incorrectly grounded to
<a class="reference external" href="https://www.uniprot.org/uniprot/P29323#names_and_taxonomy">EPHB2</a> due to
<tt class="docutils literal">ERK</tt> being listed in Uniprot as a synonym for that gene. Similarly, human
gene families were sometimes grounded to the single ortholog of the family in a
different organism. My personal favorite was the grounding of <tt class="docutils literal">AKT</tt> to the
<cite>Dictyostelium</cite> (slime mold) gene <a class="reference external" href="https://www.uniprot.org/uniprot/P54644#names_and_taxonomy">pkbA</a> instead of the
human gene family consisting of the human genes AKT1/AKT2/AKT3.</p>
<div class="center figure">
<img alt="Uniprot entry for pkbA." src="images/pkba.png" style="width: 600px;" />
<p class="caption">Uniprot entry for the <cite>Dictyostelium</cite> gene <cite>pkbA</cite> has "akt" as a
synonym, causing a spurious match.</p>
</div>
<p>In our evaluations, we found that the TRIPS system did a bit better, finding a
higher percentage of matches using a different matching algorithm and a
different set of databases, in particular the <a class="reference external" href="https://ncit.nci.nih.gov/ncitbrowser/">NCI Thesaurus (NCIT)</a> and <a class="reference external" href="https://ncit.nci.nih.gov/ncitbrowser/">NextProt</a>. Here, though, we found problems with
resolving relationships to specific genes: 41% of the entities grounded to NCIT
did not have any gene members defined, making it difficult to use this
information in downstream data analysis.</p>
<p>More generally, among NLP event extraction systems <cite>and</cite> biocuration projects
there seemed to be a complete lack of consistency in the resources used
to identify families and complexes. Among the sources we encountered,
there was literally no overlap!</p>
<table border="1" class="docutils">
<colgroup>
<col width="29%" />
<col width="15%" />
<col width="55%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Source</th>
<th class="head">Type</th>
<th class="head">Family/Complex Databases</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>REACH</td>
<td>NLP</td>
<td>InterPro, Pfam, GO</td>
</tr>
<tr><td>TRIPS</td>
<td>NLP</td>
<td>NextProt, NCIT</td>
</tr>
<tr><td>MedScan</td>
<td>NLP</td>
<td>Medscan IDs, Enzyme codes</td>
</tr>
<tr><td>TEES</td>
<td>NLP</td>
<td>Homologene</td>
</tr>
<tr><td>BEL</td>
<td>Curation</td>
<td>Selventa families and complexes</td>
</tr>
<tr><td>Pathway Commons</td>
<td>Curation</td>
<td>(genes enumerated)</td>
</tr>
<tr><td>SIGNOR</td>
<td>Curation</td>
<td>SIGNOR families and complexes</td>
</tr>
<tr><td>Reactome</td>
<td>Curation</td>
<td>Reactome families and complexes</td>
</tr>
<tr><td>EMBO Sourcedata</td>
<td>Curation</td>
<td>Ungrounded or genes enumerated</td>
</tr>
</tbody>
</table>
<p>This was a problem for us, because we were interested primarily in aggregating
and assembling information from both text mining and curated databases.</p>
<p>So...we started curating identifiers ourselves, defining IDs for protein
families and complexes, linking in the synonyms that we were seeing in the
literature, mapping them to the identifiers in both the protein databases and
pathway resources, and defining the hierarchical relationships between
complexes, families and their members. Mindful of <a class="reference external" href="https://xkcd.com/927/">how standards proliferate</a>, <strong>our goal was not to supplant the existing
resources but instead provide an extensible "bridging resource" for NLP
developers and biocurators to ground and link the most commonly occurring
entities and thereby combine information from multiple sources.</strong></p>
<p>The project grew from a handful of CSV files intended for internal use to a
fairly robust resource that improved grounding significantly enough that
several of the NLP teams in the Big Mechanism program started to use it. We
named it "FamPlex", for "Families, Complexes and their Lexicalizations". It
consists of a set of files specifying identifiers for 441 human protein
families and complexes, their synonyms, gene-level constituents, and equivalent
identifiers in other resources.</p>
<div class="center figure">
<img alt="Structure of the FamPlex resource." src="images/famplex_fig1a.png" style="width: 600px;" />
<p class="caption">Structure of the FamPlex resource.</p>
</div>
<p>We also included a curated list of gene/protein affixes annotated with their
semantic meaning, useful for normalizing entity names and correctly interpreting
extracted events.</p>
<div class="center figure">
<img alt="Gene/protein prefixes in FamPlex." src="images/famplex_table2.png" style="width: 400px;" />
<p class="caption">Gene/protein prefixes in FamPlex.</p>
</div>
<p>We realized fairly early on that a simple two-layer mapping between
families/complexes and their members would not be sufficient to capture the
range of entities described in the literature. Articles frequently referred not
just to families/complexes and their gene-level members, but often to
<cite>intermediate groupings</cite> of entities, such as a class of subunits forming a
part of a functional complex. To handle this, we defined two relationships,
<tt class="docutils literal">isa</tt> and <tt class="docutils literal">partof</tt>, that could be nested to define the relationships within
a family/complex, as for example with AMPK (a heterotrimer consisting of
different combinations of genes drawn from three subunit families, alpha, beta,
and gamma). We further defined synonyms for each element in the hierarchy to
help NLP systems extract information about mechanisms at any level.</p>
<div class="center figure">
<img alt="Hierarchical structure of FamPlex relationships." src="images/ampk.png" style="width: 400px;" />
<p class="caption">Hierarchical structure of FamPlex relationships, shown here for AMPK.</p>
</div>
<p>In sharing and publishing this resource with the community we sought to
quantify the improvements attributable to the incorporation of FamPlex in an
NLP or biocuration setting. We compared grounding accuracy for two readers
(TRIPS and REACH) compiled with and without FamPlex. For REACH, the performance
improvement was especially significant, with accuracy for protein families and
complex rising from 15% to 71%.</p>
<div class="center figure">
<img alt="Grounding accuracy with/without FamPlex." src="images/famplex_fig3.png" style="width: 300px;" />
<p class="caption">Improvements in grounding accuracy for proteins/genes and
families/complexes, with and without the use of FamPlex.</p>
</div>
<p>To get an estimate of the coverage of the resource relevant to a real-world
biocuration project, we also quantified the proportion of family/complex-level
annotations curated in the EMBO Sourcedata dataset that had relevant identifiers
in FamPlex and obtained figures of 80-90%.</p>
<p>We also found empirically that the hierarchical structure of FamPlex allowed us
to identify relationships between events extracted from families/complexes,
genes, and intermediate levels. For some entities, for example
Activin (<a class="reference external" href="https://en.wikipedia.org/wiki/Activin_and_inhibin#Activin">a class of protein complexes belonging to the TGF-beta superfamily</a>), the
intermediate levels of representation (e.g., Activin A, Activin B, and Activin
AB) were more frequently mentioned in extracted events than either genes or the
top-level category!</p>
<div class="center figure">
<img alt="Hierarchical resolution of entities in FamPlex." src="images/famplex_fig4b.png" style="width: 350px;" />
<p class="caption">The proportion of groundings at gene-level, intermediate-level, or
top-level entities for five-multi-level families/complexes in FamPlex.</p>
</div>
<p>Amid a push to facilitate data integration and <a class="reference external" href="https://www.genenames.org/useful/journals">standardize nomenclature for
human genes</a> in the published
literature, we speculate that FamPlex could be used to annotate text and data
dealing with functional complexes and gene families (e.g., the results of
antibody-based experiments involving multiple family members).</p>
<p>For more on the details of the construction and evaluation of FamPlex, see the
<a class="reference external" href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5">paper!</a>
We've made FamPlex available under a <a class="reference external" href="https://creativecommons.org/choose/zero/">CC0 license</a> so that people can extend, remix,
and combine FamPlex with other resources as necessary. We hope that this will
be a useful tool for the computational biology and biocuration community. Feel
free to ask questions or suggest additions via the <a class="reference external" href="https://github.com/sorgerlab/famplex/issues">Issues page on Github</a>.</p>
<p>Links:</p>
<ul class="simple">
<li>Paper: <a class="reference external" href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5">https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2211-5</a></li>
<li>Code (GitHub): <a class="reference external" href="https://github.com/sorgerlab/famplex">https://github.com/sorgerlab/famplex</a></li>
<li>Code for paper: <a class="reference external" href="https://github.com/sorgerlab/famplex_paper">https://github.com/sorgerlab/famplex_paper</a></li>
<li>Browse FamPlex at NCBO BioPortal: <a class="reference external" href="http://purl.bioontology.org/ontology/FPLX">http://purl.bioontology.org/ontology/FPLX</a></li>
<li>identifiers.org prefix: <a class="reference external" href="https://www.ebi.ac.uk/miriam/main/datatypes/MIR:00000651">fplx</a></li>
</ul>
Building a Python 2/3 compatible Unicode Sandwich2017-03-10T00:00:00-05:002017-03-14T11:03:07-05:00John A. Bachmantag:johnbachman.net,2017-03-10:/building-a-python-23-compatible-unicode-sandwich.html<p>So you've decided that your code needs to be compatible with both Python 2 and
Python 3. Most likely, you're upgrading your Python 2 code to work in Python 3,
and know that you need to do things like:</p>
<ul class="simple">
<li>Replace all calls to <tt class="docutils literal">print</tt> with <tt class="docutils literal">print()</tt></li>
<li>Use absolute rather than …</li></ul><p>So you've decided that your code needs to be compatible with both Python 2 and
Python 3. Most likely, you're upgrading your Python 2 code to work in Python 3,
and know that you need to do things like:</p>
<ul class="simple">
<li>Replace all calls to <tt class="docutils literal">print</tt> with <tt class="docutils literal">print()</tt></li>
<li>Use absolute rather than relative imports</li>
<li>Call <tt class="docutils literal">dict.items()</tt> instead of <tt class="docutils literal">dict.iteritems()</tt></li>
<li>etc.</li>
</ul>
<p>But for me, and perhaps for you as well, by far the biggest and most
complicated issue in getting code to be jointly Python 2/3 compatible is
maintaining the <strong>Unicode Sandwich.</strong> If you don't know what a Unicode Sandwich
is, please see these slides by Ned Batchelder: <a class="reference external" href="https://nedbatchelder.com/text/unipain/unipain.html#1">Pragmatic Unicode, or How do I
stop the Pain?</a> The
idea is that in the 21st century, all text inside your application should be
Unicode, with conversions to and from various specific character encodings at
the margins.</p>
<p>Implementing a Unicode Sandwich in Python 3 isn't too bad because Python 3
explicitly distinguishes between decoded Unicode text (of type <tt class="docutils literal">str</tt>) and
encoded characters (of type <tt class="docutils literal">bytes</tt>). This means that if you do a little bit
of extra work to always explicitly convert <tt class="docutils literal">bytes</tt> (that you might get from a
web service, text file, or some other source) into Unicode, and do the reverse
conversion when writing outputs, then voila, problem solved.</p>
<p>However, in Python 2 the <tt class="docutils literal">str</tt> type contains bytes, which are
implicitly converted (decoded) to Unicode when mixing bytes and Unicode. When
you have an application that deals with lots of external files, services, and
resources, some of which return Unicode content and others encoded bytes, along
with the string literals in your own code (which are bytes in Python 2 and
Unicode in Python 3), you have a recipe for confusion.</p>
<p><strong>Making a Unicode sandwich work in both Python 2 and 3 requires
systematically going through your code and enforcing the bytes/Unicode
conversions at all the appropriate places, using code that works for both
versions.</strong> The trick is that every data source and library is a little bit
different--some accept only bytes to their key functions, others only Unicode
strings, so the way of doing the appropriate conversions takes a little
figuring. What follows are some notes I compiled when rewriting our <a class="reference external" href="http://github.com/sorgerlab/indra">INDRA
software</a> (which deals with natural
language text and several different types of databases) to support Unicode in a
Python 2/3 compatible way. Hopefully these notes will point you in the right
direction if you are trying to do something similar.</p>
<div class="section" id="boilerplate-imports">
<h2>Boilerplate imports</h2>
<p>If you're going to maintain a Unicode sandwich, you'll need any strings that
you define in your code to be Unicode strings:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">__future__</span> <span class="kn">import</span> <span class="n">unicode_literals</span>
</pre></div>
<p>You'll also want (at least) the following builtins in your imports in every file:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">builtins</span> <span class="kn">import</span> <span class="nb">dict</span><span class="p">,</span> <span class="nb">str</span>
</pre></div>
<p>Redefining <tt class="docutils literal">dict</tt> and <tt class="docutils literal">str</tt> in this way in Python 2 cause them to behave
like the corresponding types in Python 3, e.g. <tt class="docutils literal">dict.items()</tt> returns a
generator rather than a list, and <tt class="docutils literal">str</tt> can be treated as a Unicode string
rather than a bytestring. Because of the redefinition of <tt class="docutils literal">str</tt>,
<tt class="docutils literal">isinstance(u'foo', str)</tt> will be <tt class="docutils literal">True</tt> in Python 2, so you can use the
same code for both Python 2 and 3. Also, this redefines the <tt class="docutils literal">str()</tt>
constructor so that when you convert an object to a <tt class="docutils literal">str</tt> in Python 2 (e.g.,
<tt class="docutils literal">str(5)</tt>) you'll end up with a Unicode-compatible string, not bytes.</p>
</div>
<div class="section" id="string-type-checking">
<h2>String type checking</h2>
<p>In Python 2, it's common to check whether an argument is a string by checking
<tt class="docutils literal">isinstance(foo, basestring)</tt>, because <tt class="docutils literal">basestring</tt> is the supertype of
both Python 2's <tt class="docutils literal">str</tt> and <tt class="docutils literal">unicode</tt> types. This is a handy way to, for
example, tell whether then argument to a function is a string or a list.
However, because <tt class="docutils literal">basestring</tt> doesn't exist in Python 3, this has to be
changed.</p>
<p>The best solution is to use Unicode everywhere in Python 2, importing <tt class="docutils literal">from
builtins import str</tt> (as recommended above) and then using <tt class="docutils literal">isinstance(foo,
str)</tt>, where you would have previously used <tt class="docutils literal">basestring</tt>. This is then
compatible with both Python 2 and 3.</p>
<p>However, if your Unicode sandwich isn't completely airtight and there's a
possibility that <tt class="docutils literal">foo</tt> might be a Python 2 bytestring, then <tt class="docutils literal">isinstance(foo,
str)</tt> will be False when using the above approach, possibly leading to silent
failures. In this case you might want to stick with the following workaround
that retains <tt class="docutils literal">basestring</tt>:</p>
<div class="highlight"><pre><span></span><span class="c1"># Will be OK in Python 2</span>
<span class="k">try</span><span class="p">:</span>
<span class="nb">basestring</span>
<span class="c1"># Allows isinstance(foo, basestring) to work in Python 3</span>
<span class="k">except</span><span class="p">:</span>
<span class="nb">basestring</span> <span class="o">=</span> <span class="nb">str</span>
</pre></div>
</div>
<div class="section" id="custom-str-methods">
<h2>Custom __str__ methods</h2>
<p>In Python 2, custom <tt class="docutils literal">__str__</tt> methods are expected to return a bytes (Python
2 <tt class="docutils literal">str</tt>), not a Unicode string (<tt class="docutils literal">unicode</tt>). This can be a problem once your
objects contain only hard-won Unicode strings. Fortunately the <tt class="docutils literal">future</tt>
package contains a decorator, <a class="reference external" href="http://python-future.org/what_else.html#custom-str-methods">@python_2_unicode_compatible,</a> to make your
<tt class="docutils literal">__str__</tt> method work in both Python 2 and 3. However, make sure that you
apply the decorator only once in your object hierarchy, or you will get an
error (i.e., don't apply the decorator both in a superclass and a subclass).</p>
</div>
<div class="section" id="web-services-urllib-and-requests">
<h2>Web services: urllib and requests</h2>
<p>The structure of the <tt class="docutils literal">urllib</tt> library is very different between Python 2 and
3, and <tt class="docutils literal">urllib</tt> calls require much more care in converting bytes/unicode. Do
yourself a favor and rewrite any web service calls using <tt class="docutils literal">requests</tt> instead
of any of the <tt class="docutils literal">urllib</tt> methods. <tt class="docutils literal">requests</tt> takes a dict of query parameters
directly, eliminating the need to urlencode and/or UTF-8 encode request
content. The <tt class="docutils literal">response</tt> object returned by the <tt class="docutils literal">requests</tt> library gives you
access to both the underlying bytes (in <tt class="docutils literal">response.content</tt>) as well as a
decoded Unicode version (in <tt class="docutils literal">response.text</tt>). You can also get a JSON object
directly using <tt class="docutils literal">response.json</tt>.</p>
<p>If you're stuck using <tt class="docutils literal">urllib</tt> for some reason, here are some things to note.
First, wrap your imports in a try/catch, e.g.:</p>
<div class="highlight"><pre><span></span><span class="c1"># Python 3 version</span>
<span class="k">try</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">urllib.request</span> <span class="kn">import</span> <span class="n">urlopen</span>
<span class="kn">from</span> <span class="nn">urllib.error</span> <span class="kn">import</span> <span class="n">HTTPError</span>
<span class="kn">from</span> <span class="nn">urllib.parse</span> <span class="kn">import</span> <span class="n">urlencode</span>
<span class="c1"># Python 2 version</span>
<span class="k">except</span> <span class="ne">ImportError</span><span class="p">:</span>
<span class="kn">from</span> <span class="nn">urllib</span> <span class="kn">import</span> <span class="n">urlencode</span>
<span class="kn">from</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="n">urlopen</span><span class="p">,</span> <span class="n">HTTPError</span>
</pre></div>
<p>When calling <tt class="docutils literal">urlopen</tt>, the second argument must be a bytes, so you'll need
to do:</p>
<div class="highlight"><pre><span></span><span class="n">result</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">,</span> <span class="n">data</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">))</span>
</pre></div>
<p>And when reading from the object, you'll need to decode back to Unicode:</p>
<div class="highlight"><pre><span></span><span class="n">response_text</span> <span class="o">=</span> <span class="n">result</span><span class="o">.</span><span class="n">read</span><span class="p">()</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="s1">'utf-8'</span><span class="p">)</span>
</pre></div>
<p>In some cases you'll want to keep the response content in bytes form if you're
passing it to another library that expects only bytes. For example, <tt class="docutils literal">rdflib</tt>
and <tt class="docutils literal">json</tt> perform the bytes/unicode conversion internally (see below).</p>
</div>
<div class="section" id="reading-writing-csv-files">
<h2>Reading/writing CSV files</h2>
<p>There are a few differences in the procedure for reading and writing CSV files
between Python 2 and 3. In Python 3, the encoding should be specified when
opening the file in text mode, which leads the csv.reader/writer object to then
return Unicode strings. A <tt class="docutils literal"><span class="pre">newline=''</span></tt> argument is also required when opening
the file.</p>
<p>For Python 2, the <tt class="docutils literal">encoding</tt> and <tt class="docutils literal">newline</tt> arguments are not permitted, so
the file opening step has to occur in an alternative block. Also, the
<tt class="docutils literal">delimiter</tt> and <tt class="docutils literal">quotechar</tt> arguments can be Unicode in Python 3, but in
Python 2 they must be bytes (the <tt class="docutils literal">lineterminator</tt> argument does not need to
be encoded to bytes in Python 2, however).</p>
<p>Finally, the csv reader returns byte strings, so each field must be explicitly
decoded into Unicode. Here is an example that handles the complete process and
returns a generator. Note that the Python 2 version assumes that it is getting
Unicode strings as arguments (which is the case when using <tt class="docutils literal">unicode_literals</tt>
as I recommend), which is why they have to be encoded:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">read_unicode_csv</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s1">','</span><span class="p">,</span> <span class="n">quotechar</span><span class="o">=</span><span class="s1">'"'</span><span class="p">,</span>
<span class="n">quoting</span><span class="o">=</span><span class="n">csv</span><span class="o">.</span><span class="n">QUOTE_MINIMAL</span><span class="p">,</span> <span class="n">lineterminator</span><span class="o">=</span><span class="s1">'</span><span class="se">\n</span><span class="s1">'</span><span class="p">,</span>
<span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span><span class="p">):</span>
<span class="c1"># Python 3 version</span>
<span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">version_info</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">>=</span> <span class="mi">3</span><span class="p">:</span>
<span class="c1"># Open the file in text mode with given encoding</span>
<span class="c1"># Set newline arg to ''</span>
<span class="c1"># (see https://docs.python.org/3/library/csv.html)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">,</span> <span class="n">newline</span><span class="o">=</span><span class="s1">''</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="n">encoding</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="c1"># Next, get the csv reader, with unicode delimiter and quotechar</span>
<span class="n">csv_reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">reader</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="n">delimiter</span><span class="p">,</span>
<span class="n">quotechar</span><span class="o">=</span><span class="n">quotechar</span><span class="p">,</span>
<span class="n">quoting</span><span class="o">=</span><span class="n">quoting</span><span class="p">,</span>
<span class="n">lineterminator</span><span class="o">=</span><span class="n">lineterminator</span><span class="p">)</span>
<span class="c1"># Now, iterate over the (already decoded) csv_reader generator</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">csv_reader</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">row</span>
<span class="c1"># Python 2 version</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># Open the file in bytes mode</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">'rb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="c1"># Next, get the csv reader, passing delimiter and quotechar as</span>
<span class="c1"># bytestrings rather than unicode</span>
<span class="n">csv_reader</span> <span class="o">=</span> <span class="n">csv</span><span class="o">.</span><span class="n">reader</span><span class="p">(</span><span class="n">f</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="n">delimiter</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">encoding</span><span class="p">),</span>
<span class="n">quotechar</span><span class="o">=</span><span class="n">quotechar</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="n">encoding</span><span class="p">),</span>
<span class="n">quoting</span><span class="o">=</span><span class="n">quoting</span><span class="p">,</span>
<span class="n">lineterminator</span><span class="o">=</span><span class="n">lineterminator</span><span class="p">)</span>
<span class="c1"># Iterate over the file and decode each string into unicode</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">csv_reader</span><span class="p">:</span>
<span class="k">yield</span> <span class="p">[</span><span class="n">cell</span><span class="o">.</span><span class="n">decode</span><span class="p">(</span><span class="n">encoding</span><span class="p">)</span> <span class="k">for</span> <span class="n">cell</span> <span class="ow">in</span> <span class="n">row</span><span class="p">]</span>
</pre></div>
<p>Follow the corresponding procedure for writing CSV files.</p>
</div>
<div class="section" id="pickling">
<h2>Pickling</h2>
<p>Pickling and unpickling must always be done with files opened explicitly in
binary mode. To maintain compatibility of pickled files with both Python 2 and
3, pickle files should be generated with protocol level 2, i.e., by
<tt class="docutils literal">pickle.dump(foo, fp, protocol=2)</tt>. These files can be opened by Python 2 as
well as 3.</p>
<p>If there are pre-existing pickle files generated by Python 2 that need to be
openable by Python 3, there is an optional <tt class="docutils literal">encoding</tt> argument to
<tt class="docutils literal">pickle.load</tt> that tells Python 3 how it should interpret non-ASCII byte
strings that were encoded into pickle files by Python 2. For some reason,
Python 2 pickles can sometimes fail to load in Python 3 unless the <tt class="docutils literal">encoding</tt>
argument to <tt class="docutils literal">pickle.load</tt> is set to <tt class="docutils literal"><span class="pre">latin-1</span></tt> (even if they were encoded in
Python 2 using UTF-8). This has been reported in quite a few places, including:</p>
<ul class="simple">
<li><a class="reference external" href="http://stackoverflow.com/questions/28218466/unpickling-a-python-2-object-with-python-3">Unpickling a python 2 object with python 3</a></li>
<li><a class="reference external" href="https://github.com/zopefoundation/ZODB/wiki/Pickles">Pickle interoperability between Python 2 and Python 3</a></li>
</ul>
<p>Annoyingly, the <tt class="docutils literal">encoding</tt> argument is not included in Python 2, so you will
have to have two blocks of code for loading pickle files. If possible, it is
far better to recreate and/or repickle the data in Python 3.</p>
</div>
<div class="section" id="parsing-xml-with-xml-etree-elementtree">
<h2>Parsing XML with xml.etree.ElementTree</h2>
<p>This XML parser expects bytes, not Unicode, converting the bytes into Unicode
internally. However, it is important to note than in Python 2, elements in the
parsed XML will contain <tt class="docutils literal">unicode</tt> text in the <tt class="docutils literal">et.text</tt> field <cite>only when
the element contains a Unicode character.</cite> This means that if the XML contains
mostly ASCII-compatible strings, they will come back as Python 2 <tt class="docutils literal">str</tt>,
leaking bytes into your otherwise pure Unicode sandwich. This would generally
be OK, except that if you have unit tests that check objects for Unicode
objects this will lead to failures. Moreover, explicitly converting the
ASCII-compatible strings with <tt class="docutils literal">unicode(foo)</tt> is problematic in cases where
the string can be None, as it will introduce the string <tt class="docutils literal">'None'</tt> into the
data!</p>
<p>Here's a tricky solution (hack?) that I adapted from <a class="reference external" href="http://www.gossamer-threads.com/lists/python/python/728903">this thread</a> that ensures
that <tt class="docutils literal">etree</tt> returns only Unicode strings and uses the same syntax in the
caller between Python 2 and 3. It involves subclassing
<tt class="docutils literal">xml.etree.ElementTree.XMLTreeBuilder</tt> in Python 2 and overriding a single
method. The trick is that in Python 3, a corresponding function is defined that
simply returns <tt class="docutils literal">None</tt>:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">xml.etree.ElementTree</span> <span class="kn">as</span> <span class="nn">ET</span>
<span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">version_info</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">>=</span> <span class="mi">3</span><span class="p">:</span>
<span class="k">def</span> <span class="nf">UnicodeXMLTreeBuilder</span><span class="p">():</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">class</span> <span class="nc">UnicodeXMLTreeBuilder</span><span class="p">(</span><span class="n">ET</span><span class="o">.</span><span class="n">XMLTreeBuilder</span><span class="p">):</span>
<span class="c1"># See this thread:</span>
<span class="c1"># http://www.gossamer-threads.com/lists/python/python/728903</span>
<span class="k">def</span> <span class="nf">_fixtext</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">):</span>
<span class="k">return</span> <span class="n">text</span>
<span class="c1"># Get XML content as bytes, e.g., via urlopen</span>
<span class="n">response</span> <span class="o">=</span> <span class="n">urlopen</span><span class="p">(</span><span class="o">...</span><span class="p">)</span>
<span class="n">tree</span> <span class="o">=</span> <span class="n">ET</span><span class="o">.</span><span class="n">parse</span><span class="p">(</span><span class="n">response</span><span class="p">,</span> <span class="n">parser</span><span class="o">=</span><span class="n">UnicodeXMLTreeBuilder</span><span class="p">())</span>
<span class="c1"># Or, parse directly from a bytestring</span>
<span class="n">xml_str</span> <span class="o">=</span> <span class="sa">b</span><span class="s1">'<foo><bar>baz</bar></foo>'</span>
<span class="n">tree</span> <span class="o">=</span> <span class="n">ET</span><span class="o">.</span><span class="n">XML</span><span class="p">(</span><span class="n">xml_str</span><span class="p">,</span> <span class="n">parser</span><span class="o">=</span><span class="n">UnicodeXMLTreeBuilder</span><span class="p">())</span>
</pre></div>
<p>In Python 2, the call to <tt class="docutils literal">UnicodeXMLTreeBuilder()</tt> returns an instance of the
appropriate parser, whereas in Python 3, it returns <tt class="docutils literal">None</tt> and allows the
<tt class="docutils literal">ElementTree.XML</tt> and <tt class="docutils literal">ElementTree.parse</tt> functions to operate normally.
The upshot is that the <tt class="docutils literal">parser</tt> argument should always be passed when using
either function.</p>
</div>
<div class="section" id="json">
<h2>JSON</h2>
<p>When writing a Unicode-containing Python object to a JSON file or string using
<tt class="docutils literal">json.dump</tt> or <tt class="docutils literal">json.dumps</tt>, note that the object produced is,
counterintuitively, a <tt class="docutils literal">str</tt> in Python 2 (bytestring), but with all non-ASCII
characters escaped ("Python encoded") and hence suitable for writing to a file
in text mode.</p>
<div class="highlight"><pre><span></span><span class="c1"># Python 2</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">json</span>
<span class="o">>>></span> <span class="n">foo</span> <span class="o">=</span> <span class="sa">u</span><span class="s1">'U</span><span class="se">\000</span><span class="s1">1F4A9'</span>
<span class="o">>>></span> <span class="nb">type</span><span class="p">(</span><span class="n">foo</span><span class="p">)</span>
<span class="o"><</span><span class="nb">type</span> <span class="s1">'unicode'</span><span class="o">></span>
<span class="o">>>></span> <span class="n">bar</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">foo</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">bar</span>
<span class="s1">'"U</span><span class="se">\\</span><span class="s1">u00001F4A9"'</span>
<span class="o">>>></span> <span class="nb">type</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span>
<span class="o"><</span><span class="nb">type</span> <span class="s1">'str'</span><span class="o">></span>
<span class="o">>>></span> <span class="n">baz</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">baz</span>
<span class="sa">u</span><span class="s1">'U</span><span class="se">\x00</span><span class="s1">1F4A9'</span>
<span class="o">>>></span> <span class="nb">type</span><span class="p">(</span><span class="n">baz</span><span class="p">)</span>
<span class="o"><</span><span class="nb">type</span> <span class="s1">'unicode'</span><span class="o">></span>
</pre></div>
<p>In Python 3 <tt class="docutils literal">json.dumps</tt> returns <tt class="docutils literal">str</tt>, suitable for writing to text mode
files.</p>
<p>Similarly, to load a JSON object with <tt class="docutils literal">load</tt> or <tt class="docutils literal">loads</tt>, in both Python 2
and 3 the <tt class="docutils literal">json</tt> module expects a <tt class="docutils literal">str</tt> (not a Python 3 <tt class="docutils literal">bytes</tt>). This
means that all JSON files should be opened in text (not binary) mode, and
should be created by <tt class="docutils literal">json.dump</tt> rather than by some other process that would
leave encoded byte strings in the code.</p>
<p>Three other things are worth noting:</p>
<ul class="simple">
<li>When dumping with <tt class="docutils literal">json.dumps(foo)</tt> in Python 2, <tt class="docutils literal">foo</tt> itself can contain a
mix of <tt class="docutils literal">str</tt> and <tt class="docutils literal">unicode</tt> strings as long as the <tt class="docutils literal">str</tt> objects are ASCII
only.</li>
<li>When loading with <tt class="docutils literal">foo = <span class="pre">json.loads(...)</span></tt> in Python 2, the object returned
will contain only <tt class="docutils literal">unicode</tt> strings, even if those strings were <tt class="docutils literal">str</tt>
when they were dumped. For example:</li>
</ul>
<div class="highlight"><pre><span></span><span class="c1"># Python 2</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">json</span>
<span class="c1"># All strings are str, not unicode</span>
<span class="o">>>></span> <span class="n">foo</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'foo'</span><span class="p">,</span> <span class="p">{</span><span class="s1">'bar'</span><span class="p">:</span> <span class="p">(</span><span class="s1">'baz'</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mi">2</span><span class="p">)}]</span>
<span class="c1"># Will come back with all strings unicode</span>
<span class="o">>>></span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">foo</span><span class="p">))</span>
<span class="p">[</span><span class="sa">u</span><span class="s1">'foo'</span><span class="p">,</span> <span class="p">{</span><span class="sa">u</span><span class="s1">'bar'</span><span class="p">:</span> <span class="p">[</span><span class="sa">u</span><span class="s1">'baz'</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mi">2</span><span class="p">]}]</span>
</pre></div>
<ul class="simple">
<li>In Python 3, calling <tt class="docutils literal">json.dumps</tt> on an object containing any bytestrings
will lead to a <tt class="docutils literal">TypeError</tt>.</li>
</ul>
</div>
<div class="section" id="rdf-and-rdflib">
<h2>RDF and rdflib</h2>
<p>When serializing an <cite>rdflib.graph</cite> object (e.g., for writing to a file), the
encoding can be specified by an argument to the <tt class="docutils literal">serialize</tt> function, which
returns bytes:</p>
<div class="highlight"><pre><span></span><span class="n">g</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="n">format</span><span class="o">=</span><span class="s1">'xml'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span><span class="p">)</span>
</pre></div>
<p>This can then be written to a file opened in bytes mode (i.e., with the <tt class="docutils literal">wb</tt>
arg), e.g.:</p>
<div class="highlight"><pre><span></span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">file_path</span><span class="p">,</span> <span class="s1">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">out_file</span><span class="p">:</span> <span class="c1"># Binary mode</span>
<span class="n">xml_bytes</span> <span class="o">=</span> <span class="n">g</span><span class="o">.</span><span class="n">serialize</span><span class="p">(</span><span class="n">format</span><span class="o">=</span><span class="s1">'xml'</span><span class="p">,</span> <span class="n">encoding</span><span class="o">=</span><span class="s1">'utf-8'</span><span class="p">)</span>
<span class="n">out_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">xml_bytes</span><span class="p">)</span>
</pre></div>
</div>
What fraction of articles can you expect to be available for text mining from Pubmed Central?2016-05-20T00:00:00-05:002016-09-05T16:30:51-05:00John A. Bachmantag:johnbachman.net,2016-05-20:/what-fraction-of-articles-can-you-expect-to-be-available-for-text-mining-from-pubmed-central.html<p>In a <a class="reference external" href="http://johnbachman.net/assembling-a-text-mining-corpus-for-the-ras-pathway.html">previous post,</a> I described
assembling sets of Pubmed references relevant to <a class="reference external" href="http://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/ras-pathway-v2">227 genes in the Ras pathway.</a>
The next problem was <strong>getting access to mineable content.</strong></p>
<p>The most readily available source of content for text mining by researchers is
the <a class="reference external" href="http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/">Pubmed Central Open Access article subset</a>, and most …</p><p>In a <a class="reference external" href="http://johnbachman.net/assembling-a-text-mining-corpus-for-the-ras-pathway.html">previous post,</a> I described
assembling sets of Pubmed references relevant to <a class="reference external" href="http://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/ras-pathway-v2">227 genes in the Ras pathway.</a>
The next problem was <strong>getting access to mineable content.</strong></p>
<p>The most readily available source of content for text mining by researchers is
the <a class="reference external" href="http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/">Pubmed Central Open Access article subset</a>, and most demonstrations
of text mining tools I have seen within the <a class="reference external" href="http://www.darpa.mil/program/big-mechanism">Big Mechanism program</a> have used this set of articles
almost exclusively. These articles have licenses allowing access and reuse
suitable for our own (non-commercial research) purposes, though beyond those
uses what is permitted depends on the specific, article-by-article license.</p>
<p>For those new to text-mining, as I am, it's worth noting that <em>just because you
can read all 4 million full-text articles on the Pubmed Central website doesn't
mean you can mine them.</em> In fact, only about ~1.2 million of the articles in
Pubmed Central are in the Open Access subset--the majority are off-limits to
bulk downloading and mining due to copyright restrictions. And that's out of a
total of more than <a class="reference external" href="https://www.nlm.nih.gov/pubs/factsheets/dif_med_pub.html">25 million articles in Pubmed</a>. So depending on
what's in your denominator, the fraction of open access articles relevant to
your topic could be pretty small.</p>
<p>Another source of articles for text mining is Pubmed Central's <a class="reference external" href="http://www.ncbi.nlm.nih.gov/pmc/about/mscollection/">Author's
Manuscript Collection</a>.
These consist of accepted manuscripts uploaded by authors in compliance with
the NIH Public Access Policy, and they are also eligible for text mining. This
dataset currently consists of ~386,000 articles. This brings the total number
of mineable articles from Pubmed central to ~1.6 million.</p>
<p>Some publishers are beginning to make the full texts of non-open-access
articles available to subscribers via the <a class="reference external" href="http://tdmsupport.crossref.org/researcher-faq/">CrossRef text and data mining API</a>, which is a subject for
another day. Nevertheless, by far the most straightforward way to gain access
to the full text of an article for mining is from Pubmed Central--if it's in
the Pubmed Central Open Access or Author's Manuscript subsets.</p>
<p><strong>I downloaded the Open Access and Author's Manuscript subsets from PMC and
cross-referenced the available content to the full list of PMC articles.</strong>
Before getting into the fraction of articles in these subsets, it's worth
noting that <strong>many articles in PMC do not have PMIDs or DOIs associated with
them,</strong> as shown in the following Venn diagram:</p>
<div class="center figure">
<img alt="IDs for articles in Pubmed Central" src="images/pmc_ids_venn.png" style="width: 50%;" />
<p class="caption"><em>IDs (in addition to PMC ID) associated with the 3,998,986 articles in
Pubmed Central.</em> (<a class="reference external" href="images/pmc_ids_venn.pdf">PDF0</a>)</p>
</div>
<p>This has two important consequences: first, since our current approach to
building a corpus of papers involves searching Pubmed, we will always be
starting with PMIDs, so the ~500,000 articles in Pubmed Central with no PMID
are not relevant. Second, for papers that don't turn out to be available in
mineable form from PMC, we'll have to look elsewhere (i.e., CrossRef) for full
text content, for which we'll need DOIs. Since PMC doesn't have DOIs on file
for nearly a third of its articles, we have to obtain them from somewhere else.</p>
<p><strong>Starting with the subset of ~3.5 million articles in PMC that have PMIDs, I
looked at how many of these are in the Open Access subset, the Author's
Manuscript collection, or both.</strong> As the Venn diagram shows, these two
collections of mineable articles are mostly complementary, with a relatively
small number of articles appearing in both sets.</p>
<div class="center figure">
<img alt="Subsets of articles in PMC with PMIDs available for mining." src="images/pmc_oa_auth_venn.png" style="width: 50%;" />
<p class="caption"><em>Articles in Pubmed Central that have PMIDs and are available for mining
in either the Open Access subset (OA subset) or the Author's Manuscript
collection (Author MS).</em> (<a class="reference external" href="images/pmc_oa_auth_venn.pdf">PDFz</a>)</p>
</div>
<p><strong>Next, I looked at the fraction of mineable articles in PMC (OA subset or
Author's MS) that one tends to get from Pubmed search results, using the two
sets of references for the 227 Ras genes I described in my previous post as
examples</strong> (a set of ~356,000 papers obtained by gene name searches, and
smaller set of ~54,000 papers obtained from the Entrez Gene database).</p>
<p>The results for the larger corpus of 356k papers show a roughly consistent
fraction of mineable papers in PMC that appears to be independent of the total
number of citations for the gene:</p>
<div class="center figure">
<img alt="Percentage of references with full text in Pubmed Central" src="images/pmids_by_name_ft_line.png" />
<p class="caption"><em>Percentage of references for each gene search with full text in Pubmed
Central, sorted by number of total references</em> (<a class="reference external" href="images/pmids_by_name_ft_line.pdf">PDF1</a>).</p>
</div>
<p>The results for the smaller corpus resulting from searching Entrez Gene by gene
ID were similar:</p>
<div class="center figure">
<img alt="Percentage of references with full text in Pubmed Central" src="images/pmids_by_gene_ft_line.png" />
<p class="caption"><em>Percentage of references for each gene search with full text in Pubmed
Central, sorted by number of total references</em> (<a class="reference external" href="images/pmids_by_gene_ft_line.pdf">PDF2</a>).</p>
</div>
<p>For unique references combined across all 227 gene searches, the fraction of
mineable articles in Pubmed Central made up roughly 16-18%:</p>
<table border="1" class="docutils">
<colgroup>
<col width="40%" />
<col width="33%" />
<col width="28%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head"> </th>
<th class="head">By gene name</th>
<th class="head">By gene ID</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>Unique refs</td>
<td>355,781</td>
<td>54,308</td>
</tr>
<tr><td>Mineable In PMC</td>
<td>58,719</td>
<td>9,885</td>
</tr>
<tr><td>Percentage</td>
<td>16.5%</td>
<td>18.2%</td>
</tr>
</tbody>
</table>
<p>There is another way to look at the data, which is not across the set of
references for all genes, but on a gene-by-gene basis. In other words, <strong>if
you are working on a particular gene, what fraction of the articles on your
gene can you expect to find on PMC in a mineable form?</strong></p>
<p>Here is the distribution for the 227 Ras gene searches by gene name:</p>
<div class="center figure">
<img alt="Distribution of full text ratios for different gene name searches" src="images/pmids_by_name_ft_hist.png" />
<p class="caption"><em>Distribution of full text ratios for different gene name searches</em>
(<a class="reference external" href="images/pmids_by_name_ft_hist.pdf">PDF3</a>).</p>
</div>
<p>And for the searches in Entrez Gene by gene ID:</p>
<div class="center figure">
<img alt="Distribution of full text ratios for references in Entrez gene" src="images/pmids_by_gene_ft_hist.png" />
<p class="caption"><em>Distribution of full text ratios for references in Entrez gene</em>
(<a class="reference external" href="images/pmids_by_gene_ft_hist.pdf">PDF4</a>).</p>
</div>
<p>The mean and standard deviation for full text percentages across the full set
of genes:</p>
<table border="1" class="docutils">
<colgroup>
<col width="37%" />
<col width="34%" />
<col width="29%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head"> </th>
<th class="head">By gene name</th>
<th class="head">By gene ID</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>Mean % in PMC</td>
<td>20.8%</td>
<td>18.7%</td>
</tr>
<tr><td>Std Deviation</td>
<td>9.4%</td>
<td>6.9%</td>
</tr>
</tbody>
</table>
<p>As the histograms and the summary statistics show, <strong>while on average you might
expect to find 1 out of every 5 articles on your gene available for mining in
PMC, there is a lot of variability.</strong> If you're unlucky, you could easily end
up with less than 1 in 10. Moreover, these results are specifically for a set
of gene-based searches related to cancer biology. Our experience in
obtaining references for two other less molecularly-focused projects (in
diabetes and drug-induced cardiotoxicity), suggests that <strong>in other domains of
biology, the fraction of open access articles may be substantially less,
possibly due to different journal or publication practices.</strong></p>
Assembling a text-mining corpus for the Ras pathway2016-05-19T00:00:00-05:002016-05-24T22:33:35-05:00John A. Bachmantag:johnbachman.net,2016-05-19:/assembling-a-text-mining-corpus-for-the-ras-pathway.html<p>Over the last year and a half or so I've been involved in the <a class="reference external" href="http://www.darpa.mil/program/big-mechanism">Big Mechanism
program</a> sponsored by DARPA. The
practical goal of this program is to develop software systems to extract facts
from the scientific literature by text mining and, from these facts, assemble
causal, mechanistic models that …</p><p>Over the last year and a half or so I've been involved in the <a class="reference external" href="http://www.darpa.mil/program/big-mechanism">Big Mechanism
program</a> sponsored by DARPA. The
practical goal of this program is to develop software systems to extract facts
from the scientific literature by text mining and, from these facts, assemble
causal, mechanistic models that can be used to explain and predict phenomena.
The bigger picture goal is to explore an approach to science in which machines
assume a greater share of the burden in aggregating and integrating research.
Though DARPA envisions applications of this type of technology in multiple
domains, the initial focus of the Big Mechanism program is in cancer biology,
specifically <a class="reference external" href="http://www.cancer.gov/research/key-initiatives/ras">Ras-driven cancer.</a></p>
<p>Most of my work on the Big Mechanism program up to this point has been to
<a class="reference external" href="https://github.com/sorgerlab/indra">develop tools that assemble mechanisms into models</a>, deconflicting, cleaning, and assembling
findings into different formats. Having had some success in automated assembly
of signaling models from databases (such as <a class="reference external" href="http://pathwaycommons.org/">Pathway Commons</a>) we are now looking to see how much more we can
enrich these models using large-scale machine reading.</p>
<p>I started looking into this in the context of a specific use case: assembling a
large-scale, high-quality model of the Ras signaling pathway, which I've been
developing along with Ben Gyori, Kartik Subramanian and other collaborators
here at HMS. As a starting point, we've defined the Ras signaling pathway
according to <a class="reference external" href="http://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/ras-pathway-v2">Frank McCormick's RAS Pathway v2.0 diagram and accompanying table,</a>
which includes 227 genes organized into 65 groups.</p>
<p>The first question that arises is <strong>what is the best way to find papers
relevant to a set of genes?</strong> A requirement is that the process of querying for
publications should be automated, with minimal human intervention or curation.
I tried two (very simple) approaches:</p>
<ol class="arabic simple">
<li>Query Pubmed using the canonical (HGNC) gene name</li>
<li>Get the set of references associated with the gene from the
<a class="reference external" href="http://www.ncbi.nlm.nih.gov/gene">Entrez Gene</a> database.</li>
</ol>
<p>The first approach produces a ton of results, but has some issues: for one, it
picks up false positives due to gene names that may be matches to other things:
for example, the gene name <em>JUN</em> seems to pick any paper published in the month
of June. On the other hand, this approach also seems to <em>miss</em> relevant papers
due to the fact that most genes have several synonyms and many papers may refer
to the gene using non-standard names.</p>
<p>The second approach pulls the curated PMID references associated with the gene
from the Entrez Gene database. The set of PMIDs obtained by pulling all PMIDs
out of the XML result for the gene corresponds closely to the "Bibliography"
section of the Entrez Gene information page (e.g., see the <a class="reference external" href="https://www.ncbi.nlm.nih.gov/gene/672#bibliography">Bibliography
section for BRCA1</a>).</p>
<p>As expected, searching by gene name returns a much larger set of PMIDs (more
than 6 times larger) than obtaining the references from Entrez Gene. In both
cases there was a substantial fraction of papers that were returned by searches
for multiple genes, as might be expected for genes identified <em>a priori</em> as
being involved in a common biological process. In both cases roughly 75% of the
assembled list of PMIDs were unique.</p>
<table border="1" class="docutils">
<colgroup>
<col width="33%" />
<col width="36%" />
<col width="31%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head"> </th>
<th class="head">By gene name</th>
<th class="head">By gene ID</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>Total refs</td>
<td>464,917</td>
<td>74,529</td>
</tr>
<tr><td>Unique refs</td>
<td>355,781</td>
<td>54,308</td>
</tr>
</tbody>
</table>
<p><strong>How many citations do we tend to get by gene?</strong> The figure below shows the
distribution of the number of PMIDs returned for each gene, sorted by the
number of PMIDs returned by gene name search, and plotted on a log scale. The
distribution roughly follows a power law, with deviations for the most-cited
and least-cited genes.</p>
<div class="center figure">
<img alt="Citation distribution for 227 Ras genes" src="images/citations_by_gene.png" />
<p class="caption"><em>Citation distribution for 227 Ras genes, sorted by citation count for
name-based search</em> (<a class="reference external" href="pdfs/citations_by_gene.pdf">PDF</a>).</p>
</div>
<p>Reassuringly, the number of references returned by the gene ID search roughly
follows the number of references returned by the name search, but with
substantially fewer references overall. The least-cited genes appear to be an
exception to this pattern: for these the gene ID search appears to return a
larger number of references than the name search. This appears to be due to the
fact that the least-cited genes often appear in the literature under different
names, and Entrez Gene collates citations across multiple names.</p>
<p>The list of the top 10 genes (by citations) returns reassuringly familiar
names. If anything, the gene ID search returns a list closer to what one might
expect from how "famous" the genes tend to be, suggesting that it's less
susceptible to variability due to the use of the particular name in the
literature. For example, it's surprising that <em>TP53</em> doesn't make the top 10 in
the gene name search, probably because it's more frequently referred to by its
protein name, <em>p53,</em> than its official gene name, <em>TP53</em>. Similarly, <em>FOS</em> is
number 4 on the gene name list, but it's certainly not as well known as <em>NFKB1</em>
or <em>KRAS</em>, both of which make the top 10 by gene ID but not by gene name. A
quick scan of the search results for "FOS" revealed hits not only for the gene
<em>FOS</em>, but also false positives like "fructooligosaccharide" (FOS), "Framingham
Offspring Study" (FOS), and "foot orthoses" (FOs).</p>
<table border="1" class="docutils">
<colgroup>
<col width="13%" />
<col width="46%" />
<col width="41%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Rank</th>
<th class="head">By gene name (refs)</th>
<th class="head">By gene ID (refs)</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>1</td>
<td>CASP3 (47320)</td>
<td>TP53 (7598)</td>
</tr>
<tr><td>2</td>
<td>EGFR (38072)</td>
<td>EGFR (4056)</td>
</tr>
<tr><td>3</td>
<td>MYC (30819)</td>
<td>NFKB1 (2508)</td>
</tr>
<tr><td>4</td>
<td>FOS (27521)</td>
<td>AKT1 (2370)</td>
</tr>
<tr><td>5</td>
<td>ERBB2 (23076)</td>
<td>BRCA1 (2304)</td>
</tr>
<tr><td>6</td>
<td>MTOR (18677)</td>
<td>ERBB2 (2107)</td>
</tr>
<tr><td>7</td>
<td>MAPK1 (12766)</td>
<td>MAPK1 (1719)</td>
</tr>
<tr><td>8</td>
<td>BRCA1 (12458)</td>
<td>KRAS (1609)</td>
</tr>
<tr><td>9</td>
<td>CDKN1A (12266)</td>
<td>PTEN (1571)</td>
</tr>
<tr><td>10</td>
<td>MAPK3 (12144)</td>
<td>BRAF (1503)</td>
</tr>
</tbody>
</table>
<p>The genes with the fewest citations have a surprisingly small number of
references given that they were explicitly included in a curated set of key Ras
pathway genes. Many of them are lesser-known isoforms of widely-studied gene
families (e.g., <em>SPRED3, RASA2, PIK3R5/6</em>):</p>
<table border="1" class="docutils">
<colgroup>
<col width="13%" />
<col width="46%" />
<col width="41%" />
</colgroup>
<thead valign="bottom">
<tr><th class="head">Rank</th>
<th class="head">By gene name (refs)</th>
<th class="head">By gene ID (refs)</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>218</td>
<td>PIK3R5 (12)</td>
<td>RALGAPA2 (13)</td>
</tr>
<tr><td>219</td>
<td>SPRED3 (10)</td>
<td>RASGRP4 (13)</td>
</tr>
<tr><td>220</td>
<td>EXOC1 (7)</td>
<td>SPRY3 (12)</td>
</tr>
<tr><td>221</td>
<td>RALGAPA1 (7)</td>
<td>RASA2 (11)</td>
</tr>
<tr><td>222</td>
<td>RASSF9 (6)</td>
<td>RASSF10 (11)</td>
</tr>
<tr><td>223</td>
<td>CYTH2 (4)</td>
<td>RGL1 (11)</td>
</tr>
<tr><td>224</td>
<td>EXOC6 (4)</td>
<td>RASSF9 (10)</td>
</tr>
<tr><td>225</td>
<td>RALGAPA2 (4)</td>
<td>SPRED3 (8)</td>
</tr>
<tr><td>226</td>
<td>RASAL3 (3)</td>
<td>RASAL3 (7)</td>
</tr>
<tr><td>227</td>
<td>PIK3R6 (1)</td>
<td>RGL3 (5)</td>
</tr>
</tbody>
</table>
<p>There are of course many other ways to assemple corpora, including systematic
use of gene synonyms, exploiting MeSH terms and other metadata, as well as
using other search tools (e.g., CrossRef). These were two very simple ways to
get a sense of the scale of the relevant literature, with an expansive and a
restricted search giving rough upper and lower bounds. My conclusion is that
<strong>the curated references in Entrez Gene are less likely to contain false
positives, with the downside of missing many potentially relevant articles.</strong>
Given that the size of the corpus returned by Entrez Gene search is smaller,
I'll use this set of roughly ~54k papers for an initial pilot study in machine
reading for mechanisms.</p>
<p>In a subsequent post, I'll look at what fraction of the articles in these two
corpora are available for text mining from Pubmed Central.</p>