Predicting carcinogenicity
Contents and links
Introduction: Predicting chemical carcinogenesis
Prevention of environmentally-induced cancers is a health issue of unquestionable
importance. Almost every sphere of human activity in an industrialised
society faces potential chemical hazards of some form. It is estimated
that nearly 100,000 chemicals are in use in large amounts every day. A
further 500--1000 are added every year. Only a small fraction of these
chemicals have been evaluated for toxic effects like carcinogenicity. The
U.S. National Toxicology Program
(NTP) contributes to this enterprise by conducting standardised chemical
bioassays -- exposure of rodents to a range of chemicals -- to help identify
substances that may have carcinogenic effects on humans. However, obtaining
empirical evidence from such bioassays is expensive and usually too slow
to cope with the number of chemicals that can result in adverse effects
on human exposure. This has resulted in an urgent need for models that
propose molecular mechanisms for carcinogenesis.
The data here come from tests conducted by the NTP. These have so far
resulted in a data base of more than 300 compounds that have been shown
to be carcinigenic or otherwise in rodents. Amongst other criteria, the
chemicals have been selected on the basis of their carcinogenic potential
-- for example, positive mutagenicity tests -- and on evidence of substantial
human exposure. Using rat and mouse strains (of both genders) as predictive
surrogates for humans, levels of evidence of carcinogenicity are obtained
from the incidence of tumors on long-term (two years) exposure to the chemicals.
The NTP assigns the following levels of evidence: CE, clear evidence; SE,
some evidence; E, equivocal evidence; and NE, no evidence. For the experiments
here, we are concerned only with overall levels of activity: +, if CE or
SE; or -, otherwise.
A complete listing of all chemicals tested is available at the NTP
Home Page. The diversity of these compounds present a general problem
to many conventional SAR techniques. Most of these, such as the regression-based
techniques under the broad category called Hansch Analysis can only be
applied to model compounds that have similar mechanisms of action. This
``congeneric'' assumption does not hold for the chemicals in the NTP data
base, thus limiting the applicability of such methods. The Predictive Toxicology
Evaluation project undertaken by the NIEHS aims to obtain an unbiased comparison
of prediction methods by specifying compounds for blind trials. One such
trial, PTE-1, is now complete. Complete results of NTP tests for compounds
in the second trial, PTE-2, will be available by mid 1998.
ILP
experiments with this data use Progol to predict carcinogenic activity
for compounds in PTE-1 and PTE-2. These experiments use the obvious generic
description of compounds consisting of atoms and their bond connectivities.
These two predicates follow the same representation as used by us earlier
in predicting the mutagenic activity of nitro-aromatic compounds. The background
knowledge thus available for the ILP program was as follows.
-
Atom-bond description. Bond information consists of facts of the
form bond(compound,atom1,atom2,bondtype) stating that compound
has a bond of bondtype between the atoms atom1 and atom2.
Atomic structure consists of facts of the form atm(compound,atom,element,atomtype,charge),
stating that in compound, atom has element element
of atomtype and partial charge charge.
-
Generic structural groups. This represents generic structural groups
(methyl groups, benzene rings etc.) that can be defined directly using
the atom and bond description of the compounds. Here we use definitions
for 29 different structural groups, which expands on the 12 definitions
used in our mutagenesis study.
-
Genotoxicity. These are results of short-term assays used to detect
and characterize chemicals that may pose genetic risks. These assays include
the Salmonella assay, in-vivo tests for the induction of micro-nuclei
in rat and mouse bone marrow etc. Results are usually $+$ or $-$ indicating
positive or negative response. These results are encoded into Prolog facts
of the form has_property(compound,type,result), which states that
the compound in genetic toxicology type returned result.
Here result is one of p (positive) or n (negative).
In cases where more than 1 set of results are available for a given type,
we have adopted the position of returning the majority result. When positive
and negative results are returned in equal numbers, then no result is recorded
for that test.
-
Mutagenicity. Progol rules from the earlier experiments on obtaining
structural rules for mutagenesis are included. Mutagenic chemicals have
often been found to be carcinogenic, and we use all the rules found with
Progol.
-
Structural indicators. We have been able to encode some structural
alerts thought to be associated with carcinogenesis. The NTP proposes to
make available nearly 80 additional structural attributes for the chemicals.
Unfortunately, this is not yet in place for use in experiments here.
The Progol datasets
All data is as used in the Progol
experiments, stored as one
compressed TAR file. The files are in a format suitable for a Prolog
implementation of Progol called P-Progol. Within this, background knowledge
is in files with a ``.b'' suffix. Positive and negative examples are in
files with ``.f'' and ``.n'' suffixes respectively. Examples fall in 3
categories: (1) compounds comprising PTE-1; (2) compounds comprising PTE-2;
and (3) compounds comprising the rest. Progol experiments used compounds
in category (3) as ``training'' data. Compounds in categories (1) and (2)
formed ``test'' sets.
Up to applications main page.
|