TREES
Trees HumanChimpGorilla = ChimpGorillaHuman ChimpHumanGorilla = HumanGorilla = Chimp HumanChimpGorilla ≠ ChimpHuman ≠ GorillaChimp
Same thing… s4s5 s1 s3 s2 s4s5 s1 s3 s2 =
The maximum parsimony principle Evaluation of the tree topology
Genes: 0 = absent, 1 = present speciesg1g2g3g4g5g6 s s s s s
s1s4s3 s2 s5 Evaluate this tree…
s1s4s3s2s5 Gene number
s1s4s3s2s5 Gene number 1, Option number
s1s4s3s2s5 Gene number 1, Option number 2. Number of changes for g1 =
s1s4s3 s2 s5 Gene number 2, Option number
s1s4s3 s2 s5 Gene number 2, Option number
s1s4s3 s2 s5 Gene number 2, Option number Number of changes for g2 = 2
s1s4s3 s2 s5 Gene number 3, Option number
s1s4s3 s2 s5 Gene number 3, Option number Number of changes for g3 = 1
s1s4s3 s2 s5 Gene number 4, Option number
s1s4s3 s2 s5 Gene number 4, Option number Number of changes for g4 = 2
Gene number 5 is the same as Gene number 4 Number of changes for g5 = 2
s1s4s3 s2 s5 Gene number 6, 1option only: Number of changes for g6 = 1
Sum of changes Number of changes for g6 = 1 Number of changes for g5 = 2 Number of changes for g4 = 2 Number of changes for g3 = 1 Number of changes for g2 = 2 Sum of changes for this tree topology = 9 Can we do better ??? Number of changes for g1 = 1
s1s4s3 s2 s5 The MP (most parsimonious) tree: Sum of changes for this tree topology = 8
How many rooted trees? ab abcbaccab N=3, TR(3) = 3 bcd a cbd a dbc a acd b cad b TR = “TREE ROOTED” N=2, TR(2) = 1 dac b abd c bad c dab c abc dbac d cab d bcd a cbd a dbc a N=4, TR(4) = 15
How many rooted trees 2 sequences:1 tree 3 sequences3 trees 4 sequences3*5=15 trees 5 sequences3*5*7=105 trees. … TR(n) = 1*3*5*7*…..*(2n-3)
Rooting the tree
Rooted vs. unrooted trees
The position of the root does not affect the MP score. Rooted vs. Unrooted:
s1s4s3s2s5 Gene number 1, Option number Intuition why rooting doesn’t change the score The change will always be on the same branch, no matter where the root is positioned… 1
How can we root the tree? we want rooted trees!
Gorilla gorilla (Gorilla) Homo sapiens (human) Pan troglodytes (Chimpanzee) Gallus gallus (chicken)
Evaluate all 3 possible UNROOTED trees: Human Chimp Chicken Gorilla Human Gorilla Chimp Chicken Human Chicken Chimp Gorilla MP tree
Rooting based on a priori knowledge: Human Chimp Chicken Gorilla HumanChimpChickenGorilla
Ingroup / Outgroup: HumanChimp Chicken Gorilla INGROUP OUTGROUP
Monophyletic groups HumanChimp Chicken Gorilla The Gorilla+Human+Chimp are monophyletic
How to efficiently compute the MP score of a tree
The Fitch algorithm (1971): AG C C A HumanChimp Chicken Gorilla Duck {A,G} {A,C,G} {A,C} Post-order tree scan. In each node, if the intersection between the child-nodes is empty: we apply a union operator. Otherwise, an intersection.
Number of changes AG C C A HumanChimp Chicken Gorilla Duck {A,G} {A,C,G} {A,C} Total number of changes = number of union operators.
Parsimony has many shortcomings. To name a few: (1) All changes are counted the same, which is not true for biological systems (Leu->Ile is much more likely than Leu->His). (2) Cannot take biological context into account (secondary structures, dependencies among sites, evolutionary distances between the analyzed organisms, etc). (3) Statistical basis questionable.
Alternative: MAXIMUM-LIKELIHOOD METHOD
Maximum likelihood uses a probabilistic model of evolution Each amino acid has a certain probability to change and this probability depends on the evolutionary distance. Evolutionary distances are inferred from the entire set of sequences.
Evolutionary distances Positions in an alignment can be conserved due to two reasons. Either because of functional constraints, or because a short evolutionary time elapsed since the divergence of the organisms. 5 replacements in 10 positions between 2 chimps, is considered very variable. 5 replacements between human and cucumber, is not considered too variable… Maximum likelihood takes this information into account.
Maximum ParsimonyMaximum Likelihood All changes are considered the same Different probabilities to different types of substitutions Statistically questionable Statistically robust Ignores biological context Accounts for biological context
The likelihood computations t1t1 t5t5 t3t3 X C K t2t2 Z Y MA t6t6 t4t4 With likelihood models we can: 1.Infer the most likely phylogenetic tree 2.Compute conservation for each site
Maximum likelihood tree reconstruction This is incredibly difficult (and challenging) from the computational point of view, but efficient algorithms to find approximate solutions were developed.
Two steps: 1.Compute a distance D(i,j) between any two sequences i and j. 2.Find the tree that agrees most with the distance table. Tree reconstruction using distance based methods
Neighbor-joining is based on Star decomposition A C B D E Red: best pair to group together D A D (C,B) A E ((C,B),E) In each step we cluster a pair so that the sum of branches is minimal
A few words on Human Immunodeficiency Virus (HIV) The virus = HIV The disease/syndrome = Aquired Immunodeficiency First recognized clinically in By 1992, it had become the major cause of death in individuals of years of age in the U.S.
HIV Till Dec 2002: 20 million people died of AIDS. Infected in 2002: 5 millions. Number of currently infected: ~42 millions 1 out of every 100 adults of age in the world population.
HIV HIV is the leading cause of death in sub- Sharan Africa. In some parts of this region 25-30% of the population is infected. 1 out of 3 children in these areas lost at least one of his parents.
Sub-Saharan Africa refers to the territories south to the Sahara. In the past the term ‘ Black Africa ’ has also been used to refer to the same region however today it is obsolete due to its ” politically incorrectness ” Tropical Africa might be taken as an alternative label of the same region however it excludes South Africa, which lies outside the tropics.
HIV is a lentivirus Species = HIV Genus = Lentiviruses Family = Retroviridae Lentiviruses have long incubation time, and are thus called “slow viruses”.
HIV-1 and HIV-2 In 1986, a distinct type of HIV prevalent in certain regions of West Africa was discovered and was termed HIV type 2. Individuals infected with type 2 also had AIDS, but had longer incubation time and lower morbidity.
Morbidity vs. Mortality Morbidity: the prevalence of a disease: שיעור התחלואה The probability that a randomly selected person out of the entire population is ill, at time t.
Morbidity vs. Mortality Mortality: Deaths from a disease or at general Mortality rate = Death rate שיעור התמותה
Origin of HIV-1 in the chimpanzee Pan troglodytes troglodytes Nature Vol Pages:
Five lines of evidence have been used to substantiate zoonotic transmission of primate lentivirus: 1. Similarities in viral genome organization; 2. Phylogenetic relatedness; 3. Geographic coincidence; 4. Plausible routes of transmission; 5. Prevalence in the natural host.
For HIV-2, a virus (SIVsm) that is genomically indistinguishable and phylogenetically closely related was found in substantial numbers of wild-living sooty mangabeys whose natural habitat coincides with the epicenter of the HIV- 2 epidemic
מנגבי, קוף ארוך זנב מסוג סרקוסבוס מצוי באזורי היערות של אפריקה
Close contact between sooty mangabeys and humans is common because these monkey are hunted for food and kept as pets. No fewer than six independent transmissions of SIVsm to humans have been proposed. The origin of HIV-1 is much less certain.
HIV-1 is most similar in sequence and genomic organization to viruses found in chimpanzees (SIVcpz).
BUT, there are several doubts casting the theory that chimpanzees are the natural host and reservoir for HIV-1 1.There is a wide spectrum of diversity between HIV-1 and SIVcpz. 2. An apparent low prevalence of SIVcpz infection in wild-living animals. 3. The presence of chimpanzees in geographic regions of Africa where AIDS was not initially recognized.
Rather, it has been suggested that another, yet unidentified, primate species could be the natural host for SIVcpz and HIV-1.
“We recently identified a fourth chimpanzee with natural SIVcpz infection…” This animal (Marilyn) was wild-caught in Africa (county of origin unknown), exported to the United States as an infant, and used as a breeding female in a primate facility until her death at age 26. Marilyn
During a serosurvey in 1985, Marilyn was the only chimpanzee of 98 tested who had antibodies strongly reactive against HIV-1 by enzyme-linked immunosorbent assay (ELISA) and western immunoblot. HOW was the SIV found
Maybe Marylin was infected with HIV during her stay in the U.S.? “She has never been used in AIDS research and had not received human blood products after She died in 1985 after giving birth to still-born twins.”
Endometritis: דלקת רירית הרחם Sepsis: אלח דם “An autopsy revealed endometritis, retained placental elements and sepsis as the final cause of death. Depletion of lymhoid tissues was not noted.” To convince that she did not have AIDS…
“PCR was used to amplify HIV- or SIV-related DNA sequences directly from uncultured (frozen) spleen and lymph-node tissue obtained at the autopsy in order to characterize the infection responsible for Marilyn’s HIV-1 seropositivity.”
Amplification and sequence analysis of subgenomic gag (508 base pairs (bp)) and pol (766 bp) fragments revealed the presence of a virus related to, but distinct from, known SIVcpz and HIV-1 strains.
PCR was used to amplify and sequence four overlapping subgenomic fragments that together comprised a complete proviral genome. The genome was termed SIVcpzUS.
Provirus The "provirus" is the form of the virus which is capable of being integrated into the host genome. In the case of HIV it means the DNA "copy" of the HIV genome (HIV normally carries its genes around in RNA form).
Provirus As far as the host cell's cellular machinery is concerned, this extra DNA is not different from the self DNA.
Only three other SIVcpz strains have been reported: Two from animals wild-caught in Gabon (SIVcpzGAB1 and SIVcpzGAB2) One from a chimpanzee exported to Belgium from Zaire (SIVcpzANT).
SIVcpzGAB1 and SIVcpzANT have been sequenced completely, but only 280bp of the pol sequence are available for SIVcpzGAB2.
To determine the evolutionary relationships of SIVcpzUS to these and other HIV and SIV sequences: 1.Sequences from the HIV sequence database ( were downloaded. 2.Neighbour-joining was used to construct the tree, based on the full-length Pol sequences. 3.Maximum likelihood was also used and “yielded very similar topologies”
The neighbour-joining method was applied to protein- sequence distances calculated by the method of Kimura. Clade support values were computed with 1,000 bootstrap replicates. NJ computations were computed using the CLUSTAL_X program.
These analyses identified SIVcpzUS unambiguously as a new member of the HIV-1/SIVcpz group of viruses.