49 Comments

The problem is simple - the provenance of the genetic material used to sequence the so-called sars-cov-2 genome has not been determined i.e show me the virus (isolated, purified, seperated from other biological and genetic material)

Expand full comment

Excellent, thank you very much for this.

Expand full comment

"I could not find any compelling evidence, that the SARS-CoV-2 reference genome is valid"

"Valid" means what exactly? ;-)

Expand full comment

That it is the genome, that's contained in the virus.

Expand full comment

Circular reasoning. There is no control sequence from a sample of proven pathogenic "virus" to validate an in silico sequence. No institution or lab or country has yet demonstrated the presence of isolated virus. https://christinemasseyfois.substack.com/

Expand full comment

1/ HI Ben, It has been a while since your first post on assembly, and my comments there. At that time I did not yet read the Wu paper, nor did know the details of the assembly program Megahit. I imagined how an assembly program might work and imagined that it would take a very long time to run the program. I see that Megahit extracts "K-mers" of odd length (eg 27nt). A 27nt segment of the viral genome would be covered by many reads, but in the ideal world, the very same k-mer would be extracted from many reads, reducing redundancy, and making the program fast.

Expand full comment

RACE 3/ Back to the 5' end. I got all excited that you have many reads with the pattern "GGGG-bluebox", where bluebox is an insert of 4 or 5nt. Are the 4s and the 5s all the same? I want to see if "bluebox" extends the stem of "stem loop 1 of 3' UTR" (for ref https://doi.org/10.1016/j.cell.2021.02.008 figure 2c). Once we have the RACE data, do any of those reads that contain GGGG-bluebox remain, or do they vanish because they are just sequencing artifacts?

Expand full comment

Bowtie2 doesn't allow gaps within 4 bases of read extremes by default, because the default value of the `--gbar` flag is 4. Most of the blue box reads actually have a 4-base insert at the start but a match for the next 4 bases, but Bowtie2 is not allowed to insert gaps within the first 4 bases so it has to treat the first 4 bases as mismatches and delete the next 4 bases. So the reads have CIGAR strings like 4M4I143M (which means 4 non-indel bases followed by 4 inserted bases and 143 non-indel bases, where the inserted bases are inserts in the reads relative to the reference so they're actually deleted in the alignment). With local alignment, the CIGAR string would become 4S147M (which means four soft-clipped bases followed by 147 non-indels).

There's 38 reads with starting position 1, but they all come from the reverse reads and they all have between 3 and 5 extra bases before the first nucleotides of Wuhan-Hu-1.

Wu et al. wrote that they constructed an RNA library "using the SMARTer Stranded Total RNA-Seq kit v.2 (TaKaRa)". The manual of the kit says "The first three nucleotides of the second sequencing read (Read 2) are derived from the Pico v2 SMART Adapter (shown as Xs). These three nucleotides must be trimmed prior to mapping if performing paired-end sequencing." (https://www.takarabio.com/documents/User%20Manual/SMARTer%20Stranded%20Total%20RNA/SMARTer%20Stranded%20Total%20RNA-Seq%20Kit%20v2%20-%20Pico%20Input%20Mammalian%20User%20Manual_050619.pdf) In Wu et al.'s reads, the first 3 bases of read 2 have much lower average Phred scores than the first 3 bases of read 1. But you can use `fastp -F 3` to trim the first 3 bases from read 2 as was suggested in the Takara manual.

In a tweet about this Substack post, Kevin McKernan wrote "This is the Takara template switching kit. You can see the 5’adaptor with XXXX in the red box. Entirely expected to see variable bases as the adaptors." (https://twitter.com/Kevin_McKernan/status/1674642686920278016) And McKernan tweeted: "While Takara's kit is proprietary we can guess what they are doing by looking at the sequence of the adaptors. They have an primer-NNKG adaptor. Penultimate base [actually he meant the last base] is always G as it should be with template switching." (https://twitter.com/Kevin_McKernan/status/1674635520939360258)

However I don't know if the number of extra bases at the start of read 2 can be variable, because Figure 2 in the Takara manual shows the number of bases as 5 instead of 3.

Expand full comment

Oh wait, now I'm confused. After a good night sleep, I looked up Takara in the Wu paper and it seems that Takara made the PCR kit (the second sequencing) and the RACE kit (the third sequencing). It seems that you are discussing artifacts of the RACE kit (combined with Bowtie display behavior). This implies that Ben's Wuhan dataset does include RACE reads. But I was guessing that there would have been hundreds of RACE reads and earlier you were thinking that the RACE reads would have been in a different file.

Expand full comment

The Takara manual I linked didn't mention anything about RACE. It was the manual for the Takara RNA-Seq kit that was used in the metagenomic sequencing.

The maximum length of the reads in SRR10971381 is 151 bases. Supplementary Table 8 of Wu et al. says that the predicted length of their RACE amplicons ranged from 515 bases to 688 bases, because using MN908947.2 coordinates, the 5' primers matched positions 573-599 and 491-515 and the 3' primers matched positions 29,212-29,239 and 29,398-29,423: https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2008-3/MediaObjects/41586_2020_2008_MOESM1_ESM.pdf.

The RACE reads for RaTG13 also range in length from 336 bases to 533 bases: https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&page_size=10&acc=SRR16979454&display=reads.

Wu et al. wrote that they used 56,565,928 reads to do the de novo assembly with MEGAHIT, but that's identical to the number of reads in SRR10971381. And they needed to do RACE after they had first done the metagenomic sequencing so they would know what primers they should use for RACE.

Expand full comment

OK, thanks.

Expand full comment

OK Mongol, thanks for your two replies. So then, sequencing artifact at the 5' end it is, and not evidence of degenerating 5' ends generating "dud" virions within a viral population.

Expand full comment

Hi Ben, we have a discord which discusses Metagenomics, please feel free to join here. Mongol is there as well, and others: https://discord.gg/WMTJ88cH

Expand full comment

Thanks for the invite. This has no browser access? But requires an app? I'm using a 14yr old Mac OS 10.11. I'm set up to boot into Ubuntu Linux, haven't updated in 1-2 yrs & forgot how to use it, LOL. If an app is required I'll pass.

Edit: Just looked at Wiki. A fancy chat program. I spend "hours" editing and rewriting before posting; Live chats aren't for me.

Expand full comment

Edit: Extending the stem may imply that "bluebox" may be a real variant.

Expand full comment

7/ Considering all potential sources of mismatches, some k-mers will be "problematic", and generate branches in the assembly graph. This is where my current understanding of assembly programs stops: "how do different assembly programs trim branches". Apart from the problem of ends, the problem of alternative branches is the very crux of the matter. Why did Trinity stumble and output a shorter contig than Megahit? And perhaps why you do not achieve a perfect replication of genome length.

And why Wuhan did not rely on Megahit alone, but required refinement from Bowtie afterwards.

Expand full comment

I think the reason why the original 30,474-base contig cannot be reproduced is that Wu et al. replaced human reads with N bases in the reads they uploaded to the SRA, and the original 30,474-base contig accidentally included a 618-base segment of human DNA at the 3' end: https://output.jsbin.com/suwuxoy#Why_is_the_longest_MEGAHIT_contig_not_reproducible_.

Expand full comment

Hi Mongol. 1) Thanks for solving the 618 nt mystery.

2) Thanks for your work on the other alleged viruses. In the comments for the 1st post in the series, there was a comment that Zika was also found in the Wu dataset. I suspected that this was mostly a size/mismatch issue with an evolutionary signal mixed in.

3) Do you see any subgenomic junctions in the Wu dataset? I assume that the virus excludes subgenomic RNAs during virion packaging, but that this exclusion isn't foolproof. Subgenomic RNAs would be more of an issue for lysed cell cultures as opposed to sputum or swab samples.

Expand full comment

How do you think the human DNA "accidentally" slipped in?

Did they run Megahit on the unfiltered reads?

Expand full comment

Yeah probably, because in the methods section of Wu et al., they didn't describe removing host reads before they ran MEGAHIT. They should've easily noticed the insertion though if would've just aligned the 30,474-base contig against other sarbecoviruses, and they would've noticed that their contig had a long insert at the end which was missing from other sarbecoviruses, and then they would've done a BLAST search for the insert.

In the bat sarbecovirus Rs7907 which was published in 2021 by Shengli Zhi's team at the WIV, the first 96 bases are also 100% identical with mammalian rRNA and DNA. In the paper where Rs7907 and 7 other sarbecoviruses were published, they didn't describe removing host reads before they did de-novo assembly: "For SARSr-CoV-positive RNA extraction, next-generation sequencing (NGS) was performed using BGI MGISEQ 2000. NGS reads were first processed using Cutadapt (v.1.18) to eliminate possible contamination. Thereafter, the clean reads were assembled into genomes using Geneious (v11.0.3) and MEGAHIT (v1.2.9). PCR and Sanger sequencing were used to fill the genome gaps. To amplify the terminal ends, a SMARTer RACE 5`/3`kit (Takara) was used." (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8344244/) (However I don't think they did RACE for the 5' end of Rs7907 either, because they would've noticed the assembly error, or maybe they just published the contig they got with de-novo assembly without combining it with the RACE sequences.)

Expand full comment

RACE 2/ I looked at your figure 2 again, of sequences just upstream of the polyA tail. I got all excited that you had 3 reads with the variant pattern "CnnCnnA". But I couldn't find them in your trimmed reads. Again, I guess that your dataset doesn't contain the RACE reads for the 3' end.

Expand full comment

RACE 1/ I looked at figure 1 again. Roughly 50 reads claiming to be within 10nt downstream of the 5' cap. I guess that if your dataset contained the RACE reads, you would have hundreds or thousands of reads at that end?

Expand full comment

Yeah I think usually the metagenomic reads, PCR-based reads, and RACE reads would be published separately. For example the BioProject for RaTG13 consists of three experiments titled "RNA-Seq of Rhinolophus affinis:Fecal swab", "amplicon_sequences of RaTG13", and "RACE_sequences of RaTG13": https://www.ncbi.nlm.nih.gov/sra?linkname=bioproject_sra_all&from_uid=606165.

Expand full comment

9/ "Just one more thing" - TV detective Columbo.

If a next generation sequence technology has an error rate of 1%, and if the reads are a bit over 100nt in length, then on average, each read is expected to contain one error (some 0, some 2, etc). So don't ask Kevin or anybody else to assemble a perfect match.

If we have a many-fold coverage, many of these errors should cancel out in the consensus.

There is one example of the "entire" genome of a single individual coronavirus being sequenced by nanopore (typical error rate of 10%), and 90% accuracy would be good enough to distinguish SARS-CoV-1 from SARS-CoV-2, but not good enough to track Covid19 "variants", sub-variants, sub-sub-variants, etc.

When constructing evolutionary trees, matching up long stretches of "only" 50% identity wouldn't be unusual. Matching 90% identity of >100nt reads isn't anything to complain about.

Expand full comment

1% error rate, given - ok!

but that'd should be reflected in the quality scores as well right? and fairly randomly distributed, and not scattered at the front!

Expand full comment

8/ Apparently your results within the center of the genome mostly agrees with the Wuhan assembly. So why freak out? It is a matter of perspective. Is the glass 99% full or 1% empty? A glass that is 1% empty is no reason to doubt the validity of the existence of viruses.

But by all means, please continue to hunt down the remaining 1% discrepancy. I am hunting also, but I took a little break. Time to get back into the game. I want to know too.

/end (for now)

ps, part 4, I compared chimp vs human. Note that as far as I know, the distance between Covid19 variants is smaller than the distance between chimp and human, and more like the distance among humans.

Expand full comment

Without the ends, one cannot establish validity of the entire genome.

Expand full comment

ends 1/ For the human genome project, Pres Clinton in his last year had a press conference with the two teams (Private Ventner vs Public NIH) present announcing its "completion". (Or maybe Bush in his first year, I don't remember for sure.) In any case, all the genome scientists knew that it wasn't really "complete", just the first phase, the easier stuff. Large chunks of the sequencing could be aligned onto the (classical) genetic map, but they could not rule out unknown gaps between the known sequenced chunks. That came later, perhaps, but I'm not really sure if there are still some gaps remaining even >20yrs later. I haven't followed it that closely.

Expand full comment

ends 2/ The technical issue with the resistant portion involves "repetitive" DNA. The concept overlaps with what is sometimes (falsely?) referred to as "junk" DNA.

Assembly programs like Megahit would have a problem with sequential repeats if the k-mers are too small. Even large k-mers (maybe 81nt), could not deal with a tract of hundreds of small repeats.

Assembly programs could even have a problem with well separated repeats. Well separated repeats within the SARS-CoV-2 genome could present problems for Megahit. I mentioned something about "subgenomic RNA junctions" in part "6/". These "junctions" are generated by the repeats, which are subject to "cut and paste" operations (actually, more like, "jump from one repeat to the next"). For the human genome, "cut and paste" operations result in birth defects (eg Down's syndrome) and cancers.

Expand full comment

ends 3/ The problem with failure to "complete" (in the absolute sense) the human genome probably could not be solved before very long read technology such as Nanopore and Pacific Biosystems came about.

None of this directly addresses Covid19 ends, except as context. First lesson from the human genome project: "Take your victory where you can" (99% glass full), and "keep plugging away at the rest" (1% glass empty).

Second lesson from the HGP: "Failure to resolve gaps do not invalidate the concept of the human genome", nor does it mean that humans don't exist.

Expand full comment

ends 4/ Apart from the variable length of the 3' PolyA tract, which is real and may have some biological meaning, you are complaining about high sequence variation just within the polyA tract, and just within the 5' cap. I don't know if this variation is real and has biological meaning for individuals within a "swarm", or if it is a technical issue (sequencing errors, not real variation), or a combination of both.

Recall that the Wuhan sequencing project had 3 phases, the last being a very special technique for the ends. I have not yet read about this technique. I don't know if these special end reads are included in your dataset, nor do I know if they would resolve your issues if they were included and considered separately from the rest.

Keep plugging away. It could be technical. Or maybe it is real and SARS-CoV-2 doesn't care if its ends are "frayed".

Expand full comment

ends 5/ "Just one more thing" - Columbo

You may actually be on the track of something extremely important.

Individual viral particles with "frayed ends" may be "duds". If you look at Covid19 tracking data, it seems that most people that catch Covid19 fail to transmit to somebody else (it fizzles out for these) while a few people infect lots of others. It could be that "failure to infect others" involves the accumulation of bad ends.

Those among the "there is no virus crowd", cite studies where transmission failed. (These were in the 1970s, before ethics rules probably forbid them now.) Perhaps this is a matter in timing, where bad ends accumulate towards the end of an infection cycle.

You may be the key that reconciles the two camps.

Expand full comment

6/ For mismatches resulting from real biology we have mutations.

There is also the problem of "subgenomic" RNAs that generate "junctions".

Expand full comment

5/ A mismatch at a specific segment would generate at least two different k-mers that cover that segment. Megahit filters out some sequencing error mismatches because its default setting requires that a given k-mer must exist in at least two reads. For those types of sequencing errors that are random, it is unlikely that two reads would contain the same error, although a few will exist and pass through this filter.

Expand full comment

4/ In my comments from 6 months ago, I kept harping on the concept of "mismatch tolerance". Mismatches arise from two sources (1) sequencing errors, and (2) real biology.

Note well that any genome represents a statistical average where mismatches mostly cancel out.

If you were to sequence your genome, probably every cell contains at least one mutation that would generate a mismatch, but these should cancel out within a consensus sequence. Same all humans on the planet, all 8 billion of them, each one with a unique genome. Same for chimps, but human sequences would cluster with human and chimp with chimp, even if all human/ape sequences are 97% identical. So we expect the same for infections, the population of SARS-CoV-2 virions infecting a single person would exist as a swarm of related mutants.

Expand full comment

3/ In my comments from 6 months ago, I kept repeating that the problem would be at the ends. Note that I usually read papers on bacterial DNA sequencing, and sometimes I slip into the DNA mindset even though I know that we are dealing with an RNA virus. I forgot about the 5' cap and the 3' PolyA Tail. My concern about the ends was one of read abundance, not cap or tail. Looking at the PacBio Covid Kit, one can see the problem of end abundance. In an ideal world, extraction of k-mers from the Wuhan dataset, should overcome the the problem of end abundance, because I think that they had enough "depth of coverage".

If you read the Wuhan paper, you may notice that they sequenced their sample 3 times. First to get a de novo template. Second, using PCR. Third, a special technique for both ends. I don't know if your dataset only includes the first or includes the latter two.

In regards to the polyA tail, other papers claim that each individual virion differs in the length of the polyA tail. Maybe the answer to some of your questions lies there.

Expand full comment

2/ In my comments from 6 months ago, I suggested that you communicate with Kevin McKernan. Today I see that the two of you are now engaging on Twitter. Good. Kevin can be rough at times (at least impatient). Ouch! Ouch! Ouch! Please look past that and keep the dialog up. I don't post on Twitter, just watch, otherwise I would have jumped in.

Expand full comment

Kevin McKernan posted the following comments about this blog post (https://twitter.com/Kevin_McKernan/status/1674047174357680131):

"Reads don't map to the margin of a reference because the given insert size you have selected in library construction won't exist. If you have 200bp inserts, you won't get the 1st ~50bases of the reference. This is known by all in space except Ben who thinks it's a discovery."

"Read mapping 101. If you make 200bp insert libraries, the likelihood of you capturing the 1st 50bp of the reference is reduced. This is just poisson sampling. https://irepertoire.com/ngs-considerations-coverage-read-length-multiplexing/"

"Secondly, there are biological reasons why those pieces of DNA aren't captured. The 5' end has a 5' cap which can't be cloned unless it's decapped. The 3' end is a variable poly A tail which won't sequence or amplify without polymerase slippage. Reduced end seq is expected."

Expand full comment