[Frontiers in Bioscience E4, 1542-1555, January 1, 2012]

Database documentation of retrotransposon insertion polymorphisms

Ping Liang1, Wanxiangfu Tang1

1Department of Biological Sciences, Brock University, 500 Glenridge Ave, St. Catharines, Ontario, Canada L2S 3A1

TABLE OF CONTENTS

1. Abstract
2. Introduction
3. Documentation of retrotransposon insertion polymorphisms (RIPs)
3.1. Methods for identification of RIPs
3.2. Special issues related to describing RIPs in details
3.3. Database documentation of RIPs
3.3.1. Current status of RIP database documentation
3.3.2. Overview of dbRIP
3.3.3. dbRIP utilities
3.3.4. dbRIP data tracking and data releases
3.3.5. Future development of dbRIP
4. Summary and concluding remarks
5. Acknowledgements
6. References

1. ABSTRACT

Retrotransposons constitute more than 40% of the human genome with L1, Alu, SVA, and HERVs known to remain active in transposition. Retrotransposition contribute to genetic diversity in the form of retrotransposon insertion polymorphism (RIP) that is defined as the presence or absence of a retrotransposon insertion among human populations at a specific genomic location. So far close to 5000 cases of RIPs have been identified with more than 50 cases associated with disease. A large number of new RIPs are being and to be identified from newly available personal genomes data, making RIPs an important source of genetic variations/mutations that deserve proper documentation. In this review, we discuss the special characteristics of RIPs and the challenges in their compiling and annotating, and we examine the current status of database documentation of RIPs and describe in details the design, data schema, and utilities of dbRIP, which is currently the only database dedicated to the documentation of retrotransposon insertion polymorphism. Some future perspectives and outstanding issues associated with documentation of RIPs are also presented.

2. INTRODUCTION-RETROTRANSPOSONS AND THEIR CLASSIFICATION

Retrotransposon elements (REs) are a group of transposable elements (TEs) that propagate themselves into different places in the genome via an intermediate process of reverse transcription. In a sense, REs proliferate in the genome in a copy-and-paste fashion. In the human genomes, as in most other mammalian and plant genomes, REs exist in millions of copies and all together they constitute more than 40% of the human genome. REs have played very important roles in shaping the evolution of human and other primate genomes. They impact the functions of genes and the genome via a variety of mechanisms, which include, but not limited to, generation of insertion mutations and genomic instability, creation of new genes or gene isoforms, and alteration of gene expression regulation and epigenetic regulation (1-13). The major types of retrotransposons in the human genome include the LTR retrotransposons, i.e., the Endogenous Retrovirus (ERVs), that are characterized by the presence of the two long-terminal repeats (LTRs), and the non-LTR retrotransposons LINE1 (L1) and Alus. L1s, Alus, and ERVs comprise approximately 500,000, 1,000,000 and 300,000 copies and constitute 17%, 11% and 8.5% of the human genome, respectively (14, 15). These REs represent the groups that were once very successful during the evolution of the mammals and primates and to certain degrees have remained active in the current human genomes. Also worthy of mention are the SVAs, a type of non-LTR retrotransponsons, which in sequence represent chimeras of SINE, VNTR and Alu-like regions. SVAs are very young and highly active, despite their small population size of a few thousand copies (16, 17).

Each of these major types of retrotransposons can be divided into subfamilies of more closely related elements based on a set of diagnostic nucleotide sequences. For example, Alus can be divided into more than 200 subfamilies (18), whereas SVAs have only 6 subfamilies (16). The formation of subfamilies, which often exist in a hierarchical structure, reflects the evolutionary dynamics of retrotransposon amplification in a sequential or a linear accumulation fashion. In others words, as explained by the "master gene model", a limited number of "master" copies that are competent for retrotransposition in the genome are responsible for the generation of most new copies, with all progeny copies from each "master" copy forming a subfamily at a variable size depending on the activity level of the "master" copy and the length of its existence. New "master" copies may emerge from a subfamily and form a new cluster within that parent subfamily. For example, AluYb9 originated from AluYb8 by carrying one extra diagnostic nucleotide variation, where as AluYb8 came from Yb5, which belong to the larger AluYb subfamily, which in turn is part of a relatively older AluY family. The proliferation rate of retrotransposons during the evolution of mammals and primates has not been constant with significant differences seen among the major types of retrotransposons and among subfamilies of the same RE type. For example, the ERVs were more active during the early evolution of primates but have become much less active at least in the human genome (19-21). Furthermore, different REs in many cases have showed quite different activities among closely related species after their divergence, as well demonstrated by the dramatically different profiles of Alu elements among different primate species (22, 23). For instance, AluYb and AluYa subfamilies have been highly active in the human genomes, but not in the chimpanzee genome (23, 24), whereas the activity of AluYc5 subfamily is relatively small in the human genome but is much larger in the chimpanzee genome, apparently as the most active Alu subfamily in this genome (unpublished data).

As a result of this past and ongoing proliferation activity of retrotransposons that fluctuate both vertically and horizontally (cross-species), a significant amount of genome diversity has been generated between human and other closely related primate species and among different human populations, as well as individuals within populations, leading to the generation of retrotransposon insertions that are species-, population- and family lineage-specific. For example, there are ~2000 L1, ~7000 Alus, and ~1000 SVA insertions that are only found in the human genome, and together these insertions contributed more than 8 Mb of nucleotide sequences to the human genome, a major factor leading to the increase in genome size in humans in comparison with chimpanzees (25-32). There are also close to 17,000 copies of ERVs that are specific to human genomes (human endogenous retrovirus or HERVs), and they make up more than 22 Mb of sequences. In this case, the generation of the HERVs may be a combined effect of the proliferation from existing ERVs and newly domesticated virus and their proliferation in the human genome (19, 33).

Different from other types of regular sequence variation, Retrotransposon Insertion Polymorphism (RIP) refers to the presence or absence of a retrotransposon insertion at a specific genomic location in populations of a given species. Due to their significant impact on genes and genome as a whole, and despite the relatively little attention received so far, these RIPs constitute a very important source of all human genetic polymorphisms that together with other types of genetic variations are responsible for the full spectrum of the vivid phenotypic differences observed among human individuals, such as the physical appearance and susceptibility to diseases. In this review, we examine the history of research related to identification of RIPs in humans and the associated methodologies, and we discuss the special characteristics of retrotransposon insertion polymorphisms and the challenges in compiling the data. We also examining the current status of database documentation of RIPs and describe in details the design, data schema, and utilities of dbRIP, the current only database specially designed for the documentation of RIPs.

3. COMPUTATIONAL DOCUMENTATION OF RETROTRANSPOSON INSERTION POLYMORPHISM

3.1. Methods for identification and ascertainment of retrotransposon insertion polymorphism

The currently known polymorphic retrotransposon insertions were identified using a number of approaches and methodologies all within the last two decades. Earlier studies using genomic library screening with probes/primers specific for young Alu elements contributed to the discovery of a small number of RIPs (34-37). A recent study employing the library screening approach combined with high throughput pair-end Sanger sequencing successfully identified 198 L1 insertions, as well as 1 HERV-K insertion, not present in the reference genome from the analysis of 17 genomes (38). Most of the disease-related retrotransposon insertions were discovered from the mutational screening of candidate genes, using methods including Southern blot, DNA sequencing, etc. (e.g. 39, 40 and reviews 41, 42).

The task of finding RIPs among millions of copies that are highly similar in sequence in a genome is essentially like "finding a needle in a hay stack". For this reason, no large-scale comprehensive study was possible until the human genome sequences became available (14, 43). The use of the human genome sequences for identification of RIPs was first explored by Batzer's group. In this approach, Alu elements belonging to young subfamilies were identified by computational sequence analysis based on the level of sequence divergence among family members, and polymerase chain reactions (PCRs) using primers designed in regions flanking the insertion were used to ascertain the polymorphism status of these candidates by screening DNA samples from diverse human populations. The first study using such a strategy identified 106 polymorphic Alu insertions out of 475 Ya5 and Yb8 insertions (44). Subsequently, this method was extensively used to analyze almost all Y subfamilies including Ya, Yb, Yc, Yd, Yg and Yi, Ye, and multiple AluY subfamily members on the X chromosome (23, 45-51). Together these studies are responsible for the identification of over 400 polymorphic Alu insertions. While successful, the use of this strategy was limited to REs that are covered in the public version of the human genome sequence and the selection of candidates was biased towards certain relatively small and young subfamilies for which the numbers of candidates are manageable for PCR assays.

To identify RIPs that are absent in the reference genome, genomic DNA sequences from more human individuals representing different populations are needed. With the genomic sequences becoming available for multiple human individuals, computational comparative genomic approaches were developed to more effectively identify RIPs. The first attempt with this type of strategy used partial human trace genomic sequences representing 36 diverse humans to compare with the reference genome and identified over 600 Alu, L1 and SVA insertion polymorphism (52). The first study comparing two genome sequences was performed by our group, in which we took advantage of the availability of the public and Celera versions of human genome sequences that roughly represent two different individual genomes despite the mixing nature of DNA used for genome sequencing. In that study, we identified more than 800 new Alu insertion polymorphisms, the largest set of polymorphic Alu insertions identified by a single study at that time (26). Among these Alu RIPs, more than one third were insertions outside the public versions of the genome sequence. Subsequently, the same approach was used for identification of ~150 polymorphic L1 insertions (27). Somewhat surprising was the fact that there is very little overlap among lists of the RIPs identified from the above three large-scale computational studies. This is likely because each method used a different genomic sequence source and identified an incomplete list among a large number of possible RIPs, and it served as the first strong hint that the actual level of RIPs may be much higher than what we could have expected from the limited number of RIPs previously identified. More recent studies using the diploid Venter genome in comparison with the reference genome revealed more polymorphic insertions of Alus, L1s and SVA and confirmed our previous speculation, as well as demonstrated the usefulness of diploid genome sequences for identification of new RIPs (29, 53).

With the advent of next generation sequencing (NGS) technologies and their applications in sequencing a large number of personal genomes, a few more approaches have been developed identification of de novo retrotransposon insertions in the genomes of individuals in question. One of the strategies that have been tested in a few laboratories is the use of NGS to selectively sequence the junction areas between RE insertions and their flanking genomic sequences. For example, using this approach Witherspoon et al. has identified a large number of novel Alu RIPs from several Japanese individuals (54). Similarly, Ewing & Kazazian devised a NGS sequencing approach for L1, and by surveying 26 individuals, they identified 367 L1s not present in the reference genome, majority of which are novel polymorphic L1s (55). Further more, the availability of personal genome sequences in large number, such as those that have and being generated by the 1000 genome projects, permits identification of novel RIPs via computational comparative genomic analysis (56, 57). An unprecedented larger number of novel RIPs from the known families of active retrotransposons, Alu, L1, and SVA, are being identified, among which include those that are specific to populations or groups of populations (58-60). In addition to the sequencing approaching, a microarray-based method has also been explored for identifying polymorphic L1 insertions (61). We can expect the discovery of many more novel RIPs from the analysis of a sufficiently large number of individual genomes representing diverse populations, particularly the ancient or highly isolated populations, such as the Bushmen and Neanderthal genomes (62-64). In addition to novel RIPs identified mostly as population and individual from these analyses, we can also expect a certain number of RE insertions present in the reference genome to be recognized as RIPs, particular from the analysis of the ancient populations.

The current gold standard for ascertaining a RIP is PCR, in which a pair of PCR primers are designed in the flanking regions of the insertion, such that the presence and absence of the RE insertion will lead to differences in PCR product size, i.e., the size for the insertion positive allele is larger than that of the product for the insertion negative allele roughly by the size of the insertion. This strategy works well when the RE insertion is relative small, i.e. below 2kb, and it can distinguish among the three genotypes of a RE insertion, "+/+", "+/-", and "-/-", as having one large product, one large and one small product, and one small product, respectively (see illustration in Figure 1A). When the insertion size is large, such as insertions of full length L1s and HERVs, which can be as long as 10 kb, it becomes difficult to obtain a product for the insertion positive allele even with a long range PCR. In this case, as used for genotyping HERV RIPs by Belshaw et al (21), a better strategy is to design two additional primers inside the RE, which are oriented outwards, such that in the presence of the RE insertion these two primers will work with the two primers in the flanking region to generate two shorter products, while for the insertion negative allele there would be only one product to be generated from the two primers in the flanking regions. Due to the variability of subgroup sequences, these primers generally require to be designed for each specific subgroup of RE, unless a "universal" primer can be found based on a region highly conserved among the larger RE family. With these two sets of primers, 2, 3, and 1 product(s) are expected for samples with a genotype of "+/+", "+/-", and "-/-", respectively (see illustration in Figure 1B). Therefore, a PCR assay can provide complete and accurate genotyping of the RIP by distinguishing between the three possible genotypes. In addition, PCR assay can also provides DNA for sequencing to obtain the sequence of the RE insertion, which may be highly desired for RIPs outside the reference sequence, for which the RE insertion and target site duplication (TSD) sequences are usually not available.

Among the known RIPs, most were identified by computational analysis of genomic sequences, among which usually only a small portion was subjected to PCR verification, more as a way for assessing the accuracy of the methods than an attempt for validation. In these studies, it is often not feasible to experimentally verify all candidate RIPs due to their large numbers and the prohibitive cost of validation by PCR. However, to make the RIP data as reliable and usable genetic variation data for the genetic community, it is important that we can experimentally validate all RIPs identified computationally or using any other methods that do not provide complete sequences for the insertions and the associated sequences, such as TSDs.

3.2. Special characteristics and issues associated with RIPs.

In comparison with other types of genetic variations, such as such as single nucleotide polymorphisms (SNPs), indels, genomic rearrangements, and copy number variations (CNVs), RIPs have several unique features. They require special handling in characterization, compiling, and data display as discussed below.

First, a RIP represents an evolutionary event that has a definitive ancestral status, which is always the absence of the insertion. In other words, the pre-integration sequence is always the ancestral form. Since the chance for two individuals to have the same retrotransposon insertion as a result of two independent insertion events is almost zero and there is no known mechanism that specifically removes a retrotransposon insertion, the only reason for two individuals to share a retrotransposon insertion is their shared ancestry. For this very reason, RIPs are considered to be homoplasy-free, and it is this characteristic that makes RIPs a very useful type of genetic markers in population studies, particularly in resolving the ancestral relationship (65, 66). Also for these reasons, all RIPs are always true insertion polymorphisms and they should not be called "deletions" even when absent in the reference sequences. But it does present a challenge using the current reference-based nomenclature for documenting polymorphisms (67, 68) (http://www.hgvs.org/mutnomen/recs.html).

Second, the final outcome of a retrotransposon insertion event often carries complex sequence rearrangements beyond the insertion of the RE sequence. These rearrangements include the more common generation of TSDs at variable lengths, ranging from a few bps to a few hundred bps, or the less common deletion of flanking region at the integration site. Furthermore, in some cases, more often seen in association with L1 and SVA insertions, extra sequences flanking the parent copy of the retrotransposons can be carried to the progeny copy via 5' or 3' transduction (2, 4-6, 8, 9, 13, 69, 70). The sizes of these transduced sequences range from a few to a few hundred bps, and can lead to exon shuffling if coding sequences are included in the transduced sequences. Accurate identification of the transduced sequence associated with a RE insertion is important for assessing the impact of the RIP. Again, using the current nomenclature for polymorphism/mutations presents difficulties for documenting these extra sequence rearrangements.

Third, insertions caused by certain types of retrotransposons, such as LTR (e.g. HERVs), can generate post-insertion secondary changes within the RE insertion, such as the homologous sequence-mediated recombination between the two LTRs, leading to the deletion of the internal ERV sequence and generation of a solo-LTR (19, 71, 72). As a result, different forms of the insertion sequences, despite their origin from the same insertion event, may co-exist at the same site among the populations. In the case of a HERV insertion, at least three forms of sequences at the site, i.e., the pre-integration sequence, insertion containing the full HERV sequence, and the insertion containing only the solo-LTR, can exist as shown by Belshaw et al (21). Therefore, the genotype data in this case has to be dealt and presented differently.

Lastly, RE insertions, including the associated TSDs, are also sources of SNPs and microsatellite variations (73). The two copies of TSDs can be subjected to random mutations and become different from each other, while the poly-dA tracks carried by Alus, L1s and SVAs are a major source of microsatellite DNA subjected to a high level of sequence variations. These SNP variations carried by a RE sequence that is polymorphic itself by way of presence or absence adds an additional dimension to the genetic diversity, and it is a challenge to report and document them.

For all these reasons, it is much more challenging to document these retrotransposon insertion polymorphisms than other types of sequence polymorphisms. Therefore, RIPs warrant to be treated as a special type of genetic variations.

3.3. Database documentation of retrotransposon insertion polymorphism in dbRIP

3.3.1. Current RIP database documentation status

Due to the large number of known RIPs and the many more expected to be identified, it is essential that these data are compiled in a way that is accurate and easy to access. Accuracy here refers not only to the reliability of the data in all components of the information accurately, including the sequence of the insertion, location, and classification, but also the completeness of the data. For example, just knowing the presence of the insertion at a specific location does not provide sufficient information about the potential impact of the insertion, and it is important also to know the exact sequence of the insertion and the TSDs or deletion of the flanking sequence and/or the presence of 5' or 3' transduced sequence. Other types of information, such as the source of the polymorphism (i.e. the specific population or individual showing the presence or absence of the insertion), the ascertaining/genotyping method, the insertion allele frequency in the examined populations, the phenotype association, etc. are also very useful. The sample source is very important for future study of rare RIPs. Currently, the RIP data can be found in a few databases which include the dbSNP at NCBI (http://www.ncbi.nlm.nih.gov/projects/SNP/), database of retrotransposon insertion polymorphisms (dbRIP) at Brock University (http://dbrip.org) and database of genome variants (DGV) at the Centre of Applied Genomics (http://projects.tcag.ca/variation/) (25, 74). Some mouse RIP data were included in MouseIndelDB (http://variation.osu.edu/) (75). In all these databases other than dbRIP, it was not straightforward if possible at all, to find specifically the entries related to retrotransposon insertion polymorphism, nor do they allow query by disease, gene context-based location or RE class. For example, unless the dbSNP IDs for polymorphic retrotransposon insertions, such as those by Bennett et al (52), are known at the time of query, it is very hard to find these data from dbSNP, and there is no indication of the insertion sequence and the TSDs, neither is the classification of the RE provided due to the different purpose of the database. Among all these databases, only dbRIP was designed specifically to accommodate the special needs of retrotransposon insertion polymorphism data (25). dbRIP has been recognized as an important reference resource for the research community as demonstrated by the large number of citations it receives since its relative short inception (29, 53, 54, 58, 75-80). We describe in the subsequent sections in detail about the design, database schema, and utilities, and future development of dbRIP.

3.3.2. Overview of dbRIP

In designing dbRIP, instead of having it as a standalone database like many other biological databases, we decided to have it integrated with a genome browser, and among the existing genome browsers, we choose to go with the UCSC Genome Browser for its easy-to-use interface and comprehensive coverage of functional genome data. This integration made dbRIP very user friendly, and more importantly, it allows the RIP data to be viewed in context of genome sequences, gene and many other related genomic and functional genomic data that are made available via the UCSC Genome Browser (81, 82).

In the first full release of dbRIP data in June of 2006, there were 2095 non-redundant entries from a total 2897 reported cases, including 1625, 407 and 63 cases of Alus, L1s and SVAs, respectively (25), and we have recently extended the coverage to include RIPs derived from HERVs. As of writing, dbRIP covers a total of 2,771 non-redundant RIP entries, including 2086, 598, 77, 10 cases of Alus, L1s, SVAs, and HERVs, respectively (Table 1; Tang et al, manuscript in submission). These RIP data were collected from over 70 publications and were curated manually to characterize all properties associated with a RIP.

One of the unique features of dbRIP, which is very important for retrotransposon insertion polymorphism, is that we provide the detailed sequence information associated with a RIP by distinguishing the sequences of the RE insertion, the target site duplications (TSD) and the flanking regions (Figure 2). For LTR retrotransposons, which are characterized by the presence of two long-terminal repeats (LTRs) at the ends, we also define and label the LTR sequences differently from the internal endogenous retrovirus related sequence, which is useful for assessing the impact of a HERV RIP. Other informational items we currently collect and provide for RIPs includes identifications (including the original ID(s)), RE classification, known disease associations, available PCR conditions (primers, TM, and allele sizes) used for genotyping, methods of ascertainment, source of the RIPs, genomic location in gene context, available genotyping data in details, and the original reference(s) reporting each RIP (Figure 2).

3.3.3. dbRIP utilities

A key function of a database is to allow querying its data by all properties associated with the data. The current major utilities of dbRIP include a data-search interface and a position-mapping tool. In dbRIP, the RIP data can be searched using the standard utilities provided by the UCSC genome browser. For example, one can use the genome position, gene name, and blat to find all RIPs located in the specified regions or associated with the specified genes or query sequence. Advanced users can also use the Table Browser utility to perform more sophisticated queries (83). To further facilitate the search of dbRIP data, a "SearchdbRIP" tool was developed for querying the RIP data, and it allows querying of RIP data based on one or more of RIP properties, which may not be possible or easy to perform using the UCSC Genome Browser utilities. The SearchdbRIP utilities are divided into two sections. The first part allows quick search by RIP IDs (can be either the dbRIP ID or names used in the original study), and chromosome coordinates. The second part provides advanced search by one or more of the RIP data properties including the chromosome, location in gene context, source of RIP, RE subfamily, population, allele frequency range, disease association, and author name (Figure 4). For location in gene context, we break the genome into 5 categorical regions, including exon, promoter (1kb upstream transcription start site), intron, downstream (500 bp downstream the end of the gene), and inter-genic regions in priority order from high to low to handle situations where two or more categories can be assigned to the same location. For exons, we further divide into 5'-UTR, CDS, 3'-UTR and non-coding RNA. Using this search parameter, one can query all RIPs falling into a specific gene location category, for instance, all RIPs in coding regions, as a way to study the functional impact of RIPs. The currently available categories for sources of the RIPs include "reference (e.g. hg18, hg19)", "Venter" and "others". This allows users to collect RIPs from a specific genome or data source. It is our plan to expand this list in the near future to cover other important sources, such as Watson, 1000 Genome Project, Bushmen, Neanderthal, etc. In short, SearchdbRIP allows users to collect a specific set of RIPs by using one of more of the search parameters. The output of SearchdbRIP is a detailed list of matched entries. For each matched entry, two hyper links are provided: the link to "Detailed" provides access to the detailed record page of the RIP as shown in Figure 2, and the link to "Browser" brings users to the genome browser at a default window size of the RIP insertion size plus 5kb on each side as shown in Figure 3.

The tool, PositionMapping, is designed for users to determine in a batch style among a list of newly identified candidate RIP entries which represent known RIPs (i.e. those in dbRIP) and which represent novel RIPs (i.e. not in dbRIP). The utility compares the positions of the user's input list with the coordinates of all dbRIP entries and identifies overlapping entries. It should be a useful tool for researchers who identify large lists of RIPs and want to compare with the data in dbRIP. As output, the utility provides a summary of the mapping result and generates a list of input IDs that are novel RIPs and a list of IDs that overlap with the data in dbRIP.

3.3.4. dbRIP record tracking and data releases

In designing the first version of dbRIP, we used an ID system that reveals the RE type, the chromosome, the position in million base pair of a RIP and the number of RIPs in the same position designation. For example, "RIP_Alu_chr7_003_01" indicates the RIP as the first Alu record located within the 3 million bp of chromosome 7. The intention was to provide as much information as possible via the ID about the RIP. However, this causes a problem of consistency when migrating to a new genome version due to the changes of the chromosome coordinates, making it either necessary to change the IDs with every genome migration or possibly rendering the ID meaningless. To avoid this problem, starting from release 2 of dbRIP data, we changed it to a 7-digit numerical ID system, similar to the on used in OMIM database (http://www.ncbi.nlm.nih.gov/omim). In this ID system, the first digit is used to indicate the major type of retrotransposons (1xxxxxx, 2xxxxxx, 3xxxxxx, 4xxxxxxx for Alu, L1, SVA and HERV, respectively) and the rest of 6 digits are used to indicate the sequential order of RIPs of this type deposited into dbRIP. For example, dbRIP 1000001 is the first Alu RIP, while 2000100 is the 100th L1 RIP. The remaining 5 digits (i.e., 5-9) for the 1st position in the ID are reserved for new types of RIPs and for accommodating existing types that exceed 1 million in number. Therefore, the system allows a maximal of 3 million RIP entries for one type of RE and a total of 10 million entries for all REs. The new ID system provides stability and consistency not affected by migrating to newer genome versions and it allows referencing a RIP record via a permanent identification.

To better track the changes of the dbRIP data, we developed a data-version system using release numbers. We assign a sequential data release version number, such as release 1 and 2, for each major update and use the decimal number after the major release number to label each minor update. For instance, release 2.1 will be used for the first minor update for release 2 data (either addition of a small number of RIPs and/or modifications to the existing data). The data release version is an indication of status of RIPs data (e.g. the total number of RIPs) and is not necessarily directly tied to a genome version, since one data release may be provided simultaneously for more than one genome version. For example, release 2.0 is made available for both hg18 and hg19. For the last version of each major data release, we provide a summary statistics table (see Table 1 for example).

3.3.5. Future development of dbRIP

We are committed to maintain and update dbRIP for the community as a free research resource. Future maintenance of dbRIP will focus on: 1) timely collection of newly published RIP data, 2) accommodation of new RIP data identified using new methodologies that may not complete information of a RIP, 3) support for newer versions of the reference human genomes and 4) possible expansion to other model organisms, such as mouse. Due to the expected availability of personal genome sequences in a large number and development of new strategies for experimental identification of novel RIPs, a large number of novel RIPs have been and will be identified via computational comparative genomics, microarray or next-generation sequencing approaches (54, 55, 58-61). Timely updates to accommodate these new data will be in high demand. For utility, we plan to develop an interface for users to submit new RIP data to facilitate the deposition of data into dbRIP. In addition, it may be useful to implement a mechanism to inform interested users about the database updates, e.g. an email alerting service.

Future improvements for dbRIP need to address a few outstanding issues, including 1) how to document rare case of RIPs with deletions in flank regions; 2) how to report and describe the transduced sequences associated with RIPs; 3) development of controlled vocabularies and/or nomenclatures for describing RIP properties, such as the population name, methods used to identify or ascertain RIPs, genotype data and allele frequency, disease association, etc. To make RIP data accessible to a broader user community, we are working with teams of dbVAR at NCBI (http://www.ncbi.nlm.nih.gov/dbvar) and Database of Genome Variants (http://projects.tcag.ca/variation/) and to exchange data with these databases that have been covering or intent to cover RIP data. We will also continue to work with the UCSC Genome Browser team to make the dbRIP data track available for all human genome versions on their genome browser server (http://genome.ucsc.edu). To make dbRIP a more valuable resource for the community, we welcome suggestions and contributions from users regarding future data updating and improvement of the interface and utilities.

4. SUMMARY AND CONCLUDING REMARKS

Over the past decade, with the availability of human reference genome sequences and that of other primate genomes, we have obtained a panoramic view for retrotransposons, the major class of mobile elements in the genomes. We begin to learn more about their evolutionary history, proliferation dynamics during evolution and to appreciate their important impact on genome evolution, and gene function and genome diversity. The process of retrotransposition not only allows the propagation of these retrotransposons, many to a great success, to achieve their impact in the genome, but also serves as a major mechanism responsible for generating inter- and intra-species genome diversity. The retrotransposon insertion polymorphisms represent an important source of genetic polymorphisms, not only because the ever-increasing number of loci and the large amount of sequences involved, but also because of their significant and complex impact on genome structure and gene function. Their highly complex characteristics make them distinct from other types of genetic variations and require them to be handled in differently in curation and documentation. As of writing, close to 5000 entries of RIPs, amounting to over 25 million base pairs of sequences, have been identified, mostly from genome-wide surveys and comparative genomic analysis. Now with the advent of the newer generations of genome sequencing technologies and ever increasing number of personal genomes, we are provided with an exiting new opportunity to obtain a more complete picture for the level of genetic polymorphisms contributed by these retrotransposons, the related mechanisms, and functional impact. From these analyses, we expect to identify a large the number of new RIPs, speculatively triple of what we have known so far, reaching to those that represent very rare de novo insertion events from the known active retrotransposons and those by non-canonical mechanisms, perhaps also those from mobile elements not currently known to be active.

Complete and accurate documentation of such special complex types of genetic variations in an integrated and intuitive manner is essential for fully realizing the benefits of these research data. In the mean time, this also imposes many challenges and brings the need of developing new standards and nomenclature by the research community. dbRIP is currently the only database specially designed to accommodate the documentation of RIPs. Such databases that target at specific data types and user communities provide many benefits over general databases, mainly due to their high quality data curation and clearly defined data applications. Future improvements with community support are needed to meet the needs and challenges. Last but not least, as a community of the researchers on mobile elements, we need to advocate more about the importance of retrotransposon insertion polymorphisms for genetics and human population study and solicit more funding to support related research, particularly the large scale of validation and genotyping of RIPs, as well as the database documentation.

5. ACKNOWLEDGEMENT

We thank Scott Golem and the two anonymous reviewers for their critical review of the manuscript and suggestions. This work is in part supported by grants from the Canada Research Chair program, Canadian Foundation of Innovation (CFI), Ontario Ministry of Research & Innovation (OMRI), Brock University, and Natural Sciences and Engineering Research Council (NSERC) to PL, and was made possible by the facilities of the Shared Hierarchical Academic Research Computing Network (SHARCNET, http://www.sharcnet.ca) and Compute/Calcul Canada (https://computecanada.org/).

6. REFERENCES

1. R Cordaux, MA Batzer: The impact of retrotransposons on human genome evolution. Nat Rev Genet 10, 691-703 (2009)
doi:10.1038/nrg2640
PMID:19763152    PMCid:2884099

2. A Damert, J Raiz, AV Horn, J Lower, H Wang, J Xing, MA Batzer, R Lower, GG Schumann: 5'-Transducing SVA retrotransposon groups spread efficiently throughout the human genome. Genome Res (2009)

3. J Xing, H Wang, VP Belancio, R Cordaux, PL Deininger, MA Batzer: Emergence of primate genes by retrotransposon-mediated sequence transduction. Proc Natl Acad Sci U S A 103, 17608-17613 (2006)
doi:10.1073/pnas.0603224103
PMID:17101974    PMCid:1693794

4. PA Callinan, J Wang, SW Herke, RK Garber, P Liang, MA Batzer: Alu retrotransposition-mediated deletion. J Mol Biol 348, 791-800 (2005)
doi:10.1016/j.jmb.2005.02.043
PMID:15843013

5. K Han, SK Sen, J Wang, PA Callinan, J Lee, R Cordaux, P Liang, MA Batzer: Genomic rearrangements by LINE-1 insertion-mediated deletion in the human and chimpanzee lineages. Nucleic Acids Res 33, 4040-4052 (2005)
doi:10.1093/nar/gki718
PMID:16034026    PMCid:1179734

6. K Han, J Lee, TJ Meyer, J Wang, SK Sen, D Srikanta, P Liang, MA Batzer: Alu recombination-mediated structural deletions in the chimpanzee genome. PLoS Genet 3, 1939-1949 (2007)
doi:10.1371/journal.pgen.0030184
PMID:17953488    PMCid:2041999

7. JS Han, ST Szak, JD Boeke: Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 429, 268-274 (2004)
doi:10.1038/nature02536
PMID:15152245

8. JD Boeke, OK Pickeral: Retroshuffling the genomic deck. Nature 398, 108-9, 111 (1999)
doi:10.1038/18118
PMID:10086353

9. DC Hancks, AD Ewing, JE Chen, K Tokunaga, HH Kazazian Jr: Exon-trapping mediated by the human retrotransposon SVA. Genome Res (2009)

10. DV Babushok, EM Ostertag, HH Kazazian Jr: Current topics in genome evolution: molecular mechanisms of new gene formation. Cell Mol Life Sci 64, 542-554 (2007)
doi:10.1007/s00018-006-6453-4
PMID:17192808

11. HH Kazazian Jr: Mobile elements: drivers of genome evolution. Science 303, 1626-1632 (2004)
doi:10.1126/science.1089670
PMID:15016989

12. PL Deininger, JV Moran, MA Batzer, HH Kazazian Jr: Mobile elements and mammalian genome evolution. Curr Opin Genet Dev 13, 651-658 (2003)
doi:10.1016/j.gde.2003.10.013

13. SK Sen, K Han, J Wang, J Lee, H Wang, PA Callinan, M Dyer, R Cordaux, P Liang, MA Batzer: Human genomic deletions mediated by recombination between Alu elements. Am J Hum Genet 79, 41-53 (2006)
doi:10.1086/504600
PMID:16773564

14. ES Lander, LM Linton, B Birren, C Nusbaum, MC Zody, J Baldwin, K Devon, K Dewar, M Doyle, W FitzHugh, R Funke, D Gage, K Harris, A Heaford, J Howland, L Kann, J Lehoczky, R LeVine, P McEwan, K McKernan, J Meldrim, JP Mesirov, C Miranda, W Morris, J Naylor, C Raymond, M Rosetti, R Santos, A Sheridan, C Sougnez, N Stange-Thomann, N Stojanovic, A Subramanian, D Wyman, J Rogers, J Sulston, R Ainscough, S Beck, D Bentley, J Burton, C Clee, N Carter, A Coulson, R Deadman, P Deloukas, A Dunham, I Dunham, R Durbin, L French, D Grafham, S Gregory, T Hubbard, S Humphray, A Hunt, M Jones, C Lloyd, A McMurray, L Matthews, S Mercer, S Milne, JC Mullikin, A Mungall, R Plumb, M Ross, R Shownkeen, S Sims, RH Waterston, RK Wilson, LW Hillier, JD McPherson, MA Marra, ER Mardis, LA Fulton, AT Chinwalla, KH Pepin, WR Gish, SL Chissoe, MC Wendl, KD Delehaunty, TL Miner, A Delehaunty, JB Kramer, LL Cook, RS Fulton, DL Johnson, PJ Minx, SW Clifton, T Hawkins, E Branscomb, P Predki, P Richardson, S Wenning, T Slezak, N Doggett, JF Cheng, A Olsen, S Lucas, C Elkin, E Uberbacher, M Frazier, RA Gibbs, DM Muzny, SE Scherer, JB Bouck, EJ Sodergren, KC Worley, CM Rives, JH Gorrell, ML Metzker, SL Naylor, RS Kucherlapati, DL Nelson, GM Weinstock, Y Sakaki, A Fujiyama, M Hattori, T Yada, A Toyoda, T Itoh, C Kawagoe, H Watanabe, Y Totoki, T Taylor, J Weissenbach, R Heilig, W Saurin, F Artiguenave, P Brottier, T Bruls, E Pelletier, C Robert, P Wincker, DR Smith, L Doucette-Stamm, M Rubenfield, K Weinstock, HM Lee, J Dubois, A Rosenthal, M Platzer, G Nyakatura, S Taudien, A Rump, H Yang, J Yu, J Wang, G Huang, J Gu, L Hood, L Rowen, A Madan, S Qin, RW Davis, NA Federspiel, AP Abola, MJ Proctor, RM Myers, J Schmutz, M Dickson, J Grimwood, DR Cox, MV Olson, R Kaul, C Raymond, N Shimizu, K Kawasaki, S Minoshima, GA Evans, M Athanasiou, R Schultz, BA Roe, F Chen, H Pan, J Ramser, H Lehrach, R Reinhardt, WR McCombie, M de la Bastide, N Dedhia, H Blocker, K Hornischer, G Nordsiek, R Agarwala, L Aravind, JA Bailey, A Bateman, S Batzoglou, E Birney, P Bork, DG Brown, CB Burge, L Cerutti, HC Chen, D Church, M Clamp, RR Copley, T Doerks, SR Eddy, EE Eichler, TS Furey, J Galagan, JG Gilbert, C Harmon, Y Hayashizaki, D Haussler, H Hermjakob, K Hokamp, W Jang, LS Johnson, TA Jones, S Kasif, A Kaspryzk, S Kennedy, WJ Kent, P Kitts, EV Koonin, I Korf, D Kulp, D Lancet, TM Lowe, A McLysaght, T Mikkelsen, JV Moran, N Mulder, VJ Pollara, CP Ponting, G Schuler, J Schultz, G Slater, AF Smit, E Stupka, J Szustakowski, D Thierry-Mieg, J Thierry-Mieg, L Wagner, J Wallis, R Wheeler, A Williams, YI Wolf, KH Wolfe, SP Yang, RF Yeh, F Collins, MS Guyer, J Peterson, A Felsenfeld, KA Wetterstrand, A Patrinos, MJ Morgan, P de Jong, JJ Catanese, K Osoegawa, H Shizuya, S Choi, YJ Chen, International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001)
doi:10.1038/35057062
PMID:11237011

15. PL Deininger, MA Batzer: Mammalian retroelements. Genome Res 12, 1455-1465 (2002)
doi:10.1101/gr.282402
PMID:12368238

16. H Wang, J Xing, D Grover, DJ Hedges, K Han, JA Walker, MA Batzer: SVA elements: a hominid-specific retroposon family. J Mol Biol 354, 994-1007 (2005)
doi:10.1016/j.jmb.2005.09.085
PMID:16288912

17. RE Mills, EA Bennett, RC Iskow, SE Devine: Which transposable elements are active in the human genome? Trends Genet 23, 183-191 (2007)
doi:10.1016/j.tig.2007.02.006
PMID:17331616

18. AL Price, E Eskin, PA Pevzner: Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res 14, 2245-2252 (2004)
doi:10.1101/gr.2693004
PMID:15520288    PMCid:525682

19. N Bannert, R Kurth: Retroelements and the human genome: new perspectives on an old relation. Proc Natl Acad Sci U S A 101 Suppl 2, 14572-14579 (2004)
doi:10.1073/pnas.0404838101
PMID:15310846    PMCid:521986

20. N Bannert, R Kurth: The evolutionary dynamics of human endogenous retroviral families. Annu Rev Genomics Hum Genet 7, 149-173 (2006)
doi:10.1146/annurev.genom.7.080505.115700
PMID:16722807

21. R Belshaw, AL Dawson, J Woolven-Allen, J Redding, A Burt, M Tristem: Genomewide screening reveals high levels of insertional polymorphism in the human endogenous retrovirus family HERV-K(HML2): implications for present-day activity. J Virol 79, 12507-12514 (2005)
doi:10.1128/JVI.79.19.12507-12514.2005
PMID:16160178    PMCid:1211540

22. GE Liu, C Alkan, L Jiang, S Zhao, EE Eichler: Comparative analysis of Alu repeats in primate genomes. Genome Res 19, 876-885 (2009)
doi:10.1101/gr.083972.108
PMID:19411604    PMCid:2675976

23. AB Carter, AH Salem, DJ Hedges, CN Keegan, B Kimball, JA Walker, WS Watkins, LB Jorde, MA Batzer: Genome-wide analysis of the human Alu Yb-lineage. Hum Genomics 1, 167-178 (2004)
PMID:15588477

24. R Gibbons, LJ Dugaiczyk, T Girke, B Duistermars, R Zielinski, A Dugaiczyk: Distinguishing humans from great apes with AluYb8 repeats. J Mol Biol 339, 721-729 (2004)
doi:10.1016/j.jmb.2004.04.033
PMID:15165846

25. J Wang, L Song, D Grover, S Azrak, MA Batzer, P Liang: dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum Mutat 27, 323-329 (2006)
doi:10.1002/humu.20307
PMID:16511833    PMCid:1855216

26. J Wang, L Song, MK Gonder, S Azrak, DA Ray, MA Batzer, SA Tishkoff, P Liang: Whole genome computational comparative genomics: A fruitful approach for ascertaining Alu insertion polymorphisms. Gene 365, 11-20 (2006)
doi:10.1016/j.gene.2005.09.031
PMID:16376498    PMCid:1847407

27. MK Konkel, J Wang, P Liang, MA Batzer: Identification and characterization of novel polymorphic LINE-1 insertions through comparison of two human genome sequence assemblies. Gene 390, 28-38 (2007)
doi:10.1016/j.gene.2006.07.040
PMID:17034961

28. J Wang, W Wang, R Li, Y Li, G Tian, L Goodman, W Fan, J Zhang, J Li, J Zhang, Y Guo, B Feng, H Li, Y Lu, X Fang, H Liang, Z Du, D Li, Y Zhao, Y Hu, Z Yang, H Zheng, I Hellmann, M Inouye, J Pool, X Yi, J Zhao, J Duan, Y Zhou, J Qin, L Ma, G Li, Z Yang, G Zhang, B Yang, C Yu, F Liang, W Li, S Li, D Li, P Ni, J Ruan, Q Li, H Zhu, D Liu, Z Lu, N Li, G Guo, J Zhang, J Ye, L Fang, Q Hao, Q Chen, Y Liang, Y Su, A San, C Ping, S Yang, F Chen, L Li, K Zhou, H Zheng, Y Ren, L Yang, Y Gao, G Yang, Z Li, X Feng, K Kristiansen, GK Wong, R Nielsen, R Durbin, L Bolund, X Zhang, S Li, H Yang, J Wang: The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008)
doi:10.1038/nature07484
PMID:18987735    PMCid:2716080

29. J Xing, Y Zhang, K Han, AH Salem, SK Sen, CD Huff, Q Zhou, EF Kirkness, S Levy, MA Batzer, LB Jorde: Mobile elements create structural variation: analysis of a complete human genome. Genome Res 19, 1516-1526 (2009)
doi:10.1101/gr.091827.109
PMID:19439515    PMCid:2752133

30. J Lee, R Cordaux, K Han, J Wang, DJ Hedges, P Liang, MA Batzer: Different evolutionary fates of recently integrated human and chimpanzee LINE-1 retrotransposons. Gene 390, 18-27 (2007)
doi:10.1016/j.gene.2006.08.029
PMID:17055192    PMCid:1847406

31. MC Seleme, MR Vetter, R Cordaux, L Bastone, MA Batzer, HH Kazazian Jr: Extensive individual variation in L1 retrotransposition capability contributes to human genetic diversity. Proc Natl Acad Sci U S A 103, 6611-6616 (2006)
doi:10.1073/pnas.0601324103
PMID:16618923    PMCid:1458931

32. G Liu, NISC Comparative Sequencing Program, S Zhao, JA Bailey, SC Sahinalp, C Alkan, E Tuzun, ED Green, EE Eichler: Analysis of primate genomic variation reveals a repeat-driven expansion of the human genome. Genome Res 13, 358-368 (2003)
doi:10.1101/gr.923303
PMID:12618366    PMCid:430288

33. D Moyes, DJ Griffiths, PJ Venables: Insertional polymorphisms: a new lease of life for endogenous retroviruses in human disease. Trends Genet 23, 326-333 (2007)
doi:10.1016/j.tig.2007.05.004
PMID:17524519

34. SS Arcot, TH Shaikh, J Kim, L Bennett, M Alegria-Hartman, DO Nelson, PL Deininger, MA Batzer: Sequence diversity and chromosomal distribution of "young" Alu repeats. Gene 163, 273-278 (1995)
doi:10.1016/0378-1119(95)00317-Y

35. MA Batzer, SS Arcot, JW Phinney, M Alegria-Hartman, DH Kass, SM Milligan, C Kimpton, P Gill, M Hochmeister, PA Ioannou, RJ Herrera, DA Boudreau, WD Scheer, BJ Keats, PL Deininger, M Stoneking: Genetic variation of recent Alu insertions in human populations. J Mol Evol 42, 22-29 (1996)
doi:10.1007/BF00163207
PMID:8576959

36. AM Roy, ML Carroll, DH Kass, SV Nguyen, AH Salem, MA Batzer, PL Deininger: Recently integrated human Alu repeats: finding needles in the haystack. Genetica 107, 149-161 (1999)
doi:10.1023/A:1003941704138
PMID:10952208

37. MA Batzer, VA Gudi, JC Mena, DW Foltz, RJ Herrera, PL Deininger: Amplification dynamics of human-specific (HS) Alu family members. Nucleic Acids Res 19, 3619-3623 (1991)
doi:10.1093/nar/19.13.3619
PMID:1649453    PMCid:328388

38. JM Kidd, T Graves, TL Newman, R Fulton, HS Hayden, M Malig, J Kallicki, R Kaul, RK Wilson, EE Eichler: A human genome structural variation sequencing resource reveals insights into mutational mechanisms. Cell 143, 837-847 (2010)
doi:10.1016/j.cell.2010.10.027
PMID:21111241    PMCid:3026629

39. HH Kazazian Jr, C Wong, H Youssoufian, AF Scott, DG Phillips, SE Antonarakis: Haemophilia A resulting from de novo insertion of L1 sequences represents a novel mechanism for mutation in man. Nature 332, 164-166 (1988)
doi:10.1038/332164a0
PMID:2831458

40. Y Miki, I Nishisho, A Horii, Y Miyoshi, J Utsunomiya, KW Kinzler, B Vogelstein, Y Nakamura: Disruption of the APC gene by a retrotransposal insertion of L1 sequence in a colon cancer. Cancer Res 52, 643-645 (1992)
PMID:1310068

41. PA Callinan, MA Batzer: Retrotransposable elements and human disease. Genome Dyn 1, 104-115 (2006)
doi:10.1159/000092503
PMID:18724056

42. R Druker, E Whitelaw: Retrotransposon-derived elements in the mammalian genome: a potential source of disease. J Inherit Metab Dis 27, 319-330 (2004)
doi:10.1023/B:BOLI.0000031096.81518.66
PMID:15190191

43. JC Venter, MD Adams, EW Myers, PW Li, RJ Mural, GG Sutton, HO Smith, M Yandell, CA Evans, RA Holt, JD Gocayne, P Amanatides, RM Ballew, DH Huson, JR Wortman, Q Zhang, CD Kodira, XH Zheng, L Chen, M Skupski, G Subramanian, PD Thomas, J Zhang, GL Gabor Miklos, C Nelson, S Broder, AG Clark, J Nadeau, VA McKusick, N Zinder, AJ Levine, RJ Roberts, M Simon, C Slayman, M Hunkapiller, R Bolanos, A Delcher, I Dew, D Fasulo, M Flanigan, L Florea, A Halpern, S Hannenhalli, S Kravitz, S Levy, C Mobarry, K Reinert, K Remington, J Abu-Threideh, E Beasley, K Biddick, V Bonazzi, R Brandon, M Cargill, I Chandramouliswaran, R Charlab, K Chaturvedi, Z Deng, V Di Francesco, P Dunn, K Eilbeck, C Evangelista, AE Gabrielian, W Gan, W Ge, F Gong, Z Gu, P Guan, TJ Heiman, ME Higgins, RR Ji, Z Ke, KA Ketchum, Z Lai, Y Lei, Z Li, J Li, Y Liang, X Lin, F Lu, GV Merkulov, N Milshina, HM Moore, AK Naik, VA Narayan, B Neelam, D Nusskern, DB Rusch, S Salzberg, W Shao, B Shue, J Sun, Z Wang, A Wang, X Wang, J Wang, M Wei, R Wides, C Xiao, C Yan, A Yao, J Ye, M Zhan, W Zhang, H Zhang, Q Zhao, L Zheng, F Zhong, W Zhong, S Zhu, S Zhao, D Gilbert, S Baumhueter, G Spier, C Carter, A Cravchik, T Woodage, F Ali, H An, A Awe, D Baldwin, H Baden, M Barnstead, I Barrow, K Beeson, D Busam, A Carver, A Center, ML Cheng, L Curry, S Danaher, L Davenport, R Desilets, S Dietz, K Dodson, L Doup, S Ferriera, N Garg, A Gluecksmann, B Hart, J Haynes, C Haynes, C Heiner, S Hladun, D Hostin, J Houck, T Howland, C Ibegwam, J Johnson, F Kalush, L Kline, S Koduru, A Love, F Mann, D May, S McCawley, T McIntosh, I McMullen, M Moy, L Moy, B Murphy, K Nelson, C Pfannkoch, E Pratts, V Puri, H Qureshi, M Reardon, R Rodriguez, YH Rogers, D Romblad, B Ruhfel, R Scott, C Sitter, M Smallwood, E Stewart, R Strong, E Suh, R Thomas, NN Tint, S Tse, C Vech, G Wang, J Wetter, S Williams, M Williams, S Windsor, E Winn-Deen, K Wolfe, J Zaveri, K Zaveri, JF Abril, R Guigo, MJ Campbell, KV Sjolander, B Karlak, A Kejariwal, H Mi, B Lazareva, T Hatton, A Narechania, K Diemer, A Muruganujan, N Guo, S Sato, V Bafna, S Istrail, R Lippert, R Schwartz, B Walenz, S Yooseph, D Allen, A Basu, J Baxendale, L Blick, M Caminha, J Carnes-Stine, P Caulk, YH Chiang, M Coyne, C Dahlke, A Mays, M Dombroski, M Donnelly, D Ely, S Esparham, C Fosler, H Gire, S Glanowski, K Glasser, A Glodek, M Gorokhov, K Graham, B Gropman, M Harris, J Heil, S Henderson, J Hoover, D Jennings, C Jordan, J Jordan, J Kasha, L Kagan, C Kraft, A Levitsky, M Lewis, X Liu, J Lopez, D Ma, W Majoros, J McDaniel, S Murphy, M Newman, T Nguyen, N Nguyen, M Nodell, S Pan, J Peck, M Peterson, W Rowe, R Sanders, J Scott, M Simpson, T Smith, A Sprague, T Stockwell, R Turner, E Venter, M Wang, M Wen, D Wu, M Wu, A Xia, A Zandieh, X Zhu: The sequence of the human genome. Science 291, 1304-1351 (2001)
doi:10.1126/science.1058040
PMID:11181995

44. ML Carroll, AM Roy-Engel, SV Nguyen, AH Salem, E Vogel, B Vincent, J Myers, Z Ahmad, L Nguyen, M Sammarco, WS Watkins, J Henke, W Makalowski, LB Jorde, PL Deininger, MA Batzer: Large-scale analysis of the Alu Ya5 and Yb8 subfamilies and their contribution to human genomic diversity. J Mol Biol 311, 17-40 (2001)
doi:10.1006/jmbi.2001.4847
PMID:11469855

45. AC Otieno, AB Carter, DJ Hedges, JA Walker, DA Ray, RK Garber, BA Anders, N Stoilova, ME Laborde, JD Fowlkes, CH Huang, B Perodeau, MA Batzer: Analysis of the human Alu Ya-lineage. J Mol Biol 342, 109-118 (2004)
doi:10.1016/j.jmb.2004.07.016
PMID:15313610

46. AM Roy-Engel, ML Carroll, E Vogel, RK Garber, SV Nguyen, AH Salem, MA Batzer, PL Deininger: Alu insertion polymorphisms for the study of human genomic diversity. Genetics 159, 279-290 (2001)
PMID:11560904    PMCid:1461783

47. RK Garber, DJ Hedges, SW Herke, NW Hazard, MA Batzer: The Alu Yc1 subfamily: sorting the wheat from the chaff. Cytogenet Genome Res 110, 537-542 (2005)
doi:10.1159/000084986
PMID:16093706

48. J Xing, AH Salem, DJ Hedges, GE Kilroy, WS Watkins, JE Schienman, CB Stewart, J Jurka, LB Jorde, MA Batzer: Comprehensive analysis of two Alu Yd subfamilies. J Mol Evol 57 Suppl 1, S76-89 (2003)
doi:10.1007/s00239-003-0009-0
PMID:15008405

49. AH Salem, GE Kilroy, WS Watkins, LB Jorde, MA Batzer: Recently integrated Alu elements and human genomic diversity. Mol Biol Evol 20, 1349-1361 (2003)
doi:10.1093/molbev/msg150
PMID:12777511

50. AH Salem, DA Ray, DJ Hedges, J Jurka, MA Batzer: Analysis of the human Alu Ye lineage. BMC Evol Biol 5, 18 (2005)
doi:10.1186/1471-2148-5-18
PMID:15725352    PMCid:554112

51. PA Callinan, DJ Hedges, AH Salem, J Xing, JA Walker, RK Garber, WS Watkins, MJ Bamshad, LB Jorde, MA Batzer: Comprehensive analysis of Alu-associated diversity on the human sex chromosomes. Gene 317, 103-110 (2003)
doi:10.1016/S0378-1119(03)00662-0

52. EA Bennett, LE Coleman, C Tsui, WS Pittard, SE Devine: Natural genetic variation caused by transposable elements in humans. Genetics 168, 933-951 (2004)
doi:10.1534/genetics.104.031757
PMID:15514065    PMCid:1448813

53. S Levy, G Sutton, PC Ng, L Feuk, AL Halpern, BP Walenz, N Axelrod, J Huang, EF Kirkness, G Denisov, Y Lin, JR MacDonald, AW Pang, M Shago, TB Stockwell, A Tsiamouri, V Bafna, V Bansal, SA Kravitz, DA Busam, KY Beeson, TC McIntosh, KA Remington, JF Abril, J Gill, J Borman, YH Rogers, ME Frazier, SW Scherer, RL Strausberg, JC Venter: The diploid genome sequence of an individual human. PLoS Biol 5, e254 (2007)
doi:10.1371/journal.pbio.0050254
PMID:17803354    PMCid:1964779

54. DJ Witherspoon, J Xing, Y Zhang, WS Watkins, MA Batzer, LB Jorde: Mobile element scanning (ME-Scan) by targeted high-throughput sequencing. BMC Genomics 11, 410 (2010)
doi:10.1186/1471-2164-11-410
PMID:20591181    PMCid:2996938

55. AD Ewing, HH Kazazian Jr: High-throughput sequencing reveals extensive variation in human-specific L1 content in individual human genomes. Genome Res 20, 1262-1270 (2010)
doi:10.1101/gr.106419.110
PMID:20488934    PMCid:2928504

56. J Kaiser: DNA sequencing. A plan to capture human diversity in 1000 genomes. Science 319, 395 (2008)
doi:10.1126/science.319.5862.395
PMID:18218868

57. N Siva: 1000 Genomes project. Nat Biotechnol 26, 256 (2008)
doi:10.1038/nbt0308-256b
PMID:18327223

58. AD Ewing, HH Kazazian: Whole-genome resequencing allows detection of many rare LINE-1 insertion alleles in humans. Genome Res (2010)

59. 1000 Genomes Project Consortium, RM Durbin, GR Abecasis, DL Altshuler, A Auton, LD Brooks, RM Durbin, RA Gibbs, ME Hurles, GA McVean: A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073 (2010)
PMID:20981092    PMCid:3042601

60. F Hormozdiari, C Alkan, M Ventura, I Hajirasouliha, M Malig, F Hach, D Yorukoglu, P Dao, M Bakhshi, SC Sahinalp, EE Eichler: Alu repeat discovery and characterization within human genomes. Genome Res (2010)

61. CR Huang, AM Schneider, Y Lu, T Niranjan, P Shen, MA Robinson, JP Steranka, D Valle, CI Civin, T Wang, SJ Wheelan, H Ji, JD Boeke, KH Burns: Mobile interspersed repeats are major structural variants in the human genome. Cell 141, 1171-1182 (2010)
doi:10.1016/j.cell.2010.05.026
PMID:20602999    PMCid:2943426

62. SC Schuster, W Miller, A Ratan, LP Tomsho, B Giardine, LR Kasson, RS Harris, DC Petersen, F Zhao, J Qi, C Alkan, JM Kidd, Y Sun, DI Drautz, P Bouffard, DM Muzny, JG Reid, LV Nazareth, Q Wang, R Burhans, C Riemer, NE Wittekindt, P Moorjani, EA Tindall, CG Danko, WS Teo, AM Buboltz, Z Zhang, Q Ma, A Oosthuysen, AW Steenkamp, H Oostuisen, P Venter, J Gajewski, Y Zhang, BF Pugh, KD Makova, A Nekrutenko, ER Mardis, N Patterson, TH Pringle, F Chiaromonte, JC Mullikin, EE Eichler, RC Hardison, RA Gibbs, TT Harkins, VM Hayes: Complete Khoisan and Bantu genomes from southern Africa. Nature 463, 943-947 (2010)
doi:10.1038/nature08795
PMID:20164927

63. JP Noonan: Neanderthal genomics and the evolution of modern humans. Genome Res 20, 547-553 (2010)
doi:10.1101/gr.076000.108
PMID:20439435    PMCid:2860157

64. JP Noonan, G Coop, S Kudaravalli, D Smith, J Krause, J Alessi, F Chen, D Platt, S Paabo, JK Pritchard, EM Rubin: Sequencing and analysis of Neanderthal genomic DNA. Science 314, 1113-1118 (2006)
doi:10.1126/science.1131412
PMID:17110569    PMCid:2583069

65. DJ Witherspoon, EE Marchani, WS Watkins, CT Ostler, SP Wooding, BA Anders, JD Fowlkes, S Boissinot, AV Furano, DA Ray, AR Rogers, MA Batzer, LB Jorde: Human population genetic structure and diversity inferred from polymorphic L1(LINE-1) and Alu insertions. Hum Hered 62, 30-46 (2006)
doi:10.1159/000095851
PMID:17003565

66. NT Perna, MA Batzer, PL Deininger, M Stoneking: Alu insertion polymorphism: a new type of marker for human population studies. Hum Biol 64, 641-648 (1992)
PMID:1328024

67. JT den Dunnen, SE Antonarakis: Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum Mutat 15, 7-12 (2000)
doi:10.1002/(SICI)1098-1004(200001)15:1<7::AID-HUMU4>3.0.CO;2-N

68. JT den Dunnen, SE Antonarakis: Nomenclature for the description of human sequence variations. Hum Genet 109, 121-124 (2001)
doi:10.1007/s004390100505

69. OK Pickeral, W Makalowski, MS Boguski, JD Boeke: Frequent human genomic DNA transduction driven by LINE-1 retrotransposition. Genome Res 10, 411-415 (2000)
doi:10.1101/gr.10.4.411
PMID:10779482    PMCid:310862

70. F Charbonnier, S Baert-Desurmont, P Liang, F Di Fiore, C Martin, S Frerot, S Olschwang, Q Wang, MP Buisine, B Gilbert, M Nilbert, A Lindblom, T Frebourg: The 5' region of the MSH2 gene involved in hereditary non-polyposis colorectal cancer contains a high density of recombinogenic sequences. Hum Mutat 26, 255-261 (2005)
doi:10.1002/humu.20216
PMID:16086322

71. JF Hughes, JM Coffin: Human endogenous retrovirus K solo-LTR formation and insertional polymorphisms: implications for human and viral evolution. Proc Natl Acad Sci U S A 101, 1668-1672 (2004)
doi:10.1073/pnas.0307885100
PMID:14757818    PMCid:341815

72. AR Jha, SK Pillai, VA York, ER Sharp, EC Storm, DJ Wachter, JN Martin, SG Deeks, MG Rosenberg, DF Nixon, KE Garrison: Cross-sectional dating of novel haplotypes of HERV-K 113 and HERV-K 115 indicate these proviruses originated in Africa before Homo sapiens. Mol Biol Evol 26, 2617-2626 (2009)
doi:10.1093/molbev/msp180
PMID:19666991    PMCid:2760466

73. SS Arcot, Z Wang, JL Weber, PL Deininger, MA Batzer: Alu repeats: a source for the genesis of primate microsatellites. Genomics 29, 136-144 (1995)
doi:10.1006/geno.1995.1224
PMID:8530063

74. AJ Iafrate, L Feuk, MN Rivera, ML Listewnik, PK Donahoe, Y Qi, SW Scherer, C Lee: Detection of large-scale variation in the human genome. Nat Genet 36, 949-951 (2004)
doi:10.1038/ng1416
PMID:15286789

75. K Akagi, RM Stephens, J Li, E Evdokimov, MR Kuehn, N Volfovsky, DE Symer: MouseIndelDB: a database integrating genomic indel polymorphisms that distinguish mouse strains. Nucleic Acids Res 38, D600-6 (2010)
doi:10.1093/nar/gkp1046
PMID:19933259    PMCid:2808983

76. JS Mattick, RJ Taft, GJ Faulkner: A global view of genomic information--moving beyond the gene and the master regulator. Trends Genet 26, 21-28 (2010)
doi:10.1016/j.tig.2009.11.002
PMID:19944475

77. SH Rangwala, L Zhang, HH Kazazian Jr: Many LINE1 elements contribute to the transcriptome of human somatic cells. Genome Biol 10, R100 (2009)
doi:10.1186/gb-2009-10-9-r100
PMID:19772661    PMCid:2768975

78. R Khaja, J Zhang, JR MacDonald, Y He, AM Joseph-George, J Wei, MA Rafiq, C Qian, M Shago, L Pantano, H Aburatani, K Jones, R Redon, M Hurles, L Armengol, X Estivill, RJ Mural, C Lee, SW Scherer, L Feuk: Genome assembly comparison identifies structural variants in the human genome. Nat Genet 38, 1413-1418 (2006)
doi:10.1038/ng1921
PMID:17115057    PMCid:2674632

79. JM Chen, C Ferec, DN Cooper: Mechanism of Alu integration into the human genome. Genomic Med 1, 9-17 (2007)
doi:10.1007/s11568-007-9002-9
PMID:18923924    PMCid:2276890

80. JT Simpson, K Wong, SD Jackman, JE Schein, SJ Jones, I Birol: ABySS: a parallel assembler for short read sequence data. Genome Res 19, 1117-1123 (2009)
doi:10.1101/gr.089532.108
PMID:19251739    PMCid:2694472

81. D Karolchik, G Bejerano, AS Hinrichs, RM Kuhn, W Miller, KR Rosenbloom, AS Zweig, D Haussler, WJ Kent: Comparative genomic analysis using the UCSC genome browser. Methods Mol Biol 395, 17-34 (2007)
doi:10.1007/978-1-59745-514-5_2

82. WJ Kent, CW Sugnet, TS Furey, KM Roskin, TH Pringle, AM Zahler, D Haussler: The human genome browser at UCSC. Genome Res 12, 996-1006 (2002)
PMID:12045153    PMCid:186604

83. D Karolchik, AS Hinrichs, TS Furey, KM Roskin, CW Sugnet, D Haussler, WJ Kent: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 32, D493-6 (2004)
doi:10.1093/nar/gkh103
PMID:14681465    PMCid:308837

Abbreviations: RE: retrotransposon element, RIPs: retrotransposon insertion polymorphisms, dbRIP: database of retrotransposon insertion polymorphism; TSD: target site duplication

Key Words: Retrotransposon, Mobile elements, DNA transposition, Database, Computational comparative genomics, Polymorphism, dbRIP, Human, Review

Send correspondence to: Ping Liang, Department of Biological Sciences, Brock University, 500 Glenridge Avenue, St. Catharines, Ontario, Canada L2S 3A1, Tel: 905-688-5550 Ext 5922, Fax: 905-688-1855, E-mail: pliang@brocku.ca