Web accounts

Gene Updater: a web-based tool that automatically corrects and updates misidentified gene names in Excel

When gene expression datasets are opened with Excel under default settings (Microsoft Corp., Redmond, WA), a recurring issue with converting gene names to dates occurs. Similarly, if gene names are copied from another application (eg, word processors) and pasted into an Excel spreadsheet without specifying cell formatting, converting gene names to dates may occur1. Although Excel is popular and widely used in data analysis, these automatic conversions can affect channel enrichment analysis because many channel enrichment tools such as Enrichr2Gene Set Enrichment Analysis (GSEA)3.4 and analysis of the path of ingenuity5 rely on gene symbols to query pathway databases such as Gene Ontology6.7 and reactome8. Since dates are not recognized by these lane databases, this can lead to gaps in the lane enrichment analysis. For example, septins (eg SEPT1), which are involved in cell division, are internally converted to SEP-01 in Excel, which cannot be recognized by other databases. This problem has become so widespread that approximately one-fifth of published papers with additional Excel gene lists contain erroneous gene name conversions.9.10. As many of these datasets are frequently consulted by other data scientists, such errors can be carried over to other scientific publications, causing further distortion in downstream data analysis.

To solve this problem, the HUGO Gene Nomenclature Committee (HGNC) announced in 2017 the update of gene names that can be unintentionally converted into dates in Excel files.11. This move was welcomed by researchers and data scientists, as the move to updated gene names would allow sharing of gene expression data without worrying about automatically converting gene symbols to dates in Excel. . However, at present, most published gene expression data are not updated with newly approved gene names, especially in microarray datasets. We have therefore developed a web-based Gene Updater tool that allows researchers to convert previous gene names to newly approved gene names recommended by HGNC. Additionally, if gene names are accidentally converted to dates by Excel, the web tool allows researchers to rectify those terms to the correct gene names. We believe these efforts will facilitate the sharing of gene expression data between researchers who may be working on different analysis platforms.

Code availability and problem reporting

The Gene Updater web tool is publicly available at: https://share.streamlit.io/kuanrongchan/date-to-gene-converter/main/date_gene_tool.py. The code was written with the Python programming language (https://www.python.org/) and the web tool is made with Streamlit (https://www.streamlit.io). To run the application locally, several freely available packages are required: pandas, numpy, regex, inflect, dateparser, streamlit, streamlit-tags, openpyxl, xlrd and XlsxWriter. Recommended versions are: pandas >= 1.2.5, numpy >= 1.19.5, regex >= 2021.8.3, inflect >= 5.3.0, dateparser >= 1.1.0, streamlit >= 1.8.1, streamlit -tags >= 1.2.8, openpyxl >= 3.0.9, xlrd >= 2.0.1, XlsxWriter >= 3.0.2. The MIT license version 0.1.0 is also added to the source package on GitHub as an open source license.

Updated codes and new releases will be available on GitHub, including step-by-step protocol information on running the app locally: https://github.com/kuanrongchan/date-to-gene-converter( Zenodo; https://doi.org/10.5281/zenodo.6845701). This page may also be used to communicate issues, queries, or request features. Alternatively, users can contact the developers via email.

Introducing Gene Updater

Users can directly upload data from excel spreadsheets or csv files containing gene names into Gene Updater (https://share.streamlit.io/kuanrongchan/date-to-gene-converter/main/date_gene_tool .py). Old gene names will be automatically updated with new gene names with Gene Updater. If the genes have been converted to dates by Excel, these dates will also automatically be converted to updated gene names, with the exception of Mar-01 and Mar-02 as these dates can potentially map to more than one gene (Fig. 1). The conversion of Mar-01 to MTARC1 or MARCHF1, as well as Mar-02 to MTARC2 or MARCHF2 can be assigned by the user in the Gene Updater web tool. The output is an Excel data file containing updated HUGO gene names that can be downloaded for further downstream analysis.

Figure 1

Diagram of Gene Updater. If old gene names are provided, these genes will automatically be converted to updated approved gene names. If dates are provided, all genes except MAR-01 and MAR-02 will be converted to the new approved gene names. For MAR-01 and MAR-02, users can assign the genes to MTARC1, MARCHF1, MTARC2, or MARCHF2 in Gene Updater.

Data input

The user interface starts with a file uploader which allows users to upload their .csv or .xlsx file(s). Multiple files can also be uploaded, as long as the first column contains gene names. If the genetic file contains Mar-01 and Mar-02, we encourage you to have an identifier (e.g. gene description) on the second column so that the identities of MARCH1/MARC1 and MARCH2/MARC2 can be easily resolved with the Gene Updater. Users do not have to delete other columns of data in their Excel or csv files to use the web tool. The checkbox in the sidebar allows users to verify that the correct data file is uploaded. To demonstrate the functionality of the application, a demo dataset containing gene terms converted to Excel, gene descriptions and numeric values ​​is pre-loaded if no data file is uploaded to Gene Updater.

Identity and characteristics of genes modified by Gene Updater

The human gene names that are converted to dates in Excel, along with their updated approved gene names and descriptions, are detailed in Table 1. With the exception of Mar-01 and Mar-02, all other modified genes by Excel can be mapped to a single HUGO Gene (Table 1). To examine the impact of omission of these genes due to Excel conversions on pathway analysis, we examined the biological processes modulated by these genes. Pathway Enrichment Analysis Against the Gene Ontology (GO) Biological Processes Database6.7 pointed out that these genes play critical roles in cell division, exocytosis, cilium assembly, ubiquitination, and nitric oxide biosynthesis (Fig. 2). Specifically, SEPTIN1-14 is encoded for pathways related to cytoskeleton-dependent cytokinesis, some of which are also involved in other specialized functions such as exocytosis, secretion, and cilia assembly. On the other hand, MARCH1-8 are involved in protein ubiquitination while MTARC1 and MTARC2 are involved in nitric oxide biosynthesis and metabolic processes (Fig. 2). Overall, our results highlight gaps in pathway enrichment analyzes if these gene names are automatically converted to dates in Excel.

Table 1 Most frequently converted human gene names to dates in Excel. The respective updated gene name and gene description are also provided.
Figure 2
Figure 2

Top 10 enriched pathways based on frequently converted genes to dates in Excel. Genes were analyzed against the GO Biological Processes database with Enrichr, and all pathways presented were adjusted p-values ​​

Conversion of dates associated with several genes

To resolve gene duplicates related to Mar-01, Gene Updater converts the first instance of Mar-01 to Mar-01_1st and the second instance to Mar-01_2nd. Using the drop-down widget, users can assign the Mar-01_1st and Mar-01_2nd based on the gene description or any other unique identifier (Fig. 3). The same procedure is then repeated for duplicates linked to Mar-02. After converting these dates, the output data frame or file should have gene names updated with the new HUGO gene names.

picture 3
picture 3

Gene Updater Dashboard which is used to resolve duplicate gene symbols. Red arrows indicate drop-down widgets that allow users to assign the correct gene name for Mar-01 and Mar-02.

Production

The converted data frame or file can be inspected in the Gene Updater web tool. Users can use the multi-query search bar to verify that gene names are updated successfully. The output file with updated gene names can then be exported as an Excel file.

Gene Updater performance in automatically correcting misidentified Excel gene names

To assess the usefulness of Gene Updater in resolving misidentified gene names in Excel, we relied on Mark Ziemann’s dashboard (http://ziemann-lab.net/public/gene_name_errors/Report_2022-05 .html#journals-affected) to extract text and Excel files from various journals, including BMC Genomics, Nature, Genome Biology, Nucleic Acids Research, Human Molecular Genetics, BMC Bioinformatics, Nature Communications, PLoS One, Genome Research, Genes Development and RNA that were published in June 2022 (Table 2). A total of 356 text and Excel files were found, of which 81 of these files contain gene terms or gene symbols. Notably, 28 (34.6%) of the files with gene symbols contained date-related errors, highlighting the importance of having a tool to correct these misidentified date terms (Table 2). Gene Updater was able to automatically fix the majority of these files (78.6%), except for files containing the terms Mar-01 and Mar-02, but did not provide a unique identifier for accompaniment such as gene description or gene information (Table 2, rightmost column). These results underscore the importance and usefulness of Gene Updater in rectifying misidentified gene terms and highlight the need to include a gene description column to better resolve gene terms MARC1, MARCH1, MARC2, and MARCH2.

Table 2 Journals publishing text or Excel files with date-related errors in June 2022.

Comparison with other existing web tools

Currently, the two tools that can potentially convert dates to gene names are Oct4th (https://oct4th.sandbox.bio) and Truke (http://maplab.imppc.org/truke/). However, Oct4th only works on genetic data files that have not been manipulated and processed in Excel. Additionally, the tool is currently unable to convert updated gene names, which are more resistant to automatic conversion. Truke can potentially convert date formats to gene names, but can only convert tagged dates to dd/mm/yy format and process files one at a time. On the other hand, our Gene Updater Streamlit web tool can process multiple .csv and .xlsx files, and takes into account the different types of date formatting that are converted by Excel, thus allowing faster and more efficient processing of gene dates by compared to other existing web tools.