Innovations in big data

adminAnnual Report 2015, Strategic initiatives0 Comments

A young researcher gathers crop data using a tablet. Photo by O. Adebayo, IITA

A young researcher gathers crop data using a tablet. Photo by O. Adebayo, IITA

Modern advancements in Internet data, computer hardware and software, global standards, online collaboration platforms, and mobile technologies now allow the easy integration of various data management processes that were previously strictly separated. In agricultural research, these advancements have revolutionized how information is captured, pre-processed, automated, analyzed, and accessed by multiple users in different locations at near real-time. This has accelerated related processes as well as improved the management, quality, and compatibility of collected data. At IITA, 2015 saw significant strides in the modernization of its data capture and management systems.

Automating field data collection

Crop breeding experiments are data-intensive, with a typical breeding program producing hundreds of thousands of datasets in any given year. Inefficient and poor handling of these datasets can significantly hamper the activities of a breeding program and set back its targeted outputs. IITA researchers have traditionally used a “pen-and-paper” approach to data collection and transcription, which is time-consuming and error-prone.

The Cassava Breeding Program of IITA developed an innovative method to securely capture cassava field data by using electronic field book applications in tablets, which capture data in milliseconds. A barcode reader in these tablets reads barcode labels that are generated and used, for example, for accurate and efficient plot identification. The tablets are then connected to a multifunction platform called Cassavabase (https://cassavabase.org), which makes the collected data readily available in compliance with the institute’s Open Access policy and can be used for downstream analysis.

The program uses Cassavabase as its primary data management tool for uploading both phenotyping and genotyping data. These data are useful for implementing genomic selection and will improve accuracy in estimating breeding values and genetic gain for quantitative traits compared to traditional breeding methods. Currently, Cassavabase has over 1500 phenotyping trials with ~8 million phenotypic observations and ~2 billion genotypic data points with more than 400 registered users.

The Cassava Breeding Program has successfully implemented tabletbased data collection in almost all its test environments. About 100 tablets are presently being used, with efforts geared towards implementation using handier smartphones in 2016. The program has also initiated several training workshops on data collection using tablets for its field technicians. These training workshops have also been extended to other crop breeding programs.

An integrated breeding management system

Modern breeding programs need to integrate diverse data types and exchange information with partners globally. IITA has developed and implemented the Breeding Management System (BMS), a comprehensive and easy to use software suite designed to help breeders conduct their routine activities more efficiently. Developed by the Integrated Breeding Platform (IBP) based in IITA-Nairobi in Kenya, the BMS provides interconnected tools for breeding program management, data analysis, and decision support. It also provides a database that works seamlessly to manage pedigree information, phenotypic and molecular characterization as well as germplasm evaluation.

Bioinformatics and big data

High-throughput sequencing is an emerging technology that allows for fast and inexpensive sequencing of a whole genome, which makes the process affordable to many researchers and lead to the production of large amounts of data. However, this technology demands high computer processing power to efficiently store and analyze large data sets. IITA has been using these sequencing data for more than a year for gene discovery and genotyping to accelerate breeding cycles.

The Bioinformatics Unit of IITA, based in Ibadan, offers high throughput sequencing, as well as storing and processing big data. Currently, the unit holds more than 4 TB of compressed sequencing data from different crops. To visualize this amount of stored data, if just the text of this sequencing data is printed, the printout will cover about 300 km end-to-end. For largescale data processing, the Bioinformatics Unit is equipped with upgraded computing power consisting of 64 cores and combined 900 gigabytes of RAM. The actual capacity is set up for the storage of 30 TB of data and processing of 2 TB compressed data in a one data analysis process. This allows IITA to master large-scale genotyping, gene expression whole genome sequencing data for advanced research in plant genomics. This important capacity enables IITA researchers to increase the precision of correlating traits, also complex traits, to markers which, in turn, contribute towards faster and more efficient crop breeding.

In the pipeline

Following up on the success of Cassavabase, IITA is currently developing sister platforms: Musabase and Yambase. IITA is also an active contributor to the development of the CGIAR Consortium’s “Big Data Platform Project”. The envisaged data pool to be generated from this multi-CGIAR center platform could be used, for example, to directly feed agronomic information and advice to farmers through electronic or mobile technology-based means.

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA

*