Search results for Computational Biology and Bioinformatics

Contents
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp v-viii
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

20 - Databases
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 401-420
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A brief introduction to relational databases
Any collection of data can be considered to be a database, however it is stored or utilised. However, the common use of the word ‘database’ usually refers to a relational database. Relational databases were introduced in the 1970s and model their data in terms of tables with rows and columns. There is an associated language, SQL (Structured Query Language), that can be used to send messages to the database to allow the database to be queried and modified: inserting, changing and deleting data elements. SQL also provides the ability to make connections between the data in different tables. In terms of mathematics, relational databases can be thought of as ‘first-order predicate logic’, and this mathematical underpinning of the principles of relational databases is one reason they are conceptually attractive.
Tables
A table in a relational database has a name and also has some named columns and each row in the table represents one record of data. The type of the data in each column can be specified and the data in any column can be stated to be mandatory, or not. One or more columns in each table define the key. Each record in the table must have a unique key; the key identifies the individual record. Sometimes a table has a ‘natural’ key but sometimes there is no obvious key and so instead a counter (a ‘serial’ or ‘ID’ number) is used, which is set (in many database implementations automatically) to 1 for the first record, 2 for the second, etc. At a simplistic level, spreadsheets (e.g. as used in Excel) can be thought of as tables in a weak substitute for a relational database. A table can have one or more columns that refer to one or more other tables, and this is a way that information between tables can be linked. In the database jargon, a query that involves relating information across more than one table is called a join.

24 - Machine learning
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 511-544
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

A guide to machine learning
When using computers to solve scientific problems there can be situations where you have some measured data and a related property of the data, but there is no known or fixed formula to link the two. Sometimes the link between the two sets of data may be easy for a human to see, but otherwise difficult to encode in a computer algorithm. A simple example of this would be in the reading of handwriting; humans do not write in a fixed typeface, every letter of a given kind will be written slightly differently, and yet we can read most other people’s handwriting without much effort. When we look at writing we attempt to recognise the letters and words, and where there is ambiguity we can use our intelligence to infer what was intended by using the context of what the writing means, or any other clues that we can glean. Writing a computer program to read handwriting is difficult, and not nearly as reliable as a person would be. Nevertheless it can be done, and is put to good use in the mechanised sorting of mail by postal (zip) code. The common trick to getting a computer to perform tasks like this is not to program it with a designed and elaborate rule, but rather to bestow a computer program with a degree of artificial intelligence so that it can come up with its own rules and learn. The exercise whereby a program comes up with its own rules to solve a problem is often referred to as machine learning. It should be noted, however, that we usually don’t expect a computer to learn a task perfectly; if perfection were possible we generally wouldn’t have to resort to such means. Instead it is best to think of machine learning algorithms as making predictions, and as such the predictive power should be tested before we make reliance upon it. There are two kinds of machine learning which are commonly discussed, supervised learning and unsupervised learning, and we will give examples of both in this chapter.

12 - Pairwise sequence alignments
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 208-231
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Sequence alignment
The alignment of biological sequences is probably the most widely used operation in bioinformatics. In essence sequences are aligned so that we can determine how similar they are, and from this all sorts of useful information can come, such as whether two sequences are related by evolution (they have a common ancestor) or whether they have a similar biological function. The process of comparison is called alignment because the trickiest part of the process is to say which bits of two sequences are equivalent to one another; how residues of the different sequences can be paired up. Usually when we align sequences we seek to determine the best alignment out of the vast number of possible comparisons by finding the combination of residue pairs, one from each sequence, which gives the highest overall score for similarity.
Once a sequence alignment has been achieved, and assuming you trust the results, you can treat the aligned regions as having a degree of equivalency. If the alignment is good enough you might be able to say, for example, that two DNA sequences relate to the same kind of gene, despite the nucleotides not being exactly the same. It should always be remembered, however, that a sequence alignment can only give a limited amount of information about the underlying biology, but it is often an excellent starting point. Even where the knowledge gained is distinctly incomplete, a sequence alignment is quick to perform and often helpful to guide experiments. You might significantly narrow down the number of possibilities of what a section of DNA or protein could be, or say what it definitely is not, with one simple database search, i.e. doing alignments against a database of well-studied sequences. Sequence alignments are also done in a laboratory setting to guide procedures, for example to determine which part of a protein to investigate.

Acknowledgements
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp x-x
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Frontmatter
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp i-iv
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

6 - Files
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 78-99
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Computer files
A computer file is a means by which data is stored on a permanent basis, or at least until it is deleted. It is held in a place such as a hard disk drive or removable storage device that is separate from the active, temporary memory of a computer. While the active memory may hold the current program and an amount of data, files represent a larger archive of stored data and the general idea is that this should survive when the computer is switched off. Parts of this saved data may be copied into the active memory as required. Loading data from files (which may be stored locally or transmitted via a network) places data into the active memory so that it can be worked upon efficiently. This data might be the code for a computer program which can then be executed to do a job. Naturally we save program instruction code as a file so that it may be used as many times as desired, without having to rewrite anything.
This chapter will focus on data files that store information for programs to work with, rather than the program files themselves, given that we can trust the Python interpreter to handle the loading and running of Python code. We will show how data can be read into a program and written out from a program, e.g. to and from files stored on disk. Such data files come in a large variety of shapes, sizes and forms (unlike Python files, which conform to a single, precise standard). Information can be stored in an endless number of ways, sometimes at the whim of the programmer, but fortunately in the spirit of cooperation (including mutual financial interest) particular types of data are often stored in a standardised way, with a known specification.

Appendix 3 - Standard module highlights
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 634-652
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

19 - Signal processing
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 382-400
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Signals
In science many different kinds of experiment involve the recording of signals: series of measurements that represent the variation in some kind of underlying physical property. The signal can then be interpreted, based on some theoretical model of the experiment. Commonly the recorded signal is one that varies over time, such as sound or radio waves, but it could also represent a variation in space, or indeed along any other kind of axis. In general a signal is represented by values that are directly recorded by instruments at specific, usually regular, intervals; although in some situations derived data, like a DNA sequence, can also be thought of in terms of signals.
If a signal varies in a regular manner, i.e. oscillates, then it is often the frequencies that occur within the signal that are of interest, rather than the original signal itself. This is because the underlying frequencies are generally characteristic of what made the signal. To take a toy example, if we have a peal of bells, where each bell has a different tone, we can record the variation of the overall sound signal over time. Then, by looking at the component frequencies we can discern the tones of the individual bells that made the sound. As we will illustrate, it is possible to convert the time signal into a spectrum of its component frequencies using what is known as a Fourier transform.

11 - Biological sequences
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 181-207
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Bio-molecules for non-biologists
This section is aimed at programmers who do not have much biological training, to explain a little about where biological sequences come from. Naturally we must omit a large amount of detail if we’re going to keep things short enough for this book. The emphasis will be about how information is stored, transferred and interpreted in biological systems to ultimately give the chemistry of life. We leave details of the current understanding of the precise mechanisms to your enthusiasm and further reading.
Life can be thought of as a set of controlled chemical reactions and interactions that build and maintain organisms. When there is no control over biochemistry the raw materials of an organism soon succumb to decay; complex biological molecules turn into much simpler, more stable forms. The specific set of chemical reactions and interactions that allow life to live and reproduce are mostly directed by protein molecules with occasional roles for RNA molecules.
Proteins
The different kinds of protein molecules that direct the processes needed for life are different because they are made up of different sequences of smaller entities, amino acids (see Figure 11.1). This sequence specifies their shape, physical properties, movement and chemical activity. There are 20 common amino acid types that are joined together into chains of varying length to make the various proteins. The amino acids are joined together into a linear sequence by chemical bonds that are referred to as peptide links, thus protein chains are frequently referred to as polypeptides. Most proteins adopt a particular three-dimensional arrangement as segments of the amino acid entities within the chain come together into one or more compact globules. The final shape of a protein is usually vitally important for its function, and this shape is governed by the combination and order of amino acids in the polypeptide chain. However, it should be noted that the relationship between protein sequence and structure is exceedingly complex, such that we cannot generally predict a protein’s structure directly from its sequence alone.

Appendices
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 606-606
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

These appendices contain simple explanations and definitions for a subset of the Python language, its standard libraries and a few of the key modules used in this book. The objective is not to give a complete description of every possible option, which is already documented (and will be more up-to-date) on the Internet. Rather, the aim is to cover all of the components used in this book as well as a few extra useful details in relatively plain English, to help with learning the language. Hence, it is deliberate that we have simplified or omitted certain details to avoid obfuscating the main points for novice programmers. While we describe most of the core components of standard Python, for some of the libraries we will only highlight some parts we have found particularly useful. In some cases, where we don’t describe individual components, we will describe what a library or a module is generally used for, in order to guide further investigation.
In addition to the material presented here, the website http://www.cambridge.org/pythonforbiology provides links to full, in-depth documentation for Python and the associated libraries that are used throughout this book.

Appendix 6 - Further statistics
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 668-670
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

RPy and the R statistical package
The R statistical package is one of the most commonly used ones for analysing statistical data. It has its own language. There is a Python wrapper around it called RPy.2 The main reason to use RPy would be if you have lots of existing R code that you wish to interface to in Python.
There are a few things to keep in mind when using RPy. Standard Python collection types or NumPy arrays have to be converted into special RPy data types, and results that are returned from R have to be suitably interpreted. Reading the R documentation is crucial to using RPy.
We will illustrate the use of R via RPy for a few standard examples.
Binomial test
First we consider the binomial test, which is concerned with the number of occurrences of an event that has a fixed probability of occurring, given a certain number of trials. R has a method, ‘binom.test’, to do the binomial test. We create a function, binomialTailTest(), which calls this method via RPy, and which has the same arguments as in our previous version of the function in Chapter 22, which used SciPy.
First we need to import the RPy module, rpy2.robjects, which we call R below. This has an object inside it, R.r, which is what we use to get hold of R methods, using dictionary syntax keyed on the name of the R method. Here we want to use the R method binom.test, and so R.r['binom.test'] is the Python version of this R method.

17 - High-throughput sequence analyses
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 341-360
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

High-throughput sequencing
Given the decreasing cost required to determine the sequence of nucleic acids, sequencing is used in increasingly wider contexts. Rather than only determining the genome sequence of an organism, high-throughput techniques allow researchers to investigate much more, such as the variation within individuals of a population, the amount of expression of individual genes in a given sample (e.g. by detecting RNAs) and the sequences which are bound to particular protein components. A sequencing run on one of the latest-generation sequencing machines may generate many gigabases (>109 bp) of data and so much of the task for bioinformatics is to make sense of the raw sequence data: to put it into a genomic, biological context. For organisms with a known genomic sequence the primary task when processing high-throughput sequence data is to simply map relatively short bits of sequence called ‘reads’ that come from the sequencing machine to a reference genome. Only then can the detected sequences be understood. By mapping newly acquired sequences on to the known chromosomes the whole database of information that annotates the genome, such as the position of genes and regulatory sequences, indicates which DNA features were detected. In this chapter we will give an introduction to various basic computational procedures involving high-throughput sequence data which can be achieved, or at least handled, using Python. Because this is a vast and rapidly expanding subject we can only lightly touch on the core concepts here, though hopefully we have provided solid starting points for further development.

Appendix 1 - Simplified language reference
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 607-620
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

3 - Python basics
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 17-42
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Introducing the fundamentals
Python is a powerful, general-purpose computing language. It can be used for large and complicated tasks or for small and simple ones. Naturally, to get people started with its use, we begin with relatively straightforward examples and then afterwards increase the complexity. Hence, in the next two chapters we cover most of the day-to-day fundamentals of the language. You will need to be, at least a little, familiar with these ideas to appreciate the subsequent chapters. Much of what we illustrate here is called scripting, although there is no hard and fast rule about what is deemed to be a program and what is ‘merely’ a script. We will use the terminology interchangeably.
Here we describe most of the common operations and the basic types of data, but some aspects will be left to dedicated chapters. Initially the focus will be on the core data types handled by Python, which basically means numbers and text. With numbers we will of course describe doing arithmetic operations and how this can be moved from the specific into the abstract using variables. All the other kinds of data in Python can also be worked with in a similarly abstract manner, although the operations that are used to manipulate non-numeric data won’t be mathematical. Moving on from simple numbers and text we will describe some of the other standard types of Python data. Most notable of these are the collection types that act as containers for other data, so, for example, we could have a list of words or a set of numbers, and the list or set is a named entity in itself; just another item that we can give a label to in our programs. Python also has the ability to let you describe your own types of data, by making an object specification called a class. However, this will be discussed in Chapter 7. We will end this chapter by introducing the idea of importing Python modules, which is a mechanism to allow a program to access extra functionality contained in separate files.

22 - Statistics
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 454-485
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Statistical analyses
In this chapter we look at the analysis and interpretation of collections of data in a mathematical way. In order to understand the basics of statistics we will assume some familiarity with the basics of probability, as discussed in Chapter 21.
Generally when we gather numerical measurements we don’t get identical results, rather we get a spread of values. The underlying reason for this variation could be a natural variation in what we are measuring, an error in the way we make the measurements or, as is almost always the case, a combination of both of these. Statistics helps us to make sense of variations in numerical data and commonly we are asking the question whether what we measure is statistically significant, according to some prior hypothesis. Depending on the result this naturally then drives further investigations, based on a belief of a hypothesis being true or untrue. Statistics is a vast subject, so in this chapter we can only cover a few of the more important aspects that we either refer to elsewhere in this book or that are otherwise commonly used in biology.
Samples and significance
One of the key principles, which underpins most statistical analyses, is the idea that the data we collect contains a limited number of samples from some kind of underlying probability distribution. This probability distribution can be thought of as the mechanism by which the data values are generated, but naturally the data arises due to some physical process and by ascribing a probability distribution we are merely forming a mathematical model, which is often significantly simplified, to approximate the data-generation process.

23 - Clustering and discrimination
Tim J. Stevens, MRC Laboratory of Molecular Biology, Cambridge, Wayne Boucher, University of Cambridge
Book:

Python Programming for Biology

Published online:

05 February 2015

Print publication:

12 February 2015, pp 486-510
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Summary

Separating and grouping data
When dealing with biological information, the question at hand often relates to the ability to separate a pool of data into different groups. This may be a simple two-way split, for example between people who do or do not have a disease, or it may involve many more data categories. Sometimes, however, the number of groups may not be known and it may not even be appropriate to think in terms of rigidly defined groups. Rather, it might be better to first determine the most discriminating features that separate the data and then investigate afterwards whether groups are present, and if so how many. Any kind of discrimination exercise naturally requires some form of information on which a judgement may be based, such as the results from an experiment, which can even include things like DNA sequences. Implicit in this sort of analysis is the notion that units of data are being separated, but each unit may relate to several pieces of information. For example, if a unit of data corresponds to a person they may be diagnosed by several different parameters and test measurements, or if a unit is a biological molecule it may be categorised by many different properties and experimental results.
Whatever the situation and type of data, sometimes the question being asked tries to place each unit of data in one group or another, where there is no possibility of something being in more than one group. Naturally, whether this is a valid assumption will depend on context and the formulation of the problem. In reality, a hard boundary between groups might not actually be as useful as a more fuzzy membership. Referring again to the problem of diagnosing a condition in people using experimental test results, it may be that two people with identical test results have different outcomes; there may not be a simple dividing line between groups. We may have official values to distinguish between ‘underweight’, ‘normal’ and ‘overweight’ people to help guide healthcare, but of course it is a continuous scale, so it may be sufficient to merely separate people (e.g. using height, weight and gender information) and be able to make more flexible decisions, not based on rigid categories.

Python Programming for Biology

Bioinformatics and Beyond
Tim J. Stevens, Wayne Boucher
Published online:

05 February 2015

Print publication:

12 February 2015
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
Do you have a biological question that could be readily answered by computational techniques, but little experience in programming? Do you want to learn more about the core techniques used in computational biology and bioinformatics? Written in an accessible style, this guide provides a foundation for both newcomers to computer programming and those interested in learning more about computational biology. The chapters guide the reader through: a complete beginners' course to programming in Python, with an introduction to computing jargon; descriptions of core bioinformatics methods with working Python examples; scientific computing techniques, including image analysis, statistics and machine learning. This book also functions as a language reference written in straightforward English, covering the most common Python language elements and a glossary of computing and biological terms. This title will teach undergraduates, postgraduates and professionals working in the life sciences how to program with Python, a powerful, flexible and easy-to-use language.

Flexible Pattern Matching in Strings

Practical On-Line Search Algorithms for Texts and Biological Sequences
Gonzalo Navarro, Mathieu Raffinot
Published online:

18 December 2014

Print publication:

27 May 2002
- Book
- - Get access
    
    Buy a print copy
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation
String matching problems range from the relatively simple task of searching a single text for a string of characters to searching a database for approximate occurrences of a complex pattern. Recent years have witnessed a dramatic increase of interest in sophisticated string matching problems, especially in information retrieval and computational biology. This book presents a practical approach to string matching problems, focusing on the algorithms and implementations that perform best in practice. It covers searching for simple, multiple and extended strings, as well as regular expressions, and exact and approximate searching. It includes all the most significant new developments in complex pattern searching. The clear explanations, step-by-step examples, algorithm pseudocode, and implementation efficiency maps will enable researchers, professionals and students in bioinformatics, computer science, and software engineering to choose the most appropriate algorithms for their applications.

Index
Ran Libeskind-Hadas, Harvey Mudd College, California, Eliot Bush, Harvey Mudd College, California
Book:

Computing for Biologists

Published online:

28 May 2018

Print publication:

22 September 2014, pp 206-207
- Chapter
- - Get access
    
    Check if you have access via personal or institutional login
    
    Log in Register
- Export citation

Computational Biology and Bioinformatics

Refine search

Refine search

Actions for selected content:

1070 results in Computational Biology and Bioinformatics

Contents

20 - Databases

Summary

24 - Machine learning

Summary

12 - Pairwise sequence alignments

Summary

Acknowledgements

Frontmatter

6 - Files

Summary

Appendix 3 - Standard module highlights

19 - Signal processing

Summary

11 - Biological sequences

Summary

Appendices

Summary

Appendix 6 - Further statistics

Summary

17 - High-throughput sequence analyses

Summary

Appendix 1 - Simplified language reference

3 - Python basics

Summary

22 - Statistics

Summary

23 - Clustering and discrimination

Summary

Python Programming for Biology

Flexible Pattern Matching in Strings

Index

Computational Biology and Bioinformatics

Refine search

Refine search

Actions for selected content:

Save Search

1070 results in Computational Biology and Bioinformatics

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Summary

Python Programming for Biology

Flexible Pattern Matching in Strings