All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Analysis:
We want to code a 10-letter-long substring into a integer, to perform hashset add and check for duplication.
Since each letter only has 4 cases: A,C,G,T, we can use 2-bit to represent it. Therefore, we can use a 20-bits integer to represent the substring.
Solution:
public class Solution { // Use mask to only maintain the last 20 bits. int mask = (1 << 20) - 1; public List<String> findRepeatedDnaSequences(String s) { List<String> resList = new ArrayList<String>(); if (s.length() < 10) return resList; HashSet<Integer> codeSet = new HashSet<Integer>(); HashSet<Integer> resSet = new HashSet<Integer>(); char[] charArray = s.toCharArray(); // Get code of the first 9 letters. int code = 0; for (int i = 0; i < 9; i++) { code = moveCode(code, charArray[i]); } for (int i = 9; i < s.length(); i++) { // Get code. code = moveCode(code, charArray[i]); // if current code has existed and have not appeared twice (i.e., // not added to resList), then add it into resList. if (!codeSet.add(code) && resSet.add(code)) { resList.add(s.substring(i - 9, i + 1)); } } return resList; } public int moveCode(int value, char c) { value <<= 2; // if (c=='A') value += 0; if (c == 'C') value += 1; if (c == 'G') value += 2; if (c == 'T') value += 3; value &= mask; return value; } }