LeetCode() Repeated DNA Sequences 看的非常的过瘾!

it2022-07-02  120

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].

非常好的思路: 转换成位操作。

算法分析

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

20位的二进制数,至多有2^20种组合,因此hash table的大小为2^20,即1024 * 1024,将hash table设计为bool hashTable[1024 * 1024];

vector<string> findRepeatedDnaSequences(string s) { int hashMap[1048576] = {0}; vector<string> ans; int len = s.size(),hashNum = 0; if (len < 11) return ans; for (int i = 0;i < 9;++i) hashNum = hashNum << 2 | (s[i] - 'A' + 1) % 5; for (int i = 9;i < len;++i) if (hashMap[hashNum = (hashNum << 2 | (s[i] - 'A' + 1) % 5) & 0xfffff]++ == 1) ans.push_back(s.substr(i-9,10)); return ans; }

  

转载于:https://www.cnblogs.com/yanqi110/p/4996001.html

相关资源:数据结构—成绩单生成器

最新回复(0)