MadSci Network: Computer Science |
Andrea, This is a really neat assignment that has a lot more subtlety buried in the problem than perhaps you realize. I hope you have some fun with it. Behind any computer program is an algorithm, which is the rules or procedures for solving a problem. The program is the implementation of the algorithm, or problem-solving strategy. So let's talk about some prime number finding strategies. The first strategy relies on the brute-force idea that a prime number is one which has no divisors other than itself and 1, so all I need to do is check all possible divisors. Algorithm #1: Declare N=2 a prime. For each value of N=3 to N=2,000,000 check to see if number N is prime by checking whether the division N/K leaves a remainder for all integer values of K from K=2 to K=N-1. If all values of K leave a remainder, N is prime; if any value of K leaves no remainder, N is not prime. As I said, algorithm #1 is based on a very straightforward interpretation of the definition of a prime number. It's also horribly inefficient (i.e., S-L-O-W). Algorithm #1a: Similar to #1, but start with K=2 and stop as soon as you find that N/K divides evenly (does not leave remainder) and declare "N is not a prime" or stop when K = (N/2)-1 and declare "N is prime". This is somewhat more efficient; no sense trying all possible divisors after you've found one that divides into N without remainder (you've proven that N is not prime, so quit, already), and there's no value in trying numbers larger than N/2, since you know beforehand that they can't divide into N without remainder. Algorithm #1b: similar to 1a, but instead of trying all integer values trying K=2 to K=(N/2)-1, keep a list of previously found primes and only check to see if THEY divide into N. This strategy trims down the list of potential divisors immensely and speeds things up at the expense of a little bit more complicated bookkeeping. The reason you only need to check prime numbers as possible divisors of N is the following: If K is not a prime number, then it has some prime factorization itself. For argument's sake, let's say K=p1*p2. If you get to the point in algorithm 1b where you are trying to see if K=p1*p2 is a factor of N, then you have already concluded that neither p1 nor p2 are factors of N. And if THAT's the case, why bother with p1*p2? The same argument applies for values of K that are the product of more than two prime factors, regardless of whether these factors are unique or repeated in the factorization of K. For example, 12 is obviously not a factor of 29 or 30. There's no point checking in whether 12=2*2*3 divides into N=29, because if it does, then 4 and 3 must all divided into it as well, but if 4 divides into it, so must have 2. Therefore, we would have discarded N=29 as a candidate for a prime number back when we were checking 2 or 3 as possible factors. But we haven't yet discarded 29, because neither 2 nor 3 are factors of 29 (nor are any other numbers). 12=2*2*3 can't be a factor of 29 because NONE of its factors are divisors of 29; there's no need to check 12 as a possible factor as it's already been "checked" as a consequence of not being a prime number itself. Furthermore, after we're done checking 29 (and concluding that it's a prime number), we go on to check 30, and find that 2 is a divisor, so we stop and declare that 30 is not a prime number, and go on to start checking 31 for primehood. We never have to check whether 12 is a divisor of 30, because an earlier test involving one of 12's factors (2) has already "flunked" 30 as a candidate for primehood. These versions Algorithm 1 are based on long division. Now let's try a different concept. Back in times of the ancient Greeks, a mathematician named Eratosthenes invented a method to find prime numbers. His name commemorates his algorith, "The Sieve of Eratosthenes". What Eratosthenes realized is that long division is a lot of work, and all it's really doing for us when we're looking for primes is identifying numbers which are multiples of other numbers (by the way, the name for a non-prime number is "composite number"). Algorithm 2 (Sieve of Eratosthenes). List all numbers between 2 and N. Cross out all multiples of 2, but not the number 2 itself. Then cross out all multiples of 3 but not the number 3 itself. Continue the process for 4, then 5, etc. Eventually, you will be ready to cross out multiples of (N/2)-1, at which point you will cross out N-2 and stop. Any number remaining on the list is prime. I thought the Sieve of Eratosthenes was a really cool idea (ahem, "elegant concept") when I first stumbled across a description of it in high school (not quite in the time of the ancient Greeks, but suffice it to say I was using FORTRAN II on an IBM 1620 with a 50 kHz clock speed... yes, that really was 50 kilohertz, not megahertz. Last Sunday's newspaper flyer from Walmart says that they're selling a computer that's about 6000 times faster, with a 300 MHz clock). But I digress... Algorithm 2 is pretty fast, in that you don't ask the computer to do a whole bunch of long divisions. It can be made faster, though: Algorithm 2a: List all numbers between 1 and N. Cross out multiples of 2 except for the number 2 itself. Find the next number on the list which hasn't been crossed out already (this will be 3) and cross out all multiples of 3 except for 3 itself. Look for the next number on the list which hasn't been crossed out already (this will be 5) and cross out its multiples (except for itself). This version is somewhat faster than Algorithm 2 because any starting number K which hasn't already been crossed out is prime; any potential starting number which HAS already been crossed IS NOT prime. Because it is not prime, it can be factored by some smaller prime numbers. All multiples of K have the same prime factorization (although the factors will appear a different number of times). But all multiples of these smaller prime factors will have been already crossed off the list, which means all multiples of K will have already been crossed off the list, so why bother by crossing them off again? The means Algorithm 2a is even speedier, but you run into a difficulty with the Sieve of Eratosthenes that I haven't mentioned yet: your computer may have problems dealing with a list of 2 billion numbers. So how do you deal with this? Algorithm 2b: Imagine executing Algorithm 2a yourself. Because you can't find a sheet of paper big enough write down 2 billion numbers, you will list the numbers on multiple pieces of paper. Page 1 might contain 1 through 1000, page 2 the numbers 1001 through 2000, etc. You would apply algorithm 2a first to page 1 to build a list of prime numbers between 2 and 1000. Then you could use the list of existing prime numbers to cross off all multiples of the primes on page 2 (the numbers from 1001 to 2000), and any remaining numbers must be prime so you add these to end of your list of prime numbers. (Note that any multiple of a number between 1001 and 2000 is greater than 2000, and is NOT on the current page). Now repeat the process for page 3, crossing off all multiples of the numbers on the existing prime number list and adding what remains to your list of prime numbers. Then continue for page 4, then page 5, etc., until you've completed your effort with page 2,000,000. The parallels between "sheets of paper" and "maximum allowabe array size" (be it driven by language, operating system, or hardware memory constraints) ought to be fairly obvious. Also, note that while some languages and operating systems might be able to handle a data array with 2 billion entries, it will probably do so by swapping pages of memory between RAM and your hard drive. That would make Algorithm 2a much, much slower than what it's capable of, and much slower than Algorithm 2b, assuming you work out the bookeeping so that page swapping never has to happen. Also, to make this method work efficiently, you have to do a little computation to decide where the first and last multiple of each prime number is, or if there are any multiples at all on the page. So there's a tradeoff: some added bookkeeping complexity lets Algorithm 2b achieves very high speed. On the other hand, Algorithm 1b is slower, but it's not a memory hog. Life is full of hard choices :-) Experiment with these ideas and see which you like best (and why). Happy Computing! Steve Czarnecki
Try the links in the MadSci Library for more information on Computer Science.