Notes on Generator 1

原创
2015/10/23 18:05
阅读数 104

既然英文才是程序员的母语,就尝试着用英文写博文吧。。


Iterators

Iteration is actually the process of iterating over an iterable object, common iterable objects are Dict, String, File, etc.

The iteration consumes the contents in its targeted iterable object.

Functions like sum(), min(), list(), tuple() and in operator makes an iterable object not iterable.

To make a list iterable, we can simply call iter(item_list), and then call next() on it, all elements will be returned.

Any object has iter() and next() is considered as Iterable.

# in Operator

for x in obj:
    # statements

# What's inside

_iter = iter(obj)
while 1:
    try:
        x = _iter.next()
    except StopIteration:
        break
    # statements

Generator

Generator might be a easier-used Iterator。

def countdown(n):
    print "Counting down from", n
    while n > 0:
        yield n
        n -= 1
# Note that two lines below didn't start calling countdown until the next() was called.
# yield produced the n, but suspend the whole function until next time next() was called.
>>> x = countdown(10)
>>> x
<generator object at 0x58490>
>>> x.next()
Counting down from 10
10
>>> x.next()
9
...
>>> x.next()
1
# When x returns, a next() will raise exception.
>>> x.next()
Traceback (most recent call last):
 File "<stdin>", line 1, in ?
StopIteration
>>>

Python 3.4 version below

def countdown(n):
    print("Counting down from", n)
    while n>0:
        yield n
        n -= 1
    return 'exits'
>>> x= countdown(3)
>>> x
<generator object countdown at 0x101bd7288>
>>> next(x)
counting down 3
3
>>> next(x)
2
>>> next(x)
1
>>> next(x)
# In Python 3.4, Generator Function can also return some value, and the value will be something like error message in the raised exception later.
# This feature is considered as Syntax Error in Python 2.7.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration: exits

Generators vs. Iterators

  • Generator Function isn't just an iterable object.
  • Operations on generators are always one-time, once a whole iteration was done, you have to call the generator function again.
  • Unlike generators, Iterators like list and dict can be iterated unlimited times.

Generator Expressions

Variable b is an Generator below.

>>> a = [1,2,3,4]
>>> b = (2*x for x in a)
>>> b
<generator object at 0x58760>
>>> for i in b: print b,
...
2 4 6 8

When list a is super large, the use of generator can save a lot memory actually, simply because it doesn't store another big list in memory.

>>> a = [1,2,3,4]
>>> b = [2*x for x in a]
>>> b
[2, 4, 6, 8]

A generator example

We now have a 1Gb access.log from nginx, the problem here is to sum up sizes of all the packets.

Every line of access.log looks like this below:

xx.xx.xx.xx - - [01/Jul/2014:10:06:06 +0800] "GET /share/ajax/?image_id=xxx&user_id=xxx HTTP/1.1" 200 72 "http://www.baidu.com/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"

We have two solutions, one was implemented by Generator, and the other simply use for-loop.

import cProfile, pstats, StringIO

def gene():
	with open('access.log', 'r') as f:
		lines = (line.split(' ', 11)[9] for line in f)
		sizes = (int(size) for size in lines if not size == '-')
		print "Generators Result: ", sum(sizes)

pr = cProfile.Profile()
pr.enable()
gene()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


def loop():
	size_sum = 0
	with open('access.log', 'r') as f:
		for line in f.readlines():
			size = line.split(' ', 11)[9]
			if not size == '-':
				size_sum += int(size)
		print "Forloop Result: ", size_sum

pr = cProfile.Profile()
pr.enable()
loop()
pr.disable()
s = StringIO.StringIO()
sortby = 'cumulative'
ps = pstats.Stats(pr, stream=s).sort_stats(sortby)
ps.print_stats()
print s.getvalue()


Sh4n3@Macintosh:~% python ger.py
Generators Result: 13678125506
         12481726 function calls in 41.487 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   41.487   41.487 ger.py:3(gene)
        1    1.864    1.864   41.487   41.487 {sum}
  4160297   17.209    0.000   39.623    0.000 ger.py:6(<genexpr>)
  4160713   11.972    0.000   22.414    0.000 ger.py:5(<genexpr>)
  4160712   10.442    0.000   10.442    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Forloop Result: 13678125506
         4160716 function calls in 142.672 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   84.979   84.979  142.672  142.672 ger.py:9(loop)
        1   47.609   47.609   47.609   47.609 {method 'readlines' of 'file' objects}
  4160712   10.084    0.000   10.084    0.000 {method 'split' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

So the result here shows the generator version is 3x faster than the for-loop version.

Reference

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部