Presentation is loading. Please wait.

Presentation is loading. Please wait.

 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in [1,2,3,4]: print e for c in “hello”: print c for.

Similar presentations


Presentation on theme: " 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in [1,2,3,4]: print e for c in “hello”: print c for."— Presentation transcript:

1  1 String and Data Processing

2 Sequential Processing Processing each element in a sequence for e in [1,2,3,4]: print e for c in “hello”: print c for e in (1,2,3,”Name”): print e 2

3 List Comprehension When creating a new list with a certain condition or a mapping, [ x for x in [1,2,3 ] ]  [1,2,3] [ x*x for x in [1,2,3] ]  [1,4,9] [ x for x in [1,2,3,…,10] if x%2 == 0 ]  [ 2,4,6,8,10 ] 3

4 Operation with data records L = [ (1, 3), (1, 4), (1, 5), (2, 1), (2, 2) ] Each tuple has (, ). Counting the number of data (records) of group 1? Creating a list with data from the group 1. L0 = [ x for x in L if x[0] == 1 ] Then count the number of elements with len(L0) Sum the total number from group 1? 4

5 Operation with data records Create another list whose elements are all from the group 1 and element is only a number. N0 = [ x[1] for x in L if x[0] == 1 ] Then, apply the sum() built-in function. sum(N0)  12 Another built-in functions for a sequence? 5

6 Built-in Functions for a sequence To compute the maximum, use max(L) function max([1,2,3,4,5,4,3,2,1])  5 To compute the minimum, use min(L) function min([1,2,3,4,5,4,3,2,1])  1 To create a sorted sequence, use sorted(L) function sorted([1,2,3,4,5,4,3,2,1])  [1,1,2,2,3,3,4,4,5] 6

7 Built-in Functions for a sequence What if we have to deal with the inner product? The zip(L1,L2,…,Ln) function will help! zip([1,2,3],[4,5,6])  [ (1,4), (2,5), (3,6) ] 7

8 Built-in Functions for a sequence How to use with a loop? Use packing & unpacking! for (a,b) in zip([1,2,3],[4,5,6]): print a,b This will print (1,4), (2,5), and (3,6) within the loop. 8

9 Built-in Functions for a sequence When you need an index with for statement? Use enumerate() function for (i, e) in enumerate([“Tom”, “Jack”, “Bob”]): print i, e This will print 1 Tom 2 Jack 3 Bob 9

10 Built-in Functions for a sequence Functions: min( )  the minimum element max( )  the maximum element sum(, )  the sum of elements zip(a list of sequences with the same length) enumerate( )  a list with tuples which has an index (0,1,2,…,) and an element from the given sequence. 10

11 Built-in Functions for a sequence Packing and Unpacking is useful to deal with multiple values at an operation. Collecting each element whose index is 2’s multiple [ x for (j,x) in enumerate([4,5,6]) if j%2 == 0 ]  [4, 6] Computing the inner product sum( [ a*b for (a,b) in zip((1,2),(1,2)) ])  5 11

12  12 Playing with Real Data

13 File Hopefully, you didn’t forget how to read a file. file = open(, ) lines = file.readlines() file.close() Data processing is essentially dealing with a list of lines. However, you should have a clear mind for the structure of your data. 13

14 Data Processing Data processing is essentially dealing with a list of lines. However, you should have a clear mind for the structure of your data. 14

15 Data Processing Pre-existing data is mostly the subject for statistical analysis. Basic description: count, sum, min, max, set operations Descriptive statistics such as mean, variance Information Visualization Comparative Analysis Cross correlation, Hypothesis testing Modeling and Validation Linear regression via Least Squares 15

16  16 Pythonic Way for Descriptive Statistics

17 Sternberg’s experiment Does people process a set of numbers in parallel or in sequential? 17

18 Data that we have In Excel, we have 18

19 Data that we have You can download the previous data in a text file https://www.cs.unc.edu/~joohwi/comp116/ResponseTime.txt Download the file into your project directory Let’s make a list of tuples whose type is (,,, ) 0 th element is the id of each trial 1 st element is the response time in 1/100 sec 2 nd element is the number of digits 3 rd element is 1 for if the digit is included, 2 for not 19

20 Transforming data Let’s make the data more readable. For example, 20

21 The structure of data The first (uppermost) group: the number of digits 1/3/5 The second group: the presence of digit in a given number Y/N Let’s make a hierarchical structure 21

22 The structure of data 22

23 The structure of data Make a tuple of (1 st level data, 2 nd level data, 3 rd level data) (1,1,40)(3,1,73) (5,1,39) (1,2,45)(3,2,73) (5,2,66) … … … 23

24 The structure of data Read a list of strings from the file Strip whitespaces using strip() function Separate data into a list of four words using split() Create a list of tuples with list comprehension 24

25 Reading data step by step file = open(“ ”, “r”) lines = file.readlines() file.close() lines = [ l.strip() for l in lines ] words = [ l.split() for l in lines ] 25

26 Reading data step by step data = [ (int(w[1]),int(w[2]),\ int(w[3]) for w in words ] print data [ (1,1,40), (1,1,41), …, ] 26

27 Reading data step by step data = [ (int(w[1]),int(w[2]),\ int(w[3]) for w in words ] print data [ (1,1,40), (1,1,41), …, ] 27

28 Now, we have data 28 Now, let’s make a list of strings which contains 15 numbers of reaction time. For example, L = [ 1,2,3,4,5,…,100 ] D = [ [ 1,2,3,4,…,15], [16,17,18,…,30], [31,32,…,45], … ] How could we do that? There are many ways we can.

29 Collection by counting 29 1. Create a counter variable and collect a list with every 15s. D = [] for j in range(0, len(L), 15): S = [ ] for k in range(j, j+15): S.append(L[j]) D.append(S)

30 Collection by counting 30 2. Use list comprehension D = [ [ L[k] for k in range(j, j+15) ] \ for j in range(0, len(L), 15) ] 3. Use list slicing to replace the inner loop D = [ L[j:j+15] for j in range(0, len(L), 15) ] 4. Convert each inner list into a string output = [ “ “.join(sublist) for sublist in D ]

31 Repeat for each group 31 Before repeating the process, define a function and name the piece of code. def format_data(L): D = [ L[j:j+15] for j in range(0, len(L), 15) ] return “ ”.join([ “ “.join(sublist) for sublist in D ]) The signature of our function is format_data( ) 

32 Creating a report 32 Let’s make a html report html_template = “ … %s … %s … %s … fo = open(“ ”, “w”) o1 = format_data(L1) o2 = format_data(L2) … fo.write(html_template % (o1, o2, o3, …, )) fo.close() Open the file with your browser.

33 Textual Visualization 33 More informative visualization by frequency counting. Make a histogram!

34 Multiple countings 34 A classic problem using another data structure called a dictionary or an associative array. Make a tuple (key, value) A list contains multiple tuples while maintaining each key is unique within the list.

35 Dictionary 35 Whenever inserting a tuple, test if the key exists already If the key exists, overwrite the value If not, append the tuple into the list (1, 3), (2, 4), (3, 1), (1, 4) {(1,3)} (1,4) (2,4) (3,1)

36 Dictionary 36 A dictionary is created with {} constructor. For example data = { } # an empty dictionary data = { 1: 3, 2: 4 } The key and value pair is represented by key:value within the constructor

37 Dictionary 37 Accessing an element requires its key The bracket [] operator takes a key data = { 1: 3, 2: 4 } print data[1] # will print 3 print data[2] # will print 4 Don’t be confused with the indexing

38 Dictionary 38 The type of a key and value can be anything! data = { “Tom”: “Cruise”, 1:3, (0,2):(3,4) } This is different from a sequence. print data[“Tom”] # will print Cruise print data[1] # will print 3 print data[(0,2)] # will print (3,4)

39 Dictionary 39 Accessing a non-key value will raise KeyError print d[“Nicole”] KeyError: “Nicole” IN operator tests the key’s existence print “Tom” in d# True print “Nicole” in d# False

40 Dictionary 40 A set of its keys is obtained by keys() function print d.keys() >>> [ 1, (0,2), “Tom” ] Note that the insertion order is not preserved It depends on the implementation of Python

41 Dictionary 41 A set of its values is obtained by values() function print d.values() >>> [ 3, (3,4), “Cruise” ] Note that the insertion order is not preserved It depends on the implementation of Python

42 Dictionary 42 FOR loop works well with the dictionary type for k in d: print “(“, k, “,”, v, “)” There is dictionary comprehension as well!

43 Dictionary 43 { j:j+1 for j in range(0,3) } >>> { 0:1, 1:2, 2:3 } { j:e for (j,e) in enumerate(L) } >>> { 0:L[0], 1:L[1], …, n:L[n] } { e:[] for e in L } # assume L = [ 1, 1, 2 ] >>> { 1:[], 2:[] }

44 Frequency counting 44 Make each response time value the key of counts L = [ (1,1,40), (1,1,41), … ] F = { e[2]: 0 for e in L } Initialize each value w/ 0 Duplicate keys ignored automatically!

45 Frequency counting 45 L = [ (1,1,40), (1,1,41), … ] F = { 36:0, 37:0, …, } Loop each element value in L and update F[key] for e in L: F[ e[2] ] += 1 F = {36: 1, 37: 1, … }

46 Frequency counting 46 Interestingly, keys are in increasing order! The reason is its implementation. An efficient implementation of a dictionary inevitably requires a sorted set of keys. Why? Searching is efficient with sorted data than non-sorted data


Download ppt " 1 String and Data Processing. Sequential Processing Processing each element in a sequence for e in [1,2,3,4]: print e for c in “hello”: print c for."

Similar presentations


Ads by Google