正则表达式 Regular Expression

2017-12-18 本文已影响0人钊钖

image.png

正则表达式 Regular Expression

正则表达式是一种对字符串过滤的逻辑公式

可以判断给定的字符串是否匹配
可以获取字符串中特定的部分

从dataquest 的联系中掌握一些常用的用法

1. introduction (instructions)

In the code cell, assign to the variable regex a regular expression that's four characters long and matches every string in the list strings.

strings = ["data science", "big data",metadata]
regex = 'data'

2. Wildcards in Regular Expressions(instructions)

In Python, we use the re module to work with regular expressions. The module's documentation provides a list of these special characters.

For instance, we use the special character "." to indicate that any character can be put in its place.

Assign a regular expression that is three characters long and matches every string in strings to the variable regex.

strings = ["bat",'robotics','megabyte']
regex = "b.t"

3. Searching The Beginnings And Endings Of Srtings(instructions)

We can use the caret symbol ("^") to match the beginning of a string, and the dollar sign ("$") to match the end of a string.

Assign a regular expression that's seven characters long and matches every string in strings (except for bad_string) to the variable regex.

strings = ["better not put too much", "butter in the", "batter"]
bad_string = "We also wouldn't want it to be bitter"
regex = ""
regex = '^b.tter'

4. Introduction to the AskReddit Data Set

which has five columns that appear in the following order:

Title -- The title of the post Score -- The number of upvotes the post received
Time -- When the post was posted
Gold -- How much Reddit Gold users gave the post
NumComs -- The number of comments the post received

5. Reading and Pringting the Data Set(instructions)

Title|Score|Time|Gold|NumComs
---| ---| ---|---|---
What's your internet "white whale", something you've been searching for years to find with no luck?| 11510|1433213314|1|26195
What's your favorite video that is 10 seconds or less?|8656|1434205517|4|8479
What are some interesting tests you can take to find out about yourself?|8480|1443409636|1|4055|
PhD's of Reddit. What is a dumbed down summary of your thesis?|7927|1440188623|0|13201
What is cool to be good at, yet uncool to be REALLY good at?|7711|1440082910|0|20325
Let's use the csv module to read and print our data file, "askreddit_2015.csv". Recall that we can use the csv module by performing the following steps:

Import csv.

Open the file that contains our CSV data in 'r' mode.

Call the csv.reader() function with the file object as input.

Convert the result to a list.

Use the csv module to read our data set and assign it to posts_with_header.
Use list slicing to exclude the first row, which represents the column names. Assign this sliced data set to posts.
Use a for loop and string slicing to print the first 10 rows. See if you notice any patterns in this sample of the data set.

import csv
post_with_header = list(csv.reader(open("askreddit_2015.csv",'r')))
posts = post_with_header[1:]
for post in posts[:10]:
    print(post)

6. Countint Simple Mathes in the Data Set with re()

We mentioned the re module earlier, and now we'll begin to use it in our code. One useful function the module provides is re.search.

With re.search(regex, string), we can check whether string is a match for regex. If it is, the expression will return a match object. If it isn't, it will return None. For now, we won't worry about returning the actual matches - we'll just compare the result to None to see whether we have a match or not.


if re.search("needle", "haystack") is not None:
   print("We found it!")
else:
   print("Not a match")

The code above will print Not a match, because "haystack" is not a match for the regex "needle".

You may have noticed that many of the posts in our AskReddit database are directed towards particular groups of people, using phrases like "Soldiers of Reddit". These types of posts are common, and always follow a similar format. We can use regular expressions to count how many of them are in the top 1,000.Let's do this in our next exercise. We've already read the data set into the variable posts.

Instructions

Count the number of posts in our data set that match the regex "of Reddit". Assign the count to of_reddit_count.

import re
of_reddit_count = 0 
for post in posts:
    if re.search('of Reddit',post[0]) is not None:
        of_reddit_count += 1
print(of_reddit_count)

7. Using Square Brackets to Match Multiple Characters

For example, the regex "[bcr]at" would match the substrings "bat", "cat", and "rat", but nothing else. We indicate that the first character in the regex can be either "b", "c" or "r".

Instructions

Use square bracket notation to make the code account for both capitalizations of "Reddit", and count how many posts contain "of Reddit" or "of reddit" in the title.
Assign the resulting count to of_reddit_count.

improt re
of_reddit_count = 0 
for post in posts:
    if re.search ('of [rR]eddit',post[0]) is not None:
        of_reddit_count += 1

8. Excaping Special Characters

To deal with this sort of problem, we need to escape (backslash \ )special characters.

Instructions
-Escape the square bracket characters to count the number of posts in our data set that contain the "[Serious]" tag.

Assign the count to serious_count.

import re
serious_count = 0
for post in posts:
    if re.search('\[Serious\]',post[0])is not None:
        serious_count +=1

9. Combining Escaped Characters and Multiple Matches

Some people tag serious posts as "[Serious]", and others as "[serious]". We should account for both capitalizations.

Instructions

Refine the code to count how many posts have either "[Serious]" or "[serious]" in the title.
Assign the count to serious_count.

improt re
serious_count = 0
for post in posts:
    if re.search ('\[[Ss]erious\]',post[0]):
        serious_count += 1

10. Adding More Complexity to Your Regular Expression

In our data set, some users have tagged their posts with "(Serious)" or "(serious)", including the parentheses. Therefore, we should account for both square brackets and parentheses. We can do this by using square bracket notation, and escaping the "[", "]", "(", and ")" characters with the backslash.

Instructions

Refine the code so that it counts how many posts have the serious tag enclosed in either square brackets or parentheses.
Assign the count to serious_count.

import re
serious_count =0
for post in posts:
    if re.search('[\[\(][Ss]rious[\]\)]',post[0]) is not None:
        serious_count += 1

11. Combining Multiple Regular Expressions

To combine regular expressions, we use the "|" character.

Instructions

Use the "^" character to count how many posts include the serious tag at the beginning of the title. Assign this count to serious_start_count.
Use the "$" character to count how many posts include the serious tag at the end of the title. Assign this count to serious_end_count.
Use the "|" character to count how many posts include the serious tag at either the beginning or end of the title. Assign this count to serious_count_final.

import re

serious_start_count = 0
serious_end_count = 0
serious_count_final = 0

for row in posts:
    if re.search('^[\[\(][Ss]erious[\]\)]',row[0])is not None:
        serious_start_count+=1
for row in posts:
    if re.search('[\[\(][Ss]erious[\]\)]$',row[0]) is not None:
        serious_end_count +=1
for row in posts:
    if re.search('^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$',row[0])is not None:
        serious_count_final +=1

12. Using Regular Expressions to Substitute Strings

The re module provides a sub() function that takes the following parameters (in order):

pattern: The regex to match
repl: The string that should replace the substring matches
string: The string containing the pattern we want to search

Instructions

Replace "[serious]", "(Serious)", and "(serious)" with "[Serious]" for all of the titles in posts.
You should only need to use one call to sub(), and one regex.
Recall that the repl argument is an ordinary string. It's not a regex, so you don't need to escape characters like "[".

Hint

"[\[$][Ss]erious[\]$]" is the pattern argument to sub(), and "[Serious]" is the repl argument.

import re
for row in posts:
    re.sub('[\]\)][sS]erious[\]\)]','[Serious]',row[0])

13. Matching Years with Regular Expressions

We can indicate that we're looking for integers in a pattern by using square brackets ("[" and "]"), along with a dash ("-"). For example, "[0-9]" will match any character that falls between 0 and 9 (all of which will be one-digit integers). Similarly, "[a-z]" would match any lowercase letter. We can also specify smaller ranges like "[3-5]" or "[d-g]".

This would work, but let's also add the condition that we only want to match years after year 999 and before year 3000 (any other four-digit numbers in a string are probably not years).

Instructions

We've loaded a number of strings into the strings variable for you.
Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999.
Store every string that contains a year in year_strings. The .append() function will help here.

import re
year_string = []
for string in strings:
    if re.search ('[1-2][0-9][0-9][0-9]',string)is not None:
        year_strings_append(string)

14. Repeating Characters in Regular Expressions

We can use curly brackets ("{" and "}") to indicate that a pattern should repeat. To match any four-digit number, for example, we could repeat the pattern "[0-9]" four times by writing "[0-9]{4}"

Instructions

We've loaded a number of strings into the strings variable for you.
Loop through strings and use re.search() to determine whether each string contains a year between 1000 and 2999. Use a regex that takes advantage of curly brackets.
Store every string that contains a year in year_strings. The .append() function will help here.

import re
year_srings = []
for string in strings:
    if re.search('[1-2][0-9]{3}',string)is not None:
        year_strings.append(string)

15 . Challenge: Extracting all Years

Finally, let's extract years from a string. The re module contains a findall() function that returns a list of substrings matching the regex. re.findall("[a-z]", "abc123") would return ["a", "b", "c"], because those are the substrings that match the regex.

Instructions

Use re.findall() to generate a list of all years between 1000 and 2999 in the string years_string.
Assign the result to years.

years = re.finall('[1-2][0-9]{3}',years_string)

正则表达式 Regular Expression

正则表达式 Regular Expression

正则表达式是一种对字符串过滤的逻辑公式

从dataquest 的联系中掌握一些常用的用法

1. introduction (instructions)

2. Wildcards in Regular Expressions(instructions)

3. Searching The Beginnings And Endings Of Srtings(instructions)

4. Introduction to the AskReddit Data Set

5. Reading and Pringting the Data Set(instructions)

6. Countint Simple Mathes in the Data Set with re()

7. Using Square Brackets to Match Multiple Characters

8. Excaping Special Characters

9. Combining Escaped Characters and Multiple Matches

10. Adding More Complexity to Your Regular Expression

11. Combining Multiple Regular Expressions

12. Using Regular Expressions to Substitute Strings

13. Matching Years with Regular Expressions

14. Repeating Characters in Regular Expressions

15 . Challenge: Extracting all Years

猜你喜欢

热点阅读