正则表达式 Regular Expression

2017-12-18  本文已影响0人  钊钖
image.png

正则表达式 Regular Expression

正则表达式是一种对字符串过滤的逻辑公式

从dataquest 的联系中掌握一些常用的用法

1. introduction (instructions)

In the code cell, assign to the variable regex a regular expression that's four characters long and matches every string in the list strings.

strings = ["data science", "big data",metadata]
regex = 'data'

2. Wildcards in Regular Expressions(instructions)

In Python, we use the re module to work with regular expressions. The module's documentation provides a list of these special characters.

For instance, we use the special character "." to indicate that any character can be put in its place.

Assign a regular expression that is three characters long and matches every string in strings to the variable regex.

strings = ["bat",'robotics','megabyte']
regex = "b.t"

3. Searching The Beginnings And Endings Of Srtings(instructions)

We can use the caret symbol ("^") to match the beginning of a string, and the dollar sign ("$") to match the end of a string.

Assign a regular expression that's seven characters long and matches every string in strings (except for bad_string) to the variable regex.

strings = ["better not put too much", "butter in the", "batter"]
bad_string = "We also wouldn't want it to be bitter"
regex = ""
regex = '^b.tter'

4. Introduction to the AskReddit Data Set

which has five columns that appear in the following order:

Title -- The title of the post Score -- The number of upvotes the post received
Time -- When the post was posted
Gold -- How much Reddit Gold users gave the post
NumComs -- The number of comments the post received

5. Reading and Pringting the Data Set(instructions)

Title|Score|Time|Gold|NumComs
---| ---| ---|---|---
What's your internet "white whale", something you've been searching for years to find with no luck?| 11510|1433213314|1|26195
What's your favorite video that is 10 seconds or less?|8656|1434205517|4|8479
What are some interesting tests you can take to find out about yourself?|8480|1443409636|1|4055|
PhD's of Reddit. What is a dumbed down summary of your thesis?|7927|1440188623|0|13201
What is cool to be good at, yet uncool to be REALLY good at?|7711|1440082910|0|20325
Let's use the csv module to read and print our data file, "askreddit_2015.csv". Recall that we can use the csv module by performing the following steps:

  1. Import csv.
  2. Open the file that contains our CSV data in 'r' mode.
  3. Call the csv.reader() function with the file object as input.
  4. Convert the result to a list.
import csv
post_with_header = list(csv.reader(open("askreddit_2015.csv",'r')))
posts = post_with_header[1:]
for post in posts[:10]:
    print(post)

6. Countint Simple Mathes in the Data Set with re()

We mentioned the re module earlier, and now we'll begin to use it in our code. One useful function the module provides is re.search.

With re.search(regex, string), we can check whether string is a match for regex. If it is, the expression will return a match object. If it isn't, it will return None. For now, we won't worry about returning the actual matches - we'll just compare the result to None to see whether we have a match or not.


if re.search("needle", "haystack") is not None:
   print("We found it!")
else:
   print("Not a match")

The code above will print Not a match, because "haystack" is not a match for the regex "needle".

You may have noticed that many of the posts in our AskReddit database are directed towards particular groups of people, using phrases like "Soldiers of Reddit". These types of posts are common, and always follow a similar format. We can use regular expressions to count how many of them are in the top 1,000.Let's do this in our next exercise. We've already read the data set into the variable posts.

Instructions

Count the number of posts in our data set that match the regex "of Reddit". Assign the count to of_reddit_count.

import re
of_reddit_count = 0 
for post in posts:
    if re.search('of Reddit',post[0]) is not None:
        of_reddit_count += 1
print(of_reddit_count)

7. Using Square Brackets to Match Multiple Characters

For example, the regex "[bcr]at" would match the substrings "bat", "cat", and "rat", but nothing else. We indicate that the first character in the regex can be either "b", "c" or "r".

Instructions

improt re
of_reddit_count = 0 
for post in posts:
    if re.search ('of [rR]eddit',post[0]) is not None:
        of_reddit_count += 1

8. Excaping Special Characters

To deal with this sort of problem, we need to escape (backslash \ )special characters.

Instructions
-Escape the square bracket characters to count the number of posts in our data set that contain the "[Serious]" tag.

import re
serious_count = 0
for post in posts:
    if re.search('\[Serious\]',post[0])is not None:
        serious_count +=1

9. Combining Escaped Characters and Multiple Matches

Some people tag serious posts as "[Serious]", and others as "[serious]". We should account for both capitalizations.

Instructions

improt re
serious_count = 0
for post in posts:
    if re.search ('\[[Ss]erious\]',post[0]):
        serious_count += 1

10. Adding More Complexity to Your Regular Expression

In our data set, some users have tagged their posts with "(Serious)" or "(serious)", including the parentheses. Therefore, we should account for both square brackets and parentheses. We can do this by using square bracket notation, and escaping the "[", "]", "(", and ")" characters with the backslash.

Instructions

import re
serious_count =0
for post in posts:
    if re.search('[\[\(][Ss]rious[\]\)]',post[0]) is not None:
        serious_count += 1

11. Combining Multiple Regular Expressions

To combine regular expressions, we use the "|" character.

Instructions

import re

serious_start_count = 0
serious_end_count = 0
serious_count_final = 0

for row in posts:
    if re.search('^[\[\(][Ss]erious[\]\)]',row[0])is not None:
        serious_start_count+=1
for row in posts:
    if re.search('[\[\(][Ss]erious[\]\)]$',row[0]) is not None:
        serious_end_count +=1
for row in posts:
    if re.search('^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$',row[0])is not None:
        serious_count_final +=1

12. Using Regular Expressions to Substitute Strings

The re module provides a sub() function that takes the following parameters (in order):

Instructions

Hint

"[\[\(][Ss]erious[\]\)]" is the pattern argument to sub(), and "[Serious]" is the repl argument.

import re
for row in posts:
    re.sub('[\]\)][sS]erious[\]\)]','[Serious]',row[0])

13. Matching Years with Regular Expressions

We can indicate that we're looking for integers in a pattern by using square brackets ("[" and "]"), along with a dash ("-"). For example, "[0-9]" will match any character that falls between 0 and 9 (all of which will be one-digit integers). Similarly, "[a-z]" would match any lowercase letter. We can also specify smaller ranges like "[3-5]" or "[d-g]".

This would work, but let's also add the condition that we only want to match years after year 999 and before year 3000 (any other four-digit numbers in a string are probably not years).

Instructions

import re
year_string = []
for string in strings:
    if re.search ('[1-2][0-9][0-9][0-9]',string)is not None:
        year_strings_append(string)

14. Repeating Characters in Regular Expressions

We can use curly brackets ("{" and "}") to indicate that a pattern should repeat. To match any four-digit number, for example, we could repeat the pattern "[0-9]" four times by writing "[0-9]{4}"

Instructions

import re
year_srings = []
for string in strings:
    if re.search('[1-2][0-9]{3}',string)is not None:
        year_strings.append(string)

15 . Challenge: Extracting all Years

Finally, let's extract years from a string. The re module contains a findall() function that returns a list of substrings matching the regex. re.findall("[a-z]", "abc123") would return ["a", "b", "c"], because those are the substrings that match the regex.

Instructions

years = re.finall('[1-2][0-9]{3}',years_string)
上一篇 下一篇

猜你喜欢

热点阅读