文档章节

Perl Tip

kuerant
 kuerant
发布于 2013/03/19 16:16
字数 655
阅读 39
收藏 0

perl one line iconv

perl -mEncode -npe 'Encode::from_to($_, "utf-8", "gbk")'

perl -mEncode -npe '$_=Encode::encode("gbk", Encode::decode("utf-8", $_))'

 

------------------------------------------------------------------------------

use Encode;
$_="abc你好wert";
$a=decode('cp936',$_);
($x)=($a=~m/(\p{Han}+)/);
print encode('cp936',$x),"\n";

匹配所有非汉字:\P{Han}
匹配所有汉字: \p{Han}

The Perl FAQ entry How do I strip blank space from the beginning/end of a string? states that using

s/^\s+|\s+$//g;

is slower than doing it in two steps:

s/^\s+//;
s/\s+$//;

Why is this combined statement noticeably slower than the separate ones (for any input string)?

The Perl regex runtime runs much quicker when working with 'fixed' or 'anchored' substrings rather than 'floated' substrings. A substring is fixed when you can lock it to a certain place in the source string. Both '^' and '$' provide that anchoring. However, when you use alternation '|', the compiler doesn't recognize the choices as fixed, so it uses less optimized code to scan the whole string. And at the end of the process, looking for fixed strings twice is much, much faster than looking for a floating string once. On a related note, reading perl's regcomp.c will make you go blind.

Update: Here's some additional details. You can run perl with the '-Dr' flag if you've compiled it with debugging support and it'll dump out regex compilation data. Here's what you get:

~# debugperl -Dr -e 's/^\s+//g' Compiling REx `^\s+'
size 4 Got 36 bytes for offset annotations.
first at 2
synthetic stclass "ANYOF[\11\12\14\15 {unicode_all}]".
   1: BOL(2)
   2: PLUS(4)
   3:   SPACE(0)
   4: END(0)
stclass "ANYOF[\11\12\14\15 {unicode_all}]" anchored(BOL) minlen 1
# debugperl -Dr -e 's/^\s+|\s+$//g' Compiling REx `^\s+|\s+$'
size 9 Got 76 bytes for offset annotations.

   1: BRANCH(5)
   2:   BOL(3)
   3:   PLUS(9)
   4:     SPACE(0)
   5: BRANCH(9)
   6:   PLUS(8)
   7:     SPACE(0)
   8:   EOL(9)
   9: END(0)
minlen 1

Note the word 'anchored' in the first dump.

How do I strip blank space from the beginning/end of a string?

(contributed by brian d foy)

A substitution can do this for you. For a single line, you want to replace all the leading or trailing whitespace with nothing. You can do that with a pair of substitutions:

 s/^\s+//;
s/\s+$//;

You can also write that as a single substitution, although it turns out the combined statement is slower than the separate ones. That might not matter to you, though:

 s/^\s+|\s+$//g;

In this regular expression, the alternation matches either at the beginning or the end of the string since the anchors have a lower precedence than the alternation. With the /g flag, the substitution makes all possible matches, so it gets both. Remember, the trailing newline matches the \s+, and the $ anchor can match to the absolute end of the string, so the newline disappears too. Just add the newline to the output, which has the added benefit of preserving "blank" (consisting entirely of whitespace) lines which the ^\s+ would remove all by itself:

 while( <> ) {
    s/^\s+|\s+$//g;
    print "$_\n"; 
}

For a multi-line string, you can apply the regular expression to each logical line in the string by adding the /m flag (for "multi-line"). With the /m flag, the $ matches before an embedded newline, so it doesn't remove it. This pattern still removes the newline at the end of the string:

 $string =~ s/^\s+|\s+$//gm;

Remember that lines consisting entirely of whitespace will disappear, since the first part of the alternation can match the entire string and replace it with nothing. If you need to keep embedded blank lines, you have to do a little more work. Instead of matching any whitespace (since that includes a newline), just match the other whitespace:

 $string =~ s/^[\t\f ]+|[\t\f ]+$//mg;

© 著作权归作者所有

共有 人打赏支持
上一篇: Scala Tip
下一篇: wolf team
kuerant
粉丝 9
博文 54
码字总数 12892
作品 0
通州
私信 提问
如何实现Linux下高亮关键字的tail -f功能

公司内部一哥们发布到邮件列表中的一个小tip,挺有意思,属于程序员的“奇淫技巧”类吧,值得记录一下。 如果你在linux下工作,那用tail -f跟踪一个日志文件的输出内容应该是家常便饭了。 但...

renwofei423
2014/03/11
0
0
动态 Web 和应用服务器 NGINX Unit 发布 0.7 版本

NGINX Unit 是一个动态的 Web 和应用服务器,它的设计初衷就是在同一个环境中可同时运行多种编程语言和版本编写的程序。通过 RESTful JSON API 可以轻巧、多面化地动态配置 Unit。当工程师或...

无名码农
2018/03/23
1K
1
[Perl] 关于 Bugzilla 的一些问题与研究

最近碰到一个关系到 Bugzilla 升级与二次开发的项目,对这个大名鼎鼎的缺陷管理系统有了进一步的研究,主要研究内容包括:在不同系统(windows/Linux)上的安装与配置;不同 Bugzilla 系统间...

长平狐
2012/11/19
234
0
如何为 PostgreSQL 增加系统表字段

1、BKI介绍:http://www.postgresql.org/docs/9.4/static/bki.html 可以不了解,也不影响我们继续,因为有很多现成例子参考; 2、为pgdatabase增加一个字段 datdummy,打开 /src/include/cat...

有理想的猪
2015/09/14
871
1
github代码搜索技巧

github是一个非常丰富的资源,但是面对这丰富的资源很多人不知到怎么使用,更谈不上怎么贡献给他,我们需要使用github就要学习使用他的方法, 学会了使用的方法,接受了他的这种观点我们才会...

Align
2015/12/30
210
0

没有更多内容

加载失败,请刷新页面

加载更多

Nginx 配置 root目录、虚拟目录alias

Nginx是通过 alias 设置虚拟目录,在Nginx的配置中,alias目录和root目录是有区别的。 alias指定的目录是准确的,即location匹配访问的path目录下的文件直接是在alias目录下查找的; root指定...

Yue_Chen
14分钟前
1
0
Nginx的SSL

Nginx的SSL 当我们访问站点时,网址的前缀是https的就是启用了ssl SSL介绍:SSL(Secure Sockets Layer 安全套接层),及其继任者传输层安全(Transport Layer Security,TLS)是为网络通信提供...

李超小牛子
20分钟前
2
0
matlab-自控原理 roots 传递函数 零极点

  matlab : R2018a 64bit     OS : Windows 10 x64 typesetting : Markdown    blog : my.oschina.net/zhichengjiu    gitee : gitee.com/zhichengjiu   code clearclc% s^4+5*......

志成就
22分钟前
1
0
spring官网上下载历史版本的spring插件,springsource-tool-suite

spring官网下载地址(https://spring.io/tools/sts/all),历史版本地址(https://spring.io/tools/sts/legacy)。 注:历史版本下载的都是装好插件的eclipse,而非我们需要的插件 目前官网上提...

开源oschina
22分钟前
1
0
同一台服务器,mysql登录不了指定端口的问题

一台服务器上启动了两个mysql服务,端口分别是3306,3307 mysql -P3307 -uroot -p,输入密码,登录失败 输入3306的密码,登录成功,结果登到3306数据库上去了 查资料发现,不指定主机-h 的话,...

chdahuzi
22分钟前
1
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部